Blog / AI / Claude Code for SGLang: Structured Generation Language

Claude Code for SGLang: Structured Generation Language

Published: October 13, 2027

•

Read time: 5 min read

•

By: Claude Skills 360

SGLang accelerates LLM inference with RadixAttention and structured generation. pip install sglang[all]. import sglang as sgl. Define program: @sgl.function, def classify(s, document): s += sgl.system("You are a classifier.") + sgl.user(document) + sgl.assistant(sgl.gen("label", choices=["positive","neutral","negative"])). Run locally: from sglang import Runtime, runtime = Runtime(model_path="meta-llama/Llama-3.1-8B-Instruct", tp_size=1), sgl.set_default_backend(runtime). Call: state = classify.run(document="Great product!"), state["label"]. Batch: states = classify.run_batch([{"document": d} for d in docs], num_threads=8). Constrained JSON: sgl.gen("json_out", max_tokens=256, regex=r'\{"name":".+","age":\d+\}'). JSON schema: sgl.gen("out", json_schema=User.model_json_schema()). Fork/join parallel: @sgl.function, def multi_branch(s, text): forks = s.fork(3), for i, f in enumerate(forks): f += sgl.user(f"Perspective {i}: {text}") + sgl.assistant(sgl.gen("view")), s.join(forks), s += sgl.user("Synthesize") + sgl.assistant(sgl.gen("synthesis")). Image: sgl.image(path_or_url) for vision models. Remote backend: sgl.set_default_backend(sgl.OpenAI("gpt-4o")). Server: python -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --port 30000 --tp 2 — OpenAI-compatible at http://localhost:30000. Speculative: --speculative-algorithm EAGLE --speculative-draft-model-path draft_model. RadixAttention shares KV cache prefixes automatically for 3-5x throughput on shared prefixes. --quantization fp8 for memory efficiency. Claude Code generates SGLang programs, constrained decoders, fork/join pipelines, server launch configs, and batch inference scripts.

CLAUDE.md for SGLang

## SGLang Stack
- Version: sglang >= 0.3
- Backend: Runtime(model_path, tp_size) for local | sgl.OpenAI("gpt-4o") for remote
- Program: @sgl.function def fn(s, **kwargs): s += sgl.system(...) + sgl.user(...) + sgl.assistant(sgl.gen("key"))
- Constrained: sgl.gen("key", choices=[...]) | regex=r"..." | json_schema=Model.model_json_schema()
- Batch: fn.run_batch([{"arg": val}, ...], num_threads=N)
- Fork/join: s.fork(N) → [f1, f2, ...] → s.join(forks) for parallel branches
- Vision: sgl.image(path_or_url) in sgl.user(...)
- Server: python -m sglang.launch_server --model-path ... --port 30000 --tp N

SGLang Inference Programs

# inference/sglang_programs.py — high-throughput LLM inference with SGLang
from __future__ import annotations
import json
import os
from typing import Any

import sglang as sgl
from pydantic import BaseModel


# ── 1. Backend setup ──────────────────────────────────────────────────────────

def setup_local_backend(
    model_path: str = "meta-llama/Llama-3.1-8B-Instruct",
    tp_size:    int = 1,
    port:       int = 30000,
) -> sgl.Runtime:
    """
    Launch local SGLang runtime with RadixAttention.
    tp_size: tensor parallel degree (number of GPUs).
    """
    runtime = sgl.Runtime(
        model_path=model_path,
        tp_size=tp_size,
        port=port,
        mem_fraction_static=0.88,    # GPU memory for KV cache
        disable_radix_cache=False,   # Keep RadixAttention enabled
    )
    sgl.set_default_backend(runtime)
    print(f"SGLang backend: {model_path} on {tp_size} GPU(s), port {port}")
    return runtime


def setup_openai_backend(model: str = "gpt-4o-mini"):
    """Use OpenAI as backend — same programs work unchanged."""
    backend = sgl.OpenAI(model)
    sgl.set_default_backend(backend)


# ── 2. Simple generation programs ─────────────────────────────────────────────

@sgl.function
def simple_qa(s, question: str, system: str = "You are a helpful assistant."):
    """Basic Q&A program."""
    s += sgl.system(system)
    s += sgl.user(question)
    s += sgl.assistant(sgl.gen("answer", max_tokens=512, temperature=0.7))


@sgl.function
def summarize(s, document: str, max_words: int = 50):
    """Summarize a document in a fixed word count."""
    s += sgl.system("You are a concise summarizer.")
    s += sgl.user(
        f"Summarize the following in at most {max_words} words:\n\n{document}"
    )
    s += sgl.assistant(sgl.gen("summary", max_tokens=max_words * 2))


# ── 3. Constrained generation ─────────────────────────────────────────────────

@sgl.function
def classify_sentiment(s, text: str):
    """Classify sentiment — guaranteed to produce one of three labels."""
    s += sgl.system("Classify the sentiment of the given text.")
    s += sgl.user(f"Text: {text}\n\nSentiment:")
    s += sgl.assistant(sgl.gen(
        "label",
        choices=["positive", "neutral", "negative"],
    ))


@sgl.function
def classify_priority(s, ticket: str):
    """Support ticket priority classification."""
    s += sgl.system("You are a support ticket triage system.")
    s += sgl.user(f"Ticket: {ticket}\n\nPriority:")
    s += sgl.assistant(sgl.gen(
        "priority",
        choices=["critical", "high", "medium", "low"],
    ))
    s += sgl.user("Department to route to:")
    s += sgl.assistant(sgl.gen(
        "department",
        choices=["engineering", "billing", "account", "product"],
    ))


@sgl.function
def extract_date(s, text: str):
    """Extract date in ISO format — regex-constrained output."""
    s += sgl.user(f"Extract the date from: '{text}'\n\nDate (YYYY-MM-DD):")
    s += sgl.assistant(sgl.gen(
        "date",
        regex=r"\d{4}-\d{2}-\d{2}",
        max_tokens=12,
    ))


# ── 4. JSON schema–constrained extraction ────────────────────────────────────

class ContactInfo(BaseModel):
    name:    str
    email:   str | None = None
    company: str | None = None
    role:    str | None = None


@sgl.function
def extract_contact_json(s, text: str):
    """Extract contact info as validated JSON."""
    s += sgl.system("Extract structured contact information as JSON.")
    s += sgl.user(f"Text: {text}\n\nContact JSON:")
    s += sgl.assistant(sgl.gen(
        "contact",
        json_schema=ContactInfo.model_json_schema(),
        max_tokens=256,
    ))


def get_contact(text: str) -> ContactInfo:
    state = extract_contact_json.run(text=text)
    return ContactInfo.model_validate_json(state["contact"])


# ── 5. Multi-turn conversation ────────────────────────────────────────────────

@sgl.function
def multi_turn_chat(s, messages: list[dict]):
    """
    Multi-turn conversation from a message history.
    messages: [{"role": "user"|"assistant", "content": "..."}]
    """
    s += sgl.system("You are a helpful assistant.")
    for msg in messages:
        if msg["role"] == "user":
            s += sgl.user(msg["content"])
        elif msg["role"] == "assistant":
            s += sgl.assistant(msg["content"])

    # Generate next assistant turn
    s += sgl.assistant(sgl.gen("response", max_tokens=512))


# ── 6. Fork/join for parallel generation ─────────────────────────────────────

@sgl.function
def multi_perspective_analysis(s, topic: str):
    """
    Generate multiple perspectives in parallel, then synthesize.
    Fork creates independent generation branches; join merges them.
    """
    s += sgl.system(
        "You analyze topics from multiple expert perspectives."
    )
    s += sgl.user(f"Topic: {topic}")

    # Parallel branches — each fork generates independently
    perspectives = ["technical", "business", "ethical"]
    forks = s.fork(len(perspectives))

    for fork, perspective in zip(forks, perspectives):
        fork += sgl.user(f"Provide the {perspective} perspective in 2 sentences:")
        fork += sgl.assistant(sgl.gen(f"{perspective}_view", max_tokens=128))

    # Join merges all fork states back
    s.join(forks)

    # Synthesize all perspectives
    views = "\n".join(
        f"{p}: {forks[i][f'{p}_view']}"
        for i, p in enumerate(perspectives)
    )
    s += sgl.user(f"Synthesize these perspectives:\n{views}")
    s += sgl.assistant(sgl.gen("synthesis", max_tokens=256))


@sgl.function
def parallel_qa(s, question: str, n_attempts: int = 3):
    """
    Generate N independent answers in parallel and select the best.
    Useful for self-consistency decoding.
    """
    forks = s.fork(n_attempts)
    for fork in forks:
        fork += sgl.user(question)
        fork += sgl.assistant(sgl.gen("answer", max_tokens=256, temperature=0.8))

    s.join(forks)
    answers = [fork["answer"] for fork in forks]

    # Let the model pick the best answer
    s += sgl.user(
        f"Given these {n_attempts} answers to '{question}':\n"
        + "\n".join(f"{i+1}. {a}" for i, a in enumerate(answers))
        + "\n\nWhich is most accurate and complete? Provide the final answer:"
    )
    s += sgl.assistant(sgl.gen("best_answer", max_tokens=256, temperature=0.0))


# ── 7. Batch inference ────────────────────────────────────────────────────────

def batch_classify(texts: list[str]) -> list[dict[str, str]]:
    """
    Classify a batch of texts for sentiment and priority.
    num_threads controls parallelism — SGLang handles scheduling.
    """
    states = classify_sentiment.run_batch(
        [{"text": t} for t in texts],
        num_threads=min(32, len(texts)),
        progress_bar=True,
    )
    return [{"text": t, "sentiment": s["label"]} for t, s in zip(texts, states)]


def batch_extract_contacts(texts: list[str]) -> list[ContactInfo | None]:
    """Batch contact extraction — processes all texts in parallel."""
    states = extract_contact_json.run_batch(
        [{"text": t} for t in texts],
        num_threads=16,
    )
    results = []
    for state in states:
        try:
            results.append(ContactInfo.model_validate_json(state["contact"]))
        except Exception:
            results.append(None)
    return results


# ── 8. Vision / multimodal ────────────────────────────────────────────────────

@sgl.function
def describe_image(s, image_path: str, question: str = "Describe this image in detail."):
    """Multimodal image understanding — requires a vision model backend."""
    s += sgl.user(sgl.image(image_path) + question)
    s += sgl.assistant(sgl.gen("description", max_tokens=512))


@sgl.function
def visual_qa(s, image_url: str, question: str):
    """Visual question answering with constrained answer length."""
    s += sgl.system("Answer questions about images concisely.")
    s += sgl.user(sgl.image(image_url) + f"\nQuestion: {question}")
    s += sgl.assistant(sgl.gen("answer", max_tokens=128))


# ── Demo ──────────────────────────────────────────────────────────────────────

if __name__ == "__main__":
    # Setup — use OpenAI for quick demo (no local GPU needed)
    setup_openai_backend("gpt-4o-mini")

    # Simple QA
    state = simple_qa.run(question="What is RadixAttention?")
    print(f"Answer: {state['answer'][:100]}...")

    # Constrained classification
    state = classify_sentiment.run(text="This library significantly improved our throughput!")
    print(f"Sentiment: {state['label']}")

    state = classify_priority.run(ticket="Payment processing is down for all enterprise customers")
    print(f"Priority: {state['priority']}, Dept: {state['department']}")

    # JSON extraction
    contact = get_contact("Reach out to Sarah Chen, CTO at NeuralTech — [email protected]")
    print(f"Contact: {contact.name}, {contact.email}, {contact.role}")

    # Multi-perspective
    state = multi_perspective_analysis.run(topic="Large language models in healthcare")
    print(f"Technical view: {state['technical_view'][:80]}...")
    print(f"Synthesis: {state['synthesis'][:100]}...")

    # Batch
    texts = [
        "Outstanding product, exceeded all expectations!",
        "Average quality, nothing special",
        "Terrible experience, demanded a refund",
    ]
    results = batch_classify(texts)
    for r in results:
        print(f"  '{r['text'][:40]}...' → {r['sentiment']}")

For the vLLM alternative when needing OpenAI-compatible serving with LoRA hot-swapping, GPTQ/AWQ quantization support, and the largest community ecosystem — vLLM handles broad model and quantization format coverage while SGLang’s @sgl.function programs with fork/join parallelism and constraint types (choices, regex, JSON schema) provide a higher-level abstraction specifically designed for complex multi-step generation pipelines that would otherwise require multiple sequential API calls. For the TGI (Text Generation Inference) alternative when deploying in a Kubernetes environment with Hugging Face Hub integration and built-in model sharding — TGI excels at production Kubernetes deployments while SGLang’s RadixAttention KV cache prefix sharing delivers 3-5x throughput improvements on workloads with shared prefixes (system prompts, few-shot examples, document contexts) without any code changes. The Claude Skills 360 bundle includes SGLang skill sets covering Runtime setup, constrained generation programs, fork/join parallelism, JSON schema extraction, batch inference, vision inputs, and server deployment configuration. Start with the free tier to try high-throughput LLM inference code generation.

Keep Reading

Claude Code for email.contentmanager: Python Email Content Accessors

Read and write EmailMessage body content with Python's email.contentmanager module and Claude Code — email contentmanager ContentManager for the class that maps content types to get and set handler functions allowing EmailMessage to support get_content and set_content with type-specific behaviour, email contentmanager raw_data_manager for the ContentManager instance that handles raw bytes and str payloads without any conversion, email contentmanager content_manager for the standard ContentManager instance used by email.policy.default that intelligently handles text plain text html multipart and binary content types, email contentmanager get_content_text for the handler that returns the decoded text payload of a text-star message part as a str, email contentmanager get_content_binary for the handler that returns the raw decoded bytes payload of a non-text message part, email contentmanager get_data_manager for the get-handler lookup used by EmailMessage get_content to find the right reader function for the content type, email contentmanager set_content text for the handler that creates and sets a text part correctly choosing charset and transfer encoding, email contentmanager set_content bytes for the handler that creates and sets a binary part with base64 encoding and optional filename Content-Disposition, email contentmanager EmailMessage get_content for the method that reads the message body using the registered content manager handlers, email contentmanager EmailMessage set_content for the method that sets the message body and MIME headers in one call, email contentmanager EmailMessage make_alternative make_mixed make_related for the methods that convert a simple message into a multipart container, email contentmanager EmailMessage add_attachment for the method that attaches a file or bytes to a multipart message, and email contentmanager integration with email.message and email.policy and email.mime and io for building high-level email readers attachment extractors text body accessors HTML readers and policy-aware MIME construction pipelines.

5 min read Feb 12, 2029

Claude Code for email.charset: Python Email Charset Encoding

Control header and body encoding for international email with Python's email.charset module and Claude Code — email charset Charset for the class that wraps a character set name with the encoding rules for header encoding and body encoding describing how to encode text for that charset in email messages, email charset Charset header_encoding for the attribute specifying whether headers using this charset should use QP quoted-printable encoding BASE64 encoding or no encoding, email charset Charset body_encoding for the attribute specifying the Content-Transfer-Encoding to use for message bodies in this charset such as QP or BASE64, email charset Charset output_codec for the attribute giving the Python codec name used to encode the string to bytes for the wire format, email charset Charset input_codec for the attribute giving the Python codec name used to decode incoming bytes to str, email charset Charset get_output_charset for returning the output charset name, email charset Charset header_encode for encoding a header string using the charset's header_encoding method, email charset Charset body_encode for encoding body content using the charset's body_encoding, email charset Charset convert for converting a string from the input_codec to the output_codec, email charset add_charset for registering a new charset with custom encoding rules in the global charset registry, email charset add_alias for adding an alias name that maps to an existing registered charset, email charset add_codec for registering a codec name mapping for use by the charset machinery, and email charset integration with email.message and email.mime and email.policy and email.encoders for building international email senders non-ASCII header encoders Content-Transfer-Encoding selectors charset-aware message constructors and MIME encoding pipelines.

5 min read Feb 11, 2029

Claude Code for email.utils: Python Email Address and Header Utilities

Parse and format RFC 2822 email addresses and dates with Python's email.utils module and Claude Code — email utils parseaddr for splitting a display-name plus angle-bracket address string into a realname and email address tuple, email utils formataddr for combining a realname and address string into a properly quoted RFC 2822 address with angle brackets, email utils getaddresses for parsing a list of raw address header strings each potentially containing multiple comma-separated addresses into a list of realname address tuples, email utils parsedate for parsing an RFC 2822 date string into a nine-tuple compatible with time.mktime, email utils parsedate_tz for parsing an RFC 2822 date string into a ten-tuple that includes the UTC offset timezone in seconds, email utils parsedate_to_datetime for parsing an RFC 2822 date string into an aware datetime object with timezone, email utils formatdate for formatting a POSIX timestamp or the current time as an RFC 2822 date string with optional usegmt and localtime flags, email utils format_datetime for formatting a datetime object as an RFC 2822 date string, email utils make_msgid for generating a globally unique Message-ID string with optional idstring and domain components, email utils decode_rfc2231 for decoding an RFC 2231 encoded parameter value into a tuple of charset language and value, email utils encode_rfc2231 for encoding a string as an RFC 2231 encoded parameter value, email utils collapse_rfc2231_value for collapsing a decoded RFC 2231 tuple to a Unicode string, and email utils integration with email.message and email.headerregistry and datetime and time for building address parsers date formatters message-id generators header extractors and RFC-compliant email construction utilities.

5 min read Feb 10, 2029

Put these ideas into practice

Claude Skills 360 gives you production-ready skills for everything in this article — and 2,350+ more. Start free or go all-in.

Get 360 skills free

Free $39