SGLang accelerates LLM inference with RadixAttention and structured generation. pip install sglang[all]. import sglang as sgl. Define program: @sgl.function, def classify(s, document): s += sgl.system("You are a classifier.") + sgl.user(document) + sgl.assistant(sgl.gen("label", choices=["positive","neutral","negative"])). Run locally: from sglang import Runtime, runtime = Runtime(model_path="meta-llama/Llama-3.1-8B-Instruct", tp_size=1), sgl.set_default_backend(runtime). Call: state = classify.run(document="Great product!"), state["label"]. Batch: states = classify.run_batch([{"document": d} for d in docs], num_threads=8). Constrained JSON: sgl.gen("json_out", max_tokens=256, regex=r'\{"name":".+","age":\d+\}'). JSON schema: sgl.gen("out", json_schema=User.model_json_schema()). Fork/join parallel: @sgl.function, def multi_branch(s, text): forks = s.fork(3), for i, f in enumerate(forks): f += sgl.user(f"Perspective {i}: {text}") + sgl.assistant(sgl.gen("view")), s.join(forks), s += sgl.user("Synthesize") + sgl.assistant(sgl.gen("synthesis")). Image: sgl.image(path_or_url) for vision models. Remote backend: sgl.set_default_backend(sgl.OpenAI("gpt-4o")). Server: python -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --port 30000 --tp 2 — OpenAI-compatible at http://localhost:30000. Speculative: --speculative-algorithm EAGLE --speculative-draft-model-path draft_model. RadixAttention shares KV cache prefixes automatically for 3-5x throughput on shared prefixes. --quantization fp8 for memory efficiency. Claude Code generates SGLang programs, constrained decoders, fork/join pipelines, server launch configs, and batch inference scripts.
CLAUDE.md for SGLang
## SGLang Stack
- Version: sglang >= 0.3
- Backend: Runtime(model_path, tp_size) for local | sgl.OpenAI("gpt-4o") for remote
- Program: @sgl.function def fn(s, **kwargs): s += sgl.system(...) + sgl.user(...) + sgl.assistant(sgl.gen("key"))
- Constrained: sgl.gen("key", choices=[...]) | regex=r"..." | json_schema=Model.model_json_schema()
- Batch: fn.run_batch([{"arg": val}, ...], num_threads=N)
- Fork/join: s.fork(N) → [f1, f2, ...] → s.join(forks) for parallel branches
- Vision: sgl.image(path_or_url) in sgl.user(...)
- Server: python -m sglang.launch_server --model-path ... --port 30000 --tp N
SGLang Inference Programs
# inference/sglang_programs.py — high-throughput LLM inference with SGLang
from __future__ import annotations
import json
import os
from typing import Any
import sglang as sgl
from pydantic import BaseModel
# ── 1. Backend setup ──────────────────────────────────────────────────────────
def setup_local_backend(
model_path: str = "meta-llama/Llama-3.1-8B-Instruct",
tp_size: int = 1,
port: int = 30000,
) -> sgl.Runtime:
"""
Launch local SGLang runtime with RadixAttention.
tp_size: tensor parallel degree (number of GPUs).
"""
runtime = sgl.Runtime(
model_path=model_path,
tp_size=tp_size,
port=port,
mem_fraction_static=0.88, # GPU memory for KV cache
disable_radix_cache=False, # Keep RadixAttention enabled
)
sgl.set_default_backend(runtime)
print(f"SGLang backend: {model_path} on {tp_size} GPU(s), port {port}")
return runtime
def setup_openai_backend(model: str = "gpt-4o-mini"):
"""Use OpenAI as backend — same programs work unchanged."""
backend = sgl.OpenAI(model)
sgl.set_default_backend(backend)
# ── 2. Simple generation programs ─────────────────────────────────────────────
@sgl.function
def simple_qa(s, question: str, system: str = "You are a helpful assistant."):
"""Basic Q&A program."""
s += sgl.system(system)
s += sgl.user(question)
s += sgl.assistant(sgl.gen("answer", max_tokens=512, temperature=0.7))
@sgl.function
def summarize(s, document: str, max_words: int = 50):
"""Summarize a document in a fixed word count."""
s += sgl.system("You are a concise summarizer.")
s += sgl.user(
f"Summarize the following in at most {max_words} words:\n\n{document}"
)
s += sgl.assistant(sgl.gen("summary", max_tokens=max_words * 2))
# ── 3. Constrained generation ─────────────────────────────────────────────────
@sgl.function
def classify_sentiment(s, text: str):
"""Classify sentiment — guaranteed to produce one of three labels."""
s += sgl.system("Classify the sentiment of the given text.")
s += sgl.user(f"Text: {text}\n\nSentiment:")
s += sgl.assistant(sgl.gen(
"label",
choices=["positive", "neutral", "negative"],
))
@sgl.function
def classify_priority(s, ticket: str):
"""Support ticket priority classification."""
s += sgl.system("You are a support ticket triage system.")
s += sgl.user(f"Ticket: {ticket}\n\nPriority:")
s += sgl.assistant(sgl.gen(
"priority",
choices=["critical", "high", "medium", "low"],
))
s += sgl.user("Department to route to:")
s += sgl.assistant(sgl.gen(
"department",
choices=["engineering", "billing", "account", "product"],
))
@sgl.function
def extract_date(s, text: str):
"""Extract date in ISO format — regex-constrained output."""
s += sgl.user(f"Extract the date from: '{text}'\n\nDate (YYYY-MM-DD):")
s += sgl.assistant(sgl.gen(
"date",
regex=r"\d{4}-\d{2}-\d{2}",
max_tokens=12,
))
# ── 4. JSON schema–constrained extraction ────────────────────────────────────
class ContactInfo(BaseModel):
name: str
email: str | None = None
company: str | None = None
role: str | None = None
@sgl.function
def extract_contact_json(s, text: str):
"""Extract contact info as validated JSON."""
s += sgl.system("Extract structured contact information as JSON.")
s += sgl.user(f"Text: {text}\n\nContact JSON:")
s += sgl.assistant(sgl.gen(
"contact",
json_schema=ContactInfo.model_json_schema(),
max_tokens=256,
))
def get_contact(text: str) -> ContactInfo:
state = extract_contact_json.run(text=text)
return ContactInfo.model_validate_json(state["contact"])
# ── 5. Multi-turn conversation ────────────────────────────────────────────────
@sgl.function
def multi_turn_chat(s, messages: list[dict]):
"""
Multi-turn conversation from a message history.
messages: [{"role": "user"|"assistant", "content": "..."}]
"""
s += sgl.system("You are a helpful assistant.")
for msg in messages:
if msg["role"] == "user":
s += sgl.user(msg["content"])
elif msg["role"] == "assistant":
s += sgl.assistant(msg["content"])
# Generate next assistant turn
s += sgl.assistant(sgl.gen("response", max_tokens=512))
# ── 6. Fork/join for parallel generation ─────────────────────────────────────
@sgl.function
def multi_perspective_analysis(s, topic: str):
"""
Generate multiple perspectives in parallel, then synthesize.
Fork creates independent generation branches; join merges them.
"""
s += sgl.system(
"You analyze topics from multiple expert perspectives."
)
s += sgl.user(f"Topic: {topic}")
# Parallel branches — each fork generates independently
perspectives = ["technical", "business", "ethical"]
forks = s.fork(len(perspectives))
for fork, perspective in zip(forks, perspectives):
fork += sgl.user(f"Provide the {perspective} perspective in 2 sentences:")
fork += sgl.assistant(sgl.gen(f"{perspective}_view", max_tokens=128))
# Join merges all fork states back
s.join(forks)
# Synthesize all perspectives
views = "\n".join(
f"{p}: {forks[i][f'{p}_view']}"
for i, p in enumerate(perspectives)
)
s += sgl.user(f"Synthesize these perspectives:\n{views}")
s += sgl.assistant(sgl.gen("synthesis", max_tokens=256))
@sgl.function
def parallel_qa(s, question: str, n_attempts: int = 3):
"""
Generate N independent answers in parallel and select the best.
Useful for self-consistency decoding.
"""
forks = s.fork(n_attempts)
for fork in forks:
fork += sgl.user(question)
fork += sgl.assistant(sgl.gen("answer", max_tokens=256, temperature=0.8))
s.join(forks)
answers = [fork["answer"] for fork in forks]
# Let the model pick the best answer
s += sgl.user(
f"Given these {n_attempts} answers to '{question}':\n"
+ "\n".join(f"{i+1}. {a}" for i, a in enumerate(answers))
+ "\n\nWhich is most accurate and complete? Provide the final answer:"
)
s += sgl.assistant(sgl.gen("best_answer", max_tokens=256, temperature=0.0))
# ── 7. Batch inference ────────────────────────────────────────────────────────
def batch_classify(texts: list[str]) -> list[dict[str, str]]:
"""
Classify a batch of texts for sentiment and priority.
num_threads controls parallelism — SGLang handles scheduling.
"""
states = classify_sentiment.run_batch(
[{"text": t} for t in texts],
num_threads=min(32, len(texts)),
progress_bar=True,
)
return [{"text": t, "sentiment": s["label"]} for t, s in zip(texts, states)]
def batch_extract_contacts(texts: list[str]) -> list[ContactInfo | None]:
"""Batch contact extraction — processes all texts in parallel."""
states = extract_contact_json.run_batch(
[{"text": t} for t in texts],
num_threads=16,
)
results = []
for state in states:
try:
results.append(ContactInfo.model_validate_json(state["contact"]))
except Exception:
results.append(None)
return results
# ── 8. Vision / multimodal ────────────────────────────────────────────────────
@sgl.function
def describe_image(s, image_path: str, question: str = "Describe this image in detail."):
"""Multimodal image understanding — requires a vision model backend."""
s += sgl.user(sgl.image(image_path) + question)
s += sgl.assistant(sgl.gen("description", max_tokens=512))
@sgl.function
def visual_qa(s, image_url: str, question: str):
"""Visual question answering with constrained answer length."""
s += sgl.system("Answer questions about images concisely.")
s += sgl.user(sgl.image(image_url) + f"\nQuestion: {question}")
s += sgl.assistant(sgl.gen("answer", max_tokens=128))
# ── Demo ──────────────────────────────────────────────────────────────────────
if __name__ == "__main__":
# Setup — use OpenAI for quick demo (no local GPU needed)
setup_openai_backend("gpt-4o-mini")
# Simple QA
state = simple_qa.run(question="What is RadixAttention?")
print(f"Answer: {state['answer'][:100]}...")
# Constrained classification
state = classify_sentiment.run(text="This library significantly improved our throughput!")
print(f"Sentiment: {state['label']}")
state = classify_priority.run(ticket="Payment processing is down for all enterprise customers")
print(f"Priority: {state['priority']}, Dept: {state['department']}")
# JSON extraction
contact = get_contact("Reach out to Sarah Chen, CTO at NeuralTech — [email protected]")
print(f"Contact: {contact.name}, {contact.email}, {contact.role}")
# Multi-perspective
state = multi_perspective_analysis.run(topic="Large language models in healthcare")
print(f"Technical view: {state['technical_view'][:80]}...")
print(f"Synthesis: {state['synthesis'][:100]}...")
# Batch
texts = [
"Outstanding product, exceeded all expectations!",
"Average quality, nothing special",
"Terrible experience, demanded a refund",
]
results = batch_classify(texts)
for r in results:
print(f" '{r['text'][:40]}...' → {r['sentiment']}")
For the vLLM alternative when needing OpenAI-compatible serving with LoRA hot-swapping, GPTQ/AWQ quantization support, and the largest community ecosystem — vLLM handles broad model and quantization format coverage while SGLang’s @sgl.function programs with fork/join parallelism and constraint types (choices, regex, JSON schema) provide a higher-level abstraction specifically designed for complex multi-step generation pipelines that would otherwise require multiple sequential API calls. For the TGI (Text Generation Inference) alternative when deploying in a Kubernetes environment with Hugging Face Hub integration and built-in model sharding — TGI excels at production Kubernetes deployments while SGLang’s RadixAttention KV cache prefix sharing delivers 3-5x throughput improvements on workloads with shared prefixes (system prompts, few-shot examples, document contexts) without any code changes. The Claude Skills 360 bundle includes SGLang skill sets covering Runtime setup, constrained generation programs, fork/join parallelism, JSON schema extraction, batch inference, vision inputs, and server deployment configuration. Start with the free tier to try high-throughput LLM inference code generation.