Text Generation Inference (TGI) serves LLMs in production with continuous batching. docker run --gpus all --shm-size 1g -p 8080:80 ghcr.io/huggingface/text-generation-inference:latest --model-id meta-llama/Llama-3.1-8B-Instruct. Multi-GPU: --num-shard 4. Quantize: --quantize gptq or awq or eetq or bitsandbytes. Flash attention: --dtype float16 --flash-attention. HTTP generate: curl -X POST http://localhost:8080/generate -d '{"inputs":"Hello","parameters":{"max_new_tokens":50}}'. OpenAI endpoint: http://localhost:8080/v1/chat/completions — drop-in OpenAI replacement. Python client: from huggingface_hub import InferenceClient, client = InferenceClient("http://localhost:8080"), response = client.text_generation("What is TGI?", max_new_tokens=256). Chat: response = client.chat_completion(messages=[{"role":"user","content":"Hello"}], max_tokens=256). Streaming: for token in client.text_generation(prompt, max_new_tokens=512, stream=True): print(token, end=""). OpenAI SDK: from openai import OpenAI, openai_client = OpenAI(base_url="http://localhost:8080/v1", api_key="na"). Async: from huggingface_hub import AsyncInferenceClient. LoRA: --lora-adapters lora_name=path/to/adapter. Grammar: client.text_generation(prompt, grammar={"type":"json","value":User.model_json_schema()}). Speculative: --speculative-model path/to/draft. Metrics: GET /metrics — Prometheus format. health_check = GET /health. GET /info — model config. Kubernetes: HPA on tgi_request_queue_size_sum. Claude Code generates TGI Docker configs, Python client code, streaming handlers, grammar-constrained generation, and Kubernetes deployment manifests.
CLAUDE.md for TGI
## TGI Stack
- Version: text-generation-inference >= 2.4
- Run: docker run --gpus all -p 8080:80 ghcr.io/huggingface/text-generation-inference:latest --model-id <id>
- Multi-GPU: --num-shard N (tensor parallel)
- Quantize: --quantize gptq | awq | eetq | bitsandbytes
- Python: InferenceClient(base_url) → text_generation | chat_completion
- OpenAI: OpenAI(base_url="http://localhost:8080/v1", api_key="na")
- Stream: stream=True → generator of token strings
- Grammar: grammar={"type":"json","value": json_schema_dict}
- LoRA: --lora-adapters name=path → set adapter_id in requests
- Metrics: GET /metrics (Prometheus) | GET /health | GET /info
TGI Client and Deployment
# serving/tgi_client.py — production LLM serving with HuggingFace TGI
from __future__ import annotations
import asyncio
import json
import os
from typing import AsyncGenerator, Generator, Optional
from huggingface_hub import AsyncInferenceClient, InferenceClient
from pydantic import BaseModel
TGI_BASE_URL = os.environ.get("TGI_BASE_URL", "http://localhost:8080")
# ── 1. Synchronous client ─────────────────────────────────────────────────────
def get_client(base_url: str = TGI_BASE_URL) -> InferenceClient:
"""Build a TGI sync client."""
return InferenceClient(model=base_url)
def get_async_client(base_url: str = TGI_BASE_URL) -> AsyncInferenceClient:
"""Build a TGI async client."""
return AsyncInferenceClient(model=base_url)
# ── 2. Basic text generation ──────────────────────────────────────────────────
def generate(
prompt: str,
max_tokens: int = 512,
temperature: float = 0.7,
top_p: float = 0.9,
repetition: float = 1.05,
client: InferenceClient = None,
) -> str:
"""Single-call text generation."""
c = client or get_client()
return c.text_generation(
prompt,
max_new_tokens=max_tokens,
temperature=temperature,
top_p=top_p,
repetition_penalty=repetition,
do_sample=temperature > 0,
return_full_text=False,
)
def stream_generate(
prompt: str,
max_tokens: int = 512,
temperature: float = 0.7,
client: InferenceClient = None,
) -> Generator[str, None, None]:
"""Streaming token-by-token generation."""
c = client or get_client()
for token in c.text_generation(
prompt,
max_new_tokens=max_tokens,
temperature=temperature,
stream=True,
return_full_text=False,
):
yield token
def print_stream(prompt: str, **kwargs):
"""Print streaming output in real time."""
print(f"Prompt: {prompt}\nResponse: ", end="", flush=True)
for token in stream_generate(prompt, **kwargs):
print(token, end="", flush=True)
print()
# ── 3. Chat completions (OpenAI-compatible) ───────────────────────────────────
def chat(
messages: list[dict],
max_tokens: int = 512,
temperature: float = 0.7,
system: str = None,
client: InferenceClient = None,
) -> str:
"""Chat completion using TGI's OpenAI-compatible endpoint."""
c = client or get_client()
if system:
messages = [{"role": "system", "content": system}] + messages
response = c.chat_completion(
messages=messages,
max_tokens=max_tokens,
temperature=temperature,
)
return response.choices[0].message.content
def stream_chat(
messages: list[dict],
max_tokens: int = 512,
temperature: float = 0.7,
client: InferenceClient = None,
) -> Generator[str, None, None]:
"""Streaming chat completion."""
c = client or get_client()
for chunk in c.chat_completion(
messages=messages,
max_tokens=max_tokens,
temperature=temperature,
stream=True,
):
delta = chunk.choices[0].delta.content
if delta:
yield delta
# ── 4. OpenAI SDK with TGI backend ───────────────────────────────────────────
def get_openai_client(base_url: str = TGI_BASE_URL) -> "openai.OpenAI":
"""
Use standard OpenAI SDK against TGI's OpenAI-compatible endpoint.
Drop-in replacement for any code using openai.OpenAI().
"""
from openai import OpenAI
return OpenAI(
base_url=f"{base_url}/v1",
api_key="not-needed", # TGI doesn't require auth by default
)
def openai_chat(
messages: list[dict],
max_tokens: int = 512,
temperature: float = 0.7,
base_url: str = TGI_BASE_URL,
) -> str:
"""Chat using OpenAI SDK pointing to TGI."""
client = get_openai_client(base_url)
response = client.chat.completions.create(
model="tgi", # Model name ignored by TGI, any string works
messages=messages,
max_tokens=max_tokens,
temperature=temperature,
)
return response.choices[0].message.content
# ── 5. Grammar-constrained generation ────────────────────────────────────────
class ContactInfo(BaseModel):
name: str
email: Optional[str] = None
company: Optional[str] = None
class TicketClassification(BaseModel):
priority: str # critical | high | medium | low
category: str
summary: str
def extract_with_grammar(
prompt: str,
schema: type[BaseModel],
max_tokens: int = 256,
client: InferenceClient = None,
) -> BaseModel:
"""
Grammar-constrained JSON extraction — TGI guarantees valid JSON output.
Uses FSM-based constrained decoding (same as Outlines).
"""
c = client or get_client()
raw = c.text_generation(
prompt,
max_new_tokens=max_tokens,
grammar={"type": "json", "value": schema.model_json_schema()},
temperature=0.1,
return_full_text=False,
)
return schema.model_validate_json(raw)
def classify_ticket(ticket_text: str) -> TicketClassification:
"""Classify a support ticket with guaranteed valid JSON output."""
prompt = (
"Classify this support ticket as JSON:\n\n"
f"{ticket_text}\n\n"
"Classification:"
)
return extract_with_grammar(prompt, TicketClassification)
def extract_contact(text: str) -> ContactInfo:
"""Extract contact info with grammar-constrained JSON."""
prompt = f"Extract contact info as JSON:\n{text}\n\nContact:"
return extract_with_grammar(prompt, ContactInfo)
# ── 6. LoRA adapter selection ─────────────────────────────────────────────────
def generate_with_adapter(
prompt: str,
adapter_id: str, # Name defined in --lora-adapters flag
max_tokens: int = 256,
client: InferenceClient = None,
) -> str:
"""Use a specific LoRA adapter for this request."""
c = client or get_client()
return c.text_generation(
prompt,
max_new_tokens=max_tokens,
adapter_id=adapter_id,
return_full_text=False,
)
# ── 7. Async batch inference ──────────────────────────────────────────────────
async def abatch_generate(
prompts: list[str],
max_tokens: int = 256,
temperature: float = 0.7,
concurrency: int = 20,
) -> list[str]:
"""
Async batch generation — concurrency controls max parallel requests.
TGI's continuous batching handles optimal GPU scheduling.
"""
client = get_async_client()
semaphore = asyncio.Semaphore(concurrency)
async def single(prompt: str) -> str:
async with semaphore:
return await client.text_generation(
prompt,
max_new_tokens=max_tokens,
temperature=temperature,
return_full_text=False,
)
results = await asyncio.gather(*[single(p) for p in prompts], return_exceptions=True)
return [r if isinstance(r, str) else f"Error: {r}" for r in results]
async def abatch_chat(
conversations: list[list[dict]],
max_tokens: int = 256,
concurrency: int = 20,
) -> list[str]:
"""Async batch chat completion."""
client = get_async_client()
semaphore = asyncio.Semaphore(concurrency)
async def single(messages: list[dict]) -> str:
async with semaphore:
response = await client.chat_completion(
messages=messages,
max_tokens=max_tokens,
)
return response.choices[0].message.content
results = await asyncio.gather(*[single(c) for c in conversations], return_exceptions=True)
return [r if isinstance(r, str) else f"Error: {r}" for r in results]
# ── 8. Health and monitoring ──────────────────────────────────────────────────
def check_health(base_url: str = TGI_BASE_URL) -> dict:
"""Check TGI server health and model info."""
import urllib.request
import json
def get(path: str) -> dict:
with urllib.request.urlopen(f"{base_url}{path}", timeout=5) as resp:
return json.loads(resp.read())
try:
info = get("/info")
return {
"status": "healthy",
"model": info.get("model_id"),
"max_tokens": info.get("max_total_tokens"),
"dtype": info.get("dtype"),
}
except Exception as e:
return {"status": "unhealthy", "error": str(e)}
# ── Demo ──────────────────────────────────────────────────────────────────────
if __name__ == "__main__":
# Check server (start TGI with docker first)
health = check_health()
print(f"Server: {health}")
if health["status"] != "healthy":
print("Start TGI first:")
print(" docker run --gpus all -p 8080:80 ghcr.io/huggingface/text-generation-inference "
"--model-id TinyLlama/TinyLlama-1.1B-Chat-v1.0")
exit(0)
# Basic generation
response = generate("Explain continuous batching in 2 sentences.")
print(f"Response: {response}")
# Streaming chat
messages = [{"role": "user", "content": "What makes TGI fast for production?"}]
print("\nStreaming chat:")
for token in stream_chat(messages, max_tokens=200):
print(token, end="", flush=True)
print()
# Grammar-constrained
contact = extract_contact("Contact Alice Brown, VP Engineering at DataCo — [email protected]")
print(f"\nExtracted: {contact.name}, {contact.email}")
ticket = classify_ticket("Production login is broken. All users locked out since 2pm.")
print(f"Ticket: priority={ticket.priority}, summary={ticket.summary[:60]}")
# Async batch
prompts = [f"Summarize topic {i}: {topic}" for i, topic in enumerate(
["attention mechanisms", "tokenization", "RLHF training"]
)]
results = asyncio.run(abatch_generate(prompts, max_tokens=100))
for p, r in zip(prompts, results):
print(f"\nQ: {p[:50]}...\nA: {r[:80]}...")
For the vLLM alternative when needing OpenAI-compatible serving with the fastest CUDA kernel updates, speculative decoding maturity, and widest quantization format coverage including GGUF — vLLM leads in raw throughput benchmarks and format support while TGI’s native Hugging Face Hub integration (one --model-id flag to download and serve any Hub model), built-in grammar-constrained decoding, and official Hugging Face support make it the simplest path to production for teams already using the Hugging Face ecosystem. For the SGLang alternative when needing complex multi-step generation programs with fork/join parallelism and RadixAttention prefix caching — SGLang provides higher-level abstractions for structured programs while TGI’s continuous batching, automatic Prometheus metrics, and Kubernetes-native deployment patterns make it the better choice for serving single models at high QPS in production infrastructure. The Claude Skills 360 bundle includes TGI skill sets covering Docker server setup, Python InferenceClient usage, OpenAI-compatible integration, streaming, grammar-constrained extraction, LoRA adapter switching, async batch inference, and health monitoring. Start with the free tier to try production LLM server code generation.