LMDeploy deploys LLMs with TurboMind engine for high throughput. pip install lmdeploy. CLI chat: lmdeploy chat meta-llama/Llama-3.1-8B-Instruct. Convert + quantize W4A16 AWQ: lmdeploy lite auto_awq meta-llama/Llama-3.1-8B-Instruct --work-dir ./llama3-awq-4bit. Serve API: lmdeploy serve api_server ./llama3-awq-4bit --server-port 23333 --tp 2 — OpenAI-compatible at http://localhost:23333. Python pipeline: from lmdeploy import pipeline, TurbomindEngineConfig, pipe = pipeline("meta-llama/Llama-3.1-8B-Instruct", backend_config=TurbomindEngineConfig(max_batch_size=64, cache_max_entry_count=0.8)). response = pipe("What is LMDeploy?"). Batch: responses = pipe(["Question 1", "Question 2"]). Chat: pipe([{"role":"user","content":"Hello"}]). from lmdeploy import GenerationConfig, gen_config = GenerationConfig(max_new_tokens=512, temperature=0.7, top_p=0.9, repetition_penalty=1.05), response = pipe(prompt, gen_config=gen_config). Tensor parallel: TurbomindEngineConfig(tp=4) for 4-GPU inference. INT8: lmdeploy lite kv_qparams ./model --work-dir ./model-kv-int8. W8A8: lmdeploy lite smooth_quant ./model --work-dir ./model-w8a8. Vision model: from lmdeploy.vl import load_image, pipe = pipeline("OpenGVLab/InternVL2-8B"), image = load_image("image.jpg"), response = pipe(("Describe this image", image)). from lmdeploy.serve.openai.api_client import APIClient, client = APIClient("http://localhost:23333"), model_name = client.available_models[0], for chunk in client.chat_completions_v1(...). Claude Code generates LMDeploy pipelines, quantization scripts, API server configs, and vision model inference code.
CLAUDE.md for LMDeploy
## LMDeploy Stack
- Version: lmdeploy >= 0.6
- Engine: TurbomindEngineConfig(max_batch_size, cache_max_entry_count, tp)
- Pipeline: pipeline(model_path, backend_config=TurbomindEngineConfig(...))
- Call: pipe(prompt) | pipe([prompt1, prompt2]) | pipe([{"role":"user","content":"..."}])
- GenConfig: GenerationConfig(max_new_tokens, temperature, top_p, repetition_penalty)
- Quantize: lmdeploy lite auto_awq model_path --work-dir output (W4A16)
- Serve: lmdeploy serve api_server model_path --server-port 23333 --tp N
- Vision: pipeline("InternVL2-8B") + pipe(("text prompt", load_image(path)))
LMDeploy Inference Pipeline
# inference/lmdeploy_pipeline.py — efficient LLM deployment with LMDeploy TurboMind
from __future__ import annotations
import os
import time
from pathlib import Path
from typing import Generator
from lmdeploy import (
pipeline,
GenerationConfig,
TurbomindEngineConfig,
)
from lmdeploy.messages import Response
# ── 1. Pipeline setup ─────────────────────────────────────────────────────────
def build_pipeline(
model_path: str = "meta-llama/Llama-3.1-8B-Instruct",
max_batch_size: int = 128,
tp: int = 1,
cache_ratio: float = 0.8,
quant_policy: int = 0, # 0=none, 4=W4A16-AWQ, 8=KV-int8
) -> "pipeline":
"""
Build TurboMind pipeline with PagedAttention.
cache_ratio: fraction of GPU memory reserved for KV cache.
quant_policy: 0=fp16, 4=awq-4bit, 8=kv-cache-int8.
"""
backend_config = TurbomindEngineConfig(
max_batch_size=max_batch_size,
cache_max_entry_count=cache_ratio,
tp=tp,
quant_policy=quant_policy,
rope_scaling_factor=1.0,
session_len=4096, # Max sequence length per session
)
pipe = pipeline(model_path, backend_config=backend_config)
print(f"Pipeline ready: {model_path} | tp={tp} | batch={max_batch_size}")
return pipe
def build_pipeline_quantized(
awq_model_path: str = "./llama3-awq-4bit",
tp: int = 1,
) -> "pipeline":
"""Load a pre-quantized AWQ model for 2-4x memory savings."""
backend_config = TurbomindEngineConfig(
max_batch_size=256,
cache_max_entry_count=0.85,
tp=tp,
quant_policy=4, # AWQ 4-bit weights
)
return pipeline(awq_model_path, backend_config=backend_config)
# ── 2. Generation config ──────────────────────────────────────────────────────
def make_gen_config(
max_tokens: int = 512,
temperature: float = 0.7,
top_p: float = 0.9,
top_k: int = 50,
repetition: float = 1.05,
) -> GenerationConfig:
return GenerationConfig(
max_new_tokens=max_tokens,
temperature=temperature,
top_p=top_p,
top_k=top_k,
repetition_penalty=repetition,
)
GREEDY_CONFIG = GenerationConfig(max_new_tokens=512, temperature=0.0, top_p=1.0)
CREATIVE_CONFIG = GenerationConfig(max_new_tokens=1024, temperature=0.9, top_p=0.95)
PRECISE_CONFIG = GenerationConfig(max_new_tokens=256, temperature=0.1, top_p=0.9)
# ── 3. Single and batch inference ─────────────────────────────────────────────
def chat(pipe, prompt: str, gen_config: GenerationConfig = None) -> str:
"""Single-turn text generation."""
response: Response = pipe(
prompt,
gen_config=gen_config or GREEDY_CONFIG,
)
return response.text
def chat_turns(pipe, messages: list[dict], gen_config: GenerationConfig = None) -> str:
"""
Multi-turn chat with message history.
messages: [{"role": "user"|"assistant"|"system", "content": "..."}]
"""
response: Response = pipe(
messages,
gen_config=gen_config or GREEDY_CONFIG,
)
return response.text
def batch_inference(
pipe,
prompts: list[str],
gen_config: GenerationConfig = None,
) -> list[str]:
"""
Batch inference — LMDeploy schedules all prompts concurrently.
Throughput scales with batch size up to max_batch_size.
"""
responses: list[Response] = pipe(
prompts,
gen_config=gen_config or GREEDY_CONFIG,
)
return [r.text for r in responses]
def batch_chat(
pipe,
conversations: list[list[dict]],
gen_config: GenerationConfig = None,
) -> list[str]:
"""Batch multi-turn chat — each item is a conversation history."""
responses = pipe(conversations, gen_config=gen_config or GREEDY_CONFIG)
return [r.text for r in responses]
# ── 4. Streaming generation ───────────────────────────────────────────────────
def stream_response(
pipe,
prompt: str,
gen_config: GenerationConfig = None,
) -> Generator[str, None, None]:
"""Stream tokens as they're generated."""
for response in pipe.stream_infer(
prompt,
gen_config=gen_config or make_gen_config(max_tokens=512),
):
if response.text:
yield response.text
def print_stream(pipe, prompt: str):
"""Demo: print streaming output."""
print(f"Q: {prompt}\nA: ", end="", flush=True)
for chunk in stream_response(pipe, prompt):
print(chunk, end="", flush=True)
print()
# ── 5. Document processing pipeline ──────────────────────────────────────────
def summarize_documents(
pipe,
documents: list[str],
max_words: int = 50,
) -> list[str]:
"""Batch document summarization — efficient with continuous batching."""
prompts = [
f"Summarize the following document in at most {max_words} words:\n\n{doc}\n\nSummary:"
for doc in documents
]
return batch_inference(pipe, prompts, PRECISE_CONFIG)
def classify_texts(
pipe,
texts: list[str],
labels: list[str],
) -> list[str]:
"""
Batch text classification.
Returns one label per text — post-process by matching label names in output.
"""
label_str = ", ".join(f'"{l}"' for l in labels)
prompts = [
f"Classify the following text into exactly one of these categories: {label_str}.\n"
f"Text: {text}\n"
f"Category (respond with just the category name):"
for text in texts
]
raw_outputs = batch_inference(
pipe,
prompts,
GenerationConfig(max_new_tokens=10, temperature=0.0),
)
# Match output to closest label
results = []
for output in raw_outputs:
output_lower = output.strip().lower()
matched = next(
(l for l in labels if l.lower() in output_lower),
labels[0], # Default to first label if no match
)
results.append(matched)
return results
def extract_structured(
pipe,
texts: list[str],
schema: str,
) -> list[str]:
"""Extract structured info as JSON strings."""
prompts = [
f"Extract information matching this JSON schema:\n{schema}\n\n"
f"Text: {text}\n\nJSON:"
for text in texts
]
return batch_inference(pipe, prompts, GenerationConfig(
max_new_tokens=256,
temperature=0.0,
))
# ── 6. Vision model pipeline ──────────────────────────────────────────────────
def build_vision_pipeline(
model_path: str = "OpenGVLab/InternVL2-8B",
tp: int = 1,
) -> "pipeline":
"""Build a vision-language model pipeline."""
backend_config = TurbomindEngineConfig(
max_batch_size=32,
cache_max_entry_count=0.75,
tp=tp,
)
return pipeline(model_path, backend_config=backend_config)
def describe_image(vision_pipe, image_path: str) -> str:
"""Generate image description."""
from lmdeploy.vl import load_image
image = load_image(image_path)
response = vision_pipe(
("Describe this image in detail, noting key objects, colors, and any text.", image)
)
return response.text
def visual_qa(vision_pipe, image_path: str, question: str) -> str:
"""Answer a question about an image."""
from lmdeploy.vl import load_image
image = load_image(image_path)
response = vision_pipe((question, image))
return response.text
def batch_image_captioning(
vision_pipe,
image_paths: list[str],
) -> list[str]:
"""Generate captions for multiple images in parallel."""
from lmdeploy.vl import load_image
inputs = [
("Generate a concise one-sentence caption for this image.", load_image(p))
for p in image_paths
]
responses = vision_pipe(inputs)
return [r.text for r in responses]
# ── 7. Benchmark ──────────────────────────────────────────────────────────────
def benchmark_throughput(
pipe,
prompt: str = "Explain the attention mechanism in transformers.",
batch_sizes: list = [1, 8, 32, 64],
n_iters: int = 3,
):
"""Benchmark tokens/sec at different batch sizes."""
print("\n=== LMDeploy Throughput Benchmark ===")
print(f"{'Batch':>8} {'Tokens/s':>12} {'Latency(ms)':>14}")
print("-" * 38)
gen_config = GenerationConfig(max_new_tokens=64, temperature=0.0)
for batch_size in batch_sizes:
prompts = [prompt] * batch_size
latencies = []
for _ in range(n_iters):
t0 = time.perf_counter()
responses = pipe(prompts, gen_config=gen_config)
elapsed = time.perf_counter() - t0
latencies.append(elapsed)
avg_latency = sum(latencies) / len(latencies)
total_tokens = sum(len(r.text.split()) for r in responses) * batch_size
tps = total_tokens / avg_latency
print(f"{batch_size:>8} {tps:>12.0f} {avg_latency*1000:>14.1f}")
# ── Demo ──────────────────────────────────────────────────────────────────────
if __name__ == "__main__":
# Build pipeline (adjust model_path for your environment)
pipe = build_pipeline(
model_path="meta-llama/Llama-3.1-8B-Instruct",
max_batch_size=64,
tp=1,
)
# Single chat
answer = chat(pipe, "What is PagedAttention and why does it improve LLM throughput?")
print(f"Answer: {answer[:150]}...")
# Batch inference
questions = [
"What is quantization in neural networks?",
"Explain continuous batching for LLM serving.",
"What is the difference between AWQ and GPTQ quantization?",
]
answers = batch_inference(pipe, questions)
for q, a in zip(questions, answers):
print(f"\nQ: {q}\nA: {a[:100]}...")
# Document summarization
docs = [
"LMDeploy is an inference toolkit for compressing and deploying large language models. "
"It provides TurboMind engine with PagedAttention for high throughput and low latency. "
"Features include AWQ quantization, tensor parallelism, and vision model support.",
]
summaries = summarize_documents(pipe, docs, max_words=20)
print(f"\nSummary: {summaries[0]}")
# Benchmark
benchmark_throughput(pipe, batch_sizes=[1, 4, 16])
For the vLLM alternative when needing the broadest model format support including GGUF, GPTQ, and Marlin kernels with the largest open-source community and most active LoRA hot-swapping features — vLLM covers the widest compatibility matrix while LMDeploy’s TurboMind engine with W4A16 AWQ and KV-cache INT8 quantization offers better memory efficiency specifically for Qwen, InternLM, and LLaMA architectures deployed on limited VRAM. For the SGLang alternative when needing complex multi-step generation programs with fork/join parallelism and RadixAttention for shared-prefix workloads — SGLang provides a higher-level programming model while LMDeploy’s lmdeploy lite auto_awq one-command quantization pipeline and native vision-language model support (InternVL2) make it the faster path to production for teams optimizing inference cost with quantized models. The Claude Skills 360 bundle includes LMDeploy skill sets covering TurbomindEngineConfig setup, AWQ quantization, batch pipeline inference, streaming, vision model support, API server deployment, and throughput benchmarking. Start with the free tier to try efficient LLM deployment code generation.