Modal runs Python functions on serverless GPUs — import modal; app = modal.App("my-app") defines an app. image = modal.Image.debian_slim().pip_install(["torch", "transformers"]) builds the container. @app.function(gpu="A100", memory=16384, timeout=300, image=image) decorates a function to run on Modal GPUs. modal run my_script.py executes locally (running on Modal’s cloud). Deploy: modal deploy my_script.py makes functions persistent. Web endpoint: @app.function(gpu="T4"); @modal.web_endpoint(method="POST") creates an HTTPS URL. fn.remote(args) calls a deployed function from Python or TypeScript (via subprocess or REST). fn.map(items) runs a function in parallel over a list — ideal for batch inference. fn.for_each(items) is the streaming variant. modal.Volume.from_name("weights-vol", create_if_missing=True) persists files across runs for model caching. Scheduled: @app.cron("0 6 * * *") runs on a schedule. Custom image: modal.Image.from_registry("nvcr.io/nvidia/pytorch:24.01-py3").pip_install(...). TypeScript calls Modal via REST: deployed web endpoints have plain HTTPS URLs. Claude Code generates Modal GPU inference APIs, batch processing pipelines, and fine-tuning jobs.
CLAUDE.md for Modal
## Modal Stack
- Language: Python 3.11+, modal SDK >= 0.64
- App: app = modal.App("my-app")
- Image: image = modal.Image.debian_slim().pip_install(["torch==2.3.0", "transformers", "accelerate"])
- GPU function: @app.function(gpu="A10G", memory=20480, timeout=600, image=image)
- Run locally: modal run script.py — executes the function in Modal cloud
- Deploy: modal deploy script.py — makes the app persistent with a stable URL
- Web endpoint: @modal.web_endpoint(method="POST") on a @app.function creates HTTPS URL
- Parallel map: results = list(fn.map(items, order_outputs=False)) — concurrent execution
- Volume: vol = modal.Volume.from_name("my-vol", create_if_missing=True); mount as /cache
Modal GPU Inference Server
# modal_app.py — PyTorch model served on Modal GPU
import modal
import io
from pathlib import Path
app = modal.App("llm-inference")
vol = modal.Volume.from_name("model-weights", create_if_missing=True)
image = (
modal.Image.debian_slim(python_version="3.11")
.pip_install(["torch==2.3.0", "transformers>=4.41", "accelerate", "fastapi[standard]", "sentencepiece"])
.env({"HF_HOME": "/cache/huggingface"})
)
@app.function(
gpu="A10G",
memory=20480,
timeout=600,
image=image,
volumes={"/cache": vol},
)
def load_and_generate(prompt: str, max_new_tokens: int = 512) -> str:
"""Run Llama on an A10G GPU — model downloaded once, cached in Volume."""
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
model_id = "meta-llama/Llama-3.2-3B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
)
pipe = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
max_new_tokens=max_new_tokens,
temperature=0.7,
do_sample=True,
)
messages = [{"role": "user", "content": prompt}]
result = pipe(messages)
return result[0]["generated_text"][-1]["content"]
@app.function(
gpu="A10G",
memory=20480,
image=image,
volumes={"/cache": vol},
)
def batch_embed(texts: list[str]) -> list[list[float]]:
"""Generate embeddings for a batch of texts."""
import torch
from transformers import AutoTokenizer, AutoModel
model_id = "BAAI/bge-large-en-v1.5"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModel.from_pretrained(model_id).cuda()
model.eval()
BATCH = 64
all_embeddings: list[list[float]] = []
with torch.no_grad():
for i in range(0, len(texts), BATCH):
batch = texts[i : i + BATCH]
enc = tokenizer(batch, padding=True, truncation=True, max_length=512, return_tensors="pt").to("cuda")
out = model(**enc)
emb = out.last_hidden_state[:, 0, :] # CLS token
emb = torch.nn.functional.normalize(emb, dim=-1)
all_embeddings.extend(emb.cpu().tolist())
return all_embeddings
Modal FastAPI Web Endpoint
# modal_web.py — FastAPI on Modal with GPU backing
import modal
from pydantic import BaseModel
app = modal.App("inference-api")
image = (
modal.Image.debian_slim(python_version="3.11")
.pip_install(["torch==2.3.0", "transformers>=4.41", "accelerate", "fastapi[standard]"])
)
class GenerateRequest(BaseModel):
prompt: str
max_tokens: int = 256
temperature: float = 0.7
class GenerateResponse(BaseModel):
text: str
tokens_generated: int
@app.cls(gpu="A10G", memory=16384, image=image, concurrency_limit=5)
class InferenceServer:
"""Persistent class: model loaded once per container instance."""
@modal.enter() # runs once per container start
def load_model(self):
import torch
from transformers import pipeline
self.pipe = pipeline(
"text-generation",
model="meta-llama/Llama-3.2-3B-Instruct",
torch_dtype=torch.bfloat16,
device_map="auto",
)
@modal.web_endpoint(method="POST")
def generate(self, req: GenerateRequest) -> GenerateResponse:
result = self.pipe(
[{"role": "user", "content": req.prompt}],
max_new_tokens=req.max_tokens,
temperature=req.temperature,
do_sample=True,
)
text = result[0]["generated_text"][-1]["content"]
return GenerateResponse(text=text, tokens_generated=len(text.split()))
TypeScript Client for Modal Web Endpoint
// lib/modal/client.ts — call Modal web endpoints from TypeScript
const MODAL_GENERATE_URL = process.env.MODAL_GENERATE_URL! // from `modal deploy` output
export type GenerateInput = {
prompt: string
max_tokens?: number
temperature?: number
}
export type GenerateOutput = {
text: string
tokens_generated: number
}
/** Call Modal GPU function via its HTTPS web endpoint */
export async function modalGenerate(input: GenerateInput): Promise<GenerateOutput> {
const res = await fetch(MODAL_GENERATE_URL, {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify(input),
})
if (!res.ok) {
throw new Error(`Modal inference error ${res.status}: ${await res.text()}`)
}
return res.json()
}
// Next.js API route
// export async function POST(req: Request) {
// const { prompt } = await req.json()
// const result = await modalGenerate({ prompt, max_tokens: 512 })
// return NextResponse.json(result)
// }
Scheduled Batch Job
# modal_batch.py — nightly batch processing with cron
import modal
app = modal.App("nightly-pipeline")
image = modal.Image.debian_slim().pip_install(["requests", "pandas"])
@app.function(schedule=modal.Cron("0 2 * * *"), memory=4096, timeout=3600, image=image)
def nightly_batch():
"""Run every night at 2am UTC — generate reports, process queued jobs."""
import requests
# Fetch pending items from your API
items = requests.get("https://myapp.com/api/jobs/pending").json()
print(f"Processing {len(items)} items")
# Parallel processing with .map — runs multiple GPU functions concurrently
results = list(process_item.map(items, order_outputs=False))
print(f"Completed {len(results)} items")
@app.function(cpu=2, memory=2048, image=image)
def process_item(item: dict) -> dict:
"""Process a single item — will be called in parallel by nightly_batch."""
# Do work...
return {"id": item["id"], "status": "done"}
For the Replicate alternative when running community open-source models (image generation, video, audio, specialized NLP) without writing Python infrastructure code, or when versioned model deployments via Replicate’s web UI are preferred — Replicate has a larger model catalog while Modal gives full Python control over the GPU environment, custom Docker images, persistent volumes, and faster iteration on custom models, see the Replicate guide. For the Vast.ai / Lambda Labs alternative when needing dedicated long-running GPU instances for training runs rather than serverless per-function billing — those platforms rent persistent GPU VMs while Modal’s serverless model is ideal for inference (pay per second, scale to zero, auto-scale concurrency), see the GPU cloud comparison guide. The Claude Skills 360 bundle includes Modal skill sets covering GPU inference, batch processing, and FastAPI web endpoints. Start with the free tier to try serverless GPU generation.