KServe is the Kubernetes-native model serving platform, formerly KFServing. kubectl apply -f inference-service.yaml deploys a model. InferenceService CRD declares the serving spec: apiVersion: serving.kserve.io/v1beta1, kind: InferenceService, spec.predictor defines the model server — sklearn, tensorflow, pytorch, onnx, xgboost, huggingface, or custom. Storage URIs: storageUri: s3://bucket/model/ or gs:// or pvc://. Serverless: Knative autoscaling scales to zero when idle — autoscaling.knative.dev/target: "10" sets concurrent requests target. Raw Kubernetes mode: serving.kserve.io/deploymentMode: RawDeployment skips Knative for always-on deployments. Canary rollout: spec.predictor.canaryTrafficPercent: 20 with a newModel spec — splits 20% to new model version. Transformer: spec.transformer adds a custom pre/postprocessing container that proxies requests to the predictor. Explainer: spec.explainer.alibi: { type: AnchorTabular } adds SHAP/Anchor explanations via the /explain endpoint. kubectl get inferenceservice shows status, URL, and traffic. Inference: POST https://{service}.{namespace}.{domain}/v1/models/{name}:predict with { "instances": [...] }. V2 API: POST /v2/models/{name}/infer with { "inputs": [{ "name": "input-0", "shape": [...], "datatype": "FP32", "data": [...] }] }. Python SDK: from kserve import KServeClient, KServeClient().create(isvc). ModelMesh for multi-model serving co-locates many models on fewer pods. Claude Code generates InferenceService YAMLs, transformer handlers, canary configs, Python SDK scripts, and TypeScript inference clients.
CLAUDE.md for KServe
## KServe Stack
- Version: kserve >= 0.12 (KFServing successor)
- CRD: InferenceService (serving.kserve.io/v1beta1) — spec.predictor + optional transformer/explainer
- Predictors: sklearn, tensorflow, pytorch, onnx, xgboost, huggingface, triton, custom
- Storage: storageUri: s3:// gs:// pvc:// (cluster StorageClass) + ServiceAccountName
- API: POST /v1/models/{name}:predict {"instances": [...]} or V2 /v2/models/{name}/infer
- Canary: spec.predictor.canaryTrafficPercent + newModel for progressive rollout
- Python SDK: from kserve import KServeClient, V1beta1InferenceService
InferenceService YAML Specs
# k8s/sklearn-isvc.yaml — scikit-learn model on S3
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: churn-classifier
namespace: ml-serving
annotations:
serving.kserve.io/deploymentMode: Serverless
autoscaling.knative.dev/target: "20"
autoscaling.knative.dev/minScale: "0"
autoscaling.knative.dev/maxScale: "5"
spec:
predictor:
serviceAccountName: kserve-s3-sa
sklearn:
storageUri: s3://my-models/churn/v1/
resources:
requests:
cpu: "500m"
memory: "512Mi"
limits:
cpu: "2"
memory: "2Gi"
---
# k8s/huggingface-isvc.yaml — HuggingFace transformer model (GPU)
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: sentiment-service
namespace: ml-serving
annotations:
serving.kserve.io/deploymentMode: RawDeployment # Always-on, no Knative
spec:
predictor:
serviceAccountName: kserve-s3-sa
minReplicas: 1
maxReplicas: 4
scaleTarget: 10
huggingface:
storageUri: s3://my-models/sentiment/v2/
protocolVersion: v2
args:
- "--model_name=sentiment"
- "--model_id=cardiffnlp/twitter-roberta-base-sentiment-latest"
resources:
requests:
cpu: "2"
memory: "8Gi"
nvidia.com/gpu: "1"
limits:
cpu: "4"
memory: "16Gi"
nvidia.com/gpu: "1"
---
# k8s/canary-isvc.yaml — canary rollout 20% to v2
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: recommender
namespace: ml-serving
spec:
predictor:
canaryTrafficPercent: 20
# Current stable model (80% traffic)
pytorch:
storageUri: s3://my-models/recommender/v1/
protocolVersion: v2
# Canary model (20% traffic)
newModel:
pytorch:
storageUri: s3://my-models/recommender/v2/
protocolVersion: v2
Custom Transformer
# transformer/transformer.py — KServe Transformer for pre/postprocessing
from __future__ import annotations
import json
import logging
from typing import Any
import kserve
import numpy as np
logger = logging.getLogger(__name__)
class FeatureTransformer(kserve.Model):
"""
KServe Transformer: preprocesses raw API requests before sending
to the predictor, and formats raw predictions for the client.
"""
def __init__(self, name: str, predictor_host: str):
super().__init__(name)
self.predictor_host = predictor_host
self.ready = False
self.feature_names: list[str] = [
"age", "tenure_days", "monthly_spend",
"support_tickets", "last_login_days",
]
def load(self) -> bool:
# Load any preprocessing artifacts (scalers, encoders)
# In production these would come from a model store
from sklearn.preprocessing import StandardScaler
import joblib
try:
self.scaler = joblib.load("/mnt/models/scaler.pkl")
except FileNotFoundError:
self.scaler = None
logger.warning("No scaler found — using raw features")
self.ready = True
return self.ready
def preprocess(self, payload: dict, headers: dict[str, str] | None = None) -> dict:
"""Transform raw request into model input format."""
instances = payload.get("instances", [])
processed = []
for instance in instances:
if isinstance(instance, dict):
features = [float(instance.get(f, 0.0)) for f in self.feature_names]
else:
features = [float(v) for v in instance]
if self.scaler:
features = self.scaler.transform([features])[0].tolist()
processed.append(features)
return {"instances": processed}
def postprocess(self, response: dict, headers: dict[str, str] | None = None) -> dict:
"""Format raw model output for API consumers."""
predictions = response.get("predictions", [])
results = []
for pred in predictions:
if isinstance(pred, list):
churn_prob = float(pred[1]) if len(pred) > 1 else float(pred[0])
else:
churn_prob = float(pred)
results.append({
"churn_probability": round(churn_prob, 4),
"risk_tier": "HIGH" if churn_prob > 0.7 else "MEDIUM" if churn_prob > 0.3 else "LOW",
})
return {"predictions": results}
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("--model_name", type=str, required=True)
parser.add_argument("--predictor_host", type=str, required=True)
args = parser.parse_args()
model = FeatureTransformer(args.model_name, args.predictor_host)
model.load()
kserve.ModelServer().start([model])
Python SDK
# scripts/manage_isvc.py — KServe Python SDK for InferenceService management
from __future__ import annotations
from kserve import (
KServeClient,
V1beta1InferenceService,
V1beta1InferenceServiceSpec,
V1beta1PredictorSpec,
V1beta1SKLearnSpec,
V1beta1HuggingFaceRuntimeSpec,
constants,
)
from kubernetes import client
def create_sklearn_service(
name: str,
namespace: str,
storage_uri: str,
sa_name: str = "kserve-s3-sa",
) -> V1beta1InferenceService:
"""Create an sklearn InferenceService."""
isvc = V1beta1InferenceService(
api_version=constants.KSERVE_V1BETA1,
kind=constants.KSERVE_KIND,
metadata=client.V1ObjectMeta(
name=name,
namespace=namespace,
annotations={
"serving.kserve.io/deploymentMode": "Serverless",
},
),
spec=V1beta1InferenceServiceSpec(
predictor=V1beta1PredictorSpec(
service_account_name=sa_name,
sklearn=V1beta1SKLearnSpec(
storage_uri=storage_uri,
resources=client.V1ResourceRequirements(
requests={"cpu": "500m", "memory": "512Mi"},
limits={"cpu": "2", "memory": "2Gi"},
),
),
)
),
)
kserve_client = KServeClient()
kserve_client.create(isvc, namespace=namespace)
print(f"Created InferenceService: {name} in {namespace}")
return isvc
def wait_for_ready(name: str, namespace: str, timeout: int = 300) -> str:
"""Poll until InferenceService is ready and return its URL."""
kserve_client = KServeClient()
kserve_client.wait_isvc_ready(name, namespace=namespace, timeout_seconds=timeout)
isvc = kserve_client.get(name, namespace=namespace)
url = isvc.status.url
print(f"Ready: {url}")
return url
def rollout_canary(
name: str,
namespace: str,
new_storage: str,
canary_pct: int = 20,
) -> None:
"""Patch InferenceService to add canary traffic split."""
kserve_client = KServeClient()
patch = {
"spec": {
"predictor": {
"canaryTrafficPercent": canary_pct,
"newModel": {
"sklearn": {"storageUri": new_storage}
},
}
}
}
kserve_client.patch(name, patch, namespace=namespace)
print(f"Canary rollout: {canary_pct}% to new model for {name}")
if __name__ == "__main__":
create_sklearn_service(
name="churn-classifier",
namespace="ml-serving",
storage_uri="s3://my-models/churn/v1/",
)
wait_for_ready("churn-classifier", "ml-serving")
TypeScript Client
// lib/kserve/client.ts — TypeScript client for KServe V1/V2 inference
const KSERVE_URL = process.env.KSERVE_URL ?? "http://churn-classifier.ml-serving.svc.cluster.local"
// V1 inference protocol
export async function predictV1<T>(
modelName: string,
instances: unknown[],
): Promise<{ predictions: T[] }> {
const res = await fetch(`${KSERVE_URL}/v1/models/${modelName}:predict`, {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ instances }),
})
if (!res.ok) throw new Error(`KServe ${res.status}: ${await res.text()}`)
return res.json()
}
// V2 inference protocol (Open Inference Protocol)
export async function predictV2(
modelName: string,
inputs: Array<{
name: string
shape: number[]
datatype: "FP32" | "FP64" | "INT32" | "INT64" | "BYTES"
data: number[] | string[]
}>,
): Promise<{ outputs: Array<{ name: string; shape: number[]; datatype: string; data: unknown[] }> }> {
const res = await fetch(`${KSERVE_URL}/v2/models/${modelName}/infer`, {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ inputs }),
})
if (!res.ok) throw new Error(`KServe V2 ${res.status}: ${await res.text()}`)
return res.json()
}
export async function getModelMetadata(modelName: string) {
const res = await fetch(`${KSERVE_URL}/v2/models/${modelName}`)
return res.json()
}
For the Seldon Core alternative when needing a full MLOps platform with built-in A/B testing orchestration, multi-armed bandit routing, outlier detection, model explainability, and a Seldon Deploy UI — Seldon Core offers a richer operator ecosystem for enterprise ML governance while KServe is the CNCF-preferred standard with lighter footprint and tighter Knative serverless integration for autoscaling to zero. For the BentoML alternative when packaging models into self-contained Docker images without Kubernetes orchestration — BentoML’s bentofile.yaml approach is simpler for teams without Kubernetes expertise while KServe is the right choice for teams already running workloads on Kubernetes who want declarative kubectl apply model deployments with built-in canary rollouts and scale-to-zero. The Claude Skills 360 bundle includes KServe skill sets covering InferenceService YAMLs, transformer handlers, Python SDK management, and canary rollouts. Start with the free tier to try Kubernetes model serving generation.