Kubernetes YAML is notoriously verbose and easy to get wrong — a missing labels selector, a wrong containerPort match, or an incorrect readinessProbe path silently breaks deployments. Claude Code writes correct Kubernetes manifests, debugs failing workloads from kubectl output, and generates Helm charts with proper templating.
This guide covers practical Kubernetes workflows with Claude Code: writing manifests, debugging pods, Helm chart creation, resource management, and cluster administration patterns.
Setting Up Claude Code for Kubernetes Projects
Cluster context matters for Kubernetes work. Put your setup in CLAUDE.md:
# Kubernetes Context
## Cluster
- EKS 1.29 on AWS, 3 node groups (general, compute, memory)
- Namespace per service (not per environment)
- Environments: dev/staging/prod clusters (not namespaces)
## Stack
- Helm 3.13 for all deployments — no raw kubectl apply in CI
- Container registry: ECR
- Secrets: External Secrets Operator (not raw K8s secrets)
- Ingress: nginx-ingress-controller
- Monitoring: Prometheus + Grafana via kube-prometheus-stack
## Conventions
- All deployments: 2 replicas minimum (even in dev)
- Resources: always set requests AND limits — no unbounded containers
- Health checks: readiness + liveness probes required on every container
- Labels: app, version, component, part-of on every resource
## Never
- hardcode image tags as 'latest' — always pin to digest or semver
- runAsRoot: omit securityContext — explicitly set runAsNonRoot: true
- Store secrets in ConfigMaps or env literals — use ESO
See the CLAUDE.md guide for full setup.
Writing Kubernetes Manifests
Production Deployment
Write a Kubernetes Deployment for the API service.
Image: my-registry/api:1.2.3
2 replicas minimum, 4 max (HPA will manage scaling).
Resources: 250m/512Mi requests, 1000m/1Gi limits.
Readiness probe on /health/ready, liveness on /health/live.
Run as non-root.
apiVersion: apps/v1
kind: Deployment
metadata:
name: api
labels:
app: api
version: "1.2.3"
component: backend
spec:
replicas: 2
selector:
matchLabels:
app: api
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0 # Zero-downtime: always have 2 running
template:
metadata:
labels:
app: api
version: "1.2.3"
spec:
securityContext:
runAsNonRoot: true
runAsUser: 1001
fsGroup: 1001
containers:
- name: api
image: my-registry/api:1.2.3
ports:
- containerPort: 8080
resources:
requests:
cpu: 250m
memory: 512Mi
limits:
cpu: 1000m
memory: 1Gi
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
failureThreshold: 3
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
failureThreshold: 3
env:
- name: PORT
value: "8080"
envFrom:
- secretRef:
name: api-secrets
Key things Claude includes without prompting: maxUnavailable: 0 for zero-downtime rollouts, securityContext with explicit non-root UID, separate readiness and liveness probes (different timeouts), and resource limits on every container.
Service and Ingress
Write the Service and Ingress for the API deployment.
Domain: api.example.com
TLS via cert-manager (ClusterIssuer: letsencrypt-prod)
Rate limiting: 100 req/s per IP
apiVersion: v1
kind: Service
metadata:
name: api
labels:
app: api
spec:
selector:
app: api
ports:
- port: 80
targetPort: 8080
protocol: TCP
type: ClusterIP
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: api
annotations:
cert-manager.io/cluster-issuer: "letsencrypt-prod"
nginx.ingress.kubernetes.io/limit-rps: "100"
nginx.ingress.kubernetes.io/ssl-redirect: "true"
spec:
ingressClassName: nginx
tls:
- hosts:
- api.example.com
secretName: api-tls
rules:
- host: api.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: api
port:
number: 80
Horizontal Pod Autoscaler
Add an HPA for the API deployment.
Scale between 2 and 20 pods.
Target: 70% CPU, 80% memory.
Scale up fast, scale down slowly.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api
minReplicas: 2
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 100
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300 # 5 min before scale-down
policies:
- type: Percent
value: 10
periodSeconds: 60
The behavior block is what makes this production-grade. Aggressive scale-up (double every 60s) with conservative scale-down (only 10% per minute) prevents thrash during traffic spikes.
Debugging Failing Pods
This is where Claude Code adds the most value — reading kubectl output and diagnosing problems:
CrashLoopBackOff
My pod is in CrashLoopBackOff.
kubectl describe pod output: [paste]
kubectl logs output: [paste]
Claude reads the event timeline and log output together. Common diagnoses:
- OOMKilled: memory limit too low, or memory leak — increase limit or profile the app
- Exit code 1 at startup: configuration error — usually a missing env var or bad connection string
- Exit code 137: killed by the OS (OOM at node level, or liveness probe timeout)
- Readiness probe failing: the container starts but the app isn’t healthy — usually a database connection issue
Pending Pods
Pods are stuck in Pending state.
kubectl describe pod shows: "0/3 nodes are available:
3 Insufficient cpu."
Claude diagnoses: the requested CPU across all pods exceeds what’s available. Options: reduce pod resources.requests.cpu, add nodes, or check if existing nodes have pods that should be drained. It explains that resource requests (not limits) govern scheduling.
ImagePullBackOff
Getting ImagePullBackOff.
Events show: "Failed to pull image: unauthorized"
Claude walks through the diagnosis: ECR token expired (ECR tokens expire every 12 hours), missing imagePullSecrets in the service account, or the node role lacks ECR pull permissions. It generates the fix for each case.
Helm Charts
Creating a Chart from Scratch
Create a Helm chart for the API service.
Parameterize: image tag, replica count, resource limits,
ingress hostname, and environment-specific secrets.
Claude generates a complete chart structure:
api-chart/
├── Chart.yaml
├── values.yaml
├── values-staging.yaml
├── values-prod.yaml
└── templates/
├── deployment.yaml
├── service.yaml
├── ingress.yaml
├── hpa.yaml
├── serviceaccount.yaml
└── _helpers.tpl
The values.yaml for the deployment template:
image:
repository: my-registry/api
tag: "latest"
pullPolicy: IfNotPresent
replicaCount: 2
resources:
requests:
cpu: 250m
memory: 512Mi
limits:
cpu: 1000m
memory: 1Gi
ingress:
enabled: true
hostname: api.example.com
tlsEnabled: true
clusterIssuer: letsencrypt-prod
autoscaling:
enabled: true
minReplicas: 2
maxReplicas: 20
targetCPUUtilizationPercentage: 70
Claude generates correct Helm template syntax — {{ .Values.image.tag | quote }}, {{- if .Values.ingress.enabled }} blocks, and the _helpers.tpl with standard fullname and labels templates.
Helm Hooks for Migrations
Add a Helm pre-upgrade hook that runs database migrations
before the new Pods start. If migrations fail, the upgrade should fail.
apiVersion: batch/v1
kind: Job
metadata:
name: {{ include "api.fullname" . }}-migrations
annotations:
"helm.sh/hook": pre-upgrade,pre-install
"helm.sh/hook-weight": "-5"
"helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
spec:
template:
spec:
restartPolicy: Never
containers:
- name: migrations
image: {{ .Values.image.repository }}:{{ .Values.image.tag }}
command: ["python", "manage.py", "migrate", "--run-syncdb"]
envFrom:
- secretRef:
name: {{ include "api.fullname" . }}-secrets
backoffLimit: 0 # Fail immediately — don't retry broken migrations
hook-delete-policy: before-hook-creation cleans up old migration Jobs before creating new ones. backoffLimit: 0 ensures a failed migration fails the Helm release immediately.
Resource Management
Namespace Resource Quotas
Add resource quotas to prevent a single namespace from
consuming all cluster resources.
Hard limits: 16 CPUs, 32Gi memory, 10 pods max.
apiVersion: v1
kind: ResourceQuota
metadata:
name: namespace-quota
spec:
hard:
requests.cpu: "16"
requests.memory: 32Gi
limits.cpu: "32"
limits.memory: 64Gi
pods: "50"
services: "20"
LimitRange Defaults
Set default resource requests and limits for pods in the namespace
that don't specify their own.
apiVersion: v1
kind: LimitRange
metadata:
name: default-limits
spec:
limits:
- type: Container
default:
cpu: 500m
memory: 512Mi
defaultRequest:
cpu: 100m
memory: 128Mi
max:
cpu: "4"
memory: 4Gi
LimitRange prevents pods from being scheduled without resource specs — which is one of the most common causes of node exhaustion.
RBAC and Security
Create a service account for the API pod that has:
- Read access to secrets in its own namespace
- No cross-namespace access
- Ability to list pods (for health check integration)
Claude generates the ServiceAccount, Role (namespace-scoped, not ClusterRole), and RoleBinding — minimal-privilege access. It explains why each permission is needed and flags if any requested permissions are broader than necessary.
Kubernetes + CI/CD
For the complete container-to-cluster pipeline, see the Docker and DevOps guide which covers multi-stage builds and the CI/CD guide for automating deployments on merge.
The typical pattern Claude generates for Kubernetes deployments in CI:
# Update image tag in Helm values
helm upgrade --install api ./api-chart \
--namespace production \
--values ./api-chart/values-prod.yaml \
--set image.tag=$GIT_SHA \
--wait \
--timeout 5m \
--atomic # Rolls back automatically on failure
--atomic is the important flag — if the deployment fails (pods crash, health checks fail), Helm automatically rolls back to the previous release.
Working with Kubernetes Day-to-Day
Claude Code is most effective for Kubernetes when it can see both the manifest and the error. The debugging workflow:
kubectl describe pod <name>— events and statuskubectl logs <name> --previous— logs from the crashed container- Paste both to Claude with context about what you changed
Claude reads the event timeline (image pull, container start, probe failures) and gives specific diagnoses rather than generic suggestions. This is the most time-saving Kubernetes use case — debugging without having to page through documentation for every cryptic error message.
For production-ready Kubernetes patterns, the Claude Skills 360 bundle includes infrastructure skills covering EKS, GKE, and AKS deployment patterns, GitOps workflows with ArgoCD and Flux, and observability setup with Prometheus. Start with the free tier to explore the DevOps skill collection.