Kubernetes YAML is notoriously verbose and easy to get wrong — a missing labels selector, a wrong containerPort match, or an incorrect readinessProbe path silently breaks deployments. Claude Code writes correct Kubernetes manifests, debugs failing workloads from kubectl output, and generates Helm charts with proper templating.

This guide covers practical Kubernetes workflows with Claude Code: writing manifests, debugging pods, Helm chart creation, resource management, and cluster administration patterns.

Setting Up Claude Code for Kubernetes Projects

Cluster context matters for Kubernetes work. Put your setup in CLAUDE.md:

# Kubernetes Context

## Cluster
- EKS 1.29 on AWS, 3 node groups (general, compute, memory)
- Namespace per service (not per environment)
- Environments: dev/staging/prod clusters (not namespaces)

## Stack
- Helm 3.13 for all deployments — no raw kubectl apply in CI
- Container registry: ECR
- Secrets: External Secrets Operator (not raw K8s secrets)
- Ingress: nginx-ingress-controller
- Monitoring: Prometheus + Grafana via kube-prometheus-stack

## Conventions
- All deployments: 2 replicas minimum (even in dev)
- Resources: always set requests AND limits — no unbounded containers
- Health checks: readiness + liveness probes required on every container
- Labels: app, version, component, part-of on every resource

## Never
- hardcode image tags as 'latest' — always pin to digest or semver
- runAsRoot: omit securityContext — explicitly set runAsNonRoot: true
- Store secrets in ConfigMaps or env literals — use ESO

See the CLAUDE.md guide for full setup.

Writing Kubernetes Manifests

Production Deployment

Write a Kubernetes Deployment for the API service.
Image: my-registry/api:1.2.3
2 replicas minimum, 4 max (HPA will manage scaling).
Resources: 250m/512Mi requests, 1000m/1Gi limits.
Readiness probe on /health/ready, liveness on /health/live.
Run as non-root.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
  labels:
    app: api
    version: "1.2.3"
    component: backend
spec:
  replicas: 2
  selector:
    matchLabels:
      app: api
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0  # Zero-downtime: always have 2 running
  template:
    metadata:
      labels:
        app: api
        version: "1.2.3"
    spec:
      securityContext:
        runAsNonRoot: true
        runAsUser: 1001
        fsGroup: 1001
      containers:
        - name: api
          image: my-registry/api:1.2.3
          ports:
            - containerPort: 8080
          resources:
            requests:
              cpu: 250m
              memory: 512Mi
            limits:
              cpu: 1000m
              memory: 1Gi
          readinessProbe:
            httpGet:
              path: /health/ready
              port: 8080
            initialDelaySeconds: 10
            periodSeconds: 5
            failureThreshold: 3
          livenessProbe:
            httpGet:
              path: /health/live
              port: 8080
            initialDelaySeconds: 30
            periodSeconds: 10
            failureThreshold: 3
          env:
            - name: PORT
              value: "8080"
          envFrom:
            - secretRef:
                name: api-secrets

Key things Claude includes without prompting: maxUnavailable: 0 for zero-downtime rollouts, securityContext with explicit non-root UID, separate readiness and liveness probes (different timeouts), and resource limits on every container.

Service and Ingress

Write the Service and Ingress for the API deployment.
Domain: api.example.com
TLS via cert-manager (ClusterIssuer: letsencrypt-prod)
Rate limiting: 100 req/s per IP

apiVersion: v1
kind: Service
metadata:
  name: api
  labels:
    app: api
spec:
  selector:
    app: api
  ports:
    - port: 80
      targetPort: 8080
      protocol: TCP
  type: ClusterIP
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: api
  annotations:
    cert-manager.io/cluster-issuer: "letsencrypt-prod"
    nginx.ingress.kubernetes.io/limit-rps: "100"
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
spec:
  ingressClassName: nginx
  tls:
    - hosts:
        - api.example.com
      secretName: api-tls
  rules:
    - host: api.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: api
                port:
                  number: 80

Horizontal Pod Autoscaler

Add an HPA for the API deployment.
Scale between 2 and 20 pods.
Target: 70% CPU, 80% memory.
Scale up fast, scale down slowly.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Percent
          value: 100
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300  # 5 min before scale-down
      policies:
        - type: Percent
          value: 10
          periodSeconds: 60

The behavior block is what makes this production-grade. Aggressive scale-up (double every 60s) with conservative scale-down (only 10% per minute) prevents thrash during traffic spikes.

Debugging Failing Pods

This is where Claude Code adds the most value — reading kubectl output and diagnosing problems:

CrashLoopBackOff

My pod is in CrashLoopBackOff.
kubectl describe pod output: [paste]
kubectl logs output: [paste]

Claude reads the event timeline and log output together. Common diagnoses:

OOMKilled: memory limit too low, or memory leak — increase limit or profile the app
Exit code 1 at startup: configuration error — usually a missing env var or bad connection string
Exit code 137: killed by the OS (OOM at node level, or liveness probe timeout)
Readiness probe failing: the container starts but the app isn’t healthy — usually a database connection issue

Pending Pods

Pods are stuck in Pending state.
kubectl describe pod shows: "0/3 nodes are available: 
3 Insufficient cpu."

Claude diagnoses: the requested CPU across all pods exceeds what’s available. Options: reduce pod resources.requests.cpu, add nodes, or check if existing nodes have pods that should be drained. It explains that resource requests (not limits) govern scheduling.

ImagePullBackOff

Getting ImagePullBackOff. 
Events show: "Failed to pull image: unauthorized"

Claude walks through the diagnosis: ECR token expired (ECR tokens expire every 12 hours), missing imagePullSecrets in the service account, or the node role lacks ECR pull permissions. It generates the fix for each case.

Helm Charts

Creating a Chart from Scratch

Create a Helm chart for the API service.
Parameterize: image tag, replica count, resource limits,
ingress hostname, and environment-specific secrets.

Claude generates a complete chart structure:

api-chart/
├── Chart.yaml
├── values.yaml
├── values-staging.yaml
├── values-prod.yaml
└── templates/
    ├── deployment.yaml
    ├── service.yaml
    ├── ingress.yaml
    ├── hpa.yaml
    ├── serviceaccount.yaml
    └── _helpers.tpl

The values.yaml for the deployment template:

image:
  repository: my-registry/api
  tag: "latest"
  pullPolicy: IfNotPresent

replicaCount: 2

resources:
  requests:
    cpu: 250m
    memory: 512Mi
  limits:
    cpu: 1000m
    memory: 1Gi

ingress:
  enabled: true
  hostname: api.example.com
  tlsEnabled: true
  clusterIssuer: letsencrypt-prod

autoscaling:
  enabled: true
  minReplicas: 2
  maxReplicas: 20
  targetCPUUtilizationPercentage: 70

Claude generates correct Helm template syntax — {{ .Values.image.tag | quote }}, {{- if .Values.ingress.enabled }} blocks, and the _helpers.tpl with standard fullname and labels templates.

Helm Hooks for Migrations

Add a Helm pre-upgrade hook that runs database migrations 
before the new Pods start. If migrations fail, the upgrade should fail.

apiVersion: batch/v1
kind: Job
metadata:
  name: {{ include "api.fullname" . }}-migrations
  annotations:
    "helm.sh/hook": pre-upgrade,pre-install
    "helm.sh/hook-weight": "-5"
    "helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
spec:
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: migrations
          image: {{ .Values.image.repository }}:{{ .Values.image.tag }}
          command: ["python", "manage.py", "migrate", "--run-syncdb"]
          envFrom:
            - secretRef:
                name: {{ include "api.fullname" . }}-secrets
  backoffLimit: 0  # Fail immediately — don't retry broken migrations

hook-delete-policy: before-hook-creation cleans up old migration Jobs before creating new ones. backoffLimit: 0 ensures a failed migration fails the Helm release immediately.

Resource Management

Namespace Resource Quotas

Add resource quotas to prevent a single namespace from 
consuming all cluster resources.
Hard limits: 16 CPUs, 32Gi memory, 10 pods max.

apiVersion: v1
kind: ResourceQuota
metadata:
  name: namespace-quota
spec:
  hard:
    requests.cpu: "16"
    requests.memory: 32Gi
    limits.cpu: "32"
    limits.memory: 64Gi
    pods: "50"
    services: "20"

LimitRange Defaults

Set default resource requests and limits for pods in the namespace
that don't specify their own.

apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
spec:
  limits:
    - type: Container
      default:
        cpu: 500m
        memory: 512Mi
      defaultRequest:
        cpu: 100m
        memory: 128Mi
      max:
        cpu: "4"
        memory: 4Gi

LimitRange prevents pods from being scheduled without resource specs — which is one of the most common causes of node exhaustion.

RBAC and Security

Create a service account for the API pod that has:
- Read access to secrets in its own namespace
- No cross-namespace access
- Ability to list pods (for health check integration)

Claude generates the ServiceAccount, Role (namespace-scoped, not ClusterRole), and RoleBinding — minimal-privilege access. It explains why each permission is needed and flags if any requested permissions are broader than necessary.

Kubernetes + CI/CD

For the complete container-to-cluster pipeline, see the Docker and DevOps guide which covers multi-stage builds and the CI/CD guide for automating deployments on merge.

The typical pattern Claude generates for Kubernetes deployments in CI:

# Update image tag in Helm values
helm upgrade --install api ./api-chart \
  --namespace production \
  --values ./api-chart/values-prod.yaml \
  --set image.tag=$GIT_SHA \
  --wait \
  --timeout 5m \
  --atomic  # Rolls back automatically on failure

--atomic is the important flag — if the deployment fails (pods crash, health checks fail), Helm automatically rolls back to the previous release.

Working with Kubernetes Day-to-Day

Claude Code is most effective for Kubernetes when it can see both the manifest and the error. The debugging workflow:

kubectl describe pod <name> — events and status
kubectl logs <name> --previous — logs from the crashed container
Paste both to Claude with context about what you changed

Claude reads the event timeline (image pull, container start, probe failures) and gives specific diagnoses rather than generic suggestions. This is the most time-saving Kubernetes use case — debugging without having to page through documentation for every cryptic error message.

For production-ready Kubernetes patterns, the Claude Skills 360 bundle includes infrastructure skills covering EKS, GKE, and AKS deployment patterns, GitOps workflows with ArgoCD and Flux, and observability setup with Prometheus. Start with the free tier to explore the DevOps skill collection.

Claude Code for Kubernetes: Manifests, Helm Charts, and Cluster Operations