Incident response quality determines how fast you recover when things go wrong. Claude Code helps write runbooks that are actually used during incidents (not just referenced once then ignored), generate postmortem templates from incident data, set up SLO tracking, and design chaos experiments that find weaknesses before they become incidents.

This guide covers reliability engineering with Claude Code: runbooks, incident playbooks, postmortems, and SLO alert setup.

Runbook Generation

We had three incidents in the last month where the on-call engineer
didn't know what to do when the payment service was slow.
Generate a runbook for payment service degradation.

# Runbook: Payment Service Degradation

**Severity**: P1 (affects all purchases) / P2 (intermittent failures)  
**Owner**: Platform team  
**Last verified**: 2026-04-01

## Symptoms
- Alert: `payment_service_p99_latency > 2000ms` for 5+ minutes
- Alert: `payment_failures_rate > 5%`
- User reports: "payment not going through"

## Immediate Actions (first 5 minutes)

### 1. Confirm the scope
```bash
# Check error rate across all payment service instances
kubectl logs -l app=payment-service -n production --since=15m | grep -c ERROR

# Check if it's all users or specific regions
curl -s "https://metrics.internal/api/query?q=payment_errors_by_region" | jq .

2. Check Stripe status

https://status.stripe.com — confirm if upstream is degraded
If Stripe has incident: set maintenance message, wait for resolution

3. Check recent deployments

kubectl rollout history deployment/payment-service -n production | head -5

If deployed in last 30 min → immediately consider rollback:

kubectl rollout undo deployment/payment-service -n production

4. Check database connection pool

# Is the payment DB pool exhausted?
kubectl exec -n production deployment/payment-service -- \
  curl -s localhost:9090/metrics | grep -E 'db_pool_(size|idle|used)'

Expected: idle > 5. If idle = 0 → connection pool exhausted (see escalation).

Escalation

Stripe issue: tag @stripe-account-manager in #incidents
DB pool exhausted: page @dba-oncall
Unknown cause after 15 min: page @payment-team-lead

Resolution Verification

# Confirm metrics back to normal
watch -n 10 'curl -s "https://metrics.internal/api/query?q=payment_p99_latency" | jq .value'

Resolution: p99 < 500ms for 5 consecutive minutes.

Post-incident

Create postmortem issue in GitHub within 24h
Update this runbook if you discovered new diagnostic steps


## Automated Runbook from Alerts

Generate runbooks for all my PagerDuty alert policies automatically. Use the alert conditions and historical incidents to suggest diagnostic steps.


```typescript
// scripts/generate-runbooks.ts
import { AlertPolicy, Incident } from './types';

async function generateRunbook(policy: AlertPolicy, recentIncidents: Incident[]): Promise<string> {
  const incidentPatterns = analyzeIncidentPatterns(recentIncidents);

  const prompt = `Generate a runbook for this alert:
Alert: ${policy.name}
Condition: ${policy.condition}
Recent resolution actions from incidents:
${incidentPatterns.map(p => `- ${p.action}: ${p.frequency}x`).join('\n')}

Format: markdown with Immediate Actions, Escalation, and Verification sections.
Include specific kubectl/bash commands for this service.`;

  // Use Claude Code to generate the runbook
  const runbook = await callClaude(prompt);
  return runbook;
}

SLO Setup and Alerting

Set up SLOs for our API: 99.9% availability and p99 < 500ms.
Create the Prometheus alerts that fire before we breach the SLO.

# prometheus/rules/slo-alerts.yml
groups:
  - name: api-slo-alerts
    rules:
    
    # Availability SLO: 99.9% = 43.8 min downtime/month
    # Alert at burn rate that would exhaust error budget in 1h (fast burn)
    - alert: APIAvailabilityHighBurnRate
      expr: |
        (
          sum(rate(http_requests_total{status=~"5.."}[5m]))
          /
          sum(rate(http_requests_total[5m]))
        ) > 14.4 * (1 - 0.999)
      for: 2m
      labels:
        severity: critical
        slo: availability
      annotations:
        summary: "API error budget burning fast (1h burn rate)"
        description: "Current error rate {{ $value | printf \"%.2%\" }} threatens 99.9% SLO"
        runbook: "https://wiki.internal/runbooks/api-high-error-rate"
    
    # Latency SLO: p99 < 500ms  
    - alert: APILatencyP99Breach
      expr: |
        histogram_quantile(0.99, 
          sum(rate(http_request_duration_seconds_bucket[10m])) by (le)
        ) > 0.5
      for: 5m
      labels:
        severity: warning
        slo: latency
      annotations:
        summary: "API p99 latency breaching SLO target"
        description: "p99 is {{ $value | printf \"%.0f\" }}ms (target: 500ms)"

    # Error budget remaining (monthly)
    - alert: APIErrorBudgetLow
      expr: |
        (
          1 - (
            sum(increase(http_requests_total{status=~"5.."}[30d]))
            /
            sum(increase(http_requests_total[30d]))
          )
        ) / 0.001 < 0.1  # Less than 10% of monthly error budget remaining
      labels:
        severity: warning
      annotations:
        summary: "API error budget < 10% remaining this month"

Postmortem Template

Fill out a postmortem for this incident (paste incident timeline).
Use the blameless format and include action items.

Generated postmortem structure:

# Postmortem: Payment Service Outage — 2026-04-15

**Duration**: 47 minutes (14:23 - 15:10 UTC)  
**Impact**: 100% payment failures for 47 minutes (~$28K revenue impact, 340 failed orders)  
**Severity**: P1  
**Status**: Resolved

## Summary
A deployment of payment-service v2.4.1 introduced a database connection pool exhaustion bug.
The connection pool limit was lowered from 50 to 5 in a configuration change that was reviewed
but the impact wasn't recognized. Stripe retries amplified the connection demand under the
smaller pool size, causing cascading failures.

## Timeline
| Time (UTC) | Event |
|---|---|
| 14:21 | payment-service v2.4.1 deployed |
| 14:23 | First alert fires: payment_failures_rate > 5% |
| 14:27 | On-call paged; begins investigation |
| 14:35 | DB connection pool exhaustion identified |
| 14:41 | Rollback initiated |
| 15:10 | Service restored; error rate returns to normal |

## Root Cause
The `DB_MAX_CONNECTIONS=5` configuration in the v2.4.1 deployment reduced the pool from 50 to 5.
This was introduced to fix a "too many connections" warning in staging (which had many services
sharing one DB), but applied to production config incorrectly.

## Contributing Factors
- Configuration change not covered by load test
- No alert for DB pool utilization
- Rollback took 29 min (runbook didn't have DB pool as diagnostic step)

## Action Items
| Action | Owner | Due |
|---|---|---|
| Add DB pool utilization monitoring alert | @platform-team | 2026-04-22 |
| Add DB pool check to payment runbook | @oncall-lead | 2026-04-19 |
| Separate staging/prod configuration validation | @devops | 2026-05-01 |
| Load test configuration changes before deployment | @platform-team | 2026-05-15 |

## What Went Well
- Alert fired within 2 minutes of the issue starting
- Rollback procedure was practiced and executed correctly
- No data loss occurred

Chaos Engineering

Design chaos experiments for our payment service.
What should we test and how do we do it safely?

# chaos-experiments/payment-service.py
# Using Chaos Toolkit
from chaostoolkit.types import Configuration

# Experiment: What happens when the payment DB is slow?
experiment = {
    "title": "Payment service degrades gracefully when DB latency spikes",
    "description": "Inject 500ms latency on payment DB connections; verify API returns errors for new requests while in-flight requests complete",
    
    "steady-state-hypothesis": {
        "title": "Payment API is healthy",
        "probes": [
            {
                "name": "payment-p99-under-500ms",
                "type": "probe",
                "provider": {
                    "type": "http",
                    "url": "https://api.internal/metrics/payment_p99",
                },
                "tolerance": {"type": "lt", "value": 500},
            }
        ],
    },
    
    "method": [
        {
            "name": "inject-db-latency",
            "type": "action",
            "provider": {
                "type": "process",
                "path": "tc",
                "arguments": "qdisc add dev eth0 root netem delay 500ms",
                "timeout": 30,
            },
        }
    ],
    
    "rollbacks": [
        {
            "name": "remove-db-latency",
            "type": "action",
            "provider": {
                "type": "process",
                "path": "tc",
                "arguments": "qdisc del dev eth0 root",
            },
        }
    ],
}

For proactive cost reduction before incidents become expensive, see the cost optimization guide. For the observability setup that provides data during incident response, see the observability guide. The Claude Skills 360 bundle includes reliability engineering skill sets covering SLO design, runbook generation, and chaos testing. Start with the free tier to try incident response code generation.

Claude Code for Incident Response: Runbooks, Postmortems, and Chaos Engineering

Runbook Generation

2. Check Stripe status

3. Check recent deployments

4. Check database connection pool

Escalation

Resolution Verification

Post-incident

SLO Setup and Alerting

Postmortem Template

Chaos Engineering

Keep Reading

Claude Code for Vite Plugins: Custom Build Transforms and Dev Server Extensions

Claude Code for SST Ion: Full-Stack Apps on AWS with Infrastructure as TypeScript

Claude Code for Biome: Fast Linting and Formatting for JavaScript

Put these ideas into practice