Site Reliability Engineering replaces vague “system is healthy” monitoring with concrete, business-aligned commitments: “99.9% of checkout requests succeed within 2 seconds.” These Service Level Objectives (SLOs) connect reliability to business impact, give teams an error budget to work with, and create clear escalation thresholds. Claude Code helps define SLOs, write the Prometheus recording rules and alerts, build dashboards, and structure postmortems.
Defining SLIs and SLOs
Define SLOs for our e-commerce checkout service.
We want to measure what actually matters to users.
SLI (Service Level Indicator): The metric you measure
SLO (Service Level Objective): The threshold you commit to
Error Budget: 100% - SLO = how much failure is allowed
# slo-definitions.yaml — human-readable SLO definitions
service: checkout-api
owner: payments-team
review_schedule: quarterly
slos:
- name: checkout_availability
description: "Successful checkout completions as a fraction of all attempts"
# SLI: count of successful responses / total requests
sli:
numerator: "requests with status < 500"
denominator: "all requests to /api/checkout"
target: 99.9% # 43.8 minutes downtime/month allowed
window: 30d
- name: checkout_latency
description: "95% of checkout requests complete within 2 seconds"
sli:
metric: "p95 latency for POST /api/checkout"
target: 99% # 1% of requests can exceed 2s
threshold: 2000ms
window: 30d
- name: payment_success_rate
description: "Payment processing success rate (excludes declined cards)"
sli:
numerator: "payment_intents with status=succeeded"
denominator: "payment_intents where status != 'requires_payment_method'"
target: 99.5%
window: 7d
Prometheus Recording Rules
Implement the SLOs as Prometheus recording rules and burn rate alerts.
# prometheus/rules/checkout-slo.yaml
groups:
- name: checkout_slo_recording
interval: 30s
rules:
# --- Availability SLI ---
# Request rate: good requests (non-5xx)
- record: job:checkout_requests_success:rate5m
expr: |
sum(rate(http_requests_total{
job="checkout-api",
status!~"5.."
}[5m]))
# Total request rate
- record: job:checkout_requests_total:rate5m
expr: |
sum(rate(http_requests_total{
job="checkout-api"
}[5m]))
# Error rate (complement of availability)
- record: job:checkout_error_rate:rate5m
expr: |
1 - (job:checkout_requests_success:rate5m / job:checkout_requests_total:rate5m)
# Multi-window error rates (for multi-window burn rate alerts)
- record: job:checkout_error_rate:rate1h
expr: |
1 - (
sum(rate(http_requests_total{job="checkout-api",status!~"5.."}[1h]))
/
sum(rate(http_requests_total{job="checkout-api"}[1h]))
)
- record: job:checkout_error_rate:rate6h
expr: |
1 - (
sum(rate(http_requests_total{job="checkout-api",status!~"5.."}[6h]))
/
sum(rate(http_requests_total{job="checkout-api"}[6h]))
)
# Error budget remaining (1 = full budget, 0 = fully consumed)
- record: job:checkout_error_budget_remaining:30d
expr: |
1 - (
(1 - job:checkout_error_rate:rate5m)
/
(1 - 0.999) # SLO = 99.9%
)
- name: checkout_slo_alerts
rules:
# Burn rate alert: fast burn (page now)
# Consuming 14.4x budget = will exhaust remaining budget in ~1 hour
- alert: CheckoutSLOFastBurn
expr: |
job:checkout_error_rate:rate1h > (14.4 * (1 - 0.999))
and
job:checkout_error_rate:rate5m > (14.4 * (1 - 0.999))
for: 2m
labels:
severity: critical
team: payments
annotations:
summary: "Checkout SLO burning budget at 14x rate"
description: |
Error rate {{ $value | humanizePercentage }} is 14x above SLO threshold.
At this rate, the full 30-day error budget will be consumed in ~2 hours.
runbook: https://runbooks.internal/checkout-high-error-rate
# Slow burn (alert for on-call awareness, not immediate page)
# Consuming 6x budget = will exhaust remaining budget in ~5 days
- alert: CheckoutSLOSlowBurn
expr: |
job:checkout_error_rate:rate6h > (6 * (1 - 0.999))
and
job:checkout_error_rate:rate1h > (6 * (1 - 0.999))
for: 15m
labels:
severity: warning
team: payments
annotations:
summary: "Checkout SLO slow burn — investigate during business hours"
description: "Error rate elevated for 6+ hours. Budget consumption: 6x normal."
# Latency SLO: P95 > 2s for 5 minutes
- alert: CheckoutLatencySLOBreach
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket{
job="checkout-api",
path="/api/checkout"
}[5m])) by (le)
) > 2.0
for: 5m
labels:
severity: critical
Error Budget Tracking Dashboard
Build a Grafana dashboard that shows our error budget consumption
over the 30-day rolling window.
{
"panels": [
{
"title": "Checkout SLO - 30 Day Error Budget Remaining",
"type": "gauge",
"targets": [{
"expr": "job:checkout_error_budget_remaining:30d * 100",
"legendFormat": "Budget Remaining %"
}],
"fieldConfig": {
"defaults": {
"unit": "percent",
"thresholds": {
"steps": [
{"color": "red", "value": 0},
{"color": "yellow", "value": 25},
{"color": "green", "value": 50}
]
}
}
}
},
{
"title": "Error Rate - 1h vs SLO",
"type": "timeseries",
"targets": [
{
"expr": "job:checkout_error_rate:rate1h * 100",
"legendFormat": "Error Rate %"
},
{
"expr": "vector((1 - 0.999) * 100)",
"legendFormat": "SLO Threshold (0.1%)"
}
]
}
]
}
Blameless Postmortem Template
We had an outage. Help me write a blameless postmortem
that focuses on systemic issues.
# Postmortem: Checkout Service Outage — 2026-09-25
**Status**: Complete
**Severity**: SEV1 (complete checkout unavailability, 47 minutes)
**Error Budget Impact**: 32.6% of 30-day budget consumed in this incident
## Impact
- Duration: 2026-09-25 14:23 UTC → 15:10 UTC (47 minutes)
- Users affected: ~8,400 checkout attempts (100% failure rate during window)
- Revenue impact: Estimated $42,000 in lost transactions
- SLO: 68 minutes remaining in 30-day error budget before SLO breach
## Timeline (UTC)
| Time | Event |
|------|-------|
| 14:18 | Database migration deployed to production |
| 14:23 | Checkout error rate spikes to 100% |
| 14:27 | Fast burn alert fires — on-call paged |
| 14:35 | On-call investigates, identifies DB connection pool exhausted |
| 14:51 | Root cause identified: missing index on new column causing table scan |
| 15:08 | Index added, performance restored |
| 15:10 | Error rate returns to baseline |
## Root Cause
Database migration added a `status_updated_at` column and the application immediately started filtering on it. Without an index, every checkout request triggered a full table scan on the 18M-row orders table, saturating the DB connection pool.
## Contributing Factors
1. Migration ran in production without load testing the new query pattern
2. No canary rollout — migration applied to 100% of traffic immediately
3. Alerting on `error_rate > fast_burn_threshold` caught it in 4 minutes, but the fix took 43 more minutes — the table scan wasn't immediately obvious from the connection pool error
## Action Items
| Action | Owner | Due |
|--------|-------|-----|
| Add pre-migration query plan check to CI pipeline | Platform | 2026-10-09 |
| Implement canary database migrations (5% → 100%) | Platform | 2026-10-16 |
| Add slow query logging alert for queries > 500ms | On-call | 2026-10-02 |
| Update runbook with "connection pool exhausted" diagnostic steps | Payments | 2026-10-02 |
## What Went Well
- Burn rate alerting correctly paged within 4 minutes of onset
- On-call engineer correctly triaged the database layer first
- Rollback procedure was clear (the fix was additive — add index — not a rollback)
## Lessons Learned
Schema changes that add filter columns require indexes deployed before application code that uses them (or deploy index first, ship code second).
For the distributed tracing that provides request-level context during SLO investigations, see the observability guide. For the incident response runbooks and chaos engineering that validates reliability before incidents, see the incident response guide. The Claude Skills 360 bundle includes SRE skill sets covering SLO definition, Prometheus rules, and postmortem processes. Start with the free tier to try reliability engineering templates.