Observability — the ability to understand what’s happening inside your system from its outputs — requires three pillars: logs, traces, and metrics. Implementing these correctly is tedious but important: structured logs that can be queried, distributed traces that connect requests across services, and metrics that expose the right signals. Claude Code generates observability instrumentation that follows OpenTelemetry standards and integrates with your monitoring stack.

This guide covers adding observability with Claude Code: structured logging, distributed tracing with OpenTelemetry, Prometheus metrics, Grafana dashboards, and alerting.

Setting Up Claude Code for Observability Work

Stack context prevents wrong-library instrumentation:

# Observability Context

## Stack
- Node.js services + Python workers
- Logs: structured JSON → Datadog (via fluentd)
- Traces: OpenTelemetry SDK → Jaeger (dev) / Datadog APM (prod)
- Metrics: Prometheus + Grafana (kube-prometheus-stack)
- Alerts: Alertmanager → PagerDuty

## Logging Conventions
- JSON format, not concatenated strings
- Required fields: level, msg, service, requestId, timestamp
- Never log: passwords, tokens, full card numbers, SSNs, PII beyond email
- Log levels: error (needs attention), warn (unexpected but handled), info (business events), debug (dev only)

## Tracing Conventions
- Trace every incoming HTTP request
- Create spans for: DB queries, external HTTP calls, cache operations
- Required span attributes: user.id (if authenticated), db.table, http.route

See the CLAUDE.md setup guide for complete configuration.

Structured Logging

Logger Setup

Set up structured JSON logging for an Express API.
Include: request ID correlation, service context, log level filtering.
No sensitive data in logs.

import pino from 'pino';

export const logger = pino({
  level: process.env.LOG_LEVEL ?? 'info',
  formatters: {
    level: (label) => ({ level: label }),
  },
  base: {
    service: process.env.SERVICE_NAME ?? 'api',
    version: process.env.APP_VERSION ?? 'unknown',
  },
  redact: {
    paths: ['req.headers.authorization', 'body.password', 'body.token', '*.secret'],
    censor: '[REDACTED]',
  },
  timestamp: pino.stdTimeFunctions.isoTime,
});

// Request-scoped logger with correlation ID
export function createRequestLogger(requestId: string) {
  return logger.child({ requestId });
}

pino uses redact to automatically scrub sensitive fields before output — even if application code accidentally passes them. The field list matches common secret patterns. child() loggers inherit parent context (service, version) and add request-scoped fields.

Request Logging Middleware

Write Express middleware that logs all requests and responses.
Include: method, path, status, duration, request ID.
Mask auth headers.

import { v4 as uuidv4 } from 'uuid';

export function requestLoggerMiddleware(req: Request, res: Response, next: NextFunction) {
  const requestId = req.headers['x-request-id'] as string ?? uuidv4();
  const start = Date.now();
  
  // Attach to request for use in handlers
  req.requestId = requestId;
  req.log = createRequestLogger(requestId);
  
  // Pass request ID to downstream services
  res.setHeader('x-request-id', requestId);
  
  res.on('finish', () => {
    const duration = Date.now() - start;
    const level = res.statusCode >= 500 ? 'error' : res.statusCode >= 400 ? 'warn' : 'info';
    
    req.log[level]({
      msg: 'HTTP request',
      method: req.method,
      path: req.route?.path ?? req.path,
      status: res.statusCode,
      durationMs: duration,
      contentLength: res.getHeader('content-length'),
      userAgent: req.headers['user-agent'],
      // userId if authenticated:
      userId: req.user?.id,
    });
  });
  
  next();
}

req.route?.path logs the route pattern (/users/:id) rather than the actual URL (/users/123) — making logs groupable without high-cardinality issues.

OpenTelemetry Tracing

Instrumentation Setup

Add OpenTelemetry tracing to a Node.js service.
Auto-instrument: HTTP, Express, PostgreSQL.
Send traces to Datadog.

// otel.ts — must be required BEFORE other imports
import { NodeSDK } from '@opentelemetry/sdk-node';
import { HttpInstrumentation } from '@opentelemetry/instrumentation-http';
import { ExpressInstrumentation } from '@opentelemetry/instrumentation-express';
import { PgInstrumentation } from '@opentelemetry/instrumentation-pg';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';

const sdk = new NodeSDK({
  serviceName: process.env.SERVICE_NAME ?? 'api',
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT,
  }),
  instrumentations: [
    new HttpInstrumentation({
      ignoreIncomingRequestHook: (req) =>
        req.url === '/health' || req.url === '/metrics', // Don't trace health checks
    }),
    new ExpressInstrumentation(),
    new PgInstrumentation({
      dbStatementSerializer: (statement) =>
        statement.length > 500 ? statement.slice(0, 500) + '...' : statement,
    }),
  ],
});

sdk.start();

process.on('SIGTERM', () => sdk.shutdown());

The dbStatementSerializer truncates long SQL queries to prevent bloating traces. The ignoreIncomingRequestHook skips health check tracing — without it, health checks generate noise in dashboards.

Custom Spans

Add custom spans for business logic operations.
I want to trace: user authentication check, cache lookups, 
and external API calls.

import { trace, SpanStatusCode } from '@opentelemetry/api';

const tracer = trace.getTracer('api', '1.0.0');

async function getUserWithCache(userId: string): Promise<User | null> {
  return tracer.startActiveSpan('getUserWithCache', async (span) => {
    span.setAttributes({
      'user.id': userId,
      'cache.backend': 'redis',
    });
    
    try {
      // Cache lookup span
      const cached = await tracer.startActiveSpan('cache.get', async (cacheSpan) => {
        cacheSpan.setAttributes({ 'cache.key': `user:${userId}` });
        const result = await redis.get(`user:${userId}`);
        cacheSpan.setAttribute('cache.hit', result !== null);
        cacheSpan.end();
        return result;
      });
      
      if (cached) {
        span.setAttribute('cache.hit', true);
        span.end();
        return JSON.parse(cached);
      }
      
      // DB lookup
      const user = await db.user.findUnique({ where: { id: userId } });
      
      if (user) {
        await redis.setex(`user:${userId}`, 300, JSON.stringify(user));
      }
      
      span.setAttribute('cache.hit', false);
      span.end();
      return user;
    } catch (error) {
      span.recordException(error as Error);
      span.setStatus({ code: SpanStatusCode.ERROR });
      span.end();
      throw error;
    }
  });
}

span.recordException() captures the exception with its stack trace in the span. SpanStatusCode.ERROR marks the span red in trace viewers. Always call span.end() — use try/finally to ensure it even on exception.

Prometheus Metrics

Metrics Setup

Add Prometheus metrics to an Express API.
Track: request rate, error rate, latency, active connections.

import { Registry, Counter, Histogram, Gauge } from 'prom-client';
import { collectDefaultMetrics } from 'prom-client';

const register = new Registry();
collectDefaultMetrics({ register }); // Node.js process metrics

export const httpRequestsTotal = new Counter({
  name: 'http_requests_total',
  help: 'Total HTTP requests',
  labelNames: ['method', 'route', 'status_code'],
  registers: [register],
});

export const httpRequestDuration = new Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request duration histogram',
  labelNames: ['method', 'route'],
  buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5],
  registers: [register],
});

export const activeConnections = new Gauge({
  name: 'http_active_connections',
  help: 'Active HTTP connections',
  registers: [register],
});

// Metrics endpoint
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(await register.metrics());
});

// Middleware to record metrics
export function metricsMiddleware(req: Request, res: Response, next: NextFunction) {
  const end = httpRequestDuration.startTimer({ method: req.method, route: req.route?.path ?? req.path });
  activeConnections.inc();
  
  res.on('finish', () => {
    end();
    activeConnections.dec();
    httpRequestsTotal.inc({
      method: req.method,
      route: req.route?.path ?? req.path,
      status_code: res.statusCode.toString(),
    });
  });
  
  next();
}

The histogram buckets are chosen to match typical web API latency ranges. Claude explains bucket selection — too fine-grained (many buckets) is expensive; too coarse loses precision at the important thresholds.

Grafana Dashboard Configuration

Write a Grafana dashboard JSON for the API metrics.
Panels: request rate, error rate (p99 latency), 
top slowest routes.

Claude generates the dashboard JSON with PromQL queries:

Request rate: rate(http_requests_total[5m])
Error rate: rate(http_requests_total{status_code=~"5.."}[5m]) / rate(http_requests_total[5m])
p99 latency: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
Slowest routes: topk(5, histogram_quantile(0.95, sum by (route) (rate(http_request_duration_seconds_bucket[5m]))))

Alerting Rules

Write Prometheus alerting rules for:
- Error rate > 5% for 5 minutes
- p99 latency > 2 seconds for 5 minutes
- Service down for 1 minute

groups:
  - name: api_alerts
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status_code=~"5.."}[5m])) /
          sum(rate(http_requests_total[5m])) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "API error rate above 5%"
          description: "Error rate is {{ printf \"%.2f\" $value | humanizePercentage }} over last 5 minutes"
      
      - alert: HighLatency
        expr: |
          histogram_quantile(0.99,
            sum by (le) (rate(http_request_duration_seconds_bucket[5m]))
          ) > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "p99 latency above 2 seconds"

for: 5m means the condition must be true for 5 minutes before alerting — prevents alerting on transient spikes.

Error Tracking

Integrate Sentry for error tracking.
Capture: unhandled exceptions, performance transactions,
and business errors with additional context.

import * as Sentry from '@sentry/node';

Sentry.init({
  dsn: process.env.SENTRY_DSN,
  environment: process.env.NODE_ENV,
  tracesSampleRate: process.env.NODE_ENV === 'production' ? 0.1 : 1.0,
  beforeSend(event) {
    // Scrub PII from exception events
    if (event.user?.email) {
      event.user.email = event.user.email.replace(/^.*@/, '***@');
    }
    return event;
  },
});

// Capture business errors with context
export function captureBusinessError(error: Error, context: Record<string, unknown>) {
  Sentry.withScope(scope => {
    scope.setExtras(context);
    scope.setTag('error_type', 'business_logic');
    Sentry.captureException(error);
  });
}

10% sampling rate in production (tracesSampleRate: 0.1) — 100% would be expensive. beforeSend hook scrubs PII before events leave the server.

Observability with Claude Code

The most effective observability workflow: show Claude your service code and describe what you want to be able to answer when something goes wrong (“why is this endpoint slow?”, “which users are seeing errors?”). Claude generates instrumentation targeted at answering those questions, not generic coverage.

For the full production setup — connecting tracing to your deployment pipeline and alerting to on-call workflows — the Docker guide and Kubernetes guide cover the infrastructure side. For interpreting slow query traces with database-level observability, see the database guide. The Claude Skills 360 bundle includes observability skill sets covering OpenTelemetry setup, PromQL queries, and Grafana dashboard templates for common service types. Start with the free tier.

Claude Code for Observability: OpenTelemetry, Logging, and Monitoring