Grafana Loki provides cost-efficient log aggregation by indexing only log labels — not log content — keeping storage costs low while enabling fast label-based filtering. LogQL combines label matchers with filter expressions and metric queries over log streams. Grafana visualizes both Loki logs and Prometheus metrics with unified dashboards. Dashboard as code via JSON model means dashboards live in version control. Alerting rules in YAML fire on log patterns and metrics. Promtail agents on each node tail log files and push to Loki. Claude Code generates structured logging configurations, LogQL queries, Grafana dashboard JSON, Promtail pipeline stages, and the alerting rule YAML for production observability stacks.
CLAUDE.md for Loki/Grafana Stack
## Observability Stack
- Loki >= 3.2, Grafana >= 11, Promtail >= 3.2 (or Alloy/Vector as shipper)
- Log format: structured JSON with consistent field names across services
- Labels: cluster, namespace, app, pod — keep cardinality LOW (<100 values per label)
- LogQL: use label matchers first (fast), then regex (slow) — same principle as SQL indexes
- Dashboards: commit as JSON to git, use Grafonnet or raw JSON, never click-to-create in prod
- Alerts: use Grafana alerting with Loki data source for log-based alerts
- Retention: configure per-tenant limits in Loki ruler config
Structured Logging
# logging/structured_logger.py — produce Loki-friendly JSON logs
import logging
import json
import sys
import traceback
from datetime import datetime, timezone
from contextvars import ContextVar
# Context vars: automatically included in all log lines
request_id: ContextVar[str] = ContextVar('request_id', default='')
user_id: ContextVar[str] = ContextVar('user_id', default='')
class StructuredFormatter(logging.Formatter):
"""Emit JSON log lines with consistent fields for Loki parsing."""
SERVICE_NAME = "order-service"
VERSION = "1.0.0"
def format(self, record: logging.LogRecord) -> str:
log_entry = {
"timestamp": datetime.fromtimestamp(record.created, tz=timezone.utc).isoformat(),
"level": record.levelname,
"logger": record.name,
"message": record.getMessage(),
"service": self.SERVICE_NAME,
"version": self.VERSION,
}
# Add context vars
if req_id := request_id.get():
log_entry["request_id"] = req_id
if uid := user_id.get():
log_entry["user_id"] = uid
# Add structured fields from extra={}
for key, value in record.__dict__.items():
if key not in {
"args", "asctime", "created", "exc_info", "exc_text",
"filename", "funcName", "id", "levelname", "levelno",
"lineno", "message", "module", "msecs", "msg", "name",
"pathname", "process", "processName", "relativeCreated",
"stack_info", "thread", "threadName", "taskName",
}:
log_entry[key] = value
# Include exception info
if record.exc_info:
log_entry["exception"] = {
"type": record.exc_info[0].__name__ if record.exc_info[0] else None,
"message": str(record.exc_info[1]),
"traceback": traceback.format_exception(*record.exc_info),
}
return json.dumps(log_entry, default=str)
def setup_logging(level: str = "INFO") -> None:
"""Configure structured logging for the application."""
root_logger = logging.getLogger()
root_logger.setLevel(getattr(logging, level))
handler = logging.StreamHandler(sys.stdout)
handler.setFormatter(StructuredFormatter())
root_logger.handlers = [handler]
# Usage in application code
logger = logging.getLogger("orders")
def create_order(customer_id: str, amount: int) -> dict:
logger.info(
"Creating order",
extra={
"customer_id": customer_id,
"amount_cents": amount,
"event": "order.create.start",
}
)
try:
order = _do_create(customer_id, amount)
logger.info(
"Order created",
extra={
"order_id": order["id"],
"customer_id": customer_id,
"amount_cents": amount,
"event": "order.create.success",
"duration_ms": 42,
}
)
return order
except Exception as e:
logger.error(
"Order creation failed",
extra={
"customer_id": customer_id,
"error_type": type(e).__name__,
"event": "order.create.failure",
},
exc_info=True,
)
raise
LogQL Queries
# Common LogQL patterns for application debugging
# 1. Filter logs by label (fast — index scan)
{app="order-service", namespace="production"}
# 2. Filter by level within a stream
{app="order-service"} | json | level="ERROR"
# 3. Filter by specific field value
{app="order-service"} | json | event="order.create.failure"
# 4. Regex filter on message
{app="order-service"} |~ "payment.*failed"
# 5. Count errors per minute (metric query)
rate({app="order-service"} | json | level="ERROR" [5m])
# 6. 95th percentile order creation latency
quantile_over_time(0.95,
{app="order-service"}
| json
| event="order.create.success"
| unwrap duration_ms [5m]
) by (pod)
# 7. Top error types in last hour
topk(10,
sum by (error_type) (
count_over_time(
{app="order-service"} | json | level="ERROR" [1h]
)
)
)
# 8. Request rate by endpoint with status codes
sum by (path, status_code) (
rate(
{app="api-gateway"} | json | event="request.complete" [1m]
)
)
Promtail Configuration
# promtail-config.yaml — tail logs and push to Loki
server:
http_listen_port: 9080
grpc_listen_port: 0
positions:
filename: /tmp/positions.yaml
clients:
- url: http://loki:3100/loki/api/v1/push
tenant_id: default
batchwait: 1s
batchsize: 1048576
scrape_configs:
# Kubernetes pod logs
- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod
pipeline_stages:
# Parse JSON logs
- json:
expressions:
level: level
message: message
event: event
request_id: request_id
duration_ms: duration_ms
# Promote parsed fields to labels (KEEP CARDINALITY LOW)
- labels:
level:
event:
# Set log timestamp from parsed field
- timestamp:
source: timestamp
format: RFC3339Nano
relabel_configs:
- source_labels: [__meta_kubernetes_namespace]
target_label: namespace
- source_labels: [__meta_kubernetes_pod_name]
target_label: pod
- source_labels: [__meta_kubernetes_pod_label_app]
target_label: app
- source_labels: [__meta_kubernetes_pod_container_name]
target_label: container
Grafana Dashboard JSON
{
"title": "Order Service Dashboard",
"uid": "order-service-v1",
"refresh": "30s",
"time": {"from": "now-1h", "to": "now"},
"templating": {
"list": [
{
"name": "namespace",
"type": "query",
"datasource": {"type": "loki", "uid": "loki"},
"query": "label_values(namespace)",
"label": "Namespace"
}
]
},
"panels": [
{
"title": "Error Rate",
"type": "timeseries",
"gridPos": {"x": 0, "y": 0, "w": 12, "h": 8},
"targets": [
{
"datasource": {"type": "loki", "uid": "loki"},
"expr": "sum(rate({app=\"order-service\", namespace=\"$namespace\"} | json | level=\"ERROR\" [5m]))",
"legendFormat": "Errors/sec"
}
]
},
{
"title": "Order Creation Latency (p95)",
"type": "gauge",
"gridPos": {"x": 12, "y": 0, "w": 6, "h": 8},
"targets": [
{
"datasource": {"type": "loki", "uid": "loki"},
"expr": "quantile_over_time(0.95, {app=\"order-service\"} | json | unwrap duration_ms [5m])",
"legendFormat": "p95 ms"
}
],
"fieldConfig": {
"defaults": {
"thresholds": {
"steps": [
{"color": "green", "value": null},
{"color": "yellow", "value": 500},
{"color": "red", "value": 2000}
]
}
}
}
},
{
"title": "Recent Errors",
"type": "logs",
"gridPos": {"x": 0, "y": 8, "w": 24, "h": 12},
"targets": [
{
"datasource": {"type": "loki", "uid": "loki"},
"expr": "{app=\"order-service\", namespace=\"$namespace\"} | json | level=\"ERROR\"",
"maxLines": 100
}
]
}
]
}
Alert Rules
# alerts/order-service.yaml — Grafana alerting rules
apiVersion: 1
groups:
- name: order-service
folder: Application Alerts
interval: 1m
rules:
- uid: high-error-rate
title: High Error Rate
condition: A
data:
- refId: A
queryType: range
relativeTimeRange:
from: 300
to: 0
datasourceUid: loki
model:
expr: |
sum(rate({app="order-service"} | json | level="ERROR" [5m])) > 0.1
noDataState: OK
execErrState: Error
for: 2m
annotations:
summary: "Order service error rate above 0.1/s"
runbook: "https://runbooks.internal.com/order-service-errors"
labels:
severity: warning
team: platform
- uid: order-creation-failures
title: Order Creation Failures Spike
condition: A
data:
- refId: A
datasourceUid: loki
model:
expr: |
sum(count_over_time({app="order-service"} | json | event="order.create.failure" [5m])) > 10
for: 1m
annotations:
summary: "More than 10 order creation failures in 5 minutes"
labels:
severity: critical
For the Prometheus metrics and alerting stack that pairs with Loki for a complete metrics + logs observability platform, see the OpenTelemetry guide for tracing, metrics, and log collection with the OTel Collector. For the structured logging and tracing integration that feeds into Loki, the observability guide covers distributed tracing and correlation IDs. The Claude Skills 360 bundle includes Loki/Grafana skill sets covering LogQL queries, dashboard JSON, and Promtail configuration. Start with the free tier to try Grafana dashboard generation.