Why Status Code Observability Matters
A spike in 500 errors is obvious. But a slow drift from 0.1% to 2% errors over a week is invisible without proper instrumentation. Status code observability means you know the distribution of every response code your service returns, can alert on anomalies, and can trace individual error requests end-to-end.
The RED method (Rate, Errors, Duration) is the standard framework for service health monitoring:
- Rate — requests per second (total throughput)
- Errors — non-2xx responses as a fraction of total requests
- Duration — response time latency distribution
Status codes are the primary source of truth for the Errors dimension.
Status Code Metrics with Prometheus
Counter by Status Code Class
The foundational metric is a counter labeled by status code. Most HTTP instrumentation libraries expose this automatically:
# Django with django-prometheus
# pip install django-prometheus
# Adds django_http_responses_total_by_status_total counter automatically
# Manual instrumentation with prometheus_client
from prometheus_client import Counter
http_requests_total = Counter(
'http_requests_total',
'Total HTTP requests',
['method', 'path', 'status_code']
)
# In your view or middleware:
http_requests_total.labels(
method=request.method,
path=request.path,
status_code=str(response.status_code)
).inc()
PromQL for Status Code Analysis
# Error rate (5xx fraction of total)
sum(rate(http_requests_total{status_code=~'5..'}[5m]))
/
sum(rate(http_requests_total[5m]))
# 4xx rate by path
topk(10,
sum by (path) (
rate(http_requests_total{status_code=~'4..'}[5m])
)
)
# Success rate (Apdex-style)
sum(rate(http_requests_total{status_code=~'2..'}[5m]))
/
sum(rate(http_requests_total[5m]))
Histogram for Latency by Status Code
Correlating latency with status codes reveals patterns: 504 Gateway Timeout requests cluster at the timeout boundary, while 200s have much lower p99:
from prometheus_client import Histogram
http_request_duration_seconds = Histogram(
'http_request_duration_seconds',
'HTTP request duration',
['method', 'status_code'],
buckets=[.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10]
)
Structured Logging
Metrics tell you *how many* errors; logs tell you *which requests* errored. The two systems complement each other and should share a common correlation ID.
JSON Access Log Format
Configure Nginx to emit structured JSON logs instead of the Combined Log Format:
log_format json_access escape=json '{
"time": "$time_iso8601",
"status": $status,
"method": "$request_method",
"path": "$request_uri",
"duration": $request_time,
"bytes": $body_bytes_sent,
"upstream_status": "$upstream_status",
"upstream_time": "$upstream_response_time",
"request_id": "$http_x_request_id",
"user_agent": "$http_user_agent",
"referer": "$http_referer"
}';
access_log /var/log/nginx/access.log json_access;
Application-Level Structured Logging
# Django middleware for structured request logging
import structlog
logger = structlog.get_logger(__name__)
class RequestLoggingMiddleware:
def __init__(self, get_response):
self.get_response = get_response
def __call__(self, request):
import time
start = time.monotonic()
response = self.get_response(request)
duration = time.monotonic() - start
logger.info(
'http_request',
method=request.method,
path=request.path,
status_code=response.status_code,
duration_ms=round(duration * 1000, 2),
request_id=request.META.get('HTTP_X_REQUEST_ID', ''),
)
return response
Dashboard Design
A well-designed status code dashboard answers three questions instantly: Is the service healthy right now? Has anything changed recently? Where are errors coming from?
Recommended Panel Layout
| Panel | Type | Query |
|---|---|---|
| Current error rate | Stat | `rate(errors[5m]) / rate(total[5m])` |
| Status code distribution | Pie chart | Counts by 2xx/3xx/4xx/5xx |
| Error rate over time | Time series | 5xx and 4xx rates together |
| Top error paths | Table | 4xx+5xx grouped by path |
| Latency by status | Heatmap | p50/p95/p99 per status class |
| Upstream vs app errors | Time series | Compare Nginx upstream_status to app status |
Color conventions: green for 2xx, blue for 3xx, yellow for 4xx, red for 5xx.
Alerting Rules
Error Rate Threshold
# Prometheus alerting rules
groups:
- name: http_status_codes
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status_code=~'5..'}[5m]))
/
sum(rate(http_requests_total[5m])) > 0.01
for: 5m
labels:
severity: critical
annotations:
summary: 'Error rate exceeded 1% for 5 minutes'
description: 'Current error rate: {{ $value | humanizePercentage }}'
- alert: SuddenErrorSpike
expr: |
rate(http_requests_total{status_code=~'5..'}[1m])
/
rate(http_requests_total{status_code=~'5..'}[1h] offset 1h) > 10
for: 2m
labels:
severity: warning
annotations:
summary: '5xx rate spiked 10x compared to same time yesterday'
- alert: Unusual4xxRate
expr: |
sum(rate(http_requests_total{status_code=~'4..'}[5m]))
/
sum(rate(http_requests_total[5m])) > 0.05
for: 10m
labels:
severity: warning
annotations:
summary: '4xx rate above 5% — possible client-side issue or scraper'
Error Budget Alerts
If you have a 99.9% SLO (0.1% error budget), alert when the budget burn rate is too high to survive the month:
# Burns through monthly budget in 1 hour — page immediately
(sum(rate(errors[5m])) / sum(rate(total[5m])))
/
(1 - 0.999) > 14.4
Distributed Tracing
Metrics and logs answer what and how many. Distributed tracing answers *why* — by recording the full request path across every service.
Sampling by Error Status
Sample 100% of error traces and only 1% of successful requests:
# OpenTelemetry sampler
from opentelemetry.sdk.trace.sampling import ParentBased, TraceIdRatioBased
from opentelemetry import trace
class ErrorAwareSampler:
def should_sample(self, parent_context, trace_id, name, kind, attributes, links):
# Always sample if there's already an error in the trace
if attributes.get('http.status_code', 200) >= 500:
return Decision.RECORD_AND_SAMPLE
# 1% sample for successful requests
return Decision.RECORD_AND_SAMPLE if (trace_id % 100 == 0) else Decision.DROP
Jaeger and Zipkin
Both Jaeger and Zipkin let you search traces by status code. Tag your spans:
from opentelemetry import trace
span = trace.get_current_span()
span.set_attribute('http.status_code', response.status_code)
span.set_attribute('http.method', request.method)
span.set_attribute('http.url', request.build_absolute_uri())
if response.status_code >= 500:
span.set_status(trace.StatusCode.ERROR, f'HTTP {response.status_code}')
This lets you query Jaeger for all traces where http.status_code=500 and see exactly which services were involved, how long each step took, and where the error originated.
The complete observability stack for HTTP status codes:
| Signal | Tool | Use Case |
|---|---|---|
| Metrics | Prometheus + Grafana | Error rate trends, alerting |
| Logs | Loki / CloudWatch / ELK | Individual request inspection |
| Traces | Jaeger / Zipkin / Tempo | Cross-service error path |
| Dashboards | Grafana | Unified status code view |