Production Infrastructure

Observability for HTTP Status Codes: Metrics, Logs, and Alerts

How to monitor HTTP status code distributions using Prometheus counters, structured logging, dashboard design, and alerting rules — plus distributed tracing to correlate error status codes across service hops.

Why Status Code Observability Matters

A spike in 500 errors is obvious. But a slow drift from 0.1% to 2% errors over a week is invisible without proper instrumentation. Status code observability means you know the distribution of every response code your service returns, can alert on anomalies, and can trace individual error requests end-to-end.

The RED method (Rate, Errors, Duration) is the standard framework for service health monitoring:

  • Rate — requests per second (total throughput)
  • Errors — non-2xx responses as a fraction of total requests
  • Duration — response time latency distribution

Status codes are the primary source of truth for the Errors dimension.

Status Code Metrics with Prometheus

Counter by Status Code Class

The foundational metric is a counter labeled by status code. Most HTTP instrumentation libraries expose this automatically:

# Django with django-prometheus
# pip install django-prometheus
# Adds django_http_responses_total_by_status_total counter automatically

# Manual instrumentation with prometheus_client
from prometheus_client import Counter

http_requests_total = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'path', 'status_code']
)

# In your view or middleware:
http_requests_total.labels(
    method=request.method,
    path=request.path,
    status_code=str(response.status_code)
).inc()

PromQL for Status Code Analysis

# Error rate (5xx fraction of total)
sum(rate(http_requests_total{status_code=~'5..'}[5m]))
/
sum(rate(http_requests_total[5m]))

# 4xx rate by path
topk(10,
  sum by (path) (
    rate(http_requests_total{status_code=~'4..'}[5m])
  )
)

# Success rate (Apdex-style)
sum(rate(http_requests_total{status_code=~'2..'}[5m]))
/
sum(rate(http_requests_total[5m]))

Histogram for Latency by Status Code

Correlating latency with status codes reveals patterns: 504 Gateway Timeout requests cluster at the timeout boundary, while 200s have much lower p99:

from prometheus_client import Histogram

http_request_duration_seconds = Histogram(
    'http_request_duration_seconds',
    'HTTP request duration',
    ['method', 'status_code'],
    buckets=[.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10]
)

Structured Logging

Metrics tell you *how many* errors; logs tell you *which requests* errored. The two systems complement each other and should share a common correlation ID.

JSON Access Log Format

Configure Nginx to emit structured JSON logs instead of the Combined Log Format:

log_format json_access escape=json '{
  "time": "$time_iso8601",
  "status": $status,
  "method": "$request_method",
  "path": "$request_uri",
  "duration": $request_time,
  "bytes": $body_bytes_sent,
  "upstream_status": "$upstream_status",
  "upstream_time": "$upstream_response_time",
  "request_id": "$http_x_request_id",
  "user_agent": "$http_user_agent",
  "referer": "$http_referer"
}';

access_log /var/log/nginx/access.log json_access;

Application-Level Structured Logging

# Django middleware for structured request logging
import structlog

logger = structlog.get_logger(__name__)

class RequestLoggingMiddleware:
    def __init__(self, get_response):
        self.get_response = get_response

    def __call__(self, request):
        import time
        start = time.monotonic()
        response = self.get_response(request)
        duration = time.monotonic() - start

        logger.info(
            'http_request',
            method=request.method,
            path=request.path,
            status_code=response.status_code,
            duration_ms=round(duration * 1000, 2),
            request_id=request.META.get('HTTP_X_REQUEST_ID', ''),
        )
        return response

Dashboard Design

A well-designed status code dashboard answers three questions instantly: Is the service healthy right now? Has anything changed recently? Where are errors coming from?

PanelTypeQuery
Current error rateStat`rate(errors[5m]) / rate(total[5m])`
Status code distributionPie chartCounts by 2xx/3xx/4xx/5xx
Error rate over timeTime series5xx and 4xx rates together
Top error pathsTable4xx+5xx grouped by path
Latency by statusHeatmapp50/p95/p99 per status class
Upstream vs app errorsTime seriesCompare Nginx upstream_status to app status

Color conventions: green for 2xx, blue for 3xx, yellow for 4xx, red for 5xx.

Alerting Rules

Error Rate Threshold

# Prometheus alerting rules
groups:
- name: http_status_codes
  rules:
  - alert: HighErrorRate
    expr: |
      sum(rate(http_requests_total{status_code=~'5..'}[5m]))
      /
      sum(rate(http_requests_total[5m])) > 0.01
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: 'Error rate exceeded 1% for 5 minutes'
      description: 'Current error rate: {{ $value | humanizePercentage }}'

  - alert: SuddenErrorSpike
    expr: |
      rate(http_requests_total{status_code=~'5..'}[1m])
      /
      rate(http_requests_total{status_code=~'5..'}[1h] offset 1h) > 10
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: '5xx rate spiked 10x compared to same time yesterday'

  - alert: Unusual4xxRate
    expr: |
      sum(rate(http_requests_total{status_code=~'4..'}[5m]))
      /
      sum(rate(http_requests_total[5m])) > 0.05
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: '4xx rate above 5% — possible client-side issue or scraper'

Error Budget Alerts

If you have a 99.9% SLO (0.1% error budget), alert when the budget burn rate is too high to survive the month:

# Burns through monthly budget in 1 hour — page immediately
(sum(rate(errors[5m])) / sum(rate(total[5m])))
/
(1 - 0.999) > 14.4

Distributed Tracing

Metrics and logs answer what and how many. Distributed tracing answers *why* — by recording the full request path across every service.

Sampling by Error Status

Sample 100% of error traces and only 1% of successful requests:

# OpenTelemetry sampler
from opentelemetry.sdk.trace.sampling import ParentBased, TraceIdRatioBased
from opentelemetry import trace

class ErrorAwareSampler:
    def should_sample(self, parent_context, trace_id, name, kind, attributes, links):
        # Always sample if there's already an error in the trace
        if attributes.get('http.status_code', 200) >= 500:
            return Decision.RECORD_AND_SAMPLE
        # 1% sample for successful requests
        return Decision.RECORD_AND_SAMPLE if (trace_id % 100 == 0) else Decision.DROP

Jaeger and Zipkin

Both Jaeger and Zipkin let you search traces by status code. Tag your spans:

from opentelemetry import trace

span = trace.get_current_span()
span.set_attribute('http.status_code', response.status_code)
span.set_attribute('http.method', request.method)
span.set_attribute('http.url', request.build_absolute_uri())

if response.status_code >= 500:
    span.set_status(trace.StatusCode.ERROR, f'HTTP {response.status_code}')

This lets you query Jaeger for all traces where http.status_code=500 and see exactly which services were involved, how long each step took, and where the error originated.

The complete observability stack for HTTP status codes:

SignalToolUse Case
MetricsPrometheus + GrafanaError rate trends, alerting
LogsLoki / CloudWatch / ELKIndividual request inspection
TracesJaeger / Zipkin / TempoCross-service error path
DashboardsGrafanaUnified status code view

Related Protocols

Related Glossary Terms

More in Production Infrastructure