Error Handling Patterns

Health Check Endpoint Design Guide

How to design health check endpoints that accurately report service status — liveness, readiness, and startup probes with proper response formats.

Why Health Checks Matter

Health check endpoints serve multiple critical purposes:

  • Load balancers route traffic away from unhealthy instances
  • Kubernetes restarts containers that fail liveness probes
  • Monitoring systems (UptimeRobot, Datadog) alert on downtime
  • Deployment pipelines verify a new version is ready before sending traffic

A poorly designed health check — one that always returns 200 regardless of actual state — defeats all of the above.

Liveness vs Readiness vs Startup

Kubernetes defines three distinct probe types, and the distinction matters even outside Kubernetes:

Liveness

GET /health/live — Is the process alive and not in a deadlock?

This probe should be extremely lightweight: check that the event loop is running and the process can respond at all. If this fails, the container is killed and restarted.

Do not check external dependencies in liveness — a database outage should not cause your containers to restart in a loop.

Readiness

GET /health/ready — Is the service ready to receive traffic?

This probe should check all dependencies the service needs to function:

  • Database connection and query execution
  • Cache connectivity (Redis, Memcached)
  • Required external APIs
  • Configuration loaded

If readiness fails, the load balancer stops routing traffic to this instance — but does not restart it.

Startup

GET /health/startup — Has the application finished initializing?

For slow-starting services (loading large ML models, running migrations), the startup probe buys time before liveness kicks in. Once the startup probe succeeds, liveness and readiness probes begin.

What to Check in Readiness

Database

from django.db import connection

def check_database() -> tuple[bool, str]:
    try:
        connection.ensure_connection()
        with connection.cursor() as cursor:
            cursor.execute('SELECT 1')
        return True, 'ok'
    except Exception as e:
        return False, str(e)

Cache

from django.core.cache import cache

def check_cache() -> tuple[bool, str]:
    try:
        cache.set('health_check', '1', timeout=10)
        assert cache.get('health_check') == '1'
        return True, 'ok'
    except Exception as e:
        return False, str(e)

External Services

Only check truly required external services in readiness. Optional services (email sending, analytics) should not affect readiness — use the graceful degradation pattern instead.

Response Format

The Health Check Response Format is standardized in [draft-inadarei-api-health-check](https://inadarei.github.io/rfc-healthcheck/):

{
  "status": "pass",
  "version": "1.0.0",
  "releaseId": "v2026.2.25.1",
  "checks": {
    "database": [{
      "status": "pass",
      "responseTime": 12
    }],
    "cache": [{
      "status": "pass",
      "responseTime": 3
    }]
  }
}

Status values: pass (healthy), fail (unhealthy), warn (degraded but functional).

HTTP status codes:

Health StatusHTTP Code
`pass`200
`warn`200
`fail`503

Security Considerations

Health endpoints can leak sensitive information:

  • Liveness (/health/live): Safe to expose publicly — returns minimal information
  • Readiness (/health/ready): May leak internal topology (database host, service names). Restrict to internal network or require authentication
  • Never include: credentials, connection strings, stack traces in health responses

Anti-Patterns

  • Always returning 200: Defeats the purpose entirely
  • Checking non-required dependencies: A broken recommendation service should not make your product page fail readiness
  • Long-running checks: Health checks should complete in <100ms; use timeouts on dependency checks
  • No health check at all: Load balancers will route to dead instances indefinitely
  • Same endpoint for liveness and readiness: A database outage should stop traffic, not restart all containers

Complete Django Example

# apps/core/views.py
import time
from django.http import JsonResponse
from django.db import connection

def health_live(request):
    return JsonResponse({'status': 'pass'}, status=200)

def health_ready(request):
    checks = {}
    overall = 'pass'
    # Database check
    t0 = time.monotonic()
    try:
        connection.ensure_connection()
        checks['database'] = [{'status': 'pass',
                                'responseTime': int((time.monotonic()-t0)*1000)}]
    except Exception as e:
        checks['database'] = [{'status': 'fail', 'output': str(e)}]
        overall = 'fail'
    status_code = 503 if overall == 'fail' else 200
    return JsonResponse(
        {'status': overall, 'checks': checks},
        status=status_code,
    )

Summary

Design three separate health check endpoints: liveness (is it alive?), readiness (is it ready for traffic?), and startup (has it initialized?). Only check required dependencies in readiness, return structured JSON with pass/warn/fail statuses, restrict detailed checks to internal networks, and keep all health checks under 100ms.

Verwandte Protokolle

Verwandte Glossarbegriffe

Mehr in Error Handling Patterns