Error Handling Patterns

Implementing Retry with Exponential Backoff

How to implement safe retry logic with exponential backoff and jitter to recover from transient failures without overwhelming downstream services.

Why Retry Logic Matters

Distributed systems fail. Networks drop packets, servers restart, upstream services become temporarily overloaded. The difference between a resilient system and a fragile one often comes down to one question: does your client retry intelligently?

Naive retry — retrying immediately and repeatedly — makes failures worse. If 1,000 clients all retry a struggling server simultaneously, they create a retry storm that prevents recovery. Exponential backoff with jitter solves this.

Exponential Backoff Algorithm

The core idea: wait longer after each failed attempt, and add randomness to spread retries across time.

Base Formula

delay = min(base_delay * 2^attempt, max_delay)

For base_delay=1s, max_delay=32s:

AttemptDelay (no jitter)
12s
24s
38s
416s
532s (capped)

AWS recommends full jitter — randomize the entire delay:

import random
import time

def exponential_backoff_delay(
    attempt: int,
    base_delay: float = 1.0,
    max_delay: float = 32.0,
) -> float:
    cap = min(base_delay * (2 ** attempt), max_delay)
    return random.uniform(0, cap)

Full jitter spreads retries uniformly across the backoff window, minimizing thundering herd effects.

Equal Jitter (Alternative)

def equal_jitter_delay(attempt: int, base_delay: float = 1.0) -> float:
    cap = min(base_delay * (2 ** attempt), 32.0)
    return (cap / 2) + random.uniform(0, cap / 2)

Equal jitter guarantees a minimum wait, preventing immediate re-hammering.

Which Status Codes Trigger Retries?

Always Retry

  • 429 Too Many Requests — rate limited; respect Retry-After header
  • 503 Service Unavailable — server overloaded; respect Retry-After
  • 502 Bad Gateway — upstream failure, usually transient
  • 504 Gateway Timeout — upstream timeout, usually transient

Conditionally Retry (Idempotent Requests Only)

  • 500 Internal Server Error — may be transient
  • 408 Request Timeout — client-side timeout
  • Network errors (connection refused, DNS failure, SSL error)

Never Retry

  • 400 Bad Request — malformed request; retrying won't help
  • 401 Unauthorized — fix credentials first
  • 403 Forbidden — authorization issue, not transient
  • 404 Not Found — resource doesn't exist
  • 422 Unprocessable Entity — validation failure

Critical rule: Only retry non-mutating (idempotent) requests automatically. Never auto-retry a POST that creates a resource without an idempotency key — you'll create duplicates.

Complete Implementation

import httpx
import random
import time
from typing import Callable

RETRYABLE_STATUS_CODES = {429, 502, 503, 504}
MAX_RETRIES = 5

def with_retry(
    fn: Callable[[], httpx.Response],
    max_retries: int = MAX_RETRIES,
    base_delay: float = 1.0,
) -> httpx.Response:
    for attempt in range(max_retries + 1):
        try:
            response = fn()
            if response.status_code not in RETRYABLE_STATUS_CODES:
                return response
            retry_after = float(
                response.headers.get('Retry-After', 0)
            )
        except httpx.TransportError:
            retry_after = 0.0
        if attempt == max_retries:
            raise RuntimeError('Max retries exceeded')
        cap = min(base_delay * (2 ** attempt), 32.0)
        delay = max(retry_after, random.uniform(0, cap))
        time.sleep(delay)
    raise RuntimeError('Unreachable')

Testing Retry Logic

  • Use a mock server that returns 503 for N requests then 200
  • Assert that total retries do not exceed max_retries
  • Verify delays grow between attempts (log timestamps)
  • Test that 400/404 responses do NOT trigger retries

Summary

Exponential backoff with full jitter is the industry-standard pattern for retry logic. Cap your retries, respect Retry-After, never retry non-idempotent mutations, and always add jitter to prevent thundering herds.

Protocolos relacionados

Términos del glosario relacionados

Más en Error Handling Patterns