Implementing Retry with Exponential Backoff

Why Retry Logic Matters

Distributed systems fail. Networks drop packets, servers restart, upstream services become temporarily overloaded. The difference between a resilient system and a fragile one often comes down to one question: does your client retry intelligently?

Naive retry — retrying immediately and repeatedly — makes failures worse. If 1,000 clients all retry a struggling server simultaneously, they create a retry storm that prevents recovery. Exponential backoff with jitter solves this.

Exponential Backoff Algorithm

The core idea: wait longer after each failed attempt, and add randomness to spread retries across time.

Base Formula

delay = min(base_delay * 2^attempt, max_delay)

For base_delay=1s, max_delay=32s:

Attempt	Delay (no jitter)
1	2s
2	4s
3	8s
4	16s
5	32s (capped)

Full Jitter (Recommended)

AWS recommends full jitter — randomize the entire delay:

import random
import time

def exponential_backoff_delay(
    attempt: int,
    base_delay: float = 1.0,
    max_delay: float = 32.0,
) -> float:
    cap = min(base_delay * (2 ** attempt), max_delay)
    return random.uniform(0, cap)

Full jitter spreads retries uniformly across the backoff window, minimizing thundering herd effects.

Equal Jitter (Alternative)

def equal_jitter_delay(attempt: int, base_delay: float = 1.0) -> float:
    cap = min(base_delay * (2 ** attempt), 32.0)
    return (cap / 2) + random.uniform(0, cap / 2)

Equal jitter guarantees a minimum wait, preventing immediate re-hammering.

Which Status Codes Trigger Retries?

Always Retry

429 Too Many Requests — rate limited; respect Retry-After header
503 Service Unavailable — server overloaded; respect Retry-After
502 Bad Gateway — upstream failure, usually transient
504 Gateway Timeout — upstream timeout, usually transient

Conditionally Retry (Idempotent Requests Only)

500 Internal Server Error — may be transient
408 Request Timeout — client-side timeout
Network errors (connection refused, DNS failure, SSL error)

Never Retry

400 Bad Request — malformed request; retrying won't help
401 Unauthorized — fix credentials first
403 Forbidden — authorization issue, not transient
404 Not Found — resource doesn't exist
422 Unprocessable Entity — validation failure

Critical rule: Only retry non-mutating (idempotent) requests automatically. Never auto-retry a POST that creates a resource without an idempotency key — you'll create duplicates.

Complete Implementation

import httpx
import random
import time
from typing import Callable

RETRYABLE_STATUS_CODES = {429, 502, 503, 504}
MAX_RETRIES = 5

def with_retry(
    fn: Callable[[], httpx.Response],
    max_retries: int = MAX_RETRIES,
    base_delay: float = 1.0,
) -> httpx.Response:
    for attempt in range(max_retries + 1):
        try:
            response = fn()
            if response.status_code not in RETRYABLE_STATUS_CODES:
                return response
            retry_after = float(
                response.headers.get('Retry-After', 0)
            )
        except httpx.TransportError:
            retry_after = 0.0
        if attempt == max_retries:
            raise RuntimeError('Max retries exceeded')
        cap = min(base_delay * (2 ** attempt), 32.0)
        delay = max(retry_after, random.uniform(0, cap))
        time.sleep(delay)
    raise RuntimeError('Unreachable')

Testing Retry Logic

Use a mock server that returns 503 for N requests then 200
Assert that total retries do not exceed max_retries
Verify delays grow between attempts (log timestamps)
Test that 400/404 responses do NOT trigger retries

Summary

Exponential backoff with full jitter is the industry-standard pattern for retry logic. Cap your retries, respect Retry-After, never retry non-idempotent mutations, and always add jitter to prevent thundering herds.