Why Retry Logic Matters
Distributed systems fail. Networks drop packets, servers restart, upstream services become temporarily overloaded. The difference between a resilient system and a fragile one often comes down to one question: does your client retry intelligently?
Naive retry — retrying immediately and repeatedly — makes failures worse. If 1,000 clients all retry a struggling server simultaneously, they create a retry storm that prevents recovery. Exponential backoff with jitter solves this.
Exponential Backoff Algorithm
The core idea: wait longer after each failed attempt, and add randomness to spread retries across time.
Base Formula
delay = min(base_delay * 2^attempt, max_delay)
For base_delay=1s, max_delay=32s:
| Attempt | Delay (no jitter) |
|---|---|
| 1 | 2s |
| 2 | 4s |
| 3 | 8s |
| 4 | 16s |
| 5 | 32s (capped) |
Full Jitter (Recommended)
AWS recommends full jitter — randomize the entire delay:
import random
import time
def exponential_backoff_delay(
attempt: int,
base_delay: float = 1.0,
max_delay: float = 32.0,
) -> float:
cap = min(base_delay * (2 ** attempt), max_delay)
return random.uniform(0, cap)
Full jitter spreads retries uniformly across the backoff window, minimizing thundering herd effects.
Equal Jitter (Alternative)
def equal_jitter_delay(attempt: int, base_delay: float = 1.0) -> float:
cap = min(base_delay * (2 ** attempt), 32.0)
return (cap / 2) + random.uniform(0, cap / 2)
Equal jitter guarantees a minimum wait, preventing immediate re-hammering.
Which Status Codes Trigger Retries?
Always Retry
- 429 Too Many Requests — rate limited; respect
Retry-Afterheader - 503 Service Unavailable — server overloaded; respect
Retry-After - 502 Bad Gateway — upstream failure, usually transient
- 504 Gateway Timeout — upstream timeout, usually transient
Conditionally Retry (Idempotent Requests Only)
- 500 Internal Server Error — may be transient
- 408 Request Timeout — client-side timeout
- Network errors (connection refused, DNS failure, SSL error)
Never Retry
- 400 Bad Request — malformed request; retrying won't help
- 401 Unauthorized — fix credentials first
- 403 Forbidden — authorization issue, not transient
- 404 Not Found — resource doesn't exist
- 422 Unprocessable Entity — validation failure
Critical rule: Only retry non-mutating (idempotent) requests automatically. Never auto-retry a POST that creates a resource without an idempotency key — you'll create duplicates.
Complete Implementation
import httpx
import random
import time
from typing import Callable
RETRYABLE_STATUS_CODES = {429, 502, 503, 504}
MAX_RETRIES = 5
def with_retry(
fn: Callable[[], httpx.Response],
max_retries: int = MAX_RETRIES,
base_delay: float = 1.0,
) -> httpx.Response:
for attempt in range(max_retries + 1):
try:
response = fn()
if response.status_code not in RETRYABLE_STATUS_CODES:
return response
retry_after = float(
response.headers.get('Retry-After', 0)
)
except httpx.TransportError:
retry_after = 0.0
if attempt == max_retries:
raise RuntimeError('Max retries exceeded')
cap = min(base_delay * (2 ** attempt), 32.0)
delay = max(retry_after, random.uniform(0, cap))
time.sleep(delay)
raise RuntimeError('Unreachable')
Testing Retry Logic
- Use a mock server that returns 503 for N requests then 200
- Assert that total retries do not exceed
max_retries - Verify delays grow between attempts (log timestamps)
- Test that 400/404 responses do NOT trigger retries
Summary
Exponential backoff with full jitter is the industry-standard pattern for retry logic. Cap your retries, respect Retry-After, never retry non-idempotent mutations, and always add jitter to prevent thundering herds.