Why Load Balance?
A single server has finite CPU, memory, and network capacity. Load balancing distributes incoming requests across a pool of backend servers ("upstream" or "origin" servers) to:
- Scale horizontally — add servers instead of upgrading hardware
- Provide high availability — remove failed servers without downtime
- Improve geographic performance — route to nearest data center
- Enable zero-downtime deploys — drain one server at a time
Layer 4 vs Layer 7 Load Balancing
Load balancers operate at different layers of the network stack:
| Attribute | Layer 4 (Transport) | Layer 7 (Application) |
|---|---|---|
| Sees | TCP/UDP packets | HTTP headers, URLs, body |
| Routing basis | IP + port | Path, header, cookie, body |
| TLS | Passthrough or terminate | Always terminate |
| Performance | Extremely fast | Slightly slower (parse overhead) |
| Use case | TCP proxying, gaming | HTTP APIs, microservices |
| Examples | AWS NLB, HAProxy TCP | AWS ALB, Nginx, Envoy |
For HTTP APIs, Layer 7 is almost always the right choice — it enables path-based routing, header inspection, and meaningful health checks.
Algorithm Comparison
Round Robin
Requests are distributed to servers in sequence: 1 → 2 → 3 → 1 → 2 → 3…
- Pros: Simple, no state required, predictable distribution
- Cons: Ignores server load — a slow server receives the same request rate as a fast one
- Best for: Homogeneous servers with similar capacity and request cost
Least Connections
New requests go to the server with the fewest active connections.
- Pros: Naturally routes away from overloaded servers
- Cons: Does not account for connection duration variance
- Best for: Long-lived connections (WebSocket, gRPC streaming)
Weighted Round Robin / Weighted Least Connections
Servers are assigned weights proportional to capacity. A server with weight 2 receives twice the traffic of one with weight 1.
- Use case: Heterogeneous fleets (e.g., c5.2xlarge + c5.4xlarge mix)
IP Hash
The client's IP address is hashed to consistently select the same server.
- Pros: Simple session affinity without cookies
- Cons: Poor distribution when many clients share a NAT IP; server removal disrupts a fraction of users
Consistent Hashing
A hash ring distributes keys across servers such that only 1/N of keys remaps when a server is added or removed. Used extensively in distributed caches (Redis Cluster, Memcached).
- Best for: Stateful caches where locality matters
Health Check Integration
Load balancers continuously probe backends to detect failures. Unhealthy backends are removed from rotation automatically.
upstream api_servers {
server backend1:8000;
server backend2:8000;
keepalive 32;
}
For active health checks (Nginx Plus, HAProxy, Envoy):
# Envoy health check
health_checks:
- timeout: 1s
interval: 5s
healthy_threshold: 1
unhealthy_threshold: 3
http_health_check:
path: /health
expected_statuses: [200]
Design your /health endpoint to return 200 only when the instance is fully ready (migrations complete, warm-up done, dependencies reachable).
Session Affinity (Sticky Sessions)
When server-side state is not externalized (e.g., WebSocket connections, in-memory sessions), you need requests from the same client to always reach the same backend:
upstream api_servers {
ip_hash; # Route by client IP
server backend1:8000;
server backend2:8000;
}
Cookie-based affinity (AWS ALB's AWSALB cookie) is more precise than IP hash and handles NAT correctly. Prefer externalizing state to Redis or a database to avoid needing affinity altogether.
LB Status Codes: 502, 503, 504
These three codes are generated by the load balancer, not your application:
| Code | Meaning | Common Cause |
|---|---|---|
| **502 Bad Gateway** | Backend returned an invalid response | App crashed, returned non-HTTP |
| **503 Service Unavailable** | No healthy backends available | All servers marked unhealthy |
| **504 Gateway Timeout** | Backend took too long to respond | Slow query, deadlock, overload |
When you see a flood of 503s, check your health check endpoint and upstream health. 504s often point to database contention or a blocking external API call. Tune LB timeout to match your p99 response time budget.