Load Balancing Strategies for APIs

Why Load Balance?

A single server has finite CPU, memory, and network capacity. Load balancing distributes incoming requests across a pool of backend servers ("upstream" or "origin" servers) to:

Scale horizontally — add servers instead of upgrading hardware
Provide high availability — remove failed servers without downtime
Improve geographic performance — route to nearest data center
Enable zero-downtime deploys — drain one server at a time

Layer 4 vs Layer 7 Load Balancing

Load balancers operate at different layers of the network stack:

Attribute	Layer 4 (Transport)	Layer 7 (Application)
Sees	TCP/UDP packets	HTTP headers, URLs, body
Routing basis	IP + port	Path, header, cookie, body
TLS	Passthrough or terminate	Always terminate
Performance	Extremely fast	Slightly slower (parse overhead)
Use case	TCP proxying, gaming	HTTP APIs, microservices
Examples	AWS NLB, HAProxy TCP	AWS ALB, Nginx, Envoy

For HTTP APIs, Layer 7 is almost always the right choice — it enables path-based routing, header inspection, and meaningful health checks.

Algorithm Comparison

Round Robin

Requests are distributed to servers in sequence: 1 → 2 → 3 → 1 → 2 → 3…

Pros: Simple, no state required, predictable distribution
Cons: Ignores server load — a slow server receives the same request rate as a fast one
Best for: Homogeneous servers with similar capacity and request cost

Least Connections

New requests go to the server with the fewest active connections.

Pros: Naturally routes away from overloaded servers
Cons: Does not account for connection duration variance
Best for: Long-lived connections (WebSocket, gRPC streaming)

Weighted Round Robin / Weighted Least Connections

Servers are assigned weights proportional to capacity. A server with weight 2 receives twice the traffic of one with weight 1.

Use case: Heterogeneous fleets (e.g., c5.2xlarge + c5.4xlarge mix)

IP Hash

The client's IP address is hashed to consistently select the same server.

Pros: Simple session affinity without cookies
Cons: Poor distribution when many clients share a NAT IP; server removal disrupts a fraction of users

Consistent Hashing

A hash ring distributes keys across servers such that only 1/N of keys remaps when a server is added or removed. Used extensively in distributed caches (Redis Cluster, Memcached).

Best for: Stateful caches where locality matters

Health Check Integration

Load balancers continuously probe backends to detect failures. Unhealthy backends are removed from rotation automatically.

upstream api_servers {
    server backend1:8000;
    server backend2:8000;
    keepalive 32;
}

For active health checks (Nginx Plus, HAProxy, Envoy):

# Envoy health check
health_checks:
  - timeout: 1s
    interval: 5s
    healthy_threshold: 1
    unhealthy_threshold: 3
    http_health_check:
      path: /health
      expected_statuses: [200]

Design your /health endpoint to return 200 only when the instance is fully ready (migrations complete, warm-up done, dependencies reachable).

Session Affinity (Sticky Sessions)

When server-side state is not externalized (e.g., WebSocket connections, in-memory sessions), you need requests from the same client to always reach the same backend:

upstream api_servers {
    ip_hash;  # Route by client IP
    server backend1:8000;
    server backend2:8000;
}

Cookie-based affinity (AWS ALB's AWSALB cookie) is more precise than IP hash and handles NAT correctly. Prefer externalizing state to Redis or a database to avoid needing affinity altogether.

LB Status Codes: 502, 503, 504

These three codes are generated by the load balancer, not your application:

Code	Meaning	Common Cause
502 Bad Gateway	Backend returned an invalid response	App crashed, returned non-HTTP
503 Service Unavailable	No healthy backends available	All servers marked unhealthy
504 Gateway Timeout	Backend took too long to respond	Slow query, deadlock, overload

When you see a flood of 503s, check your health check endpoint and upstream health. 504s often point to database contention or a blocking external API call. Tune LB timeout to match your p99 response time budget.