DNS-Based Failover and High Availability

What Is DNS Failover?

DNS failover uses DNS record changes to redirect traffic away from a failed server or region to a healthy one. When a health check detects that a server is unhealthy, the DNS provider automatically removes or replaces its record, causing new connections to go to the failover target.

DNS-based failover operates at the routing layer, not the connection layer — existing TCP connections to the failed server are not migrated. DNS failover works best for stateless services or when combined with load balancers.

Health Check-Based DNS

DNS providers with health checks monitor your endpoints and update records automatically:

AWS Route 53 Failover Routing:

Primary record:   api.example.com → 203.0.113.10  (health check: HTTP GET /health)
Secondary record: api.example.com → 198.51.100.20 (failover target)

When the primary health check fails (e.g., returns non-2xx or times out for 3 consecutive checks), Route 53 automatically stops returning the primary record and returns the secondary.

Cloudflare Load Balancing monitors origins with configurable health checks and steers traffic via its global anycast network.

Weighted Routing

Weighted routing distributes traffic across multiple records proportionally:

api.example.com → 203.0.113.10  weight=90  (primary, 90% of traffic)
api.example.com → 198.51.100.20 weight=10  (canary, 10% of traffic)

Use weighted routing for:

Canary deployments — send 5–10% of traffic to a new version
A/B testing — split traffic between variants
Capacity balancing — route proportionally to server capacity

Set weight=0 to drain traffic from a server without removing the record.

Geolocation Routing

Route users to the nearest region based on the client's geographic location:

api.example.com (US users)   → 203.0.113.10  (US-East)
api.example.com (EU users)   → 198.51.100.20 (EU-West)
api.example.com (APAC users) → 192.0.2.30    (APAC-South)
api.example.com (default)    → 203.0.113.10  (fallback)

Geolocation routing reduces latency by keeping requests close to the user. It is also used for data sovereignty (EU users' data must stay in EU).

Always configure a default (catch-all) record for regions not explicitly covered.

Failover with Low TTL

The TTL of the DNS record determines how quickly failover takes effect:

TTL	Failover Time	Cache Load
3600s	Up to 1 hour	Low
300s	Up to 5 min	Medium
60s	Up to 1 min	High
30s	~30 seconds	Very high

Low TTLs increase DNS query volume (resolvers re-query more often). For production failover, 60s is a common balance. Do not go below 30s — it causes excessive resolver load and can trigger rate limiting.

Active-Active vs Active-Passive

Active-Active: multiple healthy records are returned simultaneously. Clients receive different IPs via round-robin or weighted distribution. All nodes serve traffic simultaneously. Provides both load distribution and failover — when one node fails, its record is removed.

Active-Passive: only the primary record is returned under normal conditions. The passive (standby) record only appears when the primary health check fails. Simpler to reason about; no load distribution in normal operation.

Implementation with Route 53 / Cloudflare

Route 53 CLI:

# Create a health check
aws route53 create-health-check \
  --caller-reference $(date +%s) \
  --health-check-config '{
    "IPAddress": "203.0.113.10",
    "Port": 443,
    "Type": "HTTPS",
    "ResourcePath": "/health",
    "FailureThreshold": 3
  }'

Cloudflare Terraform:

resource "cloudflare_load_balancer" "api" {
  zone_id = var.zone_id
  name    = "api.example.com"
  default_pool_ids = [cloudflare_load_balancer_pool.primary.id]
  fallback_pool_id = cloudflare_load_balancer_pool.secondary.id
  ttl              = 30
}

Monitoring

Configure alerts for:

Health check failures (primary server down)
Failover events (traffic switched to secondary)
DNS query volume spikes (may indicate resolver misconfiguration)
TTL-related cache consistency issues during deploys