Debugging 503 Service Unavailable Errors in Production

What Does 503 Mean?

503 Service Unavailable means the server cannot handle the request right now. Unlike 500 (a bug), 503 indicates a temporary condition — the server is overloaded, undergoing maintenance, or not yet ready to accept traffic.

Common Causes

1. Server Overload

Too many requests for the available resources. All worker processes are busy.

# Check system resources
htop                          # CPU and memory usage
ss -tlnp | grep :8000         # Active connections to app server
cat /proc/loadavg             # Load average

2. Application Not Started

The service is configured but the process hasn't started yet.

# Check service status
sudo systemctl status gunicorn
sudo systemctl status your-app

# Check logs for startup failures
sudo journalctl -u gunicorn --no-pager -n 50

3. Health Check Failures

Load balancers take servers out of rotation when health checks fail, returning 503 to clients.

# Test the health endpoint directly
curl -v http://localhost:8000/health/

4. Deployment In Progress

During rolling deployments, old instances are stopped before new ones are ready.

Add a readiness check endpoint
Use graceful shutdown with drain timeout
Configure connection draining in your load balancer

5. Database Connection Exhaustion

All database connections in the pool are in use.

# PostgreSQL: check active connections
psql -c "SELECT count(*) FROM pg_stat_activity WHERE state = 'active';"

6. Dependency Failure

A critical downstream service (database, cache, external API) is unavailable.

# Check connectivity to dependencies
pg_isready -h db.example.com
redis-cli -h cache.example.com ping

Quick Fix Checklist

Is the application process running?
Are CPU/memory within acceptable limits?
Is the health check endpoint responding?
Are database connections available?
Are dependencies (cache, external APIs) reachable?
Was a deployment just rolled out?

Prevention

Set up auto-scaling based on CPU/request count
Implement circuit breakers for downstream dependencies
Use connection pooling with sensible limits
Add a /health/ endpoint that checks critical dependencies
Use rolling deployments with readiness probes