Debugging & Troubleshooting

Debugging 503 Service Unavailable Errors in Production

A systematic approach to diagnosing 503 errors — server overload, deployment issues, health check failures, and auto-scaling gaps.

What Does 503 Mean?

503 Service Unavailable means the server cannot handle the request right now. Unlike 500 (a bug), 503 indicates a temporary condition — the server is overloaded, undergoing maintenance, or not yet ready to accept traffic.

Common Causes

1. Server Overload

Too many requests for the available resources. All worker processes are busy.

# Check system resources
htop                          # CPU and memory usage
ss -tlnp | grep :8000         # Active connections to app server
cat /proc/loadavg             # Load average

2. Application Not Started

The service is configured but the process hasn't started yet.

# Check service status
sudo systemctl status gunicorn
sudo systemctl status your-app

# Check logs for startup failures
sudo journalctl -u gunicorn --no-pager -n 50

3. Health Check Failures

Load balancers take servers out of rotation when health checks fail, returning 503 to clients.

# Test the health endpoint directly
curl -v http://localhost:8000/health/

4. Deployment In Progress

During rolling deployments, old instances are stopped before new ones are ready.

  • Add a readiness check endpoint
  • Use graceful shutdown with drain timeout
  • Configure connection draining in your load balancer

5. Database Connection Exhaustion

All database connections in the pool are in use.

# PostgreSQL: check active connections
psql -c "SELECT count(*) FROM pg_stat_activity WHERE state = 'active';"

6. Dependency Failure

A critical downstream service (database, cache, external API) is unavailable.

# Check connectivity to dependencies
pg_isready -h db.example.com
redis-cli -h cache.example.com ping

Quick Fix Checklist

  • Is the application process running?
  • Are CPU/memory within acceptable limits?
  • Is the health check endpoint responding?
  • Are database connections available?
  • Are dependencies (cache, external APIs) reachable?
  • Was a deployment just rolled out?

Prevention

  • Set up auto-scaling based on CPU/request count
  • Implement circuit breakers for downstream dependencies
  • Use connection pooling with sensible limits
  • Add a /health/ endpoint that checks critical dependencies
  • Use rolling deployments with readiness probes

Связанные протоколы

Связанные термины глоссария

Больше в Debugging & Troubleshooting