Debugging & Troubleshooting

Debugging 500 Internal Server Error: A Systematic Approach

How to diagnose the most generic and frustrating HTTP error — log analysis, stack traces, common root causes (unhandled exceptions, database connection failures, OOM), and prevention strategies.

Why 500 Is the Hardest Error to Debug

A 500 Internal Server Error is a catch-all. It means the server encountered an unexpected condition that prevented it from fulfilling the request. Unlike a 404 (which tells you exactly what's missing) or a 400 (which points to the client payload), a 500 tells you almost nothing beyond 'something went wrong on our side.'

Three properties make 500 especially frustrating:

  • No client-facing detail — by design. Exposing stack traces in responses is a security vulnerability. Frameworks hide the real error behind a generic message.
  • It can be transient — a momentary database blip returns 500 for a fraction of requests. By the time you investigate, the error is gone.
  • Every framework handles it differently — Django's DEBUG=False returns a blank page. Express crashes the process unless you have an error handler. Spring returns a JSON error object. You need to know your stack.

The systematic approach below works regardless of framework.

Step 1: Check Application Logs Immediately

The single most important rule: the response body is useless; the server log is everything. Open your log aggregator before doing anything else.

What to Look For

A well-instrumented application produces a log entry that includes:

  • The request URL, method, and timestamp
  • A correlation ID (also called request ID or trace ID) — a UUID that links all log lines for a single request
  • The exception class and message
  • The full stack trace
# Example structured log entry (JSON)
{
  "level": "ERROR",
  "event": "Unhandled exception",
  "request_id": "04af84e7-1307-4f68-b5b3-6b58f6ab11f9",
  "method": "POST",
  "path": "/api/orders",
  "exc_info": "django.db.OperationalError: FATAL: too many connections",
  "timestamp": "2026-02-01T14:32:11.445Z"
}

If you see a correlation ID in the response header (X-Request-ID), filter your log aggregator by that value to see the complete request lifecycle — all database queries, cache hits, and external calls that preceded the crash.

Using Sentry for Error Grouping

Sentry groups similar errors into issues and tracks their frequency. A 500 that fires once might be user error. A 500 that fires 2,000 times in five minutes is an incident.

In Sentry, each issue shows:

  • The exception type and message
  • The exact line of code that raised the exception
  • Breadcrumbs — a timeline of events (SQL queries, HTTP calls) leading to the crash
  • Affected users and request contexts

If you don't have Sentry, grep your logs for ERROR and Exception within the time window of the incident:

# Grep structured JSON logs for errors in the last hour
journalctl -u gunicorn --since '1 hour ago' | grep '"level": "ERROR"'

# Or with a log file
grep -A 20 'ERROR' /var/log/app/error.log | tail -200

Step 2: Reproduce Locally

Once you have a stack trace, try to reproduce the error in your local environment. This sounds obvious, but there are systematic differences between production and development that require deliberate effort to bridge.

Environment Variables

The most common cause of 'works locally, breaks in prod' is a missing or wrong environment variable. Compare your local .env against the production .env.prod:

# List all env vars your app reads (Python example)
grep -r 'os.environ\|env(' apps/ config/ | grep -v '.pyc' | sort

# Check production env
ssh your-server 'cat /var/www/yourapp/.env.prod | grep -v "^#" | grep -v "^$"'

Database State

Some 500 errors only occur with specific data patterns. If your stack trace points to a query, run that query against a production database snapshot:

# Django: reproduce a query failure
from django.db import connection
with connection.cursor() as cursor:
    cursor.execute('EXPLAIN ANALYZE SELECT ...')
    print(cursor.fetchall())

Feature Flags and A/B Tests

If the 500 affects only a subset of users, check whether a feature flag or A/B test is active in production but not locally. Many 500s are introduced when a new code path is activated for a percentage of traffic.

Common Root Causes

Unhandled Exceptions

The most common cause. A code path raises an exception that no try/except block catches. The framework's global exception handler returns 500.

# Bug: KeyError if 'user_id' missing from session
def get_cart(request):
    user_id = request.session['user_id']  # Raises KeyError
    return Cart.objects.get(user_id=user_id)

# Fix: use .get() with a default
def get_cart(request):
    user_id = request.session.get('user_id')
    if not user_id:
        raise Http404
    return Cart.objects.get(user_id=user_id)

Database Connection Exhaustion

Under load, all database connections in the pool are in use. New requests queue up, then time out, producing 500. Signs:

  • Error message contains too many connections or connection pool exhausted
  • Errors spike during traffic peaks
  • Database monitoring shows max connections reached
# Django: check connection pool settings
# settings.py
DATABASES = {
    'default': {
        'ENGINE': 'django.db.backends.postgresql',
        'CONN_MAX_AGE': 60,  # Persistent connections
        'OPTIONS': {
            'pool': True,    # Django 5.1+ connection pooling
        },
    }
}

For high-traffic applications, put PgBouncer in front of PostgreSQL to multiplex thousands of application connections into a smaller pool of real database connections.

Out-of-Memory (OOM) Kills

The Linux kernel's OOM killer terminates processes when memory is exhausted. If your gunicorn workers are killed mid-request, clients receive 502 (from Nginx) or 500. Check for OOM events:

# Check kernel OOM log
dmesg | grep -i 'oom\|killed process'

# Check systemd for service restarts
journalctl -u gunicorn --since '1 day ago' | grep -i 'kill\|restart\|exit'

Common OOM culprits: loading large files into memory, unbounded query results (SELECT * FROM large_table), memory leaks in long-running workers.

Permission Errors

File system permission errors surface as 500 when the application tries to read or write files it doesn't own:

# Check ownership of app directories
ls -la /var/www/yourapp/
ls -la /var/www/yourapp/media/
ls -la /var/www/yourapp/logs/

# Fix ownership
sudo chown -R www-data:www-data /var/www/yourapp/media/

Prevention Strategies

Global Exception Handlers

Install a catch-all exception handler that logs the full context before returning 500. In Django, create a custom 500 view:

# config/urls.py
handler500 = 'apps.core.views.server_error'

# apps/core/views.py
import logging
logger = logging.getLogger(__name__)

def server_error(request):
    logger.exception('Unhandled 500 error', extra={
        'path': request.path,
        'method': request.method,
        'user': str(getattr(request, 'user', 'anonymous')),
    })
    return HttpResponse(status=500)

Health Checks That Actually Probe

A health check that only returns {"status": "ok"} without touching the database will pass even when your database is down. Write health checks that exercise real dependencies:

def health_check(request):
    checks = {}
    # Database
    try:
        from django.db import connection
        connection.ensure_connection()
        checks['database'] = 'ok'
    except Exception as e:
        checks['database'] = str(e)
    # Cache
    try:
        from django.core.cache import cache
        cache.set('health_check', '1', 5)
        checks['cache'] = 'ok'
    except Exception as e:
        checks['cache'] = str(e)

    status = 200 if all(v == 'ok' for v in checks.values()) else 500
    return JsonResponse(checks, status=status)

Resource Limits and Chaos Testing

Set memory limits on your gunicorn workers to prevent OOM cascades:

# gunicorn.conf.py
max_requests = 1000         # Restart workers after N requests (prevent memory leaks)
max_requests_jitter = 100   # Randomize to avoid thundering herd
timeout = 30               # Kill workers that take too long

Test your error handling deliberately. Use chaos engineering tools (or a simple script) to kill database connections, exhaust file descriptors, or trigger OOM in staging. Verify that your monitoring alerts fire and your error pages render correctly before these failures happen in production.

Related Protocols

Related Glossary Terms

More in Debugging & Troubleshooting