Debugging & Troubleshooting

Debugging Connection Reset Errors

A systematic guide to diagnosing connection reset errors: TCP RST packets, firewall drops, keep-alive mismatches, and proxy timeout misconfigurations.

What Is a Connection Reset?

A connection reset (TCP RST) occurs when one side of a TCP connection abruptly closes it by sending a RST packet instead of the normal four-way FIN handshake. The receiving side sees this as an error:

  • Connection reset by peer (Linux / curl)
  • ERR_CONNECTION_RESET (Chrome)
  • System.Net.Sockets.SocketException: Connection reset by peer (.NET)

Unlike a timeout (where nothing arrives), a reset arrives instantly — the remote end actively rejected the connection.

Common Causes

Firewall Dropping Established Connections

Stateful firewalls track connection state. If a packet arrives for a connection the firewall no longer tracks (e.g., after a restart or idle timeout), the firewall sends RST. AWS Security Groups, iptables rules, and cloud load balancer idle timeouts are frequent culprits.

Server-Side Crash or Restart

When Gunicorn or uWSGI restarts mid-request, the OS sends RST for all open connections. Rolling deploys minimize this window but cannot eliminate it entirely.

Keep-Alive Misconfiguration

A mismatch between client and server keep-alive timeouts causes resets. If Nginx has keepalive_timeout 65s but the upstream Gunicorn has a 30s idle limit, Nginx may send a request on a connection Gunicorn already closed — receiving a RST.

Half-Open Connections

A connection where one side thinks it is open but the other has closed (e.g., after a server restart). The first new request on the half-open connection triggers an RST.

Reading TCP RST Packets

Use tcpdump to capture the reset:

# Capture TCP RST packets on port 443
sudo tcpdump -i eth0 'tcp[tcpflags] & tcp-rst != 0 and port 443'

# Save to file for Wireshark analysis
sudo tcpdump -i eth0 -w /tmp/capture.pcap 'port 443'

In the output, look for R flags:

14:32:01.123456 IP 10.0.1.5.42310 > 10.0.1.10.443: Flags [R.], ...

The source IP tells you which side sent the RST.

Debugging with curl

# Verbose output showing TCP-level events
curl -v --trace-time https://api.example.com/endpoint

# Test keep-alive by reusing the connection
curl -v --keepalive-time 30 https://api.example.com/endpoint \
  https://api.example.com/endpoint2

# Measure connection time vs TTFB
curl -w 'connect=%{time_connect} ttfb=%{time_starttransfer}\n' \
  -o /dev/null -s https://api.example.com/

Keep-Alive Misconfigurations

The safest configuration is to set the client keep-alive timeout lower than the server's:

# Nginx → Gunicorn: set upstream keepalive shorter than Gunicorn's
upstream gunicorn {
    server 127.0.0.1:8000;
    keepalive 32;          # connection pool size
    keepalive_timeout 30s; # must be < gunicorn --timeout
}
# Gunicorn config — keep-alive timeout should be > Nginx keepalive_timeout
# gunicorn.conf.py
keepalive = 65   # seconds
timeout = 120

Proxy and Load Balancer Issues

AWS ALB has a default idle timeout of 60 seconds. If an application keeps a connection open for longer without data, the ALB sends RST to both sides. Fix: increase the ALB idle timeout, or send periodic keep-alive pings.

For WebSocket connections: proxies often have different timeouts for HTTP and WebSocket. Check your proxy documentation explicitly.

Resolution Checklist

  • [ ] Capture RST packets with tcpdump to identify which side resets
  • [ ] Check firewall idle timeout — ensure it exceeds keep-alive interval
  • [ ] Check load balancer idle timeout (AWS ALB: EC2 > Load Balancers > Attributes)
  • [ ] Verify Nginx keepalive_timeout < Gunicorn/uWSGI keep-alive
  • [ ] Check for process restarts in application logs during the incident
  • [ ] Enable TCP keep-alive at the socket level for long-lived connections

Protokol Terkait

Istilah Glosarium Terkait

Lebih lanjut di Debugging & Troubleshooting