Dead Letter Queues and Async Error Handling

Async Communication Failures

Synchronous HTTP calls fail visibly — you get an error immediately and can handle it in the same request context. Asynchronous communication (message queues, event streams, webhooks) fails silently: the message is gone, and the producer may never know.

Dead letter queues (DLQs) provide a safety net: messages that cannot be processed are moved to a separate queue for inspection and reprocessing, rather than being dropped.

What Is a DLQ?

A dead letter queue is a secondary queue that receives messages that failed processing after a configured number of attempts. It is separate from the main queue so it doesn't block live traffic.

Most message brokers support DLQs natively:

Broker	DLQ Mechanism
AWS SQS	`RedrivePolicy` with `maxReceiveCount`
RabbitMQ	`x-dead-letter-exchange` argument
Apache Kafka	Custom consumer with separate DLQ topic
Google Pub/Sub	Dead letter topic with `maxDeliveryAttempts`

AWS SQS DLQ Configuration

{
  "RedrivePolicy": {
    "deadLetterTargetArn": "arn:aws:sqs:us-east-1:123:orders-dlq",
    "maxReceiveCount": 3
  }
}

After 3 failed processing attempts, the message moves to orders-dlq automatically.

When Messages Fail

Messages are sent to the DLQ when:

Poison messages: malformed payload that can never be processed
Logic errors: bugs in the consumer that throw exceptions
Dependency failures: downstream service unavailable during processing
Schema mismatches: producer and consumer schema versions don't align
Timeout exceeded: processing takes too long

DLQ Processing Strategies

1. Inspect and Fix

The simplest approach: treat the DLQ as an alert that requires human intervention:

Monitor DLQ depth — alert when depth > 0
Inspect failed messages with their error context
Fix the bug or data issue
Replay messages back to the main queue

2. Automated Reprocessing

A DLQ processor periodically retries DLQ messages with exponential backoff. Be careful: if the root cause is a poison message (bad data), reprocessing loops forever.

def process_dlq(max_retries: int = 3) -> None:
    messages = dlq.receive_messages(MaxNumberOfMessages=10)
    for msg in messages:
        attempt = int(msg.attributes.get('ApproximateReceiveCount', 0))
        if attempt > max_retries:
            archive_and_alert(msg)  # Give up, alert humans
            dlq.delete_message(msg)
            continue
        try:
            process_message(msg)
            dlq.delete_message(msg)
        except Exception:
            pass  # Leave in DLQ for next attempt

3. Archive for Audit

Write DLQ messages to cold storage (S3, database) before deleting. This provides an audit trail and enables batch reprocessing after fixes.

Monitoring and Alerting

Alert immediately when DLQ depth goes above zero — do not let messages accumulate silently
Track message age in the DLQ — stale messages indicate a persistent failure
Measure DLQ throughput — a spike indicates a systemic consumer bug
Use message attributes to carry error context (exception type, stack trace snippet, original timestamp)

Webhook Delivery Failures

Webhooks are an HTTP-based async pattern with similar failure modes. When a webhook delivery fails (non-2xx response, timeout, connection error), treat it like a message that went to a DLQ:

Retry with exponential backoff (1s, 2s, 4s, 8s, ... up to 24h)
After N retries, move to a 'failed deliveries' table (your DLQ)
Alert the webhook owner via email or dashboard
Provide a UI to inspect and manually replay failed deliveries

Summary

Dead letter queues are an essential safety net for async systems. Configure DLQs on every queue, alert when DLQ depth > 0, store error context with each failed message, and have a clear strategy for reprocessing — whether manual or automated.