Error Handling Patterns

Dead Letter Queues and Async Error Handling

How dead letter queues capture failed async messages for later inspection and reprocessing — a safety net for webhook delivery and event-driven systems.

Async Communication Failures

Synchronous HTTP calls fail visibly — you get an error immediately and can handle it in the same request context. Asynchronous communication (message queues, event streams, webhooks) fails silently: the message is gone, and the producer may never know.

Dead letter queues (DLQs) provide a safety net: messages that cannot be processed are moved to a separate queue for inspection and reprocessing, rather than being dropped.

What Is a DLQ?

A dead letter queue is a secondary queue that receives messages that failed processing after a configured number of attempts. It is separate from the main queue so it doesn't block live traffic.

Most message brokers support DLQs natively:

BrokerDLQ Mechanism
AWS SQS`RedrivePolicy` with `maxReceiveCount`
RabbitMQ`x-dead-letter-exchange` argument
Apache KafkaCustom consumer with separate DLQ topic
Google Pub/SubDead letter topic with `maxDeliveryAttempts`

AWS SQS DLQ Configuration

{
  "RedrivePolicy": {
    "deadLetterTargetArn": "arn:aws:sqs:us-east-1:123:orders-dlq",
    "maxReceiveCount": 3
  }
}

After 3 failed processing attempts, the message moves to orders-dlq automatically.

When Messages Fail

Messages are sent to the DLQ when:

  • Poison messages: malformed payload that can never be processed
  • Logic errors: bugs in the consumer that throw exceptions
  • Dependency failures: downstream service unavailable during processing
  • Schema mismatches: producer and consumer schema versions don't align
  • Timeout exceeded: processing takes too long

DLQ Processing Strategies

1. Inspect and Fix

The simplest approach: treat the DLQ as an alert that requires human intervention:

  • Monitor DLQ depth — alert when depth > 0
  • Inspect failed messages with their error context
  • Fix the bug or data issue
  • Replay messages back to the main queue

2. Automated Reprocessing

A DLQ processor periodically retries DLQ messages with exponential backoff. Be careful: if the root cause is a poison message (bad data), reprocessing loops forever.

def process_dlq(max_retries: int = 3) -> None:
    messages = dlq.receive_messages(MaxNumberOfMessages=10)
    for msg in messages:
        attempt = int(msg.attributes.get('ApproximateReceiveCount', 0))
        if attempt > max_retries:
            archive_and_alert(msg)  # Give up, alert humans
            dlq.delete_message(msg)
            continue
        try:
            process_message(msg)
            dlq.delete_message(msg)
        except Exception:
            pass  # Leave in DLQ for next attempt

3. Archive for Audit

Write DLQ messages to cold storage (S3, database) before deleting. This provides an audit trail and enables batch reprocessing after fixes.

Monitoring and Alerting

  • Alert immediately when DLQ depth goes above zero — do not let messages accumulate silently
  • Track message age in the DLQ — stale messages indicate a persistent failure
  • Measure DLQ throughput — a spike indicates a systemic consumer bug
  • Use message attributes to carry error context (exception type, stack trace snippet, original timestamp)

Webhook Delivery Failures

Webhooks are an HTTP-based async pattern with similar failure modes. When a webhook delivery fails (non-2xx response, timeout, connection error), treat it like a message that went to a DLQ:

  • Retry with exponential backoff (1s, 2s, 4s, 8s, ... up to 24h)
  • After N retries, move to a 'failed deliveries' table (your DLQ)
  • Alert the webhook owner via email or dashboard
  • Provide a UI to inspect and manually replay failed deliveries

Summary

Dead letter queues are an essential safety net for async systems. Configure DLQs on every queue, alert when DLQ depth > 0, store error context with each failed message, and have a clear strategy for reprocessing — whether manual or automated.

関連プロトコル

関連用語

同シリーズ Error Handling Patterns