Resilient Webhooks: Error Handling, Retries, and Dead Letter Queues

Webhooks are deceptively simple in the happy path. An event fires, your endpoint receives a POST request, you process it, you return a 200. Done. But production systems are not happy paths. Your endpoint times out. Your database is momentarily overloaded when the request arrives. A deploy introduces a regression that rejects valid payloads. A network partition between the sender and your server drops a batch of events with no warning.

The developers who get burned by webhooks are almost always the ones who built for the happy path and discovered, sometimes months later and sometimes only by noticing missing data in production, that their system had been silently losing events the whole time. This guide is about building webhook infrastructure that doesn't do that.

We'll cover the full reliability stack: signature verification, idempotency, retry logic with exponential backoff, dead letter queues for events that exhaust retries, and the monitoring layer that tells you when something is going wrong before users notice.

Why Webhooks Fail: A Taxonomy of Failure Modes

Before building solutions, it's worth being precise about the problem space. Webhook delivery failures fall into five categories, and each one requires a different response.

Transient infrastructure failures. Your server returns a 500 or 503 because a dependency (database, cache, downstream API) is temporarily unavailable. The event is valid. The processing logic is correct. The system just wasn't ready. These are the easiest to handle: a retry with a short delay almost always succeeds.

Timeout failures. Your endpoint takes longer than the sender's timeout threshold to respond. Many webhook providers timeout after 5-30 seconds. If your processing is slow — say, you're doing synchronous database writes and downstream API calls before returning 200 — you will fail this way constantly under load. The fix is to respond 200 immediately and process asynchronously.

Logic failures. Your code throws an unhandled exception because the payload structure doesn't match what you expected, a new event type appears that you haven't handled, or a recent code change introduced a bug. These events will fail on every retry until you fix the code. This is why dead letter queues exist — you need somewhere to hold these events so you can reprocess them after the fix is deployed.

Network failures. The TCP connection is never established, or is dropped mid-request. From the sender's perspective, this looks like a timeout or a connection refused. From your perspective, you may have received nothing, or you may have received a partial request. Good webhook providers retry these automatically; not all do.

Duplicate delivery. This is not a failure in the traditional sense, but it is a failure mode you must handle. Any reliable webhook system will deliver at-least-once, not exactly-once. When a sender retries a delivery because it didn't receive your 200 (even if your code processed the event and the response was lost in transit), you will receive the same event twice. Your system must be idempotent.

Foundation: Verify Signatures on Every Request

Before any error handling or retry logic, you need to establish that the request is actually coming from the sender you expect and hasn't been tampered with in transit. This is signature verification, and it belongs at the very top of your webhook handler — before you parse the body, before you touch a database.

Most webhook providers sign their requests using HMAC-SHA256. They compute a signature over the raw request body using a shared secret, and they send the signature in a request header. Your job is to recompute the same signature and compare it to the header value.

import hmac
import hashlib
import time
from fastapi import Request, HTTPException

WEBHOOK_SECRET = b"your-webhook-secret-here"
TIMESTAMP_TOLERANCE_SECONDS = 300  # 5 minutes

async def verify_webhook_signature(request: Request) -> bytes:
    """
    Verifies the HMAC-SHA256 signature of an incoming webhook request.
    Returns the raw body bytes if valid, raises HTTPException if not.
    """
    body = await request.body()

    # Reject requests without a signature header
    signature_header = request.headers.get("X-Webhook-Signature")
    if not signature_header:
        raise HTTPException(status_code=401, detail="Missing signature")

    # Many providers also send a timestamp to prevent replay attacks
    timestamp_header = request.headers.get("X-Webhook-Timestamp")
    if timestamp_header:
        try:
            request_time = int(timestamp_header)
            current_time = int(time.time())
            if abs(current_time - request_time) > TIMESTAMP_TOLERANCE_SECONDS:
                raise HTTPException(
                    status_code=401,
                    detail="Request timestamp too far from current time"
                )
        except ValueError:
            raise HTTPException(status_code=401, detail="Invalid timestamp")

    # Recompute expected signature
    expected_sig = hmac.new(
        WEBHOOK_SECRET,
        body,
        hashlib.sha256
    ).hexdigest()

    # Constant-time comparison to prevent timing attacks
    if not hmac.compare_digest(
        signature_header.removeprefix("sha256="),
        expected_sig
    ):
        raise HTTPException(status_code=401, detail="Invalid signature")

    return body

Two things worth noting here. First, hmac.compare_digest is not stylistic — comparing strings with == short-circuits on the first non-matching character, which leaks timing information that a sophisticated attacker can use to reconstruct the secret. Always use the constant-time comparison. Second, the timestamp tolerance window (5 minutes here) prevents replay attacks, where an attacker captures a valid signed request and replays it minutes or hours later.

Respond Immediately, Process Asynchronously

The most common cause of timeout failures is doing synchronous processing inside the webhook handler. If your handler does a database write, calls a downstream API, sends an email, and then returns 200 — you are coupling your response time to the slowest downstream operation in that chain. Under any load, you will start timing out.

The correct pattern is: receive the request, verify the signature, enqueue the payload for async processing, return 200. Everything else happens in a background worker.

import asyncio
import json
from fastapi import FastAPI, Request
from collections import deque
from dataclasses import dataclass, field
from datetime import datetime
from typing import Any

app = FastAPI()

@dataclass
class WebhookEvent:
    event_id: str
    event_type: str
    payload: dict
    received_at: datetime = field(default_factory=datetime.utcnow)
    attempt_count: int = 0
    last_error: str | None = None

# In production, replace this with Redis, SQS, or a proper message queue
event_queue: deque[WebhookEvent] = deque()

@app.post("/webhook")
async def receive_webhook(request: Request):
    body = await verify_webhook_signature(request)
    payload = json.loads(body)

    event = WebhookEvent(
        event_id=payload.get("id") or request.headers.get("X-Event-Id"),
        event_type=payload.get("type", "unknown"),
        payload=payload
    )

    event_queue.append(event)

    # Return 200 immediately — processing happens in the background
    return {"status": "accepted", "event_id": event.event_id}

For production systems, the in-memory deque should be replaced with a proper message queue — Redis with a list or streams, AWS SQS, Google Pub/Sub, RabbitMQ. The choice depends on your infrastructure, but the principle is the same: decouple receipt from processing.

Idempotency: Handling Duplicate Deliveries

Before building the retry and processing logic, the idempotency layer needs to be in place. Without it, a retry from the sender — or a network hiccup that caused your 200 response to be lost in transit — results in the same event being processed twice, which in most business contexts means double-charges, duplicate records, or double-sent emails.

The standard approach is an idempotency key store: before processing any event, check whether you've seen this event ID before. If yes, return success without reprocessing. If no, process it and record the ID.

import redis
from typing import Callable, Any

redis_client = redis.Redis(host="localhost", port=6379, decode_responses=True)
IDEMPOTENCY_TTL_SECONDS = 86400 * 7  # 7 days

def is_duplicate(event_id: str) -> bool:
    """Returns True if we've already successfully processed this event."""
    return redis_client.exists(f"processed:{event_id}") == 1

def mark_processed(event_id: str) -> None:
    """Mark an event as successfully processed."""
    redis_client.setex(
        f"processed:{event_id}",
        IDEMPOTENCY_TTL_SECONDS,
        "1"
    )

async def process_with_idempotency(
    event: WebhookEvent,
    handler: Callable[[WebhookEvent], Any]
) -> None:
    if is_duplicate(event.event_id):
        print(f"Skipping duplicate event: {event.event_id}")
        return

    await handler(event)
    mark_processed(event.event_id)

The TTL on the idempotency key is important. Set it to cover the maximum retry window your sender will use, plus a comfortable margin. If a sender retries for up to 72 hours, your idempotency keys should live for at least a week. Setting it too short means old-but-valid retries bypass your deduplication.

Retry Logic with Exponential Backoff

When a processing attempt fails — database unavailable, downstream API error, any transient fault — the event should be retried. But retries need to be implemented carefully. Retrying immediately after a failure rarely helps (the underlying condition usually hasn't resolved in the milliseconds since the first attempt), and retrying at a fixed interval under sustained failure creates thundering-herd problems where every queued event retries at the same moment.

Exponential backoff with jitter solves both problems: each retry waits progressively longer, and the jitter (random variance) prevents synchronized retry storms.

import asyncio
import random
import logging
from typing import Callable

logger = logging.getLogger(__name__)

MAX_RETRY_ATTEMPTS = 5
BASE_DELAY_SECONDS = 1.0
MAX_DELAY_SECONDS = 300.0  # 5 minutes cap

def calculate_backoff(attempt: int) -> float:
    """
    Exponential backoff with full jitter.
    attempt=0 -> ~0.5s-1s
    attempt=1 -> ~1s-2s
    attempt=2 -> ~2s-4s
    attempt=3 -> ~4s-8s
    attempt=4 -> ~8s-16s
    """
    exponential = BASE_DELAY_SECONDS * (2 ** attempt)
    capped = min(exponential, MAX_DELAY_SECONDS)
    # Full jitter: uniform between 0 and the capped delay
    return random.uniform(0, capped)

async def process_with_retry(
    event: WebhookEvent,
    handler: Callable,
    dead_letter_queue: list
) -> None:
    while event.attempt_count < MAX_RETRY_ATTEMPTS:
        try:
            await process_with_idempotency(event, handler)
            logger.info(
                f"Event {event.event_id} processed successfully "
                f"on attempt {event.attempt_count + 1}"
            )
            return

        except Exception as exc:
            event.attempt_count += 1
            event.last_error = str(exc)

            if event.attempt_count >= MAX_RETRY_ATTEMPTS:
                logger.error(
                    f"Event {event.event_id} exhausted retries after "
                    f"{MAX_RETRY_ATTEMPTS} attempts. "
                    f"Last error: {exc}. Moving to DLQ."
                )
                dead_letter_queue.append(event)
                return

            delay = calculate_backoff(event.attempt_count)
            logger.warning(
                f"Event {event.event_id} failed on attempt {event.attempt_count}. "
                f"Retrying in {delay:.1f}s. Error: {exc}"
            )
            await asyncio.sleep(delay)

The MAX_RETRY_ATTEMPTS ceiling is a business decision as much as a technical one. Five retries with exponential backoff means you're retrying over a window of roughly 15-30 minutes total. For many business events (user signups, subscription updates, payment notifications), that window is acceptable. For time-sensitive events, you may want fewer retries with shorter delays. For low-urgency batch processing events, more retries over a longer window make sense.

Dead Letter Queues: The Safety Net

A dead letter queue (DLQ) is where events go when they have exhausted all retry attempts without succeeding. It is not a failure state — it is a holding state. The events are preserved, along with their error history, so you can investigate the root cause, fix the underlying issue, and reprocess them.

Without a DLQ, events that fail all retries are simply lost. That is the silent data loss scenario that makes webhook debugging so painful — you don't know what you don't know.

import json
from datetime import datetime
from pathlib import Path

DLQ_FILE = Path("/var/log/webhook-dlq.jsonl")

class DeadLetterQueue:
    def __init__(self, storage_path: Path = DLQ_FILE):
        self.storage_path = storage_path
        self._queue: list[WebhookEvent] = []
        self._load_from_disk()

    def append(self, event: WebhookEvent) -> None:
        self._queue.append(event)
        self._persist(event)
        logger.error(
            f"DLQ: stored event {event.event_id} (type={event.event_type}, "
            f"attempts={event.attempt_count}, "
            f"last_error={event.last_error})"
        )

    def _persist(self, event: WebhookEvent) -> None:
        """Append to newline-delimited JSON file for durable storage."""
        record = {
            "event_id": event.event_id,
            "event_type": event.event_type,
            "payload": event.payload,
            "received_at": event.received_at.isoformat(),
            "attempt_count": event.attempt_count,
            "last_error": event.last_error,
            "queued_at": datetime.utcnow().isoformat()
        }
        with open(self.storage_path, "a") as f:
            f.write(json.dumps(record) + "\n")

    def _load_from_disk(self) -> None:
        """Reload unprocessed DLQ items on server restart."""
        if not self.storage_path.exists():
            return
        with open(self.storage_path) as f:
            for line in f:
                try:
                    record = json.loads(line.strip())
                    # Reconstruct WebhookEvent from stored record
                    # (simplified — add full reconstruction in production)
                    logger.info(f"DLQ loaded: {record['event_id']}")
                except json.JSONDecodeError:
                    pass

    def replay(self, handler: Callable, event_ids: list[str] | None = None) -> None:
        """
        Reprocess DLQ events. Pass event_ids to replay specific events,
        or None to replay all.
        """
        events_to_replay = [
            e for e in self._queue
            if event_ids is None or e.event_id in event_ids
        ]
        for event in events_to_replay:
            # Reset attempt count for the replay
            event.attempt_count = 0
            asyncio.create_task(
                process_with_retry(event, handler, self)
            )
        logger.info(f"DLQ replay started for {len(events_to_replay)} events")

In production, the file-based DLQ shown here should be replaced with a more robust storage mechanism — a dedicated database table, a cloud DLQ service (AWS SQS dead-letter queues, Google Cloud Pub/Sub dead-letter topics), or a Redis sorted set with the event timestamp as the score. The file approach works for low-volume systems and local development, but it doesn't scale and doesn't give you visibility without querying the file directly.

Monitoring: Knowing Before Users Do

The retry and DLQ infrastructure handles failures gracefully, but you need a monitoring layer that tells you when failures are happening so you can investigate before they become user-visible problems. Three metrics to track for every webhook endpoint:

Delivery success rate — the percentage of received events that are processed successfully on the first attempt. A healthy system should be above 99%. A sudden drop below 95% is a critical alert.

Retry rate — the percentage of events that required at least one retry. A small retry rate (1-3%) is normal and expected. A rising retry rate is an early warning sign of infrastructure stress or a code regression.

DLQ depth — the number of events currently in the dead letter queue. This should be zero under normal operation. Any event landing in the DLQ should trigger an alert, not just increment a metric.

from dataclasses import dataclass, field
from threading import Lock

@dataclass
class WebhookMetrics:
    total_received: int = 0
    total_processed: int = 0
    total_retried: int = 0
    total_dead_lettered: int = 0
    _lock: Lock = field(default_factory=Lock, repr=False)

    def record_received(self) -> None:
        with self._lock:
            self.total_received += 1

    def record_processed(self) -> None:
        with self._lock:
            self.total_processed += 1

    def record_retry(self) -> None:
        with self._lock:
            self.total_retried += 1

    def record_dead_lettered(self) -> None:
        with self._lock:
            self.total_dead_lettered += 1
        # In production, emit this to your alerting system
        logger.critical(
            "Event moved to DLQ — immediate investigation required"
        )

    @property
    def success_rate(self) -> float:
        if self.total_received == 0:
            return 1.0
        return self.total_processed / self.total_received

    @property
    def retry_rate(self) -> float:
        if self.total_received == 0:
            return 0.0
        return self.total_retried / self.total_received

metrics = WebhookMetrics()

@app.get("/webhook/health")
def webhook_health():
    return {
        "success_rate": metrics.success_rate,
        "retry_rate": metrics.retry_rate,
        "dead_letter_count": metrics.total_dead_lettered,
        "total_processed": metrics.total_processed
    }

Putting It Together: The Production Checklist

A production-grade webhook system needs all five layers working together. As a summary checklist:

Signature verification on every incoming request, before any processing.
Timestamp validation to prevent replay attacks, with a tolerance window that matches your use case.
Async processing — respond 200 immediately, process in the background, never do synchronous work in the handler.
Idempotency keys with TTL covering your full retry window, stored in persistent storage (not in-memory).
Exponential backoff with jitter on retries, with a sensible maximum attempt count for your event types.
Dead letter queue with durable persistence and a replay mechanism for post-fix reprocessing.
Metrics and alerting on success rate, retry rate, and DLQ depth — with an immediate alert on any DLQ event.

Webhooks are the connective tissue of modern distributed systems, and the investment in building them right is paid back immediately — the first time a transient failure auto-recovers without anyone noticing, and even more the first time you catch a logic bug in the DLQ and replay the affected events instead of discovering the data loss three weeks later in a user complaint.

Build for failure, and your webhooks will almost never fail.

Alex Mercer is a backend infrastructure engineer writing about event-driven architecture, APIs, and distributed systems at WebhookAgent.com.

Building Resilient Webhooks: Error Handling, Retries, and Dead Letter Queues