Dead Letter Queues for Pipeline Reliability
The pipeline engine I work on processes work through a chain of steps, each step dispatched as a Cloud Task. When a step fails, the task system retries with backoff. That model works well for transient failures—network blips, rate limits, a worker that crashed mid-request. It breaks down for permanent failures: invalid input that will never parse, a dependency that was deleted, a configuration mistake that no amount of waiting will fix.
Without explicit dead-letter handling, those runs do not disappear. They linger. A task exhausts all retries and the manifest still reads IN_PROGRESS because no worker is attached anymore. Another run sits in ERROR past any reasonable recovery window while operators wonder whether to retry or abandon. Dashboards fill with ambiguous rows. Every investigation starts with the same question: is this still recoverable, or is it dead?
This post is about implementing dead letter queue escalation for that engine—two detection paths, a reason-code registry, a service that writes terminal state to the GCS manifest, and UI surfacing so permanently failed runs are flagged instead of silently stalling.
The problem: silent stalls vs. explicit terminal state
A multi-step pipeline is only as trustworthy as its worst failure mode. Retries are correct for uncertainty; they are wrong for certainty. When the worker knows retries are exhausted, or when a cron sweep finds a run that has been stuck longer than the dispatch deadline with no active worker, the system should stop pretending the run might complete on its own.
Dead-lettering is not “delete and forget.” It is a permanent, explicit signal: this run will not progress without human intervention. The record stays in storage for visibility and audit. Operators can filter for it, see why it was escalated, and recreate work if the underlying issue is fixed. The alternative—a pipeline that hides failures behind ambiguous statuses—is a pipeline that trains people to distrust the dashboard.
Two detection paths
Reliability here comes from redundancy. No single code path should be the only way a permanently failed run gets flagged.
Inline detection
When a worker handles a step failure, it already knows the retry count and the configured maximum. If the next attempt would exceed the limit, the worker dead-letters the run before returning success or scheduling another retry. The run never enters another ambiguous cycle of “maybe something will pick it up.”
Sweep detection
Inline detection cannot catch every edge case. A worker can crash after writing IN_PROGRESS but before finishing. Cloud Tasks can stop redispatching while the manifest still looks active. A separate cron job—call it recover-stale-workers—periodically lists active manifests and applies two thresholds:
- Stuck in progress: status is
IN_PROGRESS, last heartbeat or step start is older than the stuck timeout (aligned with the dispatch deadline plus margin). - Unrecovered error: status is
ERRORand the error timestamp is older than a configurable recovery window—long enough for legitimate auto-recovery, short enough that abandoned errors do not linger forever.
Either path calls the same dead-letter service with a reason code from the registry. Symmetry matters: inline and sweep produce the same manifest fields and the same analytics event, so operations and the UI see one concept, not two.
Dead letter registry
The codebase already uses a pattern for extensible enums: a frozen array of entries, each with a stable code and metadata, plus a Map-backed index for lookup. Dead-letter reasons follow that pattern. Adding a new reason is one registry entry—not a scatter of string literals across workers and cron jobs.
Thresholds live beside the reasons and are overridable via environment variables so staging can be aggressive and production conservative without forking code.
const DEAD_LETTER_REASONS = [
{ code: "RETRIES_EXHAUSTED", label: "Retries exhausted", severity: "terminal" },
{ code: "STUCK_IN_PROGRESS", label: "Stuck in progress", severity: "terminal" },
{ code: "UNRECOVERED_ERROR", label: "Error past recovery window", severity: "terminal" },
] as const;
type DeadLetterReasonCode = (typeof DEAD_LETTER_REASONS)[number]["code"];
const reasonByCode = new Map(
DEAD_LETTER_REASONS.map((r) => [r.code, r])
);
export const DLQ_THRESHOLDS = {
maxRetries: Number(process.env.PIPELINE_MAX_RETRIES ?? "5"),
stuckInProgressMs: Number(process.env.PIPELINE_STUCK_TIMEOUT_MS ?? "900000"),
unrecoveredErrorMs: Number(process.env.PIPELINE_RECOVERY_WINDOW_MS ?? "3600000"),
};
export function isKnownDeadLetterReason(code: string): code is DeadLetterReasonCode {
return reasonByCode.has(code as DeadLetterReasonCode);
}
Dead letter service
A small service class owns the side effects. Two public methods cover the two detection paths.
recordDeadLetter(taskId, reason, metadata) loads the run manifest from object storage, short-circuits if deadLettered is already true (idempotent), then writes deadLettered: true, deadLetteredAt as an ISO timestamp, and deadLetterReason from the registry. It emits a STEP_DEAD_LETTERED analytics event with the reason and optional metadata (last step, retry count, sweep job id). The operation is deliberately non-reversible: once dead-lettered, the run can only be recreated, not resumed. That constraint prevents “helpful” retries from undoing a terminal decision.
export class DeadLetterService {
constructor(
private manifestStore: ManifestStore,
private analytics: AnalyticsClient
) {}
async recordDeadLetter(
taskId: string,
reason: DeadLetterReasonCode,
metadata?: Record<string, unknown>
): Promise<RunManifest> {
const manifest = await this.manifestStore.read(taskId);
if (manifest.deadLettered) return manifest;
const updated: RunManifest = {
...manifest,
deadLettered: true,
deadLetteredAt: new Date().toISOString(),
deadLetterReason: reason,
};
await this.manifestStore.write(taskId, updated);
await this.analytics.track("STEP_DEAD_LETTERED", {
taskId,
reason,
...metadata,
});
return updated;
}
}
sweepStaleRuns() is invoked by the cron job. It lists manifests for active runs, evaluates each against DLQ_THRESHOLDS, and calls recordDeadLetter for any that qualify. Logging includes counts by reason so on-call can see whether spikes are retry exhaustion or stuck workers.
async sweepStaleRuns(): Promise<SweepResult> {
const now = Date.now();
const manifests = await this.manifestStore.listActive();
const results: SweepResult = { deadLettered: 0, byReason: {} };
for (const manifest of manifests) {
const reason = this.classifyStaleRun(manifest, now);
if (!reason) continue;
await this.recordDeadLetter(manifest.taskId, reason, {
sweep: true,
lastStep: manifest.currentStep,
});
results.deadLettered += 1;
results.byReason[reason] = (results.byReason[reason] ?? 0) + 1;
}
return results;
}
private classifyStaleRun(
manifest: RunManifest,
now: number
): DeadLetterReasonCode | null {
if (manifest.deadLettered) return null;
if (manifest.status === "IN_PROGRESS") {
const startedAt = Date.parse(manifest.stepStartedAt ?? manifest.updatedAt);
if (now - startedAt > DLQ_THRESHOLDS.stuckInProgressMs) {
return "STUCK_IN_PROGRESS";
}
}
if (manifest.status === "ERROR" && manifest.errorAt) {
const errorAt = Date.parse(manifest.errorAt);
if (now - errorAt > DLQ_THRESHOLDS.unrecoveredErrorMs) {
return "UNRECOVERED_ERROR";
}
}
return null;
}
Manifest schema extension
The run manifest in GCS is the source of truth the workers and the API agree on. Three fields extend that schema and flow through domain types into the task summary the UI consumes.
type RunManifest = {
taskId: string;
status: PipelineStatus;
currentStep: string;
stepStartedAt?: string;
errorAt?: string;
updatedAt: string;
// Dead letter fields — terminal, written once
deadLettered?: boolean;
deadLetteredAt?: string; // ISO-8601
deadLetterReason?: DeadLetterReasonCode;
};
API handlers map these fields onto list and detail responses without special casing in every endpoint: if deadLettered is true, the summary carries the reason for overlays and filters. Listing supports ?deadLettered=true so support can pull the terminal queue without exporting raw storage paths.
UI surfacing
Backend terminal state only helps if operators see it at a glance. Dead-lettered runs get a dedicated overlay in the status registry—distinct icon, color, and tooltip from error, paused, or cancelled overlays. In the task list and on the detail page, a red skull badge sits beside the status chip so terminal runs are visually unmistakable.
const STATUS_OVERLAYS = {
deadLettered: {
label: "Dead lettered",
icon: "skull",
color: "var(--danger)",
tooltip: (summary: TaskSummary) =>
summary.deadLetterReason
? `Permanently failed: ${summary.deadLetterReason}`
: "Permanently failed — requires human review",
},
// error, paused, cancelled ...
};
function TaskStatusBadge({ summary }: { summary: TaskSummary }) {
const base = <StatusChip status={summary.status} />;
if (!summary.deadLettered) return base;
const overlay = STATUS_OVERLAYS.deadLettered;
return (
<span className="status-with-overlay">
{base}
<span
className="badge badge-dead-letter"
title={overlay.tooltip(summary)}
aria-label={overlay.label}
>
<Icon name={overlay.icon} color={overlay.color} />
</span>
</span>
);
}
The skull is intentional affordance: this is not “failed again, try retry.” It is “stop waiting for automation.” Filters in the listing UI wire through to the API parameter so triage does not depend on memorizing reason codes.
Design philosophy: escalate, do not hide
Dead-lettering does not try to fix the problem at the system level. It acknowledges that the problem is unfixable without a human—wrong data, missing dependency, policy block—and escalates. The run remains in the system for visibility; it is not deleted to make metrics look better. That is the difference between a pipeline that hides failures and one that surfaces them.
Inline detection catches the cases workers already understand. Sweep detection catches the cases only time reveals. A single registry and service keep behavior consistent. Manifest fields and UI badges make terminal state legible. Together, they turn “is this still running?” from a manual investigation into a answered question on the dashboard.
If I were shipping this again from scratch, I would add the manifest fields and the registry before the first production incident—schema migration is cheap early and expensive when thousands of ambiguous rows already exist. I would also alert on STEP_DEAD_LETTERED rate by reason so spikes in STUCK_IN_PROGRESS surface worker or dispatch issues before operators open the task list. Reliability is not only recovering from failure; it is making failure impossible to ignore.