All Articles

Pipeline Hardening: Timeouts, Traceability & Shutdown

· 9 min read · Humza Tareen
Pipeline Cloud Tasks TypeScript Graceful Shutdown Observability

The pipeline engine I maintain orchestrates multi-step AI task processing. Each step in a run is dispatched as a Cloud Tasks message to a pool of workers. A worker picks up the message, acquires a lease on the step so no other worker touches it, runs the agent logic, writes outputs, and releases the lease. Steps vary wildly in expected duration: a lightweight extraction step finishes in ten minutes; a multi-trial synthesis step can run for twenty-five. Over the past two weeks, three production issues surfaced that all traced back to the same root cause: the engine treated every step as if it were identical. This post covers how a dispatch registry, a SIGTERM handler, and dynamic lease TTL fixed that.

Problem 1: uniform timeouts for non-uniform steps

Every step shared a single hardcoded dispatch deadline of 1800 seconds—thirty minutes. Cloud Tasks uses this value to decide how long to wait before considering a dispatched message failed. In theory, thirty minutes is generous enough for any step. In practice, it masked failures. A step that should finish in ten minutes would fail silently at minute three, but Cloud Tasks would not retry it for another twenty-seven minutes because the deadline had not expired yet. The worker process was already gone; the queue slot sat occupied by a dead message.

Worse, these timeouts were compile-time constants. Tuning a single step's deadline required a code change, a review, a merge, and a full deployment. During an incident, that cycle time is unacceptable. You want to pull a lever in the environment and see the effect on the next dispatch, not the next release.

Solution: the dispatch registry

The codebase already used a registry pattern—frozen arrays with a Map-backed index—for other per-step configuration. The dispatch registry follows the same shape: a single frozen array where each entry defines the step's agent timeout, dispatch deadline, and environment variable override keys for both values.

type StepDispatchEntry = {
  stepType: string;
  agentTimeoutMs: number;
  dispatchDeadlineS: number;
  envOverrides: {
    agentTimeout: string;
    dispatchDeadline: string;
  };
};

const DISPATCH_REGISTRY: readonly StepDispatchEntry[] = Object.freeze([
  {
    stepType: "extraction",
    agentTimeoutMs: 600_000,
    dispatchDeadlineS: 900,
    envOverrides: {
      agentTimeout: "PIPELINE_EXTRACTION_TIMEOUT_MS",
      dispatchDeadline: "PIPELINE_EXTRACTION_DEADLINE_S",
    },
  },
  {
    stepType: "synthesis",
    agentTimeoutMs: 1_500_000,
    dispatchDeadlineS: 1_800,
    envOverrides: {
      agentTimeout: "PIPELINE_SYNTHESIS_TIMEOUT_MS",
      dispatchDeadline: "PIPELINE_SYNTHESIS_DEADLINE_S",
    },
  },
]);

const registryIndex = new Map(
  DISPATCH_REGISTRY.map((e) => [e.stepType, e])
);

function getDispatchConfig(stepType: string): StepDispatchEntry {
  const entry = registryIndex.get(stepType);
  if (!entry) throw new Error(`Unknown step type: ${stepType}`);

  return {
    ...entry,
    agentTimeoutMs:
      Number(process.env[entry.envOverrides.agentTimeout]) ||
      entry.agentTimeoutMs,
    dispatchDeadlineS:
      Number(process.env[entry.envOverrides.dispatchDeadline]) ||
      entry.dispatchDeadlineS,
  };
}

The extraction step gets a 600-second (ten-minute) agent timeout and a 900-second (fifteen-minute) dispatch deadline. The synthesis step gets 1500 seconds and 1800 seconds respectively. If an incident demands a quick adjustment, an operator sets PIPELINE_EXTRACTION_DEADLINE_S=1200 in the environment and the next dispatch picks it up—no code change, no deploy.

There was also a deployment problem hiding underneath. The CI script that provisioned Cloud Tasks queues was using a create-or-update call, meaning every deploy overwrote queue settings—including retry configuration. The original setup had no backoff configured, which meant a failed task would retry ten times in under a second before exhausting its attempts. The fix was twofold: switch queue provisioning to create-only so deploys do not stomp existing settings, and set proper retry backoff defaults (minimum backoff of 10 seconds, maximum of 300 seconds, doubling ratio of 2) when creating queues for the first time.

Problem 2: silent dispatch deadline kills

When Cloud Tasks hits the dispatch deadline, it terminates the container running the worker. In the normal request-response model, that is fine—the unacknowledged message goes back to the queue. But the pipeline uses async acknowledgment: the worker acknowledges the message immediately on receipt and manages progress through the lease system. So when the deadline kills the container, Cloud Tasks considers the message delivered. The lease is still held. The step's database row says in_progress. No error is recorded. No trace event is written. From the outside, the step looks like it is still running. You only discover it is dead when someone notices the step has been "in progress" for three hours and starts investigating.

This was the most operationally painful of the three issues. Without a trace event explaining what happened, debugging required correlating Cloud Tasks logs, container lifecycle events, and database state manually. Every incident ate an hour of engineering time before anyone understood the cause.

Solution: SIGTERM handler with error traceability

When Cloud Tasks terminates a container, it sends SIGTERM before the hard kill. The window is short—typically ten seconds—but that is enough to clean up if the handler is lean. The new handler iterates over the in-flight lease registry (a simple Map of step ID to lease metadata maintained by the worker runtime), releases each lease, writes a trace event with error details, and exits.

const inFlightLeases = new Map<string, { stepId: string; leaseId: string }>();

process.on("SIGTERM", async () => {
  console.warn(`SIGTERM received — releasing ${inFlightLeases.size} leases`);

  const cleanup = [...inFlightLeases.values()].map(async ({ stepId, leaseId }) => {
    try {
      await releaseLease(leaseId);
      await writeTraceEvent(stepId, {
        type: "step_terminated",
        reason: "dispatch_deadline_exceeded",
        detail: "Container received SIGTERM; lease released during shutdown.",
      });
    } catch (err) {
      console.error(`Cleanup failed for step ${stepId}:`, err);
    }
  });

  await Promise.allSettled(cleanup);
  process.exit(0);
});

Now, when a dispatch deadline kills a worker, the step's trace log contains a step_terminated event with the reason dispatch_deadline_exceeded. Operators can see it in the dashboard. Automated monitors can alert on it. The lease is released, so another worker can reclaim the step immediately instead of waiting for a static TTL to expire.

The SIGTERM fix also exposed two gaps in how errors were recorded during normal (non-shutdown) execution.

Gap 1: swallowed validation errors

The agent client threw a structured error—call it AgentClientError—that carried a code field (e.g., VALIDATION_FAILED) and a details array of Zod validation issues. But the catch block only logged err.message and wrote that single string to the trace record. The structured details were lost.

// Before: only message propagated
catch (err) {
  const message = err instanceof Error ? err.message : String(err);
  await writeTraceEvent(stepId, {
    type: "step_error",
    reason: message,
  });
}

// After: full error context preserved
catch (err) {
  const tracePayload: StepTraceError = {
    type: "step_error",
    reason: err instanceof Error ? err.message : String(err),
  };

  if (err instanceof AgentClientError) {
    tracePayload.errorCode = err.code;
    tracePayload.validationIssues = err.details;
  }

  await writeTraceEvent(stepId, tracePayload);
}

With the full error context in the trace record, a Zod validation failure now shows exactly which field failed and why, without anyone needing to reproduce the input locally.

Gap 2: opaque quality gate failures

The pipeline includes quality gate checks that can block a step from advancing. When a check failed, the blocking log only captured the checkId—a string like format_compliance—with no information about what specifically failed or how to fix it. The new GatePolicyFailure structure includes a remediation field with human-readable guidance, and both the blocking log and the trace event carry the full check detail: which check, what it found, and what the operator should do.

Problem 3: static lease TTL

Leases had a fixed one-hour TTL regardless of step type. The intent was conservative: give every step plenty of time so leases do not expire prematurely during legitimate long runs. The side effect was that a crashed worker held its lease for the full hour. During that hour, no other worker could pick up the step. For a ten-minute extraction step, that is fifty minutes of dead time before recovery can begin.

Combined with the silent deadline kill from Problem 2, this created a worst case: a step fails at the dispatch deadline, no error is recorded, and the lease blocks retries for an hour. The effective recovery time was the dispatch deadline plus the lease TTL—potentially ninety minutes for a step that should have finished in ten.

Solution: dynamic lease TTL from the dispatch registry

Since the dispatch registry already knows how long each step should take, lease TTL can be derived from it. The formula is straightforward: take the dispatch deadline in seconds, convert to milliseconds, and add a five-minute buffer. The buffer accounts for the SIGTERM cleanup window and minor clock skew.

const LEASE_BUFFER_MS = 5 * 60 * 1000;

function computeLeaseTtlMs(stepType: string): number {
  const config = getDispatchConfig(stepType);
  return config.dispatchDeadlineS * 1000 + LEASE_BUFFER_MS;
}

// Usage at lease acquisition
const ttlMs = computeLeaseTtlMs(step.type);
const lease = await acquireLease(step.id, { ttlMs });

An extraction step with a 900-second dispatch deadline gets a lease TTL of 900 * 1000 + 300000 = 1,200,000ms—twenty minutes. If the worker crashes after the dispatch deadline fires, the SIGTERM handler releases the lease immediately. If the handler itself fails (hard kill, network partition), the lease expires five minutes after the deadline and another worker can reclaim the step. Worst-case recovery dropped from sixty minutes to five.

Tests for this are straightforward: mock the dispatch registry to return known deadline values, call computeLeaseTtlMs, and assert the result equals deadline * 1000 + LEASE_BUFFER_MS. A second test verifies that environment variable overrides flow through correctly—set the env var, call the function, confirm the TTL reflects the overridden deadline rather than the default.

Long-horizon handoff fixes

While working on the dispatch registry, three additional P1 issues surfaced in the long-running multi-trial steps—steps that can run multiple attempts before settling on a result.

Duplicate handoff replays. When a step completes and hands off to the next step, it writes a manifest entry marking the transition. If the handoff message is delivered twice (at-least-once delivery is the norm with Cloud Tasks), the second delivery would reset the manifest state, effectively erasing progress. The fix was to move the state update after the idempotency guard: check if the manifest entry already exists, and if it does, skip the entire handler. The state mutation only runs on the first delivery.

Premature author-slot opening. A downstream step checks whether the previous step produced output and, if so, opens an author slot for the next trial. The check was too loose—it triggered on any output, including partial or errored output. Now the gate requires fully successful output, verified by checking the output status field rather than just the existence of the output blob.

Crash recovery for config rewrites. One step rewrites a configuration file (task.toml) into the destination blob store before processing. If the worker crashed after the rewrite but before completing the step, the recovery path would skip the rewrite because the destination blob already existed. But the blob might contain stale config from a previous attempt. The fix is to always reapply the rewrite on step entry, regardless of whether the destination exists, so crash recovery starts from a known-good configuration state.

What changed operationally

Before these fixes, a dispatch deadline kill was invisible. The step sat in limbo, the lease blocked retries, and someone had to manually investigate. Now, the SIGTERM handler releases the lease and writes a trace event. The dynamic TTL means even if the handler fails, recovery starts in minutes, not an hour. The dispatch registry means timeout tuning is an environment variable change, not a deployment. And the error traceability improvements mean that when something does fail, the trace log tells you what failed and why, not just that it failed.

Pipeline reliability is not one feature. It is the accumulation of dozens of small correctness fixes—each one closing a gap between what the system assumes and what production actually does. A dispatch registry, a SIGTERM handler, and a dynamic lease TTL are not glamorous. They are the difference between a pipeline that silently loses work and one that recovers on its own.