Back to Blog

Production Hardening an AI Video Pipeline: Retries, Fallbacks, and Crash Guards

By · 7 min read
Production Hardening TypeScript AI Pipeline Reliability Error Handling Retries

The video generation platform I work on orchestrates a long chain of AI calls — a video generation provider for clip rendering, Claude for scene planning, ElevenLabs for voiceover — into finished ad creatives. When every provider was healthy, the pipeline worked. When they were not, it failed quietly: a ten-minute backend outage burned through all retries, an unhandled promise rejection took down the Next.js server with a 502 and no stack trace, and a null optional field from Claude crashed plan validation before a single clip was generated.

These were reliability gaps, not creative logic bugs. I addressed them across five focused pull requests totaling roughly 1,200 lines. This post walks through each one.

1. Transient outage hardening for the video provider

The first incident was blunt. During a live run, the video generation provider's backend went down for roughly ten minutes. All sixteen clips in the job failed. The retry logic gave up long before the outage ended.

The old configuration was simple and wrong for provider-scale outages: three retries with a fixed fifteen-second delay between attempts. That is a forty-five-second horizon. A transient backend incident routinely lasts five to fifteen minutes. Retrying three times and declaring failure is not resilience — it is giving up on the first long tail.

I built an error classifier that separates transient failures from permanent ones. Transient errors — internal server errors, "try again later" messages, rate limits, capacity or overload signals, timeouts, HTTP 5xx, and 429 responses — get retried with exponential backoff. Non-transient errors — bad input, authentication failures, invalid parameters — fail immediately. There is no point burning retry budget on a request that will never succeed.

const TRANSIENT_PATTERNS = [
  /internal error/i,
  /try again later/i,
  /rate.?limit/i,
  /capacity|overload/i,
  /timeout/i,
];

function isTransientProviderError(msg: string, status?: number): boolean {
  if (status === 429 || (status !== undefined && status >= 500)) return true;
  return TRANSIENT_PATTERNS.some((re) => re.test(msg));
}

const BACKOFF_MS = [15_000, 30_000, 60_000, 120_000];

async function generateClipWithRetry(
  request: ClipRequest,
  maxAttempts = BACKOFF_MS.length,
): Promise<ClipResult> {
  for (let attempt = 0; attempt < maxAttempts; attempt++) {
    try {
      return await videoProvider.generate(request);
    } catch (err) {
      const msg = err instanceof Error ? err.message : String(err);
      const status = (err as { status?: number }).status;

      if (!isTransientProviderError(msg, status)) throw err;
      if (attempt === maxAttempts - 1) throw err;

      await sleep(BACKOFF_MS[attempt]);
    }
  }
  throw new Error('unreachable');
}

The backoff schedule spans roughly three and a half minutes per clip — enough runway to survive a mid-length provider outage. Permanent failures still surface immediately, which keeps debugging fast when the problem is on our side.

2. Claude model fallback and per-call observability

The planning stage calls Claude to produce structured scene blueprints — shot descriptions, timing markers, voiceover cues. That call was wired to a single hardcoded model with no fallback. When the primary model hit a rate limit or returned a 503, the entire job failed. There was also no visibility into cost or latency per call. I could see that planning failed; I could not see whether it failed because of tokens, latency, or a provider-side outage.

I added a two-tier model strategy. The primary model comes from ANTHROPIC_MODEL (defaulting to claude-opus-4-7). When the primary fails with a transient error — HTTP 5xx, 429 — or is unavailable (404), the client automatically retries with ANTHROPIC_FALLBACK_MODEL. Client-side 4xx bad-request errors do not trigger fallback. If the input is malformed, switching models will not fix it.

function isTransientAnthropicError(status: number): boolean {
  return status === 429 || status >= 500;
}

async function callClaude(messages: Message[]): Promise<AnthropicResponse> {
  const primary = process.env.ANTHROPIC_MODEL ?? 'claude-opus-4-7';
  const fallback = process.env.ANTHROPIC_FALLBACK_MODEL;

  const start = Date.now();
  try {
    const res = await anthropic.messages.create({ model: primary, messages });
    logUsage({ model: primary, ...res.usage, latencyMs: Date.now() - start });
    return res;
  } catch (err) {
    const status = (err as { status?: number }).status ?? 0;

    if (!fallback || !isTransientAnthropicError(status)) throw err;

    console.warn(`[anthropic] primary ${primary} failed (${status}), falling back to ${fallback}`);
    const res = await anthropic.messages.create({ model: fallback, messages });
    logUsage({ model: fallback, ...res.usage, latencyMs: Date.now() - start, fallback: true });
    return res;
  }
}

Every call now logs model name, token counts, and latency. Fallback switches are logged explicitly, which made it straightforward to correlate job failures with model outages and estimate per-job inference cost.

3. Process crash guards

The most frustrating production issue was random 502 responses with nothing in the application logs. The Next.js server was dying silently. Tracing it back, a stray unhandled promise rejection in background pipeline work — fire-and-forget job updates, async callbacks without catch handlers — was terminating the Node process. In development, Node prints a warning and keeps running. In production, an unhandled rejection can crash the entire server.

I added explicit process-level handlers at startup. Unhandled rejections are logged with the full stack trace and current memory usage, but the process keeps running — these are bugs worth fixing, not worth taking down every in-flight job. Uncaught exceptions are logged and followed by a clean exit(1), which lets the hosting platform restart the process. Jobs auto-restore from Google Drive checkpoints, so a controlled restart is preferable to a corrupted in-memory state.

process.on('unhandledRejection', (reason) => {
  console.error('[process] unhandledRejection', {
    reason,
    stack: reason instanceof Error ? reason.stack : undefined,
    memory: process.memoryUsage(),
  });
  // Keep running — log and alert, do not crash
});

process.on('uncaughtException', (err) => {
  console.error('[process] uncaughtException — exiting for clean restart', {
    message: err.message,
    stack: err.stack,
    memory: process.memoryUsage(),
  });
  process.exit(1);
});

I also removed seven dead void updateJob; no-ops in the regenerate route and added a /api/health endpoint exempted from auth middleware so Railway health checks stop getting false 401s.

4. Planner null-tolerance in Zod schemas

After the crash guards shipped, a different class of failure surfaced: plan validation errors. Jobs crashed with messages like plan validation failed: scenes.3.timeMarker: Expected string, received null. Claude was returning null for fields it interpreted as "no value." Our Zod schemas used z.string().optional(), which accepts string or undefined but rejects null. This is a well-known footgun when parsing LLM JSON output.

LLMs routinely emit null for absent optional fields. Zod's optional() accepts undefined but not null. Without normalization, every null is a crash.

I built a reusable helper and applied it to every LLM-supplied optional string in the plan and blueprint schemas:

import { z } from 'zod';

/** Accept string | null | undefined from LLM JSON; normalize null → undefined */
export const llmOptionalString = z
  .union([z.string(), z.null()])
  .optional()
  .transform((val) => val ?? undefined);

const sceneSchema = z.object({
  description: z.string(),
  timeMarker: llmOptionalString,
  voiceoverCue: llmOptionalString,
  cameraNote: llmOptionalString,
});

The same PR added a config-driven own-brand allow-list so the planner stops stripping the customer's own product references as "competitor content."

5. Local E2E frame hosting

The last issue only appeared in local end-to-end testing. Steps one through four of the pipeline passed cleanly, but every step-five clip generation failed with image_load_error or fetch timeouts. The video generation provider could not reach input frame URLs served through an ngrok tunnel from my laptop. Production serves frames from a stable CDN; local dev does not.

The fix is env-gated and off in production. When UPLOAD_FRAMES_TO_KIE=1 is set, the pipeline uploads input frames to the video provider's own CDN via their File Upload API before submitting the generation request. The provider fetches from its own infrastructure — no tunnel, no timeout.

async function resolveFrameUrl(localPath: string): Promise<string> {
  if (process.env.UPLOAD_FRAMES_TO_KIE !== '1') {
    return `${process.env.NGROK_BASE_URL}/frames/${basename(localPath)}`;
  }

  const uploaded = await videoProvider.uploadFile(localPath);
  return uploaded.cdnUrl;
}

Zero production behavior change — the flag defaults off. Local developers flip one env var and run the full pipeline end-to-end without fighting tunnel networking.

What I learned

Multi-provider AI pipelines fail in layers, but the fix pattern is consistent: classify the error, retry or fall back only when it makes sense, log enough to diagnose the next incident, and gate dev-only workarounds behind explicit env flags.

The five PRs were deliberately small — 125 to 381 lines each — so each could be reviewed and shipped independently. Reliability work works better as targeted fixes for failure modes you actually hit in production than as a monolithic hardening sprint.

Related Articles