Building an AI Video Generation Pipeline from Scratch

By Humza Tareen June 28, 2026 · 9 min read

Video Generation TypeScript FFmpeg Claude AI Pipeline Next.js

I joined a new project recently — an AI-powered video ad generation platform. The product promise is deceptively simple: give it a viral reference video and a product or brand, and it produces a new ad that feels like the same viral, just for your brand. Same pacing, same emotional arc, same camera language — but with your protagonist, your product, your voiceover.

That simplicity hides a brutal engineering problem. A viral short-form video is not a template — it is scene cuts, emotional beats, caption rhythms, and narrative causality that humans parse subconsciously. Replicating that feel with generative models requires a multi-stage pipeline where each stage has different failure modes, latency profiles, and vendor APIs.

I had three days. I shipped 20 pull requests across 100+ files and roughly 7,000 lines of TypeScript. This post walks through the five-stage pipeline, the domain-driven module layout, and the checkpointing and QA layers that make the whole thing recoverable when a Kling render fails at scene seven of twelve.

What the pipeline does

The input is a reference video URL plus brand assets: product images, an avatar, a voice profile, and optional brand guidelines. The output is a finished MP4 ready for ad placement — captioned, color-graded, with mixed audio and smooth transitions between scenes.

The generation flow has five stages, each producing structured artifacts that the next stage consumes:

Analyzer — decomposes the reference video into scenes and narrative metadata
Planner — generates a scene-by-scene script and generation plan via Claude
Clip Generation — renders each scene independently through Kling
Stitch — assembles clips with FFmpeg: captions, audio mix, color grade, transitions
QA / Resolve — cross-scene narrative validation and automated fixes

The hard part is not generating a single good clip. It is maintaining narrative continuity, emotional pacing, and visual consistency across twelve independently generated scenes — then stitching them into something that feels like one continuous piece of content.

Each stage is idempotent and checkpointed. If clip generation fails halfway through, the wizard restores from the last good checkpoint on Google Drive and resumes — it does not restart from the analyzer.

Stage 1: The Analyzer

The analyzer's job is to reverse-engineer why a reference video works. It runs scene detection to split the video into discrete segments, then enriches each segment with structured metadata:

Story structure — continuous narrative vs. montage (this changes how the planner writes scene transitions)
Protagonist arc — who is on screen, how they change scene to scene
Time progression — morning-to-evening, before-and-after, flashback patterns
Emotion per scene — curiosity, frustration, relief, triumph — mapped to timestamps
Caption style — font placement, word timing, emphasis patterns extracted from the reference
Camera moves — static, push-in, pan, handheld — per scene

Scene detection uses FFmpeg frame analysis combined with visual change scoring. The analyzer also extracts keyframes from each scene — these become input references for clip generation later, anchoring the visual composition of each generated scene to the reference.

interface SceneAnalysis {
  index: number;
  startMs: number;
  endMs: number;
  keyframeUrl: string;
  emotion: EmotionTag;
  cameraMove: CameraMove;
  captionStyle: CaptionStyle;
  spokenDurationMs: number;
}

interface VideoAnalysis {
  storyStructure: 'continuous' | 'montage';
  protagonistArc: ProtagonistArc;
  timeProgression: TimeProgression;
  scenes: SceneAnalysis[];
  totalSpokenMs: number;
  referenceCaptionStyle: CaptionStyle;
}

The totalSpokenMs field is critical downstream. The planner budgets voiceover script length to match the reference's total spoken time — a 47-second viral stays a 47-second ad, not a 90-second explainer that kills retention.

Stage 2: The Planner

The planner is where Claude does the creative heavy lifting. It receives the full VideoAnalysis, the brand's product catalog entry, avatar profile, and voice characteristics. It outputs a scene-by-scene plan where every scene must satisfy two constraints simultaneously:

Story spine — each scene advances the protagonist and sets up the next scene (no orphaned beats)
Product placement — the product appears naturally within the narrative, not as a bolted-on end card

Each planned scene includes a VO script line (budgeted to that scene's spokenDurationMs), an emotion tag, time markers, a generation prompt for Kling, and explicit references to which product images and avatar frames to use as conditioning inputs.

interface PlannedScene {
  sceneIndex: number;
  voScript: string;
  voDurationMs: number;
  emotion: EmotionTag;
  storyBeat: string;          // what this scene accomplishes narratively
  setupForNext: string;       // causality link to scene N+1
  klingPrompt: string;
  inputFrames: {
    referenceKeyframe: string;
    avatarFrame?: string;
    productImages: string[];
  };
  productPlacement: ProductPlacementSpec;
}

interface GenerationPlan {
  scenes: PlannedScene[];
  totalVoMs: number;
  storyStructure: 'continuous' | 'montage';
  warnings: PlanWarning[];
}

Plan QA runs immediately after planning. It checks for protagonist drift (scene five introduces a character who was never set up), missing causality (scene three's payoff has no setup), and time jumps that break the reference's pacing rhythm. Warnings surface in the UI before generation starts — the operator can resolve them manually or hit "Resolve with AI."

Stage 3: Clip Generation

Each planned scene renders independently through Kling, accessed via the kie provider API. This is the slowest and most expensive stage — a twelve-scene ad means twelve sequential (or partially parallelized) video generation calls, each taking 30–90 seconds.

Every clip generation request includes:

Input frames from the reference keyframe (composition anchor)
Avatar consistency frames (same face across scenes)
Product images locked via the product-fidelity module
The scene-specific Kling prompt from the planner

Avatar consistency and product fidelity are non-negotiable for ad use. A generated scene where the product label is wrong or the protagonist's face morphs between cuts is unusable. The catalog module runs asset preflight checks before generation starts — verifying image resolution, aspect ratio, and that product-lock metadata matches the catalog entry.

async function generateSceneClip(
  scene: PlannedScene,
  job: GenerationJob,
): Promise<SceneClipResult> {
  const preflight = await assetPreflight(scene.inputFrames, job.catalog);
  if (!preflight.ok) {
    throw new PreflightError(preflight.failures);
  }

  const taskId = await kieKling.submit({
    prompt: scene.klingPrompt,
    imageUrl: scene.inputFrames.referenceKeyframe,
    avatarUrl: scene.inputFrames.avatarFrame,
    productUrls: scene.inputFrames.productImages,
    durationSec: Math.ceil(scene.voDurationMs / 1000),
  });

  const clip = await kieKling.pollUntilComplete(taskId, {
    timeoutMs: 180_000,
    onProgress: (pct) => job.checkpoint.updateSceneProgress(scene.sceneIndex, pct),
  });

  await job.checkpoint.saveSceneClip(scene.sceneIndex, clip);
  return { sceneIndex: scene.sceneIndex, clipPath: clip.localPath, durationMs: clip.durationMs };
}

Checkpoints write each completed clip to Google Drive immediately. If the job crashes at scene eight, scenes one through seven are already persisted. The wizard-runner restores state and resumes from scene eight on the next run.

Stage 4: Stitch

Individual Kling clips are raw generative output — no captions, no mixed audio, no color consistency across scenes. The stitch module is FFmpeg-based and handles the full post-production stack:

Concat — join scene clips in plan order
Transitions — xfade for soft joins between scenes (cross-dissolve duration tuned per story structure)
Captions — overlay rendered from the planner's VO script and reference caption style
Audio mix — ElevenLabs TTS for VO, mixed with a background music bed using sidechain ducking (VO ducking the bed, not the other way around)
Color grade — LUT application for cross-scene tonal consistency

Sidechain ducking ducks the music bed when VO is present and releases when VO pauses — the result sounds professionally mixed, not like a slideshow with a track pasted underneath.

interface StitchConfig {
  clips: string[];
  voTrackPath: string;
  musicBedPath: string;
  captions: CaptionOverlay[];
  transitions: TransitionSpec[];   // xfade duration per join
  colorGradeLut: string;
  outputPath: string;
}

async function stitchFinalVideo(config: StitchConfig): Promise<string> {
  const concatenated = await ffmpegConcat(config.clips, config.transitions);
  const withCaptions = await overlayRender(concatenated, config.captions);
  const voTrack = await voiceEmotion.render(config.voTrackPath);
  const mixed = await audioBed.mix({
    voTrack,
    bedTrack: config.musicBedPath,
    sidechainDuck: { threshold: -18, ratio: 4, attackMs: 5, releaseMs: 200 },
  });
  const graded = await ffmpegApplyLut(withCaptions, config.colorGradeLut);
  return ffmpegMuxVideoAudio(graded, mixed, config.outputPath);
}

Stage 5: QA and Resolve

Per-scene generation QA catches visual defects — wrong product, face drift, truncated clips. Cross-scene narrative QA catches problems that only emerge after assembly:

Protagonist drift — the character's appearance or role shifts mid-video
Missing causality — a scene references an event that was never shown
Time jumps — the narrative timeline breaks (morning scene followed by "later that morning" with no visual transition)
Pacing mismatch — VO lines exceed their scene's spoken duration budget

Pacing auto-fit handles over-budget VO lines. When a script line runs longer than its scene's spokenDurationMs, the resolver rewrites it across up to three passes — tightening language while preserving the story beat. If three passes still exceed budget, it flags the scene for manual review rather than silently truncating.

"Resolve with AI" is the operator escape hatch. Each warning type maps to a targeted fix prompt — regenerate a single scene with adjusted conditioning, rewrite a VO line, or reorder scene transitions. The warning resolver module dispatches the fix, re-runs the affected pipeline stages, and updates the checkpoint.

Domain-driven module layout

When I joined, the codebase had 72 flat lib/*.ts files with no clear ownership boundaries. Finding where clip generation interacted with checkpointing meant grep archaeology. I reorganized everything into five domain folders:

Domain	Responsibility	Key modules
`core/`	Job lifecycle, persistence, validation	jobs, checkpoint, drive, wizard-restore, logger, process-guards, validators
`swipe/`	Pipeline orchestration and planning	wizard-runner, swipe-analyzer, swipe-planner, swipe-vo, plan QA, warning resolver
`vendors/`	External API adapters	anthropic, openai-image, kie/Kling, fal, elevenlabs
`media/`	FFmpeg and audio processing	stitch, audio-bed, music-bed, overlay-render, voice-emotion, scene-detector
`catalog/`	Brand asset management	products, avatars, voices, product-lock, product-fidelity, asset-preflight

The wizard-runner orchestrates analyzer → planner → clip gen → stitch → QA, saving checkpoints after each stage boundary and consulting wizard-restore on retry.

type WizardStage = 'analyze' | 'plan' | 'generate' | 'stitch' | 'qa';

async function runWizard(job: GenerationJob): Promise<WizardResult> {
  const restored = await wizardRestore(job.id);
  let stage: WizardStage = restored?.lastCompletedStage
    ? nextStage(restored.lastCompletedStage)
    : 'analyze';

  while (stage !== 'done') {
    await processGuards.assertNotCancelled(job.id);
    const artifact = await runStage(stage, job, restored?.artifacts);
    await job.checkpoint.saveStage(stage, artifact);
    stage = nextStage(stage);
  }

  return job.checkpoint.loadFinalResult();
}

Vendor adapters live in isolation under vendors/. Swapping Kling for a different video generation model means changing the kie adapter — the planner, stitch, and QA modules never import vendor-specific types.

Tech stack and deployment

The platform runs on Next.js with TypeScript throughout. Long-running generation jobs execute as background processes on Railway, not in serverless request handlers — a twelve-scene Kling run can take 15+ minutes and cannot fit in a Vercel function timeout.

Claude (Anthropic) — scene planning, narrative QA, VO rewrite, warning resolution
Kling via kie — per-scene video generation with image conditioning
ElevenLabs — TTS voiceover and sound effects
FFmpeg — scene detection, concat, xfade transitions, LUT color grade, audio mux
Google Drive — checkpoint persistence (clips, plans, analysis artifacts)
fal — lipsync and queue management for alternate generation paths

Process guards wrap every stage with cancellation checks and timeout enforcement. If an operator cancels a job mid-generation, the guard raises before the next Kling call fires — preventing orphaned API charges on a job nobody wants.

What 20 PRs in 3 days looked like

The work broke down roughly like this:

Days 1–2: Analyzer and planner modules, Claude integration, plan QA, domain folder restructure
Day 2–3: Kling clip generation with checkpointing, FFmpeg stitch pipeline, audio mix with sidechain ducking
Day 3: Cross-scene narrative QA, pacing auto-fit, warning resolver, wizard-restore, process guards

Several PRs were refactors — vendor adapters, validator consolidation, FFmpeg helper dedup. The domain restructure happened early; navigating 72 flat files while shipping in parallel would have been slower than half a day on folder boundaries.

What I learned

Generative video ads are a pipeline problem, not a model problem. Kling produces impressive individual clips, but the product value lives in orchestration — narrative continuity across independently generated scenes, VO pacing matched to the reference, and graceful recovery when scene seven times out.

Checkpointing at every stage boundary was the highest-leverage decision. A 15-minute job that fails at minute fourteen becomes a resume from scene eight, not a full restart.

Plan QA catches structural narrative problems before Kling credits are spent. Pacing auto-fit rewrites over-budget lines before TTS renders them. Every expensive operation has a cheap validation gate upstream.

Three days, 20 PRs, 7,000 lines. The pipeline generates ads that feel like the reference went viral for a different brand — which is exactly the point.

Viral Feel Parity: Making AI-Generated Ads Feel Like the Original
Avatar consistency, audio beds, and VO pacing
Narrative Continuity: Story Spine and L/J-Cut Editing
Story spine enforcement and cross-scene QA
Production Hardening an AI Video Pipeline
Retries, fallbacks, and crash guards for the same pipeline