I joined a new project recently — an AI-powered video ad generation platform. The product promise is deceptively simple: give it a viral reference video and a product or brand, and it produces a new ad that feels like the same viral, just for your brand. Same pacing, same emotional arc, same camera language — but with your protagonist, your product, your voiceover.
That simplicity hides a brutal engineering problem. A viral short-form video is not a template — it is scene cuts, emotional beats, caption rhythms, and narrative causality that humans parse subconsciously. Replicating that feel with generative models requires a multi-stage pipeline where each stage has different failure modes, latency profiles, and vendor APIs.
I had three days. I shipped 20 pull requests across 100+ files and roughly 7,000 lines of TypeScript. This post walks through the five-stage pipeline, the domain-driven module layout, and the checkpointing and QA layers that make the whole thing recoverable when a Kling render fails at scene seven of twelve.
What the pipeline does
The input is a reference video URL plus brand assets: product images, an avatar, a voice profile, and optional brand guidelines. The output is a finished MP4 ready for ad placement — captioned, color-graded, with mixed audio and smooth transitions between scenes.
The generation flow has five stages, each producing structured artifacts that the next stage consumes:
- Analyzer — decomposes the reference video into scenes and narrative metadata
- Planner — generates a scene-by-scene script and generation plan via Claude
- Clip Generation — renders each scene independently through Kling
- Stitch — assembles clips with FFmpeg: captions, audio mix, color grade, transitions
- QA / Resolve — cross-scene narrative validation and automated fixes
The hard part is not generating a single good clip. It is maintaining narrative continuity, emotional pacing, and visual consistency across twelve independently generated scenes — then stitching them into something that feels like one continuous piece of content.
Each stage is idempotent and checkpointed. If clip generation fails halfway through, the wizard restores from the last good checkpoint on Google Drive and resumes — it does not restart from the analyzer.
Stage 1: The Analyzer
The analyzer's job is to reverse-engineer why a reference video works. It runs scene detection to split the video into discrete segments, then enriches each segment with structured metadata:
- Story structure — continuous narrative vs. montage (this changes how the planner writes scene transitions)
- Protagonist arc — who is on screen, how they change scene to scene
- Time progression — morning-to-evening, before-and-after, flashback patterns
- Emotion per scene — curiosity, frustration, relief, triumph — mapped to timestamps
- Caption style — font placement, word timing, emphasis patterns extracted from the reference
- Camera moves — static, push-in, pan, handheld — per scene
Scene detection uses FFmpeg frame analysis combined with visual change scoring. The analyzer also extracts keyframes from each scene — these become input references for clip generation later, anchoring the visual composition of each generated scene to the reference.
interface SceneAnalysis {
index: number;
startMs: number;
endMs: number;
keyframeUrl: string;
emotion: EmotionTag;
cameraMove: CameraMove;
captionStyle: CaptionStyle;
spokenDurationMs: number;
}
interface VideoAnalysis {
storyStructure: 'continuous' | 'montage';
protagonistArc: ProtagonistArc;
timeProgression: TimeProgression;
scenes: SceneAnalysis[];
totalSpokenMs: number;
referenceCaptionStyle: CaptionStyle;
}
The totalSpokenMs field is critical downstream. The planner budgets voiceover script length to match the reference's total spoken time — a 47-second viral stays a 47-second ad, not a 90-second explainer that kills retention.
Stage 2: The Planner
The planner is where Claude does the creative heavy lifting. It receives the full VideoAnalysis, the brand's product catalog entry, avatar profile, and voice characteristics. It outputs a scene-by-scene plan where every scene must satisfy two constraints simultaneously:
- Story spine — each scene advances the protagonist and sets up the next scene (no orphaned beats)
- Product placement — the product appears naturally within the narrative, not as a bolted-on end card
Each planned scene includes a VO script line (budgeted to that scene's spokenDurationMs), an emotion tag, time markers, a generation prompt for Kling, and explicit references to which product images and avatar frames to use as conditioning inputs.
interface PlannedScene {
sceneIndex: number;
voScript: string;
voDurationMs: number;
emotion: EmotionTag;
storyBeat: string; // what this scene accomplishes narratively
setupForNext: string; // causality link to scene N+1
klingPrompt: string;
inputFrames: {
referenceKeyframe: string;
avatarFrame?: string;
productImages: string[];
};
productPlacement: ProductPlacementSpec;
}
interface GenerationPlan {
scenes: PlannedScene[];
totalVoMs: number;
storyStructure: 'continuous' | 'montage';
warnings: PlanWarning[];
}
Plan QA runs immediately after planning. It checks for protagonist drift (scene five introduces a character who was never set up), missing causality (scene three's payoff has no setup), and time jumps that break the reference's pacing rhythm. Warnings surface in the UI before generation starts — the operator can resolve them manually or hit "Resolve with AI."
Stage 3: Clip Generation
Each planned scene renders independently through Kling, accessed via the kie provider API. This is the slowest and most expensive stage — a twelve-scene ad means twelve sequential (or partially parallelized) video generation calls, each taking 30–90 seconds.
Every clip generation request includes:
- Input frames from the reference keyframe (composition anchor)
- Avatar consistency frames (same face across scenes)
- Product images locked via the product-fidelity module
- The scene-specific Kling prompt from the planner
Avatar consistency and product fidelity are non-negotiable for ad use. A generated scene where the product label is wrong or the protagonist's face morphs between cuts is unusable. The catalog module runs asset preflight checks before generation starts — verifying image resolution, aspect ratio, and that product-lock metadata matches the catalog entry.
async function generateSceneClip(
scene: PlannedScene,
job: GenerationJob,
): Promise<SceneClipResult> {
const preflight = await assetPreflight(scene.inputFrames, job.catalog);
if (!preflight.ok) {
throw new PreflightError(preflight.failures);
}
const taskId = await kieKling.submit({
prompt: scene.klingPrompt,
imageUrl: scene.inputFrames.referenceKeyframe,
avatarUrl: scene.inputFrames.avatarFrame,
productUrls: scene.inputFrames.productImages,
durationSec: Math.ceil(scene.voDurationMs / 1000),
});
const clip = await kieKling.pollUntilComplete(taskId, {
timeoutMs: 180_000,
onProgress: (pct) => job.checkpoint.updateSceneProgress(scene.sceneIndex, pct),
});
await job.checkpoint.saveSceneClip(scene.sceneIndex, clip);
return { sceneIndex: scene.sceneIndex, clipPath: clip.localPath, durationMs: clip.durationMs };
}
Checkpoints write each completed clip to Google Drive immediately. If the job crashes at scene eight, scenes one through seven are already persisted. The wizard-runner restores state and resumes from scene eight on the next run.
Stage 4: Stitch
Individual Kling clips are raw generative output — no captions, no mixed audio, no color consistency across scenes. The stitch module is FFmpeg-based and handles the full post-production stack:
- Concat — join scene clips in plan order
- Transitions — xfade for soft joins between scenes (cross-dissolve duration tuned per story structure)
- Captions — overlay rendered from the planner's VO script and reference caption style
- Audio mix — ElevenLabs TTS for VO, mixed with a background music bed using sidechain ducking (VO ducking the bed, not the other way around)
- Color grade — LUT application for cross-scene tonal consistency
Sidechain ducking ducks the music bed when VO is present and releases when VO pauses — the result sounds professionally mixed, not like a slideshow with a track pasted underneath.
interface StitchConfig {
clips: string[];
voTrackPath: string;
musicBedPath: string;
captions: CaptionOverlay[];
transitions: TransitionSpec[]; // xfade duration per join
colorGradeLut: string;
outputPath: string;
}
async function stitchFinalVideo(config: StitchConfig): Promise<string> {
const concatenated = await ffmpegConcat(config.clips, config.transitions);
const withCaptions = await overlayRender(concatenated, config.captions);
const voTrack = await voiceEmotion.render(config.voTrackPath);
const mixed = await audioBed.mix({
voTrack,
bedTrack: config.musicBedPath,
sidechainDuck: { threshold: -18, ratio: 4, attackMs: 5, releaseMs: 200 },
});
const graded = await ffmpegApplyLut(withCaptions, config.colorGradeLut);
return ffmpegMuxVideoAudio(graded, mixed, config.outputPath);
}
Stage 5: QA and Resolve
Per-scene generation QA catches visual defects — wrong product, face drift, truncated clips. Cross-scene narrative QA catches problems that only emerge after assembly:
- Protagonist drift — the character's appearance or role shifts mid-video
- Missing causality — a scene references an event that was never shown
- Time jumps — the narrative timeline breaks (morning scene followed by "later that morning" with no visual transition)
- Pacing mismatch — VO lines exceed their scene's spoken duration budget
Pacing auto-fit handles over-budget VO lines. When a script line runs longer than its scene's spokenDurationMs, the resolver rewrites it across up to three passes — tightening language while preserving the story beat. If three passes still exceed budget, it flags the scene for manual review rather than silently truncating.
"Resolve with AI" is the operator escape hatch. Each warning type maps to a targeted fix prompt — regenerate a single scene with adjusted conditioning, rewrite a VO line, or reorder scene transitions. The warning resolver module dispatches the fix, re-runs the affected pipeline stages, and updates the checkpoint.
Domain-driven module layout
When I joined, the codebase had 72 flat lib/*.ts files with no clear ownership boundaries. Finding where clip generation interacted with checkpointing meant grep archaeology. I reorganized everything into five domain folders:
| Domain | Responsibility | Key modules |
|---|---|---|
core/ |
Job lifecycle, persistence, validation | jobs, checkpoint, drive, wizard-restore, logger, process-guards, validators |
swipe/ |
Pipeline orchestration and planning | wizard-runner, swipe-analyzer, swipe-planner, swipe-vo, plan QA, warning resolver |
vendors/ |
External API adapters | anthropic, openai-image, kie/Kling, fal, elevenlabs |
media/ |
FFmpeg and audio processing | stitch, audio-bed, music-bed, overlay-render, voice-emotion, scene-detector |
catalog/ |
Brand asset management | products, avatars, voices, product-lock, product-fidelity, asset-preflight |
The wizard-runner orchestrates analyzer → planner → clip gen → stitch → QA, saving checkpoints after each stage boundary and consulting wizard-restore on retry.
type WizardStage = 'analyze' | 'plan' | 'generate' | 'stitch' | 'qa';
async function runWizard(job: GenerationJob): Promise<WizardResult> {
const restored = await wizardRestore(job.id);
let stage: WizardStage = restored?.lastCompletedStage
? nextStage(restored.lastCompletedStage)
: 'analyze';
while (stage !== 'done') {
await processGuards.assertNotCancelled(job.id);
const artifact = await runStage(stage, job, restored?.artifacts);
await job.checkpoint.saveStage(stage, artifact);
stage = nextStage(stage);
}
return job.checkpoint.loadFinalResult();
}
Vendor adapters live in isolation under vendors/. Swapping Kling for a different video generation model means changing the kie adapter — the planner, stitch, and QA modules never import vendor-specific types.
Tech stack and deployment
The platform runs on Next.js with TypeScript throughout. Long-running generation jobs execute as background processes on Railway, not in serverless request handlers — a twelve-scene Kling run can take 15+ minutes and cannot fit in a Vercel function timeout.
- Claude (Anthropic) — scene planning, narrative QA, VO rewrite, warning resolution
- Kling via kie — per-scene video generation with image conditioning
- ElevenLabs — TTS voiceover and sound effects
- FFmpeg — scene detection, concat, xfade transitions, LUT color grade, audio mux
- Google Drive — checkpoint persistence (clips, plans, analysis artifacts)
- fal — lipsync and queue management for alternate generation paths
Process guards wrap every stage with cancellation checks and timeout enforcement. If an operator cancels a job mid-generation, the guard raises before the next Kling call fires — preventing orphaned API charges on a job nobody wants.
What 20 PRs in 3 days looked like
The work broke down roughly like this:
- Days 1–2: Analyzer and planner modules, Claude integration, plan QA, domain folder restructure
- Day 2–3: Kling clip generation with checkpointing, FFmpeg stitch pipeline, audio mix with sidechain ducking
- Day 3: Cross-scene narrative QA, pacing auto-fit, warning resolver, wizard-restore, process guards
Several PRs were refactors — vendor adapters, validator consolidation, FFmpeg helper dedup. The domain restructure happened early; navigating 72 flat files while shipping in parallel would have been slower than half a day on folder boundaries.
What I learned
Generative video ads are a pipeline problem, not a model problem. Kling produces impressive individual clips, but the product value lives in orchestration — narrative continuity across independently generated scenes, VO pacing matched to the reference, and graceful recovery when scene seven times out.
Checkpointing at every stage boundary was the highest-leverage decision. A 15-minute job that fails at minute fourteen becomes a resume from scene eight, not a full restart.
Plan QA catches structural narrative problems before Kling credits are spent. Pacing auto-fit rewrites over-budget lines before TTS renders them. Every expensive operation has a cheap validation gate upstream.
Three days, 20 PRs, 7,000 lines. The pipeline generates ads that feel like the reference went viral for a different brand — which is exactly the point.