Narrative Continuity in AI Video Generation: Story Spine and L/J-Cut Editing

By Humza Tareen June 28, 2026 · 6 min read

Video Generation Narrative AI FFmpeg Claude TypeScript Content Quality

The reference videos that go viral on short-form platforms often tell one continuous story — a person waking up, getting ready, commuting, arriving at work. The cuts are fast, but the narrative thread is unmistakable. When I first built the video generation pipeline on the video generation platform, the output looked nothing like that. Each scene was generated independently from a single script line. Scene three had no idea what happened in scene two. The result felt like a montage of unrelated clips stitched together, not a day-in-the-life story.

I spent roughly 850 lines across two PRs fixing this — adding a story spine to the planner, cross-scene narrative QA, and a continuous voiceover model inspired by L-cut and J-cut editing. Then I revised the audio approach again after live testing revealed sync issues. This post walks through the problem, the architecture, and what I learned shipping narrative continuity into an automated pipeline.

The problem: independent scenes, fragmented voiceover

The pipeline takes a reference viral video, analyzes its cut rhythm, and regenerates each segment with AI-generated visuals and a new script. The original design treated each cut as an isolated unit: one script line in, one scene out. That worked fine for montage-style content — product showcases, before-and-after reels — but it broke down for continuous narratives.

Three symptoms showed up consistently in QA:

No causal connection between scenes. A character might be leaving the house in scene two and suddenly be back in bed in scene three, because each scene prompt was built from its own line with no memory of prior context.
Visual discontinuity. Lighting, wardrobe, and setting drifted between cuts even when the reference video maintained a coherent time-of-day progression.
Staccato voiceover. Each narration line was force-fit to its ~2-second reference cut. Sentences broke mid-thought. The VO sounded like a series of fragments rather than someone telling a story at a natural ~150 words per minute.

The reference video told one story. Our pipeline was generating twelve unrelated micro-stories and calling it a video.

Story spine: teaching the planner about narrative

The fix started upstream, in the analyzer and planner. The analyzer now captures a storyStructure object that distinguishes continuous-story content from montage, identifies the protagonist, maps the emotional arc, and tracks time progression across the reference video.

The planner enforces a story spine: every scene must advance the protagonist, and scene N must set up scene N+1. Instead of writing one line per cut in isolation, the script generator works against a whole-script budget — the total word count across all narration — producing flowing, coherent text that reads as a single piece of writing.

Each scene also gets a timeMarker — morning, midday, evening — that drives consistent lighting and setting prompts. This prevents the jarring time jumps that made generated videos feel like they were filmed across three different days.

interface StoryStructure {
  mode: 'continuous-story' | 'montage';
  protagonist: string;
  emotionalArc: string[];
  timeProgression: TimeMarker[];
}

interface PlannedScene {
  index: number;
  scriptLine: string;
  timeMarker: TimeMarker;
  setupForNext: string;   // what this scene establishes for N+1
  followsFromPrev: string; // causal link from N-1
}

function enforceStorySpine(
  scenes: PlannedScene[],
  structure: StoryStructure,
): PlannedScene[] {
  if (structure.mode === 'montage') return scenes;

  return scenes.map((scene, i) => ({
    ...scene,
    timeMarker: structure.timeProgression[i] ?? scene.timeMarker,
    followsFromPrev: i > 0
      ? `Continues from: ${scenes[i - 1].setupForNext}`
      : `Opens with protagonist: ${structure.protagonist}`,
    setupForNext: scene.setupForNext || inferSetup(scene, structure),
  }));
}

The spine does not guarantee perfect continuity — the video model still hallucinates — but it gives every downstream step a shared narrative frame instead of twelve independent prompts.

Cross-scene narrative QA

Planning improvements are not enough on their own. I added a narrativePlanQa pass that inspects the full scene plan before generation starts. It flags three categories of problems:

protagonist_drift — the main character changes identity or role mid-story
missing_causality — scene N does not logically follow from scene N-1
time_jump — time-of-day markers regress or skip inconsistently

These are advisory warnings, not hard blocks. Operators see them in the preflight UI alongside existing timing and content checks. When a warning is actionable, "Resolve with AI" sends the flagged scenes to Claude along with the narrative context; the resolver maps the new QA codes and rewrites only the affected scenes while preserving the story spine.

type NarrativeQaCode =
  | 'protagonist_drift'
  | 'missing_causality'
  | 'time_jump';

interface NarrativeQaWarning {
  code: NarrativeQaCode;
  sceneIndices: number[];
  message: string;
  autoFixable: boolean;
}

function narrativePlanQa(scenes: PlannedScene[]): NarrativeQaWarning[] {
  const warnings: NarrativeQaWarning[] = [];

  const protagonists = new Set(scenes.map(s => extractProtagonist(s)));
  if (protagonists.size > 1) {
    warnings.push({
      code: 'protagonist_drift',
      sceneIndices: scenes.map((_, i) => i),
      message: `Multiple protagonists detected: ${[...protagonists].join(', ')}`,
      autoFixable: true,
    });
  }

  for (let i = 1; i < scenes.length; i++) {
    if (!scenesHaveCausalLink(scenes[i - 1], scenes[i])) {
      warnings.push({
        code: 'missing_causality',
        sceneIndices: [i - 1, i],
        message: `Scene ${i + 1} does not follow from scene ${i}`,
        autoFixable: true,
      });
    }
  }

  return warnings;
}

Continuous VO over cuts: L-cut and J-cut editing

Fixing the script was half the battle. The other half was audio timing. Real viral videos do not map one sentence to one cut. A narrator might begin a thought over a wide shot, continue through two close-ups, and finish as the scene transitions — classic L-cut and J-cut editing where audio leads or trails the visual change.

I rebuilt the timing model around this idea. Instead of generating per-scene VO segments aligned to individual cuts, the stitch step builds one continuous narration track. Visual scenes maintain the reference cut rhythm underneath; a single sentence spans several cuts at natural speaking pace.

Two key changes made this work:

effectiveSceneDuration stays locked to the reference cut — scenes do not grow per narration line
computeTimingPreflight runs a single total-VO-versus-reference check instead of per-scene fragment flags

interface TimingPreflight {
  totalReferenceDurationSec: number;
  totalVoDurationSec: number;
  withinBudget: boolean;
  wordsPerMinute: number;
}

function computeTimingPreflight(
  scenes: PlannedScene[],
  voLines: string[],
  referenceCuts: number[],
): TimingPreflight {
  const totalReferenceDurationSec = referenceCuts.reduce((a, b) => a + b, 0);
  const totalVoDurationSec = estimateSpeechDuration(voLines.join(' '));
  const wordCount = voLines.join(' ').split(/\s+/).length;
  const wordsPerMinute = (wordCount / totalVoDurationSec) * 60;

  return {
    totalReferenceDurationSec,
    totalVoDurationSec,
    withinBudget: totalVoDurationSec <= totalReferenceDurationSec * 1.05,
    wordsPerMinute,
  };
}

// Stitch: one continuous VO track, padded to video length
async function buildContinuousVoTrack(
  voLines: string[],
  videoDurationSec: number,
): Promise<AudioBuffer> {
  const raw = await synthesizeSpeech(voLines.join(' '));
  return padAudioToDuration(raw, videoDurationSec);
}

The FFmpeg stitch concatenates raw narration lines into a single track and pads silence at the tail to match total video length. Visually, cuts still land on the reference rhythm. Aurally, the story flows.

Revision: back to per-scene audio

The continuous VO approach sounded better on paper than in production. In practice, it felt like one overlay track sitting on top of unrelated visuals rather than narration matched to each scene. The L-cut illusion broke when viewers could see a kitchen scene while hearing about a commute.

I reverted to per-scene voice-driven segments while keeping the story spine intact. The timing model changed:

effectiveSceneDuration = max(referenceCut, voDuration + tailPadding)
Stitch builds per-scene audioSegment objects, then concatenates via concatAudio

This preserved narrative continuity in the script and planning layers while restoring natural audio-visual sync. The story spine ensures the words still connect across scenes; the per-scene segments ensure each cut sounds like it belongs to that cut.

Pacing auto-fit and live-testing bugs

Per-scene audio reintroduced a pacing constraint: narration must fit each scene's reference cut. The "Resolve with AI" flow handles this by rewriting over-budget VO lines to fit the available seconds, then re-measuring across up to three passes until every scene clears preflight.

Live testing surfaced two bugs in the resolver itself. First, the critic model echoed QA warnings back as if they were source code — every proposed action was dropped because the parser treated warning text as invalid JSON fields. Second, long resolver responses hit the token limit and returned truncated JSON, causing 500 errors mid-fix. I increased the token budget and added salvage logic that extracts complete action objects from truncated responses rather than failing the entire resolve request.

What I learned

Narrative continuity in AI video generation is a planning problem before it is a generation problem. Independent scene prompts will always produce montages, no matter how good the video model is. A story spine — protagonist, causality, time progression — gives the entire pipeline a shared context that survives from script through QA through stitch.

Audio timing is its own discipline. Continuous VO mimics how viral editors work in post-production, but automated pipelines need per-scene sync to feel natural. The right model depends on content type: montage can tolerate overlay narration; continuous story needs scene-matched segments with a coherent script underneath.

Advisory QA with auto-fix is more useful than hard blocks for narrative issues. Operators want to know when causality breaks, but they also want a one-click path to fix it without restarting the entire generation. The story spine makes those targeted rewrites possible — the resolver rewrites scene five knowing what scenes four and six establish.

Building an AI Video Generation Pipeline from Scratch
The pipeline architecture these changes build on
Viral Feel Parity: Making AI-Generated Ads Feel Like the Original
The companion feel engineering work
Production Hardening an AI Video Pipeline
Reliability work done in parallel