Back to Blog

Viral Feel Parity: Making AI-Generated Ads Feel Like the Original

By · 7 min read
Video Generation FFmpeg Audio Engineering AI TypeScript ElevenLabs

The video generation platform I work on takes a reference viral ad and produces a new version with swapped product copy, fresh voice-over, and AI-generated scenes. Technically the output was correct — scenes rendered, VO synthesized, timeline stitched. But the editor's feedback was blunt: "the avatar keeps changing, random boxes in the video, no bed sound, doesn't feel the same."

That last phrase — doesn't feel the same — is the whole problem. Viral ads work because of accumulated micro-decisions: one consistent face, ambient room tone under the narration, captions that sit naturally on screen, pacing that breathes between beats. Our pipeline was optimizing for structural fidelity to the blueprint while ignoring perceptual fidelity to the reference. Over roughly 2,800 lines across several PRs, I closed that gap. This post walks through eight concrete failure modes and the fixes that made generated ads feel like they belonged to the same family as the original.

1. Avatar consistency

The protagonist changed appearance between scenes. Scene one showed a woman with short dark hair; scene three had a different face entirely. Root cause: avatar selection was random per scene. Each scene generation call picked from the avatar pool independently, the same way we might pick a background variant — except viewers experience the protagonist as a continuous character, not a per-shot casting decision.

The fix mirrored how we already handled voice and product: pick once, propagate everywhere. The start route and every runner resolve site now pass a single avatarSlug through the job context. Scene generators read that slug instead of rolling dice.

interface JobContext {
  avatarSlug: string;
  voiceId: string;
  productSlug: string;
}

function resolveAvatarForJob(
  blueprint: VideoBlueprint,
  overrides?: Partial<JobContext>,
): string {
  if (overrides?.avatarSlug) return overrides.avatarSlug;
  if (blueprint.avatar?.defaultSlug) return blueprint.avatar.defaultSlug;
  return pickDefaultAvatar(blueprint.demographics);
}

// Start route + all runner resolve sites
const ctx: JobContext = {
  avatarSlug: resolveAvatarForJob(blueprint, req.body),
  voiceId: resolveVoice(blueprint, req.body),
  productSlug: resolveProduct(blueprint, req.body),
};

One slug, one face, every scene. Simple invariant, large perceptual payoff.

2. Caption rendering

Editors reported "random white boxes" floating over the video. The caption overlay tried to reproduce the reference's on-screen text by drawing rectangles wherever the analyzer detected text regions. When the analyzer flagged an overlay it could not describe — a stylized graphic, a motion-blurred lower-third — the renderer still drew the pill background with no text inside. Blank boxes.

I rewrote caption rendering to draw real text on a pill background, driven by blueprint.captions.style: caps vs sentence case, accent color, light-vs-dark pill, vertical position. Undescribable overlays now draw nothing — not a placeholder rectangle. Emoji are stripped before render because our bundled font cannot glyph them reliably. And because Railway containers have no system fonts, I bundled DejaVuSans-Bold.ttf into the asset pipeline so caption typography is deterministic in production.

interface CaptionStyle {
  uppercase: boolean;
  accentColor: string;
  pillVariant: 'light' | 'dark';
  position: 'top' | 'center' | 'bottom';
}

function renderCaptionOverlay(
  scene: SceneBlueprint,
  style: CaptionStyle,
): OverlayCommand[] {
  const text = scene.caption?.text;
  if (!text || scene.caption?.undescribable) return [];

  const sanitized = stripEmoji(text);
  const display = style.uppercase ? sanitized.toUpperCase() : sanitized;

  return [{
    type: 'text-pill',
    text: display,
    fontPath: bundledFont('DejaVuSans-Bold.ttf'),
    accentColor: style.accentColor,
    pillVariant: style.pillVariant,
    position: style.position,
  }];
}

Captions went from broken rectangles to readable, styled text that matched the reference's visual language.

3. Audio bed and background sound

Output audio was clean VO only — technically pristine, perceptually sterile. The reference viral had ambient room tone, subtle music bed, the sense of a real environment. Our mix sounded like someone recorded voice-over in a vacuum.

I added an ambience bed generated via the ElevenLabs Sound Effects API, themed from the blueprint's mood and setting descriptors. The bed mixes under the VO with sidechain ducking: the voice-over keys a compressor on the bed track, so speech dips the ambience and gaps between lines let it swell back. Tunable constants evolved through editor feedback — DEFAULT_BED_VOLUME went from 0.18 to 0.30 to 0.40, duck ratio softened from 8:1 to 4:1 to 3:1, threshold adjusted so ducking felt natural rather than pumping.

interface AudioBedProvider {
  generateBed(prompt: string, durationSec: number): Promise<Buffer>;
}

const DEFAULT_BED_VOLUME = 0.40;
const DUCK_RATIO = 3;
const DUCK_THRESHOLD_DB = -24;

async function mixVoWithBed(
  voTrack: AudioBuffer,
  bedProvider: AudioBedProvider,
  blueprint: VideoBlueprint,
  bedVolume = DEFAULT_BED_VOLUME,
): Promise<Buffer> {
  const bed = await bedProvider.generateBed(
    buildBedPrompt(blueprint.style, blueprint.setting),
    voTrack.durationSec,
  );

  return ffmpegMix([
    { input: voTrack, filter: 'anull' },
    {
      input: bed,
      filter: `volume=${bedVolume},acompressor=threshold=${DUCK_THRESHOLD_DB}dB:ratio=${DUCK_RATIO}:sidechain=0`,
      sidechainFrom: voTrack,
    },
  ]);
}

The AudioBedProvider interface keeps the ElevenLabs implementation swappable. Editors can also tune audioBedVolume (0..1) per job through the edit API without redeploying.

4. Voice-over pacing

Early versions sped up the VO to fit the reference's scene timing. When synthesized speech ran longer than the reference clip, the pipeline applied atempo to compress it. The result was chipmunk narration — technically on-beat, obviously wrong.

The rule is now absolute: never speed the VO. Instead, the stitcher sizes the timeline to max(reference total duration, VO length) and stretches the picture to fill. Audio is only ever padded — silence at the tail or scene boundaries — never time-compressed. I also reverted from a single continuous VO track back to per-scene audio segments. Continuous VO sounded like one flat overlay and killed the scene-to-scene rhythm the reference relied on.

function computeSceneDuration(
  referenceDurationSec: number,
  voDurationSec: number,
): number {
  // Picture stretches; audio never speeds up
  return Math.max(referenceDurationSec, voDurationSec);
}

function buildSceneAudio(segment: VoSegment): AudioFilterGraph {
  const targetSec = computeSceneDuration(segment.refDurationSec, segment.voDurationSec);
  if (segment.voDurationSec >= targetSec) {
    return padSilence(segment.voBuffer, targetSec);
  }
  return padSilence(segment.voBuffer, targetSec); // always pad, never atempo
}

5. Emotion-driven audio

Flat delivery was another subtle tell. The reference shifted energy scene by scene — urgent hook, calm product explanation, excited CTA. Our analyzer already extracted per-scene emotion from the blueprint; we just were not using it downstream.

Each scene's emotion now drives ElevenLabs audio tags and voice settings. An "excited" scene gets higher stability variance and expressive tags; "calm" scenes get steadier settings. The mapping is declarative in the synthesis config rather than hard-coded per scene type.

const EMOTION_VOICE_SETTINGS: Record<SceneEmotion, VoiceSettings> = {
  excited: { stability: 0.35, similarityBoost: 0.75, style: 0.6 },
  calm:    { stability: 0.70, similarityBoost: 0.85, style: 0.2 },
  urgent:  { stability: 0.45, similarityBoost: 0.80, style: 0.5 },
};

function synthesizeSceneVo(
  scene: SceneBlueprint,
  voiceId: string,
): Promise<Buffer> {
  const settings = EMOTION_VOICE_SETTINGS[scene.emotion] ?? EMOTION_VOICE_SETTINGS.calm;
  return elevenLabsTts({
    text: scene.script,
    voiceId,
    voiceSettings: settings,
    audioTags: [`[${scene.emotion}]`],
  });
}

6. Transitions

Every scene joined with a hard cut. For references that used soft dissolves or morphs between beats — common in lifestyle and beauty virals — our output felt jarring. The blueprint already flagged joins as soft where the analyzer detected them; the stitcher ignored that flag and concatenated with concat.

Soft transitions now route through FFmpeg's xfade filter when the blueprint marks a join as soft. The feature is env-gated with SOFT_TRANSITIONS=1 — default off preserves the validated hard-concat path in production until a job explicitly opts in.

function buildTransitionFilter(
  leftClip: string,
  rightClip: string,
  join: SceneJoin,
): string {
  if (join.type === 'soft' && process.env.SOFT_TRANSITIONS === '1') {
    return `xfade=transition=fade:duration=${join.crossfadeSec}:offset=${join.offsetSec}`;
  }
  return 'concat';
}

7. Color grading

Scene-to-scene color drift was subtle but real — AI-generated clips from different prompts carried slightly different white balance and contrast. I added a light color-grade pass via FFmpeg's eq filter, targeting the tone described in blueprint.style (warm/cool, lifted shadows, muted saturation). It is not cinema-grade grading; it is enough to pull disparate clips toward a unified look so the final ad reads as one piece rather than a montage of unrelated generations.

8. Viral hook audio capture

The opening hook is where virals earn their name — a specific sound, a verbal tic, ambient texture that stops the scroll. We were regenerating everything from scratch and losing the reference's own hook audio entirely. The generated ad had new VO from frame zero; the reference had that opening gasp, music sting, or ambient clip.

The fix extracts and preserves the viral's hook audio segment from the reference file and splices it into the generated output for the hook window defined in the blueprint. New VO picks up after the preserved hook. The opening "feel" — the thing editors recognize instantly — survives the remix.

async function buildFinalAudio(
  referenceVideo: Buffer,
  generatedSegments: VoSegment[],
  blueprint: VideoBlueprint,
): Promise<Buffer> {
  const hookAudio = await extractAudioSlice(
    referenceVideo,
    blueprint.hook.startSec,
    blueprint.hook.endSec,
  );

  const bodyVo = stitchSegments(generatedSegments.filter(s => !s.isHook));
  return concatAudio([hookAudio, bodyVo]);
}

What changed in the edit review

After these changes shipped, the same editor who flagged avatar drift and blank boxes signed off without requesting a revision pass on feel. The fixes are not glamorous — font bundling, sidechain ratios, refusing to call atempo — but they address what viewers actually notice. Structural parity (correct scenes, correct script) is necessary. Viral feel parity (one face, real captions, ambient bed, natural pacing, preserved hook) is what makes someone believe the ad belongs in the same feed as the original.

"Doesn't feel the same" is a systems problem. Every perceptual mismatch — avatar drift, blank overlays, sterile mix, sped-up VO — is a bug as real as a crash.

If you are building AI video pipelines, measure success by editor reaction and scroll-stop rate, not just render success. The last 10% of feel is where the other 90% of the engineering lives.

Related Articles