Five feet in three weeks: experiment five in the AI video series

This is the fifth experiment in an open series documenting how the Rover Planet videos get made. The first four were polite about it — laying the pipeline down piece by piece, each one adding a single new capability. This one is the one where the pipeline finally has my actual voice in it.

The vertical Short version is at https://www.youtube.com/watch?v=oyad4XkgFts. The content is the same shape: a wet, windy garden update three weeks after the last corn check, with beans climbing five feet up the trellis in that window, and a tour through what’s happening in the cold frame and along the fence.

What was missing until now

The previous four experiments all relied entirely on an ElevenLabs cloned voice for narration. That voice is a good clone — better than I expected when I first set it up — but it isn’t me, in the specific sense that it can’t carry the surprise in my voice when I lift a cover off a corner of the garden and one of the beans is suddenly two feet higher than the last time I looked. The line on this one’s video is “Look at this guy. Wow. So it’s — he’s like, it’s like almost four or five feet tall.” That’s my actual voice in the cut. The AI narration says around it, but it doesn’t say that.

The reason it took five experiments to get there isn’t technical — I had user footage from experiment four too. The reason was that I muted all the source clips on the first user-footage round and only realised afterwards how much the absence flattened the result. So this round, the pipeline does it properly: the AI narration carries the structure, and at the moments where my voice carries something the narration can’t — the rainy-windy intro, the bean-corner surprise — we cut to the source audio, full volume, and let me be the one talking.

How the pipeline actually decides what to keep

I shot twelve clips on the iPhone this morning, totalling about five and a half minutes of footage, and dropped them in a shared Google Drive folder. The pipeline picked them up, fed each one to Google’s Gemini 2.5 Pro, and asked the same set of questions of each clip: what’s on screen, what’s being said, what are the visually interesting moments with timestamps, what’s the strongest three-to-seven-second segment.

Gemini’s analysis is what drives the cut. For the bean corner clip — sixty-eight seconds of vine inspection — it picked out timestamp seventeen-to-twenty-four as the strongest segment because that’s where I’m clearly reacting to the height of the plant, and the camera is clearly showing the plant at the same time. The pipeline used that. For the half-second selfie shot of just my face against a grey sky, Gemini flagged it: “a brief, static shot, too short to be cut down further.” The pipeline excluded it.

That kind of judgement was the missing layer between “here’s some footage” and “here’s a finished cut.” In a traditional edit, an editor sits at a timeline and reviews every clip, makes those decisions, types the script around the chosen moments. The AI didn’t replace that work. It just made it possible to do it without a human at the timeline.

The cumulative ledger

One thing I started this round that I want to keep explicit: every experiment in this series carries forward every previously-proven capability and adds at least one new one. The ledger lives in the repo and is reviewed before each new video. By experiment five, the inherited stack includes:

ElevenLabs cloned narration in eleven_multilingual_v2 with tuned voice settings
Pillow-rendered brand cards with the Rover Planet palette
ffmpeg ken-burns motion and crossfade assembly
Whisper auto-captions, burned into the Short and (as of this experiment) soft-attached as a real caption track on the landscape
Custom Pillow-rendered YouTube thumbnails
Final Cut Pro .fcpxml exports for producer handover
fal.ai Kling AI B-roll, drawn from a small persistent library
fal.ai Stable Audio music bed, ducked under the narration with a sidechain compressor
Vertical screen-recording framing for the landscape (blurred fill behind the sharp foreground)
Google Gemini 2.5 Pro analysis of every source clip
User footage from Google Drive via per-video subfolders

And the new layers this round:

Source-clip audio mixed alongside the synthetic narration — the missing voice problem, solved.
Soft-attached SRT caption tracks on the landscape — properly uploaded via the YouTube Data API, toggleable in the CC menu rather than burned in. The Shorts still get the burned-in word-grouped captions because that’s the format norm there.
A longer landscape runtime — about two and a quarter minutes vs the seventy-to-eighty seconds of the earlier four. Content earned it.

An open finding — where the AI tooling falls short

It’s worth being honest about a piece of this experiment that didn’t work. The Short’s bean-corner audio doesn’t sound clean. The pipeline made three attempts to fix it — first a different cut window (tightening to the seven seconds where the speech is densest), then a gentler ffmpeg filter chain (high-pass, low-pass, denoise, light dynamic normalisation), then a switch to a sibling source clip that Gemini’s analysis specifically described as “clear narration.” Each pass cleared all the automated checks: Whisper transcribed the audio cleanly each time, the ffmpeg filters all ran without errors, the burned-in captions matched what Whisper said had been said. By every automated quality criterion in the pipeline, the audio was fine.

Each pass also failed on first listen by a human. The audio still sounded muddy. Three different attempts, three identical human-ear failures, all on content that passed every automated check the pipeline ran.

That gap — between “AI tools say this is fine” and “the listener can actually parse this” — is the most interesting thing this experiment surfaced. It maps to a category of failure I think other people building AI-assisted production are going to hit: the upstream tooling all reports success while the downstream artefact is broken in a way that only a human listening pass would catch. We shipped the Short anyway with this note on the description because hiding the failure would have been worse than naming it. The next experiment will add an explicit human-ear validation step before publish — either a higher-confidence-threshold Whisper pass that flags low-quality segments for manual review, or simply a thirty-second “please actually listen to this” gate in the build script.

If you watch the Short and the bean-corner section sounds garbled to you, you’re hearing exactly what I’m hearing. The point of this series is to document what works and what doesn’t. This one didn’t work yet.

The cost

API spend this round: well under £0.10. ElevenLabs for the narration parts, Gemini analysis in the free tier (twelve clips totalling nine hundred megabytes — Gemini handled them in under three minutes), reuse of an existing music bed from the library, no new fal.ai Kling clips required this time.
Time from “files uploaded” to “both videos live”: about ninety minutes, most of which was an OAuth re-auth detour to add the youtube.force-ssl scope needed for the caption-track API. Once that was sorted, the build and upload took twenty minutes.

What this is becoming

The first experiment was a proof-of-concept: can you ship a competent landscape video from a Claude Code terminal at all? Yes. The next three added the things that turn “competent” into “you’d actually click on this” — captions, custom thumbnails, AI B-roll, a music bed, a longer-form structure. This one started using the substrate I’d been ignoring — my own footage, my own voice — and the resulting piece is the first one in the series where I think the AI is genuinely helping me make something I couldn’t have made alone in a comparable time, rather than just making something I could have made better with a real editor and a longer afternoon.

The series continues. The next round will use the same growing stack and add at least one more thing. If anything in this experiment is useful to your own work, the pipeline scripts, the cumulative ledger, the analysis JSON, and the cost numbers are all in the repo. The substance is still on each of us.