Captions, thumbnails, and a Final Cut Pro export: week one of the Rover Planet AI video experiment

Two new videos went up on the Rover Planet YouTube channel today — a longer landscape piece and a vertical Short — both built around the same idea: the difference between preventing carbon emissions and actually pulling carbon back out of the atmosphere. The source material is one of our earlier posts, Tracking Your Carbon Impact: How Small Actions Add Up.

Vertical Short version (0:32) · The Kitchen Sink on the App Store

The post above explains the substance. This one is about the process — what changed in the production pipeline since the previous video, what worked, what surprised us, and what’s queued next. It’s the first instalment of a small series I’m starting on Rover Planet about how AI tooling is shifting what a single person can produce.

Where the pipeline stood a week ago

The previous Rover Planet video — the Arena Flowers piece — established a working baseline. Narration via my own cloned voice on ElevenLabs. Cards rendered with Python and the Pillow imaging library. Ken-burns motion and final assembly with ffmpeg. Upload via the YouTube Data API. Everything driven from a single terminal session. Per-video cost: under five pence.

That pipeline ships videos but it doesn’t yet ship good videos. It was missing four obvious things: captions, a music bed, a custom thumbnail, and any AI-generated B-roll. Each one shows up in audience-retention data on every other YouTube channel that bothers to measure it. The plan for week one was to close as many of those four gaps as possible without adding tools we don’t need.

What got added this round

Captions. I’d assumed this was the easy one. Generate the narration audio, run a speech-to-text model on it to produce an SRT file, hand the SRT to ffmpeg’s subtitle filter to burn it into the video. Three commands. Done.

The transcription part went fine — faster-whisper on the "small" model transcribed each narration file in a few seconds on CPU. With word-level timestamps it was straightforward to regroup the text into three-and-four-word phrases that read well on a vertical Short. So far so good.

Then ffmpeg’s subtitles filter refused to load. So did drawtext. It turns out the version of ffmpeg installed via Homebrew is built without libass (used by subtitles) and without libfreetype (used by drawtext). Neither is enabled in the default formula. I could have rebuilt ffmpeg from source with the right flags — not hard, but a rabbit-hole I didn’t want to start at 2pm on a Friday.

The workaround took about twenty minutes: render each caption phrase as a transparent PNG using Pillow, with a translucent black box behind white bold text, then use ffmpeg’s overlay filter (which is in the default build) to composite each PNG onto the video timed by an enable=between(t,start,end) expression. Roughly the same visual result, no rebuild required, and the Python that generates the PNGs is reusable. The Short now has burned-in captions exactly where Shorts-style content expects them.

Custom thumbnail. The Arena Flowers video got YouTube’s auto-thumbnail — usually a still frame from the first second, usually not great. For the Carbon Impact video I rendered a branded 1280×720 thumbnail with Pillow: a stock image of hands holding a sapling, "Sequestration vs Prevention" in large type, a small green tag showing "20 kg / tree / yr". Set on YouTube via the thumbnails().set() API call. Two minutes of code, large impact on click-through.

Final Cut Pro export. This was the most interesting addition philosophically. One of the principles of how we want this channel to work is that the AI-driven pipeline shouldn’t be a one-way ratchet that locks out human editors. The same project should be openable in the tools our team already uses — Final Cut Pro and CapCut.

Final Cut Pro can’t be controlled headlessly, but it does consume a public XML interchange format (.fcpxml). Python emits an .fcpxml referencing the same image assets and narration audio that the automated pipeline used; you double-click the file and Final Cut opens with all seven cards laid out on the timeline, narration on a separate audio lane, captions referenced as an importable SRT. A human editor can then refine timing, add transitions, drop in stock footage, change copy — whatever the moment calls for. The code path and the human path stay in sync.

For CapCut, there’s no equivalent project-import format. The handover there is an asset pack: numbered MP4 clips, narration WAV, SRT — the producer rebuilds the timeline in CapCut from those pieces. Less seamless, but explicit and unambiguous.

What got deferred

Music bed. YouTube Audio Library has no API, Pixabay’s music collection requires interactive consent, and I didn’t want to commit to an auto-download path that might pick a track without listening to it. Deferring music to the next video also gives us a cleaner A/B comparison: same script, same visual style, with-music vs. without-music. We’ll measure retention.

AI B-roll clips. We set up fal.ai (an aggregator API that exposes Kling, Runway, Pika and others behind a single Python client) and queued prompts for short photorealistic clips: a slow drone shot through a forest canopy, hands placing kitchen scraps into a compost bin, a sapling being planted in soil. Then the API returned "User is locked. Exhausted balance." — fal.ai accounts ship with zero credit by default. Adding $5-10 was a sub-minute task but it happened after the video’s other steps were already done. The Kling clips will land in the next video.

The numbers

Total per-video cost: still under one penny, dominated by ElevenLabs at ~£0.005 of speech credit. The new additions (Whisper, Pillow, Final Cut export) are all free.
Time from blank page to two finished videos live on YouTube: about three and a half hours, including the libass detour.
Runtime: 1:11 landscape, 0:32 Short. Both well within the formats’ sweet spots.
Total code in this repo for the pipeline: roughly 800 lines of Python across six files. All committed.

What’s queued for the next video

The plan, ordered by likely signal:

AI B-roll clips via fal.ai (Kling first, then a Runway comparison). The hypothesis being tested: do a few short photorealistic AI clips meaningfully improve perceived production value, given that we’re already covering the visual layer with stills + ken-burns motion?
A music bed with proper sidechain ducking under the narration. This is the addition I expect to make the biggest perceived-quality jump.
Voice realism — the current ElevenLabs setup uses an Instant Voice Clone with default settings; switching to a higher-quality model (or re-cloning with better source recordings) is a Week-2 priority.
A Remotion spike for code-driven animated data graphics. The Pillow stat cards work but they’re static; a count-up animation on numbers like "100…200 kg of CO₂" would land much harder visually.

The bigger question

The reason for documenting this in public is that I think the cost curve for shipping useful video is collapsing faster than most of us are tracking. A one-person team with the right tooling can now produce videos that would have needed a four-person crew five years ago, at a marginal cost of pennies. That doesn’t make the result automatically good — the script still has to work, the topic still has to matter, the editing still has to respect the viewer’s time — but it does mean the constraint has moved.

The question I’m most curious about isn’t whether AI tooling can make video. That’s been answered. The question is what happens to the format itself when production cost stops being the limiting factor for small operators. We’ll see what the data says video by video. The full back-and-forth, including which experiments worked and which didn’t, will keep getting documented here.

Next one drops with the music bed and the AI clips. If anything in this post is useful to your own work, let me know — some of the things I’m figuring out cheaply might save other people a half-day of detours.