This is the third instalment of an open series on Rover Planet about how the channel’s videos actually get made. The first and second experiments tested code-driven cards, AI B-roll, captions, and a Final Cut Pro project export. This one tries something different: real source footage I shot myself, fed through an AI that watches the video, fed through a script the AI writes, fed through a voice that’s a clone of mine. It’s a chain. I wanted to see what came out the other end.
A vertical Short version is up at https://www.youtube.com/watch?v=hLWXHv2JIs0.
The premise
Too Many Tomatoes (on the App Store) is an iPhone app we built for tracking what’s happening in the garden — one in a small family of Rover utility apps alongside The Kitchen Sink, which has its own AI tracker for carbon impact. It has an AI assistant inside it called Bubbles the Cat. You photograph a plant, it tells you what to do about it. This morning I did a corn check, photographed some seedlings, asked Bubbles for advice.
Then I took the six iPhone clips of the corn check — five short videos plus one screen recording of the app itself — and put them in a shared Drive folder. The pipeline picked them up from there, fed each one to Google’s Gemini 2.5 Pro, and asked it to describe what it saw.
What Gemini gave back
For each clip, Gemini returned a tight JSON blob with the same fields:
- A short description of what’s on screen, specific enough to actually use ("A person lifts a gridded, transparent plastic cover from a garden bed to reveal a row of young corn seedlings…").
- A verbatim transcript of any speech in the clip. My own audible mutterings — "Alright. We’re out, we’re going to do a little bit of a corn check today" — came back clean.
- An ordered list of visually interesting timestamps with one-line labels. The pipeline used these to decide which seconds to actually cut to.
- A "best clip" recommendation — the strongest three-to-seven-second segment in each file — with a reason.
- Production notes: handheld shake, lighting issues, vertical-orientation flags, the half-second clip that was just my forehead by accident.
That last one mattered. Gemini saw the throwaway clip and flagged it as "a brief, low-angle close-up of the man’s face against the sky" with the note "the subject is very close to the lens, creating a slightly distorted perspective." It didn’t pretend the clip was useful; it just told the truth about it. The pipeline excluded it.
The screen recording problem
One of the source clips was a 43-second screen recording of me using the Too Many Tomatoes app — opening it, creating a "Corn check" entry, photographing a seedling, then asking Bubbles for advice and waiting for the AI to respond with observations and gardening tips. That’s the demo moment that gives the meta-layer its substance.
The screen recording is shot in iPhone-vertical (1206×2622). The landscape video is 16:9. Showing a vertical clip in a landscape frame normally looks bad — either letterboxed on both sides with black bars, or cropped to bits. The pipeline does what most professional editors do for the same problem: it scales the vertical content as a sharp foreground, scales a blurred copy of the same content to fill the 1920×1080 background behind it, and composites them. The result reads as "phone screen in a styled frame" without breaking the eye.
The music bed that didn’t ship last time, did this time
Two earlier experiments couldn’t add a music bed. The first try was ElevenLabs Music, which is gated behind a higher subscription tier than the one we’re on. The second try was fal.ai’s musicgen endpoint, which submitted fine and then sat in the queue for fifteen minutes before I killed it. The third try, on the same fal.ai account, was their Stable Audio endpoint — thirty seconds wall-clock, forty-seven seconds of ambient instrumental, about five pence. That’s the bed running underneath this video, ducked under the narration with an ffmpeg sidechain compressor so it pulls back about ten decibels whenever I’m speaking and returns to full level in the gaps.
The cost
- API spend on this video: about £0.10. ElevenLabs for two narrations (landscape and Short), Gemini for analysis (within free tier), reuse of an existing fal.ai-generated music bed from the library, no new Kling clips needed.
- Time from "videos uploaded to Drive" to two videos live on YouTube: about ninety minutes, most of which was me debugging an ffmpeg filter graph and a duplicated text overlay on one of the Pillow cards.
- Runtime: landscape 1:10, Short 0:28.
What stayed honest
The video doesn’t pretend the AI did the corn check. I did. The voice is a clone of mine, but it’s narrating a script that describes my actual morning. The footage is unedited iPhone clips of the actual garden. Gemini doesn’t generate anything new — it just describes what I shot. Bubbles the Cat is a real feature of a real app I’m building. The only thing that’s notable about the production is the absence of a step that used to be required: someone sitting at an editing timeline and cutting.
The interesting question for me, at the end of this experiment, isn’t whether AI can stitch together a video about my corn check. It clearly can. The interesting question is whether the underlying observation from the Arena Flowers piece earlier in this series — that the brand surfacing through AI now is the brand that did the substance work earlier — applies recursively to small operations like Rover. If a one-person team can ship video on a topic at this cost, the constraint is no longer production. It’s having something worth saying.
What’s queued for the next experiment
The pieces still on the list:
- Remotion for proper code-driven motion graphics — on a video where the stat density justifies setting it up. The Pillow frame-sequence count-up from the hedgerow experiment works but doesn’t really scale to richer animations.
- A Google Veo 3.1 hero shot, once, as a quality-ceiling reference.
- A landscape video that runs three or four minutes rather than seventy seconds — on a topic that earns the longer runtime, not a padded version of a tight piece.
- Voice realism iteration two — if the tuned ElevenLabs settings still don’t sound enough like me, the next step is re-cloning with a longer, cleaner recording session.
If anything in this experiment is useful to your own work, the pipeline scripts are documented in the repo, the cost numbers should be reproducible, and the Gemini analysis JSON for the source clips is sitting on disk if you want to see what it looked like. The substance is still on each of us.