Google Veo 3.1
Availability
Google’s new video generation model Veo 3.1 is now available, marked as “Preview” in the Google Vertex AI Studio. Google AI Studio still only offers Veo 2. Third parties LetzAI and RunwayML offer Veo 3.1, but don’t include the “Preview” moniker in their model selectors - it remains unclear if it’s the same or a different model. In my one single test on LetzAI, the result - lip-sync from text on a LetzAI-generated photo realistic image - looked inferior to what I got via the Vertex AI Studio.
(Because each run gives different results, this may have simply been bad luck).
Restrictions and Limitations
Audio
One of the special features of Veo 3 series is that, like OpenAI Sora 2, it can optionally generate an audio track matching the video. Unlike Sora 2 in the OpenAI API, Veo 3.1 allows human faces as inputs. This allows users to emulate the RunwayML “lip-sync” feature, where a static input image narrates an input text. (Lip-Sync is cheaper, though: 5¢ per second vs. 40¢ per second for Veo via Vertex AI). In addition to the voice, Veo may add additional background sounds: a (matching) synthetic chirping to the original Aileen 2 starter image and an ambient music track to the new LetzAI-generated image. These background sounds were included without additional prompting, but appeared in only 1 generation results out of 4.
Single generations are limited to 8 seconds. Because there is no voice (or background) consistency across generations (even with a fixed seed parameter; unlike Runway which lets you choose from preset voices), using Veo 3.1 for narrations is thus currently limited: when stitching together several segments, the voice will be different in each segment and background sounds will discontinue. This is also true for the Video Extension feature.
Video Extension
When a Veo 3.1 generation is done in Vertex AI Studio, the user may choose from several follow-up actions - including adding a soundtrack using Lyra 2 (“Use this video to inspire a new music composition”) and extending the video. However, the continuation will not be done with the Veo 3 series but Veo 2 - which doesn’t support audio generation. This means that trying to continue a lip-sync video will result in silent-talk.
There is limited support in the API, however. A core restriction is that only videos that have been generated by Veo may be extended. This is enforced by only allowing source videos to be referenced that were the output of a prior generation on the Gemini API. Referencing a Veo-generated video that was saved by Vertex API to GCS results in:
ClientError: 400 INVALID_ARGUMENT. {’error’: {’code’: 400, ‘message’: ‘Expected a file URI in the form of https://generativelanguage.googleapis.com/{API_VERSION}/files/{FILE_ID}:download?alt=media, but got https://storage.cloud.google.com/cloud-ai-platform-.../.../sample_3.mp4‘, ‘status’: ‘INVALID_ARGUMENT’}}
The Getting Start Colab Notebook currently describes Video Extension as “WIP” (so work in progress?). The current happy path, based on this notebook, is to “import” an image by using it as an input to gemini-2.5-flash-image (aka “nano-banana“) and asking for a slight change, then use this output from nano-banana for the initial video segment, then use this segment for the extension call. (Length extension via the Vertex AI API is only available with Veo 3.0 according to this documentation).
Having Lyra 2 compose a soundtrack based on the video - and thereby perhaps at least continue ambient background music? - doesn’t work either in Vertex AI Studio: choosing that option simply does nothing.
Conclusion
The Veo 3.1 release continues the unfortunate tradition of product launches that have a number of limitations - like Sora 2 in the API rejecting human faces - but do not disclose them prominently. As with open-weights LLMs, consuming models through a first party offering may result in better quality.
[Update] Consistency techniques
Steffin Flickinger, who won the Fal VEO 3.1 competition, shares his techniques on Reddit:
“for consistent characters I used nano banana within Gemini and some patient manual photo editing. For the audio, I usually get a similar voice but if not, eleven labs has a voice changer to make them all the same” (link)
“I use nano banana and always start frames, often end frames too.“ (link)
“I actually just used the voices Veo gave me. Sometimes taking the video from one take and the audio from another” (link)

