Prompting Veo: Shot Descriptions That Actually Land
Veo rewards structure. Name the camera move first, the subject second, the light last. Here is the five slot template that cuts rerender cost in half.
Veo does not respond to vibes. It responds to structure. The prompts that land on the first take share a specific shape, and once you see it you cannot unsee it. This is a working template you can apply today against Veo 3.1 at $0.40 per second at 1080p, and it should carry over to Veo 4 when the endpoint ships since the parameter names are expected to stay the same.
The template has five slots in this order: camera, subject, action, environment, light. Name them in that order in your prompt, and the model builds the shot in the same order internally. Name them out of order and you get averaged mush.
Here is the shape. Camera: what the lens is doing. Subject: who or what is in frame. Action: what they are doing. Environment: where and when. Light: quality, direction, color temperature. That is it. No adjectives hunting for a noun. Every word earns its place.

A worked example. Bad prompt first: "beautiful cinematic shot of a woman walking through a forest, golden hour, stunning light." Veo reads that as four competing directives with no hierarchy. The woman drifts. The forest warps. The light is either too much or not enough. You rerender three times and give up.
Now the structured version: "slow tracking shot from behind, a woman in a wool coat, walking between tall pine trunks, late autumn forest with fog in the middle distance, low sun backlighting the fog into soft gold." Same length. Completely different result. The camera is locked. The subject is specific. The action is a single verb. The environment has two details, not ten. The light has direction and color, not a feeling.
Here is the call. Swap in your own five slot prompt and run it.
1import { fal } from "@fal-ai/client";23// or fal-ai/veo4/text-to-video once available4const result = await fal.subscribe("fal-ai/veo3.1/text-to-video", {5 input: {6 prompt: "slow tracking shot from behind, a woman in a wool coat, walking between tall pine trunks, late autumn forest with fog in the middle distance, low sun backlighting the fog into soft gold",7 aspect_ratio: "16:9",8 duration: "8s",9 resolution: "1080p",10 generate_audio: false11 },12 logs: true13});1415console.log(result.data.video.url);
A few rules to keep the slots tight.
One verb per slot. "Walking, looking around, brushing her hair back" is three shots, not one. Pick one. If you want the others, shoot them as separate clips and cut them together. Veo at 8 second max is not a scene engine. It is a shot engine.
Two adjectives per noun, tops. "A red weathered metal door" is fine. "A beautifully ornate deeply red weathered old rusted metal door with carved details" is a coin flip on what you get back. Pick the two most load bearing adjectives and drop the rest.
Specify the lens or do not. If you say "35mm lens" or "wide angle" the model will honor it roughly. If you leave it out, the model picks one based on the other slots. What you should not do is give conflicting signals, like "close up" and "wide shot" in the same prompt. Pick a framing and commit.

Now the light slot, because it is where most prompts fall apart. Bad: "good lighting." Slightly better: "cinematic lighting." Good: "low sun backlight, warm key on the face, cool fill in the shadows." You are giving the model a lighting plan, not a mood. Mood comes out of the plan. If you skip the plan you get whatever the model averages from the rest of the prompt, and that is usually flat.
One more thing. If you want a specific color palette, put it in the environment slot, not the light slot. "Late autumn forest with rust and ochre ground cover" will do more for your color than any "warm tones" note. Veo reads the physical world better than it reads abstract color theory.
This template is not the only way to prompt Veo. It is the way that cuts your rerender count. On a run of 20 clips, you can expect to land maybe eight on the first take with loose prompts. With the five slot structure, that number goes up to roughly fifteen. At $0.40 per second and 8 second clips, that is the difference between spending $96 and spending $48 to get your 20 good takes. Worth the extra thirty seconds of writing.