Lipsync turns a face into a talking head by matching lip movements to a voiceover. Two inputs are required: a source file (video or photo) and a voiceover. The quality of each input directly affects the output.

Source video (Lipsync Labs and Lipsync 2 Pro)

Both Lipsync Labs and Lipsync 2 Pro take a short video as input. The video should feature one person with a clearly visible face. Frontal or near-frontal angles work best — heavily turned heads or profile shots reduce sync accuracy.

Video length is flexible, but shorter videos process faster. The output length matches your source video — if the voiceover is longer than the video, it gets cut at the video's end.

Avoid videos with multiple faces in frame. Lipsync targets one face — if several are visible, it may pick the wrong one.

Source image (InfiniteTalk only)

InfiniteTalk accepts a still photo instead of a video. It generates a talking-head video from a single portrait. Use a clear, front-facing photo with one visible face. With image input, the voiceover length drives the output duration. See Lipsync models for a full comparison of which model fits your use case.

Voiceover

The voiceover drives the lip movement — it determines the pacing and pauses. With video input, output length is set by the source video, not the voiceover. With image input (InfiniteTalk), the voiceover length determines how long the output runs. You can generate a voiceover in Voice Studio or upload an existing audio file from the Voiceover modal.

The voiceover language doesn't need to match the source material — lipsync works with any audio, any language.

Check before you generate

When both a source file and a voiceover are selected, the Generate button activates. If it stays gray, one input is still empty. See Create a lip-sync video for the full generation flow.

Prepare source files for lip-sync

Source video (Lipsync Labs and Lipsync 2 Pro)

Source image (InfiniteTalk only)

Voiceover

Check before you generate