Lipsync turns a face into a talking head by matching lip movements to a voiceover. Two inputs are required: a source file (video or photo) and a voiceover. The quality of each input directly affects the output.
Source video (Lipsync Labs and Lipsync 2 Pro)
Both Lipsync Labs and Lipsync 2 Pro take a short video as input. The video should feature one person with a clearly visible face. Frontal or near-frontal angles work best — heavily turned heads or profile shots reduce sync accuracy.
Video length is flexible, but shorter videos process faster. The output length matches your source video — if the voiceover is longer than the video, it gets cut at the video's end.
Avoid videos with multiple faces in frame. Lipsync targets one face — if several are visible, it may pick the wrong one.
InfiniteTalk accepts a still photo instead of a video. It generates a talking-head video from a single portrait. Use a clear, front-facing photo with one visible face. With image input, the voiceover length drives the output duration. See
Lipsync models for a full comparison of which model fits your use case.
The voiceover drives the lip movement — it determines the pacing and pauses. With video input, output length is set by the source video, not the voiceover. With image input (InfiniteTalk), the voiceover length determines how long the output runs. You can generate a voiceover in
Voice Studio or upload an existing audio file from the Voiceover modal.
The voiceover language doesn't need to match the source material — lipsync works with any audio, any language.
When both a source file and a voiceover are selected, the
Generate button activates. If it stays gray, one input is still empty. See
Create a lip-sync video for the full generation flow.