VoiceCraft-Dub: Automated Video Dubbing with Neural Codec Language Models

Sung-Bin, K; Choi, J; Peng, P; Chung, J S; Oh, T_H; Harwath, D

Citation Details

This content will become publicly available on April 3, 2026

VoiceCraft-Dub: Automated Video Dubbing with Neural Codec Language Models

We present VoiceCraft-Dub, a novel approach for automated video dubbing that synthesizes high-quality speech from text and facial cues. This task has broad applications in filmmaking, multimedia creation, and assisting voice-impaired individuals. Building on the success of Neural Codec Language Models (NCLMs) for speech synthesis, our method extends their capabilities by incorporating video features, ensuring that synthesized speech is time-synchronized and expressively aligned with facial movements while preserving natural prosody. To inject visual cues, we design adapters to align facial features with the NCLM token space and introduce audio-visual fusion layers to merge audio-visual information within the NCLM framework. Additionally, we curate CelebV-Dub, a new dataset of expressive, real-world videos specifically designed for automated video dubbing. Extensive experiments show that our model achieves high-quality, intelligible, and natural speech synthesis with accurate lip synchronization, outperforming existing methods in human perception and performing favorably in objective evaluations. We also adapt VoiceCraft-Dub for the video-to-speech task, demonstrating its versatility for various applications. more »

Award ID(s):: 2505865

PAR ID:: 10631371

Author(s) / Creator(s):: Sung-Bin, K; Choi, J; Peng, P; Chung, J S; Oh, T_H; Harwath, D

Publisher / Repository:: https://doi.org/10.48550/arXiv.2504.02386

Date Published:: 2025-04-03

ISSN:: 2504.02386

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
This content will become publicly available on April 3, 2026
Conference Paper:
The DOI is not currently available.

More Like this