Audio-Driven Talking Head Video Generation with Diffusion Model

Yizhe Zhua; Chunhui Zhanga; Qiong Liub; Xi Zhoub

doi:10.1109/icassp49357.2023.10094937

ScienceGate Book Chapters

JOURNAL ARTICLE

Audio-Driven Talking Head Video Generation with Diffusion Model

Yizhe Zhua Chunhui Zhanga Qiong Liub Xi Zhoub

Year: 2023 Pages: 1-5

DOI: 10.1109/icassp49357.2023.10094937

Get Full-Text PDF Get Analytical Report

Abstract

Synthesizing high-fidelity talking head videos by fitting input audio sequences is a highly anticipated technique in many applications, such as digital humans, virtual video conferences, and human-computer interaction. Popular GAN-based methods aim to align speech audio with lip motions and head poses. However, existing methods are prone to training instability and even mode collapse, resulting in low-quality video generation. In this paper, we propose a novel audio-driven diffusion method for generating high-resolution realistic videos of talking heads with the help of the denoising diffusion model. Specifically, the face attribute disentanglement module is proposed to disentangle eye blinking and lip motion features, where the lip motion features are synchronized with audio features via the contrastive learning strategy, and the disentangled motion features are aligned well with the talking head. Furthermore, the denoising diffusion model takes the source image and the warped motion features as input to generate the high-resolution realistic talking head with diverse head poses. Extensive evaluations using multiple metrics demonstrate that our method outperforms the current techniques both qualitatively and quantitatively.

Keywords:

Computer science High fidelity Artificial intelligence Computer vision Noise reduction Fidelity Head (geology) Face (sociological concept) Motion (physics) Speech recognition Diffusion Acoustics

Metrics

Cited By

1.64

FWCI (Field Weighted Citation Impact)

Refs

0.80

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Generative Adversarial Networks and Image Synthesis

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Face recognition and analysis

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Speech and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Audio-Driven Talking Head Video Generation with Diffusion Model

Abstract

Metrics

Citation History

Topics

Related Documents

Text-Driven Synchronized Diffusion Video and Audio Talking Head Generation

One-shot motion talking head generation with audio-driven model

Audio-Driven Talking Face Video Generation with Emotion

Audio-driven Talking Head Generation with Transformer and 3D Morphable Model

ATL-Diff: Audio-Driven Talking Head Generation with Early Landmarks-Guide Noise Diffusion