Abstract

Talking face generation has historically struggled to produce head movements and natural facial expressions without guidance from additional reference videos. Recent developments in diffusion-based generative models allow for more realistic and stable data synthesis and their performance on image and video generation has surpassed that of other generative models. In this work, we present an autoregressive diffusion model that requires only one identity image and audio sequence to generate a video of a realistic talking head. Our solution is capable of hallucinating head movements, facial expressions, such as blinks, and preserving a given background. We evaluate our model on two different datasets, achieving state-of-the-art results in expressiveness and smoothness on both of them. 1

Keywords:
Beat (acoustics) Face (sociological concept) Computer science Diffusion Speech recognition Acoustics Physics Linguistics Philosophy

Metrics

98
Cited By
48.24
FWCI (Field Weighted Citation Impact)
63
Refs
1.00
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Face recognition and analysis
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition

Related Documents

JOURNAL ARTICLE

Diffusion Models Beat GANs on Topology Optimization

François MazéFaez Ahmed

Journal:   Proceedings of the AAAI Conference on Artificial Intelligence Year: 2023 Vol: 37 (8)Pages: 9108-9116
JOURNAL ARTICLE

From Gans to Diffusion Models: Text-To-Image Generation

Yian Xiao

Journal:   Highlights in Science Engineering and Technology Year: 2025 Vol: 160 Pages: 80-87
© 2026 ScienceGate Book Chapters — All rights reserved.