Diffused Heads: Diffusion Models Beat GANs on Talking-Face Generation

Michał Stypułkowski; Konstantinos Vougioukas; Sen He; Maciej Zięba; Stavros Petridis; Maja Pantić

doi:10.1109/wacv57701.2024.00502

ScienceGate Book Chapters

JOURNAL ARTICLE

Diffused Heads: Diffusion Models Beat GANs on Talking-Face Generation

Michał Stypułkowski Konstantinos Vougioukas Sen He Maciej Zięba Stavros Petridis Maja Pantić

Year: 2024 Pages: 5089-5098

DOI: 10.1109/wacv57701.2024.00502

Get Full-Text PDF Get Analytical Report

Abstract

Talking face generation has historically struggled to produce head movements and natural facial expressions without guidance from additional reference videos. Recent developments in diffusion-based generative models allow for more realistic and stable data synthesis and their performance on image and video generation has surpassed that of other generative models. In this work, we present an autoregressive diffusion model that requires only one identity image and audio sequence to generate a video of a realistic talking head. Our solution is capable of hallucinating head movements, facial expressions, such as blinks, and preserving a given background. We evaluate our model on two different datasets, achieving state-of-the-art results in expressiveness and smoothness on both of them. ¹

Keywords:

Beat (acoustics) Face (sociological concept) Computer science Diffusion Speech recognition Acoustics Physics Linguistics Philosophy

Metrics

Cited By

48.24

FWCI (Field Weighted Citation Impact)

Refs

1.00

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Face recognition and analysis

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Diffused Heads: Diffusion Models Beat GANs on Talking-Face Generation

Abstract

Metrics

Citation History

Topics

Related Documents

Diffusion Models Beat GANs on Topology Optimization

Diffused Poses and Distilled Expressions for Controllable Audio-driven Talking Face Generation

From Gans to Diffusion Models: Text-To-Image Generation

EmoTalker: Emotionally Editable Talking Face Generation via Diffusion Model

EAT-Face: Emotion-Controllable Audio-Driven Talking Face Generation via Diffusion Model