MoDiTalker: Motion-Disentangled Diffusion Model for High-Fidelity Talking Head Generation

Seyeon Kim; Siyoon Jin; Jihye Park; KiHong Kim; Ji Young Kim; Jisu Nam; Seungryong Kim

doi:10.1609/aaai.v39i4.32452

ScienceGate Book Chapters

JOURNAL ARTICLE

MoDiTalker: Motion-Disentangled Diffusion Model for High-Fidelity Talking Head Generation

Seyeon Kim Siyoon Jin Jihye Park KiHong Kim Ji Young Kim Jisu Nam Seungryong Kim

Year: 2025 Journal: Proceedings of the AAAI Conference on Artificial Intelligence Vol: 39 (4)Pages: 4302-4310 Publisher: Association for the Advancement of Artificial Intelligence

DOI: 10.1609/aaai.v39i4.32452

Get Full-Text PDF Get Analytical Report

Abstract

Conventional GAN-based models for talking head generation often suffer from limited quality and unstable training. Recent approaches based on diffusion models have attempted to address these limitations and improve fidelity. However, they still face challenges, such as intensive sampling times and difficulties in maintaining temporal consistency due to the high stochasticity of diffusion models. To overcome these challenges, we propose a novel motion-disentangled diffusion model for high-quality talking head generation, called MoDiTalker. We introduce two modules: the Audio-To-Motion (AToM) module, designed to generate synchronized lip movements from audio, and the Motion-To-Video (MToV) module, designed to produce high-quality talking head videos based on the generated motions. AToM excels in capturing subtle lip movements by leveraging an audio attention mechanism. Additionally, MToV enhances temporal consistency by utilizing an efficient tri-plane representation. Our experiments on standard benchmarks demonstrate that our model outperforms existing GAN-based and diffusion-based models. We also provide comprehensive ablation studies and user study results.

Keywords:

Fidelity Diffusion Head (geology) Motion (physics) High fidelity Computer science Physics Artificial intelligence Geology Thermodynamics Acoustics

Metrics

Cited By

3.22

FWCI (Field Weighted Citation Impact)

Refs

0.82

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Speech and dialogue systems

Physical Sciences → Computer Science → Artificial Intelligence

Face recognition and analysis

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Speech Recognition and Synthesis

Physical Sciences → Computer Science → Artificial Intelligence

MoDiTalker: Motion-Disentangled Diffusion Model for High-Fidelity Talking Head Generation

Abstract

Metrics

Citation History

Topics

Related Documents

Motion-disentangled Diffusion Model for High-fidelity Talking Head Generation

Landmark-guided Diffusion Model for High-fidelity and Temporally Coherent Talking Head Generation

High-Fidelity and Freely Controllable Talking Head Video Generation

DisentTalk: Cross-lingual Talking Face Generation via Semantic Disentangled Diffusion Model

DisenEmo: Learning disentangled emotional representation from facial motion for 3D talking head generation