Multimodal Feature-Guided Audio-Driven Emotional Talking Face Generation

Xueping Wang; Yujia Huo; Yanan Liu; Xueni Guo; Feihu Yan; Guangzhe Zhao

doi:10.3390/electronics14132684

ScienceGate Book Chapters

JOURNAL ARTICLE

Multimodal Feature-Guided Audio-Driven Emotional Talking Face Generation

Xueping Wang Yujia Huo Yanan Liu Xueni Guo Feihu Yan Guangzhe Zhao

Year: 2025 Journal: Electronics Vol: 14 (13)Pages: 2684-2684 Publisher: Multidisciplinary Digital Publishing Institute

DOI: 10.3390/electronics14132684

Get Full-Text PDF Get Analytical Report

Abstract

Audio-driven emotional talking face generation aims to generate talking face videos with rich facial expressions and temporal coherence. Current diffusion model-based approaches predominantly depend on either single-label emotion annotations or external video references, which often struggle to capture the complex relationships between modalities, resulting in less natural emotional expressions. To address these issues, we propose MF-ETalk, a multimodal feature-guided method for emotional talking face generation. Specifically, we design an emotion-aware multimodal feature disentanglement and fusion framework that leverages Action Units (AUs) to disentangle facial expressions and models the nonlinear relationships among AU features using a residual encoder. Furthermore, we introduce a hierarchical multimodal feature fusion module that enables dynamic interactions among audio, visual cues, AUs, and motion dynamics. This module is optimized through global motion modeling, lip synchronization, and expression subspace learning, enabling full-face dynamic generation. Finally, an emotion-consistency constraint module is employed to refine the generated results and ensure the naturalness of expressions. Extensive experiments on the MEAD and HDTF datasets demonstrate that MF-ETalk outperforms state-of-the-art methods in both expression naturalness and lip-sync accuracy. For example, it achieves an FID of 43.052 and E-FID of 2.403 on MEAD, along with strong synchronization performance (LSE-C of 6.781, LSE-D of 7.962), confirming the effectiveness of our approach in producing realistic and emotionally expressive talking face videos.

Keywords:

Feature (linguistics) Face (sociological concept) Computer science Speech recognition Artificial intelligence Multimedia Sociology Linguistics

Metrics

Cited By

9.55

FWCI (Field Weighted Citation Impact)

Refs

0.93

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Face recognition and analysis

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Generative Adversarial Networks and Image Synthesis

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Speech and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Multimodal Feature-Guided Audio-Driven Emotional Talking Face Generation

Abstract

Metrics

Citation History

Topics

Related Documents

EMMN: Emotional Motion Memory Network for Audio-driven Emotional Talking Face Generation

Emotionally Controllable Audio-driven Talking Face Generation

TellMeTalk: Multimodal-driven talking face video generation

Talking Face Generation With Audio-Deduced Emotional Landmarks

Emotional Synchronization for Audio-Driven Talking-Head Generation