JOURNAL ARTICLE

Multimodal Feature-Guided Audio-Driven Emotional Talking Face Generation

Xueping WangYujia HuoYanan LiuXueni GuoFeihu YanGuangzhe Zhao

Year: 2025 Journal:   Electronics Vol: 14 (13)Pages: 2684-2684   Publisher: Multidisciplinary Digital Publishing Institute

Abstract

Audio-driven emotional talking face generation aims to generate talking face videos with rich facial expressions and temporal coherence. Current diffusion model-based approaches predominantly depend on either single-label emotion annotations or external video references, which often struggle to capture the complex relationships between modalities, resulting in less natural emotional expressions. To address these issues, we propose MF-ETalk, a multimodal feature-guided method for emotional talking face generation. Specifically, we design an emotion-aware multimodal feature disentanglement and fusion framework that leverages Action Units (AUs) to disentangle facial expressions and models the nonlinear relationships among AU features using a residual encoder. Furthermore, we introduce a hierarchical multimodal feature fusion module that enables dynamic interactions among audio, visual cues, AUs, and motion dynamics. This module is optimized through global motion modeling, lip synchronization, and expression subspace learning, enabling full-face dynamic generation. Finally, an emotion-consistency constraint module is employed to refine the generated results and ensure the naturalness of expressions. Extensive experiments on the MEAD and HDTF datasets demonstrate that MF-ETalk outperforms state-of-the-art methods in both expression naturalness and lip-sync accuracy. For example, it achieves an FID of 43.052 and E-FID of 2.403 on MEAD, along with strong synchronization performance (LSE-C of 6.781, LSE-D of 7.962), confirming the effectiveness of our approach in producing realistic and emotionally expressive talking face videos.

Keywords:
Feature (linguistics) Face (sociological concept) Computer science Speech recognition Artificial intelligence Multimedia Sociology Linguistics

Metrics

2
Cited By
9.55
FWCI (Field Weighted Citation Impact)
40
Refs
0.93
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Face recognition and analysis
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Generative Adversarial Networks and Image Synthesis
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Speech and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing

Related Documents

JOURNAL ARTICLE

Emotionally Controllable Audio-driven Talking Face Generation

Yifan XuSirui ZhaoShifeng LiuTong XuEnhong Chen Enhong Chen

Journal:   ACM Transactions on Multimedia Computing Communications and Applications Year: 2026
JOURNAL ARTICLE

TellMeTalk: Multimodal-driven talking face video generation

Pengfei LiHuihuang ZhaoQingyun LiuPeng TangLin Zhang

Journal:   Computers & Electrical Engineering Year: 2024 Vol: 114 Pages: 109049-109049
JOURNAL ARTICLE

Talking Face Generation With Audio-Deduced Emotional Landmarks

Shuyan ZhaiMeng LiuYongqiang LiZan GaoLei ZhuLiqiang Nie

Journal:   IEEE Transactions on Neural Networks and Learning Systems Year: 2023 Vol: 35 (10)Pages: 14099-14111
© 2026 ScienceGate Book Chapters — All rights reserved.