Abstract

Emotion represents an essential aspect of human speech that is manifested in speech prosody.Speech, visual, and textual cues are complementary in human communication.In this paper, we study a hybrid fusion method, referred to as multi-modal attention network (MMAN) to make use of visual and textual cues in speech emotion recognition.We propose a novel multimodal attention mechanism, cLSTM-MMA, which facilitates the attention across three modalities and selectively fuse the information.cLSTM-MMA is fused with other uni-modal subnetworks in the late fusion.The experiments show that speech emotion recognition benefits significantly from visual and textual cues, and the proposed cLSTM-MMA alone is as competitive as other fusion methods in terms of accuracy, but with a much more compact network structure.The proposed hybrid network MMAN achieves state-of-the-art performance on IEMOCAP database for emotion recognition.

Keywords:
Speech recognition Computer science Emotion recognition Modal Natural language processing

Metrics

83
Cited By
7.39
FWCI (Field Weighted Citation Impact)
30
Refs
0.98
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Speech and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing
Emotion and Mood Recognition
Social Sciences →  Psychology →  Experimental and Cognitive Psychology
Hand Gesture Recognition Systems
Physical Sciences →  Computer Science →  Human-Computer Interaction

Related Documents

© 2026 ScienceGate Book Chapters — All rights reserved.