Emotion represents an essential aspect of human speech that is manifested in speech prosody.Speech, visual, and textual cues are complementary in human communication.In this paper, we study a hybrid fusion method, referred to as multi-modal attention network (MMAN) to make use of visual and textual cues in speech emotion recognition.We propose a novel multimodal attention mechanism, cLSTM-MMA, which facilitates the attention across three modalities and selectively fuse the information.cLSTM-MMA is fused with other uni-modal subnetworks in the late fusion.The experiments show that speech emotion recognition benefits significantly from visual and textual cues, and the proposed cLSTM-MMA alone is as competitive as other fusion methods in terms of accuracy, but with a much more compact network structure.The proposed hybrid network MMAN achieves state-of-the-art performance on IEMOCAP database for emotion recognition.
Junfeng ZhangLining XingZhen TanHongsen WangKesheng Wang
Yang LiuHaoqin SunWenbo GuanYuqi XiaZhen Zhao
Darshana PriyasadTharindu FernandoSimon DenmanSridha SridharanClinton Fookes
Tsegaye Misikir TashuTomáš Horváth