Speech emotion recognition (SER) is a crucial aspect of affective computing and human-computer interaction, yet effectively identifying emotions in different speakers and languages remains challenging. This paper introduces SER-Fuse, a multi-modal SER application that is designed to address the complexities of multiple speakers and languages. Our approach leverages diverse audio/speech embeddings and text embeddings to extract optimal features for multi-modal SER. We subsequently employ multi-feature fusion to integrate embedding features across modalities and languages. Experimental results archived on the English-Chinese emotional speech (ECES) dataset reveal that SER-Fuse attains competitive performance in the multi-lingual approach compared to the single-lingual approaches. Furthermore, we provide the implementation of SER-Fuse for download at https://github.com/nhattruongpham/SER-Fuse to support reproducibility and local deployment.
Chunyi WangYing RenNa ZhangFuwei CuiShiying Luo
Xiao FuWei XiYang ZhaoRui JiangDianwen NgJie YangJizhong Zhao
Weizhi NieYan YanDan SongKun Wang
Aziguli WulamuYuheng WuXin LiuYao ZhangJinghan XuYang Zhang