Zhiyu LiuJunchen FuKaiwen ZhengJoemon M. Jose
Multimodal pre-trained models have demonstrated remarkable capabilities in processing diverse data types, including text and audio. However, their potential for Speech Emotion Recognition (SER) remains relatively underexplored. In this work, we investigate the effectiveness of such models for SER tasks and evaluate the stateof-the-art IISAN framework for efficient fine-tuning. We further improved IISAN by incorporating a dynamic gating mechanism, referred to as IISAN-MOE, to enhance adaptability and performance. Experimental results confirm that multimodal approaches consistently outperform their single-modal counterparts, with IISAN significantly boosting both effectiveness and efficiency. Moreover, the newly proposed IISAN-MOE refines IISAN’s original static fusion mechanism, offering a more flexible and advanced solution for multimodal speech emotion recognition.
Jaejin SeoTaein KangIl‐Youp Kwak
Yunrui CaiZhiyong WuJia JiaHelen Meng
Antonio Rosas MaresG. Diaz-ArangoJorge Perez-Jacome-FriscioneHéctor Vázquez-LealLuis Hernández-MartínezJ. Huerta-ChuaA. F. Jaramillo-AlvaradoJose A. Dominguez-Chavez
Sukui XuNan JiangSong Liang LianJingmin Pan
Isaac SlaughterCraig S. GreenbergReva SchwartzAylin Caliskan