JOURNAL ARTICLE

Exploring Multimodal Pre-trained Models for Speech Emotion Recognition

Abstract

Multimodal pre-trained models have demonstrated remarkable capabilities in processing diverse data types, including text and audio. However, their potential for Speech Emotion Recognition (SER) remains relatively underexplored. In this work, we investigate the effectiveness of such models for SER tasks and evaluate the stateof-the-art IISAN framework for efficient fine-tuning. We further improved IISAN by incorporating a dynamic gating mechanism, referred to as IISAN-MOE, to enhance adaptability and performance. Experimental results confirm that multimodal approaches consistently outperform their single-modal counterparts, with IISAN significantly boosting both effectiveness and efficiency. Moreover, the newly proposed IISAN-MOE refines IISAN’s original static fusion mechanism, offering a more flexible and advanced solution for multimodal speech emotion recognition.

Keywords:
Speech recognition Computer science Emotion recognition Natural language processing Hidden Markov model Artificial intelligence

Metrics

0
Cited By
0.00
FWCI (Field Weighted Citation Impact)
10
Refs
0.16
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Topics

Emotion and Mood Recognition
Social Sciences →  Psychology →  Experimental and Cognitive Psychology
Speech and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing
Speech Recognition and Synthesis
Physical Sciences →  Computer Science →  Artificial Intelligence
© 2026 ScienceGate Book Chapters — All rights reserved.