Exploring Multimodal Pre-trained Models for Speech Emotion Recognition

Zhiyu Liu; Junchen Fu; Kaiwen Zheng; Joemon M. Jose

doi:10.1145/3701716.3717561

ScienceGate Book Chapters

JOURNAL ARTICLE

Exploring Multimodal Pre-trained Models for Speech Emotion Recognition

Zhiyu Liu Junchen Fu Kaiwen Zheng Joemon M. Jose

Year: 2025 Pages: 2176-2180

DOI: 10.1145/3701716.3717561

Get Full-Text PDF Get Analytical Report

Abstract

Multimodal pre-trained models have demonstrated remarkable capabilities in processing diverse data types, including text and audio. However, their potential for Speech Emotion Recognition (SER) remains relatively underexplored. In this work, we investigate the effectiveness of such models for SER tasks and evaluate the stateof-the-art IISAN framework for efficient fine-tuning. We further improved IISAN by incorporating a dynamic gating mechanism, referred to as IISAN-MOE, to enhance adaptability and performance. Experimental results confirm that multimodal approaches consistently outperform their single-modal counterparts, with IISAN significantly boosting both effectiveness and efficiency. Moreover, the newly proposed IISAN-MOE refines IISAN’s original static fusion mechanism, offering a more flexible and advanced solution for multimodal speech emotion recognition.

Keywords:

Speech recognition Computer science Emotion recognition Natural language processing Hidden Markov model Artificial intelligence

Metrics

Cited By

0.00

FWCI (Field Weighted Citation Impact)

Refs

0.16

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Topics

Emotion and Mood Recognition

Social Sciences → Psychology → Experimental and Cognitive Psychology

Speech and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Speech Recognition and Synthesis

Physical Sciences → Computer Science → Artificial Intelligence

Exploring Multimodal Pre-trained Models for Speech Emotion Recognition

Abstract

Metrics

Topics

Related Documents

Pre-trained models and ensemble technique for speech emotion recognition

LoRA-MER: Low-Rank Adaptation of Pre-Trained Speech Models for Multimodal Emotion Recognition Using Mutual Information

Advancing Spanish Speech Emotion Recognition: A Comprehensive Benchmark of Pre-Trained Models

Chinese Speech Emotion Recognition Based on Pre-Trained Model

Pre-trained Speech Processing Models Contain Human-Like Biases that Propagate to Speech Emotion Recognition