JOURNAL ARTICLE

Multimodal Speech Emotion Recognition via Transformer-Based Hybrid Fusion and Dual Cross-entropy Techniques

Zhixing Yu

Year: 2024 Journal:   Highlights in Science Engineering and Technology Vol: 120 Pages: 114-121

Abstract

Speech emotion recognition is gaining increasing interest in the academic sphere due to the advancement of machine intelligence in the service industries. The previous research has already validated the efficacy of multimodality in Speech Emotion Recognition (SER); yet most studies have focused on one-time fusion techniques. This paper proposes a hybrid fusion architecture which optimizes the advantages of multiple fusion techniques and modalities. The model is predominantly based on the rapidly rising Transformer architecture. This study also extends the classic cross-entropy loss and designs a novel loss function which differentiates the misprediction patterns. The architecture is experimented on the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) dataset with sufficient cross-validation. It reaches 89.7% accuracy and outmatches the State-of-the-art (SOTA) methods. The performance is further enhanced by the proposed loss function and arrives at 91.1% accuracy. In addition, the models show computation scalability and few needs for hyperparameter fine-tuning. This article concludes that more comprehensive fusion techniques are worth exploration for multimodal speech emotion recognition and Transformers are suitable for emotional characteristics and lead the classification process.

Keywords:
Computer science Cross entropy Transformer Speech recognition Emotion recognition Architecture Hyperparameter Artificial intelligence Modalities Multimodality Scalability Computation Machine learning Pattern recognition (psychology) Engineering

Metrics

0
Cited By
0.00
FWCI (Field Weighted Citation Impact)
25
Refs
0.31
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Topics

Emotion and Mood Recognition
Social Sciences →  Psychology →  Experimental and Cognitive Psychology
Speech and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing
Music and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing
© 2026 ScienceGate Book Chapters — All rights reserved.