Multimodal Speech Emotion Recognition via Transformer-Based Hybrid Fusion and Dual Cross-entropy Techniques

Zhixing Yu

doi:10.54097/6pzrwp08

ScienceGate Book Chapters

JOURNAL ARTICLE

Multimodal Speech Emotion Recognition via Transformer-Based Hybrid Fusion and Dual Cross-entropy Techniques

Zhixing Yu

Year: 2024 Journal: Highlights in Science Engineering and Technology Vol: 120 Pages: 114-121

DOI: 10.54097/6pzrwp08

Get Full-Text PDF Get Analytical Report

Abstract

Speech emotion recognition is gaining increasing interest in the academic sphere due to the advancement of machine intelligence in the service industries. The previous research has already validated the efficacy of multimodality in Speech Emotion Recognition (SER); yet most studies have focused on one-time fusion techniques. This paper proposes a hybrid fusion architecture which optimizes the advantages of multiple fusion techniques and modalities. The model is predominantly based on the rapidly rising Transformer architecture. This study also extends the classic cross-entropy loss and designs a novel loss function which differentiates the misprediction patterns. The architecture is experimented on the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) dataset with sufficient cross-validation. It reaches 89.7% accuracy and outmatches the State-of-the-art (SOTA) methods. The performance is further enhanced by the proposed loss function and arrives at 91.1% accuracy. In addition, the models show computation scalability and few needs for hyperparameter fine-tuning. This article concludes that more comprehensive fusion techniques are worth exploration for multimodal speech emotion recognition and Transformers are suitable for emotional characteristics and lead the classification process.

Keywords:

Computer science Cross entropy Transformer Speech recognition Emotion recognition Architecture Hyperparameter Artificial intelligence Modalities Multimodality Scalability Computation Machine learning Pattern recognition (psychology) Engineering

Metrics

Cited By

0.00

FWCI (Field Weighted Citation Impact)

Refs

0.31

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Topics

Emotion and Mood Recognition

Social Sciences → Psychology → Experimental and Cognitive Psychology

Speech and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Music and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Multimodal Speech Emotion Recognition via Transformer-Based Hybrid Fusion and Dual Cross-entropy Techniques

Abstract

Metrics

Topics

Related Documents

Multimodal transformer augmented fusion for speech emotion recognition

Cross-Subject Multimodal Emotion Recognition Based on Hybrid Fusion

Dual Memory Fusion for Multimodal Speech Emotion Recognition

HyFusER: Hybrid Multimodal Transformer for Emotion Recognition Using Dual Cross Modal Attention

MemoCMT: multimodal emotion recognition using cross-modal transformer-based feature fusion