Speech Emotion Recognition via Parallel Dual-Branch Fusion Model

Zhongliang Wei; Chang Ge; Lijun Zhu; Jinmin Ye

doi:10.14569/ijacsa.2025.0161115

ScienceGate Book Chapters

JOURNAL ARTICLE

Speech Emotion Recognition via Parallel Dual-Branch Fusion Model

Zhongliang Wei Chang Ge Lijun Zhu Jinmin Ye

Year: 2025 Journal: International Journal of Advanced Computer Science and Applications Vol: 16 (11) Publisher: Science and Information Organization

DOI: 10.14569/ijacsa.2025.0161115

Get Full-Text PDF Get Analytical Report

Abstract

Speech Emotion Recognition (SER) has become a pivotal topic within affective computing and human–computer interaction, where the core challenge lies in jointly capturing both the time–frequency structure and the semantic context of speech. To overcome the shortcomings of current approaches—including single-view feature representation, the lack of emotional discriminability in self-supervised models, and suboptimal complementarity among fusion strategies—this study proposes a parallel dual-branch fusion architecture for SER. The framework consists of a wav2vec 2.0 branch and a CNN–Transformer spectrogram branch, which respectively extract contextual semantic representations from raw waveforms and explicit time–frequency features from spectrograms. A logistic regression fusion mechanism is further introduced at the decision level to achieve adaptive weighting in the probability space, thereby fully leveraging the complementary strengths of the two feature types. Experiments carried out on the RAVDESS audio subset showed that the proposed model surpassed several mainstream baselines (e.g., CNN-n-GRU and RELUEM), achieving 92.7% accuracy and 92.2% Macro-F1, with an average improvement of about 3.2 percentage points. The layer unfreezing studies confirmed the effectiveness of partial fine-tuning for transferring pretrained features, while the comparative experiments on fusion strategies validated the superiority of probability-space fusion in both performance and stability. Overall, the proposed framework achieves simultaneous gains in accuracy and robustness through feature complementarity, branch decoupling, and lightweight fusion. Future work will explore cross-lingual generalization, multimodal extensions, lightweight deployment, and dynamic emotion modeling, contributing to more efficient affective computing and intelligent interaction systems.

Keywords:

Metrics

Cited By

0.00

FWCI (Field Weighted Citation Impact)

Refs

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Speech Emotion Recognition via Parallel Dual-Branch Fusion Model

Abstract

Metrics

Topics

Related Documents

Dual Memory Fusion for Multimodal Speech Emotion Recognition

Fusion-ConvBERT: Parallel Convolution and BERT Fusion for Speech Emotion Recognition

Cross-Attention Dual-Stream Fusion for Speech Emotion Recognition

Dual-Branch Multimodal Fusion Network for Driver Facial Emotion Recognition

Anchor Model Fusion for Emotion Recognition in Speech