JOURNAL ARTICLE

Speech Emotion Recognition with a ResNet-CNN-Transformer Parallel Neural Network

Abstract

As a challenging pattern recognition task, speech emotion recognition has attracted more and more attention in recent years and is widely used in medical, Affective Computing, and other fields. In this paper, we proposed a parallel network of ResNet-CNN-Transformer Encoder. The Res-Net is used to alleviate the problems caused by the deepening of the network. The CNN calculates the fewer parameters to increase the fitting expression ability of the network. Due to the traditional recurrent neural network, with a long-term dependence on the feature extraction of speech and text sequences and sequence attributes not capturing long-distance features, the multi attention mechanism of the transformer coding layer is used to parallelize the sequence, improve the processing speed and extract the emotional semantic information in the sequence. Experiments are carried out on the RAVDESS dataset. Our results demonstrate the effectiveness of the proposed method and make a significant improvement compared with the previous results.

Keywords:
Computer science Transformer Encoder Feature extraction Speech recognition Artificial neural network Recurrent neural network Artificial intelligence Pattern recognition (psychology) Time delay neural network Long short term memory Coding (social sciences) Voltage

Metrics

30
Cited By
5.89
FWCI (Field Weighted Citation Impact)
12
Refs
0.96
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Emotion and Mood Recognition
Social Sciences →  Psychology →  Experimental and Cognitive Psychology
Speech and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing
Face and Expression Recognition
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
© 2026 ScienceGate Book Chapters — All rights reserved.