JOURNAL ARTICLE

A Lightweight Multi-Scale Model for Speech Emotion Recognition

Hengduo LiDaqi ZhaoJingwen WangDeqiang Wang

Year: 2024 Journal:   IEEE Access Vol: 12 Pages: 130228-130240   Publisher: Institute of Electrical and Electronics Engineers

Abstract

Recognizing emotional states from speech is essential for human-computer interaction. It is challenging to realize effective speech emotion recognition (SER) on platforms with limited memory capacity and computing power. In this paper, we propose a lightweight multi-scale deep neural network architecture for SER, which takes Mel Frequency Cepstral Coefficients (MFCCs) as input. In order to realize effective multi-scale feature extraction, we propose a new Inception module, named A_Inception. A_Inception combines the merits of Inception module and attention-based rectified linear units (AReLU) and thus can learn multi-scale features adaptively with low computational cost. Meanwhile, to extract most important emotional information, we propose a new multiscale cepstral attention and temporal-cepstral attention (MCA-TCA) module. The idea of MCA-TCA module is to focus on the key cepstral components and the key temporal-cepstral positions. Furthermore, a loss function combining Softmax loss and Center loss is adopted to supervise the model training so as to enhance the model’s discriminative power. Experiments have been carried out on IEMOCAP, EMO-DB and SAVEE datasets to verify the performance of the proposed model and compare with the state-of-the-art SER models. Numerical results reveal that the proposed model has a small number of parameters (0.82 M) and much lower computational cost (81.64 MFLOPs) than compared models, and achieves impressive accuracy on all datasets considered.

Keywords:
Computer science Softmax function Discriminative model Mel-frequency cepstrum Speech recognition Feature extraction Key (lock) Artificial intelligence Feature (linguistics) Artificial neural network Task (project management) Deep learning Pattern recognition (psychology)

Metrics

3
Cited By
3.29
FWCI (Field Weighted Citation Impact)
39
Refs
0.85
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Emotion and Mood Recognition
Social Sciences →  Psychology →  Experimental and Cognitive Psychology
Speech and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing
© 2026 ScienceGate Book Chapters — All rights reserved.