A Lightweight Multi-Scale Model for Speech Emotion Recognition

Hengduo Li; Daqi Zhao; Jingwen Wang; Deqiang Wang

doi:10.1109/access.2024.3432813

ScienceGate Book Chapters

JOURNAL ARTICLE

A Lightweight Multi-Scale Model for Speech Emotion Recognition

Hengduo Li Daqi Zhao Jingwen Wang Deqiang Wang

Year: 2024 Journal: IEEE Access Vol: 12 Pages: 130228-130240 Publisher: Institute of Electrical and Electronics Engineers

DOI: 10.1109/access.2024.3432813

Get Full-Text PDF Get Analytical Report

Abstract

Recognizing emotional states from speech is essential for human-computer interaction. It is challenging to realize effective speech emotion recognition (SER) on platforms with limited memory capacity and computing power. In this paper, we propose a lightweight multi-scale deep neural network architecture for SER, which takes Mel Frequency Cepstral Coefficients (MFCCs) as input. In order to realize effective multi-scale feature extraction, we propose a new Inception module, named A_Inception. A_Inception combines the merits of Inception module and attention-based rectified linear units (AReLU) and thus can learn multi-scale features adaptively with low computational cost. Meanwhile, to extract most important emotional information, we propose a new multiscale cepstral attention and temporal-cepstral attention (MCA-TCA) module. The idea of MCA-TCA module is to focus on the key cepstral components and the key temporal-cepstral positions. Furthermore, a loss function combining Softmax loss and Center loss is adopted to supervise the model training so as to enhance the model’s discriminative power. Experiments have been carried out on IEMOCAP, EMO-DB and SAVEE datasets to verify the performance of the proposed model and compare with the state-of-the-art SER models. Numerical results reveal that the proposed model has a small number of parameters (0.82 M) and much lower computational cost (81.64 MFLOPs) than compared models, and achieves impressive accuracy on all datasets considered.

Keywords:

Computer science Softmax function Discriminative model Mel-frequency cepstrum Speech recognition Feature extraction Key (lock) Artificial intelligence Feature (linguistics) Artificial neural network Task (project management) Deep learning Pattern recognition (psychology)

Metrics

Cited By

3.29

FWCI (Field Weighted Citation Impact)

Refs

0.85

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Emotion and Mood Recognition

Social Sciences → Psychology → Experimental and Cognitive Psychology

Speech and Audio Processing

Physical Sciences → Computer Science → Signal Processing

A Lightweight Multi-Scale Model for Speech Emotion Recognition

Abstract

Metrics

Citation History

Topics

Related Documents

Lightweight Speech Emotion Recognition Model Based on Multi-Task Learning

Multi-scale Aggregation Network for Speech Emotion Recognition

Multi-Scale Temporal Transformer For Speech Emotion Recognition

TLBT-Net: A Multi-scale Cross-fusion Model for Speech Emotion Recognition

A Lightweight Speech Emotion Recognition Model with Bias-Focal Loss