Enhancing Dimensional Emotion Recognition from Speech through Modulation-Filtered Cochleagram and Parallel Attention Recurrent Network

Zhichao Peng; Zeng Hua; Yongwei Li; Yegang Du; Jianwu Dang

doi:10.3390/electronics12224620

ScienceGate Book Chapters

JOURNAL ARTICLE

Enhancing Dimensional Emotion Recognition from Speech through Modulation-Filtered Cochleagram and Parallel Attention Recurrent Network

Zhichao Peng Zeng Hua Yongwei Li Yegang Du Jianwu Dang

Year: 2023 Journal: Electronics Vol: 12 (22)Pages: 4620-4620 Publisher: Multidisciplinary Digital Publishing Institute

DOI: 10.3390/electronics12224620

Get Full-Text PDF Get Analytical Report

Abstract

Dimensional emotion can better describe rich and fine-grained emotional states than categorical emotion. In the realm of human–robot interaction, the ability to continuously recognize dimensional emotions from speech empowers robots to capture the temporal dynamics of a speaker’s emotional state and adjust their interaction strategies in real-time. In this study, we present an approach to enhance dimensional emotion recognition through modulation-filtered cochleagram and parallel attention recurrent neural network (PA-net). Firstly, the multi-resolution modulation-filtered cochleagram is derived from speech signals through auditory signal processing. Subsequently, the PA-net is employed to establish multi-temporal dependencies from diverse scales of features, enabling the tracking of the dynamic variations in dimensional emotion within auditory modulation sequences. The results obtained from experiments conducted on the RECOLA dataset demonstrate that, at the feature level, the modulation-filtered cochleagram surpasses other assessed features in its efficacy to forecast valence and arousal. Particularly noteworthy is its pronounced superiority in scenarios characterized by a high signal-to-noise ratio. At the model level, the PA-net attains the highest predictive performance for both valence and arousal, clearly outperforming alternative regression models. Furthermore, the experiments carried out on the SEWA dataset demonstrate the substantial enhancements brought about by the proposed method in valence and arousal prediction. These results collectively highlight the potency and effectiveness of our approach in advancing the field of dimensional speech emotion recognition.

Keywords:

Speech recognition Valence (chemistry) Computer science Arousal Categorical variable Artificial neural network Artificial intelligence Pattern recognition (psychology) Machine learning Psychology

Metrics

Cited By

0.00

FWCI (Field Weighted Citation Impact)

Refs

0.20

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Topics

Emotion and Mood Recognition

Social Sciences → Psychology → Experimental and Cognitive Psychology

Speech and Audio Processing

Physical Sciences → Computer Science → Signal Processing

EEG and Brain-Computer Interfaces

Life Sciences → Neuroscience → Cognitive Neuroscience

Enhancing Dimensional Emotion Recognition from Speech through Modulation-Filtered Cochleagram and Parallel Attention Recurrent Network

Abstract

Metrics

Topics

Related Documents

Multi-resolution modulation-filtered cochleagram feature for LSTM-based dimensional emotion recognition from speech

Multi-Level Attention-Based Categorical Emotion Recognition Using Modulation-Filtered Cochleagram

CochleaTion: Speech Emotion Recognition Through Cochleagram with CNN-GRU and Attention Mechanism

Learning Emotion Information for Expressive Speech Synthesis Using Multi-resolution Modulation-filtered Cochleagram

Enhancing speech emotion recognition through parallel CNNs with transformer encoder and co-attention