Enhancing speech emotion recognition through parallel CNNs with transformer encoder and co-attention

Ahlam Hashem; Muhammad Arif; Manal Alghamdi; Mohammed A. Al Ghamdi; Sultan H. Almotiri

doi:10.7717/peerj-cs.3254

ScienceGate Book Chapters

JOURNAL ARTICLE

Enhancing speech emotion recognition through parallel CNNs with transformer encoder and co-attention

Ahlam Hashem Muhammad Arif Manal Alghamdi Mohammed A. Al Ghamdi Sultan H. Almotiri

Year: 2025 Journal: PeerJ Computer Science Vol: 11 Pages: e3254-e3254 Publisher: PeerJ, Inc.

DOI: 10.7717/peerj-cs.3254

Get Full-Text PDF Get Analytical Report

Abstract

Speech emotion recognition (SER) plays a pivotal role in enabling machines to determine human subjective emotions based only on audio information. This capability is essential for enabling effective communication and enhancing the user experience in human-computer interactions (HCI). Recent studies have successfully integrated temporal and spatial features to improve recognition accuracy. This study presents a novel approach that integrates parallel convolutional neural networks (CNNs) with a Transformer encoder and incorporates a collaborative attention mechanism (co-attention) to extract spatiotemporal features from audio samples. The proposed model is evaluated on multiple datasets and uses various fusion methods. The parallel CNNs combined with a transformer and hierarchical co-attention yield the most promising performance. In version v1 of the ASVP-ESD dataset, the proposed model achieves a weighted accuracy (WA) of 70% and an unweighted accuracy (UW) of 67%. In version 2 of the ASVP-ESD dataset, the model achieves a WA of 52% and a UW of 45%. Furthermore, the model was evaluated on the ShEMO data set to confirm its robustness and effectiveness in diverse datasets, achieving a UW of 68%. These comprehensive evaluations across multiple datasets highlight the generalizability of the proposed approach.

Keywords:

Metrics

Cited By

0.00

FWCI (Field Weighted Citation Impact)

Refs

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Enhancing speech emotion recognition through parallel CNNs with transformer encoder and co-attention

Abstract

Metrics

Topics

Related Documents

Enhancing Human-Computer Interaction with Attention-Based CNNs in Speech Emotion Recognition

Attention-Guided Facial Emotion Recognition with Swin Transformer Encoder Decoder Model

Enhancing Dimensional Emotion Recognition from Speech through Modulation-Filtered Cochleagram and Parallel Attention Recurrent Network

Amazigh Speech Recognition via Parallel CNN Transformer-Encoder Model

Enhancing Speech Emotion Recognition through Bone-Conducted Speech