Vocal Sentiments: Transformer Based Speech Emotion Recognition

Didar Ali; Muhammad Shahab; Yasir Saleem Afridi; Rehmat Ullah

doi:10.21015/vtse.v13i3.2174

ScienceGate Book Chapters

JOURNAL ARTICLE

Vocal Sentiments: Transformer Based Speech Emotion Recognition

Didar Ali Muhammad Shahab Yasir Saleem Afridi Rehmat Ullah

Year: 2025 Journal: VFAST Transactions on Software Engineering Vol: 13 (3)Pages: 187-197

DOI: 10.21015/vtse.v13i3.2174

Get Full-Text PDF Get Analytical Report

Abstract

Speech Emotion Recognition (SER) plays a crucial role in Human–Computer Interaction (HCI) by enabling systems to interpret and respond to human emotions through speech analysis. This paper presents a Transformer-based SER framework that leverages the Wav2Vec2 model for self-supervised representation learning. Unlike conventional approaches relying on handcrafted acoustic features or shallow learning, our approach employs transfer learning to extract high-level contextual embeddings from raw audio. We integrate two benchmark datasets, RAVDESS and TESS, to improve generalization across diverse speakers and emotions, and further analyze system robustness by introducing varying levels of environmental noise. The proposed model achieves an accuracy of 79.01%, with balanced precision, recall, and F1-scores, demonstrating competitive performance compared with recent state-of-the-art SER models. The main contributions of this work are threefold: (i) a novel evaluation of Wav2Vec2 embeddings on combined RAVDESS–TESS data, (ii) a systematic assessment of noise robustness in Transformer-based SER, and (iii) a comprehensive benchmark that highlights the strengths and limitations of transfer learning in practical emotion recognition scenarios. These findings suggest broad applicability in voice assistants, call-center analytics, and mental health monitoring, while future extensions may incorporate multimodal data and advanced fine-tuning strategies to further enhance performance.

Keywords:

Metrics

Cited By

0.00

FWCI (Field Weighted Citation Impact)

Refs

0.39

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Topics

Emotion and Mood Recognition

Social Sciences → Psychology → Experimental and Cognitive Psychology

Music and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Speech and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Vocal Sentiments: Transformer Based Speech Emotion Recognition

Abstract

Metrics

Topics

Related Documents

TASER-Net: Transformer Based Speech Emotion Recognition

Speech Emotion Recognition Based on Swin-Transformer

Robust speech emotion recognition using conditional transformer-based architecture

Assessing Audio-Based Transformer Models for Speech Emotion Recognition

DST: Deformable Speech Transformer for Emotion Recognition