Didar AliMuhammad ShahabYasir Saleem AfridiRehmat Ullah
Speech Emotion Recognition (SER) plays a crucial role in Human–Computer Interaction (HCI) by enabling systems to interpret and respond to human emotions through speech analysis. This paper presents a Transformer-based SER framework that leverages the Wav2Vec2 model for self-supervised representation learning. Unlike conventional approaches relying on handcrafted acoustic features or shallow learning, our approach employs transfer learning to extract high-level contextual embeddings from raw audio. We integrate two benchmark datasets, RAVDESS and TESS, to improve generalization across diverse speakers and emotions, and further analyze system robustness by introducing varying levels of environmental noise. The proposed model achieves an accuracy of 79.01%, with balanced precision, recall, and F1-scores, demonstrating competitive performance compared with recent state-of-the-art SER models. The main contributions of this work are threefold: (i) a novel evaluation of Wav2Vec2 embeddings on combined RAVDESS–TESS data, (ii) a systematic assessment of noise robustness in Transformer-based SER, and (iii) a comprehensive benchmark that highlights the strengths and limitations of transfer learning in practical emotion recognition scenarios. These findings suggest broad applicability in voice assistants, call-center analytics, and mental health monitoring, while future extensions may incorporate multimodal data and advanced fine-tuning strategies to further enhance performance.
Karen E. JenniU. Shivani Sri VarshiniPriyanka KumarM. Srinivas
Ülkü BayraktarHasan KilimciH. Hakan KılınçZeynep Hilal Kilimci
Weidong ChenXiaofen XingXiangmin XuJianxin PangLan Du