JOURNAL ARTICLE

SwinTSER: An Improved Bilingual Speech Emotion Recognition Using Shift Window Transformer

Samson AkinpeluSerestina ViririMuhammad Haroon Yousaf

Year: 2025 Journal:   Cognitive Computation Vol: 17 (4)   Publisher: Springer Science+Business Media

Abstract

Abstract Emotion recognition from human speech occupies a significant position in Human-Computer Interaction, especially with the recent advancements in Artificial Intelligence and Robotic computing. As the level of interactivity of man–machine increases, intuitive responses that are emotionally based have attracted a lot of research into emotion recognition from speech signals. However, with various machine learning models littering the literature, cross-language efficient speech emotion recognition with extracted features inherent in speech signals with state-of-the-art deep learning techniques, is still posing a serious challenge. In this paper, we proposed a deep learning transformer network based on a shift window for speech emotion recognition using speech corpus from two different languages. Shift Window Transformer (SWT) is based on a hierarchical transformer architecture designed for natural language tasks and has recently become a novel model in computer vision and image processing tasks. The input feature to the model, Mel spectrogram, is extracted from two public speech datasets: Toronto English Emotion Speech (TEES) and EMOVO. Our proposed transformer model achieved a promising result of 98.3%, 64%, and 66% recognition accuracy on TESS, EMOVO, and TESS_EMOVO (hybrid bi-lingual) datasets, respectively, after extensive experiments and parameter optimization. Our performance evaluation revealed that the proposed model yielded an improved result in the recognition of six different emotions from human auditory speech compared to others found in the literature. The study explores the performance of the SWT architecture on cross-language speech emotion recognition and informs future robust and adaptive model development.

Keywords:
Transformer Speech recognition Computer science Window (computing) Natural language processing Artificial intelligence Voltage Engineering

Metrics

4
Cited By
21.31
FWCI (Field Weighted Citation Impact)
62
Refs
0.98
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Speech and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing
Emotion and Mood Recognition
Social Sciences →  Psychology →  Experimental and Cognitive Psychology
Speech Recognition and Synthesis
Physical Sciences →  Computer Science →  Artificial Intelligence
© 2026 ScienceGate Book Chapters — All rights reserved.