SwinTSER: An Improved Bilingual Speech Emotion Recognition Using Shift Window Transformer

Samson Akinpelu; Serestina Viriri; Muhammad Haroon Yousaf

doi:10.1007/s12559-025-10484-4

ScienceGate Book Chapters

JOURNAL ARTICLE

SwinTSER: An Improved Bilingual Speech Emotion Recognition Using Shift Window Transformer

Samson Akinpelu Serestina Viriri Muhammad Haroon Yousaf

Year: 2025 Journal: Cognitive Computation Vol: 17 (4) Publisher: Springer Science+Business Media

DOI: 10.1007/s12559-025-10484-4

Get Full-Text PDF Get Analytical Report

Abstract

Abstract Emotion recognition from human speech occupies a significant position in Human-Computer Interaction, especially with the recent advancements in Artificial Intelligence and Robotic computing. As the level of interactivity of man–machine increases, intuitive responses that are emotionally based have attracted a lot of research into emotion recognition from speech signals. However, with various machine learning models littering the literature, cross-language efficient speech emotion recognition with extracted features inherent in speech signals with state-of-the-art deep learning techniques, is still posing a serious challenge. In this paper, we proposed a deep learning transformer network based on a shift window for speech emotion recognition using speech corpus from two different languages. Shift Window Transformer (SWT) is based on a hierarchical transformer architecture designed for natural language tasks and has recently become a novel model in computer vision and image processing tasks. The input feature to the model, Mel spectrogram, is extracted from two public speech datasets: Toronto English Emotion Speech (TEES) and EMOVO. Our proposed transformer model achieved a promising result of 98.3%, 64%, and 66% recognition accuracy on TESS, EMOVO, and TESS_EMOVO (hybrid bi-lingual) datasets, respectively, after extensive experiments and parameter optimization. Our performance evaluation revealed that the proposed model yielded an improved result in the recognition of six different emotions from human auditory speech compared to others found in the literature. The study explores the performance of the SWT architecture on cross-language speech emotion recognition and informs future robust and adaptive model development.

Keywords:

Transformer Speech recognition Computer science Window (computing) Natural language processing Artificial intelligence Voltage Engineering

Metrics

Cited By

21.31

FWCI (Field Weighted Citation Impact)

Refs

0.98

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Speech and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Emotion and Mood Recognition

Social Sciences → Psychology → Experimental and Cognitive Psychology

Speech Recognition and Synthesis

Physical Sciences → Computer Science → Artificial Intelligence

SwinTSER: An Improved Bilingual Speech Emotion Recognition Using Shift Window Transformer

Abstract

Metrics

Citation History

Topics

Related Documents

Speech emotion recognition using the novel SwinEmoNet (Shifted Window Transformer Emotion Network)

DWFormer: Dynamic Window Transformer for Speech Emotion Recognition

Speech Emotion Recognition using CNN-TRANSFORMER Architecture

An enhanced speech emotion recognition using vision transformer

Research on Speech Emotion Recognition Based on the Improved Transformer