Samson AkinpeluSerestina ViririMuhammad Haroon Yousaf
Abstract Emotion recognition from human speech occupies a significant position in Human-Computer Interaction, especially with the recent advancements in Artificial Intelligence and Robotic computing. As the level of interactivity of man–machine increases, intuitive responses that are emotionally based have attracted a lot of research into emotion recognition from speech signals. However, with various machine learning models littering the literature, cross-language efficient speech emotion recognition with extracted features inherent in speech signals with state-of-the-art deep learning techniques, is still posing a serious challenge. In this paper, we proposed a deep learning transformer network based on a shift window for speech emotion recognition using speech corpus from two different languages. Shift Window Transformer (SWT) is based on a hierarchical transformer architecture designed for natural language tasks and has recently become a novel model in computer vision and image processing tasks. The input feature to the model, Mel spectrogram, is extracted from two public speech datasets: Toronto English Emotion Speech (TEES) and EMOVO. Our proposed transformer model achieved a promising result of 98.3%, 64%, and 66% recognition accuracy on TESS, EMOVO, and TESS_EMOVO (hybrid bi-lingual) datasets, respectively, after extensive experiments and parameter optimization. Our performance evaluation revealed that the proposed model yielded an improved result in the recognition of six different emotions from human auditory speech compared to others found in the literature. The study explores the performance of the SWT architecture on cross-language speech emotion recognition and informs future robust and adaptive model development.
R. RameshViswanathan Balasubramanian PrahaladhanP NithishK. Mohanaprasad
Shuaiqi ChenXiaofen XingWei-Bin ZhangWeidong ChenXiangmin Xu
Sarthak MangalmurtiOjshav SaxenaTej Singh
Samson AkinpeluSerestina ViririAdekanmi Adeyinka Adegun