Speech Emotion Recognition (SER) system classifies the human emotional state based on speaker's utterances in different categories. This study proposes a novel SER system using cochleagram which is an acoustic feature associated with human auditory perception of emotions. The proposed model integrates a hybrid architecture comprising Convolutional Neural Network (CNN) and Gated Recurrent Unit (GRU) network, augmented with self-attention mechanism. Evaluation of the model is conducted on the BanglaSER and RAVDESS dataset, where BanglaSER provides a notable accuracy of 91.17% in categorizing five distinct emotions: angry, calm, happy, neutral, and sad. Furthermore,on the RAVDESS dataset, the model exhibited a solid accuracy of 78.35% in classifying eight diverse emotions. The incorporation of cochleagram and the hybrid neural network design showcases the efficacy of the proposed SER system, offering a promising approach for precise and efficient emotion categorization in speech signals,
Atkia NameyKhadija AkterMd. Azad HossainM. Ali Akber Dewan
Liyan ZhangYetong WangJiaxin DuXinyu Wang
Xiaoyu TangJiazheng HuangYixin LinTing DangJintao Cheng
Saumya BorwankarDhruv ShahJ. P. VermaSudeep Tanwar
Zhichao PengZeng HuaYongwei LiYegang DuJianwu Dang