Manoara BegumMd Akash RahmanTanjim MahmudMohammad Shahadat HossainKarl Andersson
Speech Emotion Recognition (SER) is a complex endeavor in human-computer interaction (HCI) that necessitates the use of artificial intelligence and deep learning to accurately classify emotional states, which are determined by analyzing speech audio signals. Bangla is classified as a low-resource language for SER due to the scarcity of labeled datasets, despite its status as the seventh most frequently spoken language at the global level. By utilizing the SUBESCO and BanglaSER corpora, which are both audio-only Bangla emotive speech datasets, this study aims to enhance emotion recognition in Bengali speech. Noise was eliminated through the application of Envelope Masking during preprocessing, and Mel-Frequency Cepstral Coefficients (MFCCs) were extracted to capture critical spectral features. Machine learning models such as K-Nearest Neighbor (KNN), Random Forest, and Multi-Layer Perceptron (MLP) are implemented by the system, in addition to ensemble techniques like Voting and Stacking Classifiers, to optimize its performance. Further, in order to process temporal and sequential speech patterns efficiently, deep learning architectures such as Artificial Neural Networks (ANN), Convolutional Neural Networks (CNN), and Long Short-Term Memory (LSTM) were implemented. The models that were proposed exhibited a high level of perceptual efficiency, obtaining an accuracy of 95.92% on SUBESCO and 90.61% on BanglaSER. These findings substantiate the efficacy of the preprocessing techniques and applied learning models, thereby enhancing BanglaSER and broadening the scope of research opportunities for low-resource languages.
Priom DebHabiba MahrinAsibur Rahman Bhuiyan
Md. Mahadi HassanM. RaihanMd. Mehedi HassanAnupam Kumar Bairagi
Roy D Gregori AyonMd. Sanaullah RabbiUmme HabibaMaoyejatun Hasana
Md. Sarwar HosainTetsuya Shimamura