Employing a computer for automatic speech-emotion identification is a formidable and intricate undertaking. Speech emotion recognition (SER) has gained significant popularity among academics for over three decades due to its wide range of applications in many industries, such as medical treatment, marketing, customer service, driving, internet searching, and education. Researchers used many approaches to enhance the efficiency of emotion categorization. In our work, we used the images of the mel frequency cepstral coefficient (MFCC), mel-spectrogram, and a combination of both as feature input to a 2D convolutional neural network (2D-CNN) classifier to classify the emotion. We trained the model with individuals and a combination of images of the proposed feature to classify the emotion. Based on the experimental results, we observed that the suggested feature combination MFCC and mel-spectrogram performed superior to the individual in terms of speech signal emotion recognition. To assess the efficacy of our features, we used three datasets: TESS, RAVDESS, and EMO-DB. For the EMO-DB, TESS, and RAVDESS datasets, we found that the accuracy of emotion categorization was 88.89%, 100%, and 81.2%, respectively.
Fauzivy ReggiswarashariSari Widya Sihwi
Minh H. PhamFarzan Majeed NooriJim Tørresen
Arun Kumar DubeyYogita AroraNeha GuptaSarita YadavAchin JainDevansh Verma