Speech Emotion Recognition (SER) is a challenging task due to the complexity and variability of human emotions. In this paper, we propose an innovative approach to improve SER performance on the EMODB dataset. Our approach employs data augmentation techniques, such as noise addition and spectrogram shift, as well as balancing techniques, including random oversampling. We also extract five different features from the dataset samples: MFCC, Chroma, Mel Spectrogram, ZCR, and RMS. We compare the performance of four different classifiers - MLP, SVM, KNN, and CNN - with and without the use of our proposed approach. Our results demonstrate that the proposed approach significantly enhances the accuracy of speech emotion recognition compared to the approach without data augmentation and balancing techniques. Our experiments reveal that the proposed approach achieves higher accuracy and F1-score compared to other approaches, with MLP and CNN models achieving 100% accuracy. These findings highlight the effectiveness of data augmentation and balancing techniques in improving the performance of speech emotion recognition. Moreover, our approach holds great potential for application in various real-life scenarios, including mental health monitoring, human-robot interaction, and speech-based virtual assistants.
Tanisha KapoorArnaja GangulyD Rajeswari
Karim M. IbrahimAntony PerzoSimon Leglaive