The research applies deep learning to SER, or voice recording emotion detection. Precision vocal emotion recognition has several applications, including human-computer interaction, virtual assistants, and healthcare. This study uses emotional-labeled spoken utterances to build an accurate SER system used to train the deep learning models like convolutional neural networks (CNNs). These models are popular for speech emotion recognition because they can learn complex voice signal patterns that indicate different moods. Accurate diagnosis involves more detailed sound analysis and mood or emotion recognition. This paper presents a comprehensive framework for SER from recorded audio samples using digital signal processing advances. The dataset's speech features spectrograms and pitch picture train the models. Speech analysis uses these features because they capture vocal tract and pitch aspects of the speech stream. After training, the models' classification accuracy their ability to correctly recognize unseen speech samples' emotional content is examined. To assess its performance, the best model is compared to the most advanced methods. Vgg16 CNN outperformed Mel-Spectrogram-featured Convolutional Neural Networks in this work. Emotion sound samples processed with CNN and mel-spectrogram achieved 89% accuracy, with better results using transfer learning (CNN-VGG16). Other classifiers like SVM, Logistic Regression, Decision Tree, and Random Forest yielded lower accuracy (60%-75%). Further research should explore composite feature sets for improved classification.
Prof. Martina DsouzaRohan AdhavShivam DubeySachin Dwivedi
Surabika HotaPamela ChaudhuryS. KaliaA. PrakashSatyananda Champati
Rashmi GuptaJeetendra KumarSuvarna Sharma
Yoshihide HayashizakiTakashi NoseShiro KobayashiAkinori Ito
S Shajith AhamedJ. JabezM Prithiviraj