Abstract—Identifying emotion from speech has a wide range of applications and has drawn special interests in research to improve the human-computer interaction experience. Traditional machine learning approaches usually face the challenge of selecting the optimal feature set for each application. Deep learning, on the other hand, allows end-to-end development of the models and inherent feature extraction. In this study, we evaluate the performance of Convolutional Neural Network on different kinds of spectral features of acoustic signal collections, from two popular open databases Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) and Berlin Database of Emotional Speech (EmoDB). Two-to-eight classes of emotions (RAVDESS) and two-to-seven classes of emotions (EmoDB) are identified by the deep learning model. The results, in terms of unweighted average recall, are 0.888 (two classes) and 0.694 (eight classes) for the RAVDESS dataset. The corresponding results for the EmoDB dataset are 0.993 (two classes) and 0.764 (seven classes)
Fauzivy ReggiswarashariSari Widya Sihwi
Saikat BasuJaybrata ChakrabortyMd. Aftabuddin
Arun Kumar DubeyYogita AroraNeha GuptaSarita YadavAchin JainDevansh Verma
Siba Prasad MishraPankaj WaruleSuman Deb