This paper describes an automatic emotion recognition (AER) system that combines prosodic features extracted at utterance level and syllable level to recognize its emotional content. The prosodic features are extracted after identifying speech/non-speech intervals, followed by syllable level segmentation. Prosodic features chosen include parameters for representing dynamics of pitch and energy, along with duration information. Two separate classifiers are built using Deep Neural Networks (DNN). The decision scores based on both levels are fused to identify the emotion of a test utterance from the German Emotion Database (Emo-DB) which contains seven emotions, namely anger, boredom, disgust, fear, happiness, sadness and neutral. The proposed system gives a Weighted Average Recall (WAR) of 58.88% for both utterance level and syllable level prosodic features. Fusion of scores by merely adding the scores gives an overall WAR of 61.68%.
Starlet Ben AlexLeena MaryBen P. Babu
K. Sreenivasa RaoShashidhar G. Koolagudi
Remya Susan JohnStarlet Ben AlexM S SinithL. William Mary
Soumaya GharsellaouiSid‐Ahmed SelouaniAdel Omar Dahmane
Koteswara Rao AnneSwarna KuchibhotlaHima Deepthi Vankayalapati