The core of a Speech Emotion Recognition(SER) system is to extract features that can best represent speech emotion and construct an acoustic model with strong robustness and generalization.In this study, a heterogeneous parallel Recurrent Neural Network(RNN) model based on the attention mechanism AHPCL is constructed for SER.The Long Short-Term Memory(LSTM) network is used to extract the time-series features of speech emotion, and the convolution operation is used to extract the speech spatial spectral features.By combining temporal and spatial information to jointly represent speech emotion, the accuracy of the prediction results is improved.The attention mechanism is used to assign weights according to the contribution of different time-series features to speech emotion to select a time sequence that better represents speech emotion from a large amount of feature information.Low-level descriptor features such as pitch, Zero Crossing Rate(ZCR), and Mel-Frequency Cepstrum Coefficient(MFCC) are extracted from three speech emotion databases, namely CASIA, EMODB, and SAVEE, and the high-level statistical functions of these low-level descriptor features are calculated to obtain 219 dimensional features.The experimental results show that the proposed model achieves 86.02%, 84.03%, and 64.06% Unweighted Average Recall(UAR) on the CASIA, EMODB, and SAVEE databases, respectively.Compared with the LeNet, DNN-ELM, and TSFFCNN baseline models, the AHPCL model exhibits greater robustness and generalization.
Alif Bin Abdul QayyumAsiful ArefeenCelia Shahnaz
Wei JiangZheng WangJesse S. JinXian-Feng HanChunguang Li