Current Speech Emotion Recognition(SER) models have shortcomings such as large numbers of training parameters,poor model generalization,and low emotion recognition accuracy. Therefore,under the condition of limited sample data,it is particularly important to build a lightweight model to improve model recognition efficiency and accuracy.To this end,this paper proposes a lightweight end-to-end multi-task deep learning model named P-CNN+Gender,which is composed of three parts:a speech feature combination network,body convolutional network responsible for emotion and gender feature extraction,and emotion and gender classifier.The model uses the Mel-Frequency Cepstral Coefficients(MFCC) features of speech as input,and the feature combination network uses convolutional kernels of different sizes to extract MFCC features in parallel and combine them for the subsequent body convolutional network to extract emotion and gender features.Finally,considering the correlation between emotional expression and gender,gender classification is integrated into emotion classification as an auxiliary task to improve the model's emotion classification performance.The model is tested on the IEMOCAP,Emo-DB,and CASIA speech emotion datasets and achieved Unweighted Accuracy(UA) results of 73.3%,96.4% and 93.9%,which are 3.0,5.8,and 6.5 percentage points higher than the P-CNN model,respectively.The training parameter quantity of this model is only 1/10-1/2 that of other models,such as 3D-ACRNN,CNNBiRNN,etc.,and the model achieves faster processing and higher accuracy.
Huijuan ZhaoZhijie HanRuchuan Wang
Xingyu CaiJiahong YuanRenjie ZhengLiang HuangKenneth Church
Ruichu CaiKaibin GuoBoyan XuXiaoyan YangZhenjie Zhang
Pengcheng YueLeyuan QuShukai ZhengTaihao Li