As a major branch of speech processing, speech emotion recognition has drawn much attention of researchers. Prior works have proposed a variety of models and feature sets for training a system. In this paper, we propose to use semi-supervised learning with ladder networks to generate robust feature representation for speech emotion recognition. In our method, the input of ladder network is the normalized static acoustic features and is mapped to high level hidden representations. The model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by back-propagation. The extracted hidden representations are used as emotional features in SVM model for speech emotion recognition. The experimental results, performed on IEMOCAP database, show 2.6% higher performance than denoising auto-encoder, and 5.3% than the static acoustic features.
Jianhua TaoJian HuangYa LiZheng LianMingyue Niu
Jianhua TaoJian HuangYa LiZheng LianMingyue Niu
Jianhua TaoJian HuangYa LiZheng LianMingyue Niu
Jianhua TaoJian HuangYa LiZheng LianMingyue Niu