Speech Emotion Recognition (SER) is becoming necessary for interactive spoken dialogue systems as users are expecting empathy from computers. Recent work has shown the importance of approaching this problem from a multimodal perspective, with models that combine visual, acoustic, and lexical features performing better than models based on single modalities. However, current SER models are not robust to out of domain data, partly due to the fact that emotion labeled corpora are generally small. This paper outlines my PhD research plan that aims to improve the SER model by proposing to jointly train with an Automatic Speech Recognition (ASR) model using a novel cross-task semi-supervised learning approach on unlabeled data. The ASR model would be benefit from the training approach and serve as the lexical features provider. This joint ASR-SER model is expected to alleviate the lack of data problem and to be applied in real-life applications such as human-computer interaction and digital health.
Zixing ZhangFabien RingevalBin DongEduardo CoutinhoErik MarchiBjörn W. Schuller
K RemyaB. S. Shajee MohanK V Ahammed Muneer
Jian HuangYa LiJianhua TaoZheng LianMingyue NiuJiangyan Yi
Vasileios TsouvalasTanır ÖzçelebiNirvana Meratnia
Mirko AgarlaSimone BiancoLuigi CelonaPaolo NapoletanoAlexey B. PetrovskyFlavio PiccoliRaimondo SchettiniIvan Shanin