Automatic emotion recognition is an active research topic with wide range of\napplications. Due to the high manual annotation cost and inevitable label\nambiguity, the development of emotion recognition dataset is limited in both\nscale and quality. Therefore, one of the key challenges is how to build\neffective models with limited data resource. Previous works have explored\ndifferent approaches to tackle this challenge including data enhancement,\ntransfer learning, and semi-supervised learning etc. However, the weakness of\nthese existing approaches includes such as training instability, large\nperformance loss during transfer, or marginal improvement.\n In this work, we propose a novel semi-supervised multi-modal emotion\nrecognition model based on cross-modality distribution matching, which\nleverages abundant unlabeled data to enhance the model training under the\nassumption that the inner emotional status is consistent at the utterance level\nacross modalities.\n We conduct extensive experiments to evaluate the proposed model on two\nbenchmark datasets, IEMOCAP and MELD. The experiment results prove that the\nproposed semi-supervised learning model can effectively utilize unlabeled data\nand combine multi-modalities to boost the emotion recognition performance,\nwhich outperforms other state-of-the-art approaches under the same condition.\nThe proposed model also achieves competitive capacity compared with existing\napproaches which take advantage of additional auxiliary information such as\nspeaker and interaction context.\n
Shuzhen LiTong ZhangC. L. Philip Chen
Aparna KhareSrinivas ParthasarathyShiva Sundaram
Sheng ZhangMin ChenJincai ChenYuan-Fang LiYiling WuMinglei LiChuanbo Zhu