Automatic emotion recognition is a challenging task that can make great impact on improving natural human-computer interactions. In this paper, we present our automatic prediction of dimensional emotional state for Cross-cultural Emotion Sub-Challenge (AVEC 2018), which uses multi-features and fusion across visual, audio and text modalities. Single-feature predictions are modeled at first with support vector regression (SVR). The multimodal fusion of these modalities is then performed with a multiple linear regression model. Besides the baseline features, we extract one-gram and two-gram features from text, and some types of convolutional neural networks (CNNs) feature from video. Our multimodal fusion reached CCC=0.599 on the development set for arousal, 0.617 for valence and 0.289 for likability.
Jingjun LiangShizhe ChenJinming ZhaoQin JinHaibo LiuLu Li
Jinming ZhaoShizhe ChenShuai WangQin Jin
Lu GanWei LiuYun LuoXun WuBao‐Liang Lu
Jian HuangJianhua TaoBin LiuZheng LianMingyue Niu