Human emotion recognition is a very important part of the human-computer interaction process, and its application scenarios are very wide, which has received more and more attention in recent years. In this paper, a lightweight multimodal emotion recognition network is proposed, which makes the network model as small as possible under the premise of ensuring network accuracy, so that human emotion recognition can be well applied to mobile devices. Specifically, this article uses three modalities: audio, video, and text as input data. The audio signal is converted into MFCC and video signal using MobileNet for feature extraction, thereby reducing the amount of network parameters. For text data, Bert is used for feature extraction, and features extracted from the three modalities are combined through the attention mechanism. Finally, in order to improve the recognition rate and generalization ability of the network, a multi-task structure is also introduced. The experimental results show that the lightweight model can effectively reduce the amount of network parameters, greatly reduce the requirements for equipment, and make it possible to apply emotion recognition on the mobile terminal.
Aparna KhareSrinivas ParthasarathyShiva Sundaram
Guoliang XiangSong YaoXianhui WuHanwen DengGuojie WangYü LiuFan LiYong Peng
Peisong LiuManqiang CheJiangchuan Luo