Expressions are an important non-verbal behavior for humans to convey their emotional information, reflecting their inner activities. As attention in Transformers has excellent feature representation capabilities, it is catching more interests in computer vision field. Swin Transformer is a successful implementation of transformer in computer vision field. This paper proposes a Swin Transformer-based emotion recognition algorithm called Swin Emotion. The algorithm utilizes Swin Transformer as the backbone network and fuses the varying-sized feature vectors extracted from multiple stages, taking into account both the detailed features of the image and its macroscopic characteristics while preserving the integrity of the modeling process. The multi-scale feature fusion network simultaneously uses both low-level and high-level features of images, thereby improving the model's perception ability for objects of different scales and its capture ability of detail and contextual information. Experimental results show that the model demonstrates better understanding of emotions and achieves outstanding performance on all tasks. With an accuracy of up to 99.38% in peak expressions on the CK+ dataset, Swin Emotion holds broad application prospects in the field of emotion recognition.
Mei BieHuan XuYan GaoKai SongXiangjiu Che
Xinhua ZhaoYongjia LvZheng Huang
Longteng DuanWei ShaoLinqi Song
Peng QiZhengguang LiuMohammad AsadpourhafshejaniDezheng HuaXinhua Liu