With the procession of technology, the human-machine interaction research field is in growing need of robust automatic emotion recognition systems. Building machines that interact with humans by comprehending emotions paves the way for developing systems equipped with human-like intelligence. Previous architecture in this field often considers RNN models. However, these models are unable to learn in-depth contextual features intuitively. This paper proposes a transformer-based model that utilizes speech data instituted by previous works, alongside text and mocap data, to optimize our emotional recognition system's performance. Our experimental result shows that the proposed model outperforms the previous state-of-the-art. The IEMOCAP dataset supported the entire experiment.
Narzillo MamatovNilufar NiyozmatovaSh. Sh. AbdullaevAbdurashid SamijonovKeulimjay Erejepov
Yuanyuan WangYu GuYifei YinYingping HanHe ZhangShuang WangChenyu LiDou Quan
Zhongwen TuRaoxin YanShizhuang WengJiatong LiWei Zhao
Weidong ChenXiaofeng XingXiangmin XuJichen YangJianxin Pang
Hua JinYang TianLulu YanChangda WangXuehua Song