Peng HeJun YuChengjie GeWei JiaW. L. XuLei WangTianyu LiuZhen Kan
As a focal point of research in various fields, human body language understanding has long been a subject of intense interest. Within this realm, the exploration of emotion recognition through the analysis of facial expressions, voice patterns, and physiological signals holds significant practical value. Compared with unimodal approaches, multimodal emotion recognition models leverage complementary information from vision, acoustic, and language modalities to robust perceive the human sentiment attitudes. However, the heterogeneity among modality signals leads to significant domain shifts, posing challenges for achieving balanced fusion. In this article, we propose a Domain-Separated Bottleneck Attention (DBA) Fusion Framework for human multimodal emotion recognition with lower computational complexity. Specifically, we partition each modality into two distinct domains: the invariant/private domain. The invariant domain contains crucial shared information, while the private domain aims to capture modality-specific representations. For the decomposed features, we introduce two sets of bottleneck cross-attention modules to effectively utilize the complementarity between domains to reduce redundant information. In each module, we interweave two Fusion Adapter blocks into the Self-Attention Transformer backbone. Each Fusion Adapter block integrates a small group of latent tokens as bridges for inter-modal and inter-domain interactions, mitigating the adverse effects of modality distribution differences and lowering computational costs. Extensive experimental results demonstrate that our method outperforms State-of-the-Art (SOTA) approaches across three widely used benchmark datasets.
YuTong SunDalei ChengYaxi ChenZhiwei He
Kai LiYa HuangGang ZhongYolwas NurmemetSilamu Wushouer
Chengwen ZhangYaohui LiuBo Cheng
Ying WangJianjun LeiXiangwei ZhuTao Zhang