Chengtao LvXiaofei ZhouBin WanShuai WangYaoqi SunJiyong ZhangChenggang Yan
Salient object detection (SOD) can be applied to consumer electronic area, which can help to identify and locate objects of interest. RGB/RGB-D (depth) salient object detection has achieved great progress in recent years. However, there is a large room for improvement in exploring the complementarity of two-modal information for RGB-T (thermal) SOD. Therefore, this paper proposes a Transformer-based Cross-modal Integration Network (i.e., TCINet) to detect salient objects in RGB-T images, which can properly fuse two-modal features and interactively aggregate two-level features. Our method consists of the siamese Swin Transformer-based encoders, the cross-modal feature fusion (CFF) module, and the interaction-based feature decoding (IFD) block. Here, the CFF module is designed to fuse the complementary information of two-modal features, where the collaborative spatial attention emphasizes salient regions and suppresses background regions of the two-modal features. Furthermore, we deploy the IFD block to aggregate two-level features, including the previous-level fused feature and the current-level encoder feature, where the IFD block bridges the large semantic gap and reduces the noise. Extensive experiments are conducted on three RGB-T datasets, and the experimental results clearly demonstrate the superiority and effectiveness of our method when compared with the cutting-edge saliency methods. The results and code of our method will be available at https://github.com/lvchengtao/TCINet.
Zhenyu ZhangHuiyan ChenQingzhen XuQiang Chen
Chang XuQingwu LiQingkai ZhouXiongbiao JiangDabing YuYaqin Zhou
Jincheng LuoYongjun LiBo LiXinru ZhangC. LiZhimin ChenjinJingyi HeYifei Liang
Nianchang HuangYang YangQiang ZhangJungong HanJin H. Huang
Shuai MaKechen SongHongwen DongHongkun TianYunhui Yan