Shuo ZhangJiaming HuangWenbing TangYan WuTengjiang HuXiaogang XuJing Liu
Multi-modal salient object detection (SOD) through the integration of additional data such as depth or thermal information has become a significant task in computer vision during recent years. Traditionally, the challenges of identifying salient objects in RGB, RGB-D (Depth), and RGB-T (Thermal) images are tackled separately. However, without intricate cross-modal fusion strategies, such approaches struggle to effectively integrate multi-modal information, often resulting in poorly defined object edges or overconfident inaccurate predictions. Recent studies have shown that designing a unified end-to-end framework to handle all three types of SOD tasks simultaneously is both necessary and difficult. To address this need, we propose a novel approach that treats multi-modal SOD as a conditional mask generation task utilizing diffusion models. We introduce DiMSOD, which enables the concurrent use of local (depth maps, thermal maps) and global controls (original images) within a unified model for progressive denoising and refined prediction. DiMSOD is efficient, only requiring fine-tuning of our newly introduced modules on the existing stable diffusion, which not only reduces the fine-tuning cost, making it more viable for practical use, but also enhances the integration of multi-modal conditional controls. Specifically, we have developed modules including SOD-ControlNet, Feature Adaptive Network (FAN), and Feature Injection Attention Network (FIAN) to enhance the model's performance. Extensive experiments demonstrate that DiMSOD efficiently detects salient objects across RGB, RGB-D, and RGB-T datasets, achieving superior performance compared to previous well-established methods.
Shuo ZhangJiaming HuangWenbing TangLili TianWei YJing Liu
Ao WangChenhong SuiHaipeng WangDanfeng HongQingtao GongShengwen ZhouJian Hu
Chao YangZheng GuanXue WangWenbi MaJinde Cao
Peipei SongJing ZhangPiotr KoniuszNick Barnes
Idir FilaliMohand Saïd AlliliNadjia Benblidia