Xu YiRuichao HouZiheng QiTongwei Ren
ABSTRACT RGB and thermal salient object detection (RGB‐T SOD) aims to accurately locate and segment salient objects in aligned visible and thermal image pairs. However, existing methods often struggle to produce complete masks and sharp boundaries in challenging scenarios due to insufficient exploration of complementary features from the dual modalities. In this paper, we propose a novel mamba‐based fusion network for RGB‐T SOD task, named Mamba4SOD, which integrates the strengths of Swin Transformer and Mamba to construct robust multi‐modal representations, effectively reducing pixel misclassification. Specifically, we leverage Swin Transformer V2 to establish long‐range contextual dependencies and thoroughly analyse the impact of features at various levels on detection performance. Additionally, we develop a novel Mamba‐based fusion module with linear complexity, boosting multi‐modal enhancement and fusion. Experimental results on VT5000, VT1000 and VT821 datasets demonstrate that our method outperforms the state‐of‐the‐art RGB‐T SOD methods.
Qiang ZhangTonglin XiaoNianchang HuangDingwen ZhangJungong Han
Qianwen MaXiaobo LiBincheng LiZhen ZhuJing WuXu‐Feng HuangHaofeng Hu
Shuai MaKechen SongHongwen DongHongkun TianYunhui Yan
Fengming SunKang ZhangXia YuanChunxia Zhao