Zhengyi LiuYacheng TanQian HeYun Xiao
Convolutional neural networks (CNNs) are good at extracting contexture\nfeatures within certain receptive fields, while transformers can model the\nglobal long-range dependency features. By absorbing the advantage of\ntransformer and the merit of CNN, Swin Transformer shows strong feature\nrepresentation ability. Based on it, we propose a cross-modality fusion model\nSwinNet for RGB-D and RGB-T salient object detection. It is driven by Swin\nTransformer to extract the hierarchical features, boosted by attention\nmechanism to bridge the gap between two modalities, and guided by edge\ninformation to sharp the contour of salient object. To be specific, two-stream\nSwin Transformer encoder first extracts multi-modality features, and then\nspatial alignment and channel re-calibration module is presented to optimize\nintra-level cross-modality features. To clarify the fuzzy boundary, edge-guided\ndecoder achieves inter-level cross-modality fusion under the guidance of edge\nfeatures. The proposed model outperforms the state-of-the-art models on RGB-D\nand RGB-T datasets, showing that it provides more insight into the\ncross-modality complementarity task.\n
Shuaihui WangFengyi JiangBoqian Xu
Geng ChenQingyue WangBo DongRuitao MaNian LiuHuazhu FuYong Xia
Xu LiuChenhua LiuXiaoming ZhouGuodong Fan
Mingfeng JiangJianhua MaJiatong ChenYaming WangXian Fang
Chao ZengSam KwongHorace H. S. Ip