In recent years, many salient object detection (SOD) methods introduce depth cues to boost detection performance in challenging scenes, named as RGB-D SOD. However, how to effectively fuse cross-modal features with various properties (i.e., RGB and depth) has become a key issue that is hard to be avoided. Most existing methods employ simple operations, such as concatenation or summation, for cross-modal fusion, ignoring the negative effects of low-quality depth maps, thus yielding poor performance. In this paper, we design a simple yet effective fusion method, which utilizes 3D convolution to extract modality-specific and modality-shared information respectively for sufficient cross-modal fusion, and combines modality weights to mitigate the interference of invalid information. In addition, we propose a novel multi-level feature integration strategy in the decoder, which explicitly incorporates the low-level detail information and high-level semantic information into the mid-level to generate accurate saliency maps. Extensive experiments on six public datasets show that our method achieves competitive results compared to 17 state-of-the-art methods.
Zhiqiang CuiZhengyong FengFeng WangQiang Liu
Qinsheng DuYingxu BianJianyu WuShiyan ZhangJian Zhao
Zhengyi LiuYuan WangYacheng TanWei LiYun Xiao