Scene depth information can help visual information for more accurate semantic segmentation. However, how to effectively integrate multi-modality information into representative features is still an open problem. Most of the existing work uses DCNNs to implicitly fuse multi-modality information. But as the network deepens, some critical distinguishing features may be lost, which reduces the segmentation performance. This work proposes a unified and efficient feature selection-and-fusion network (FSFNet), which contains a symmetric cross-modality residual fusion module used for explicit fusion of multi-modality information. Besides, the network includes a detailed feature propagation module, which is used to maintain low-level detailed information during the forward process of the network. Compared with the state-of-the-art methods, experimental evaluations demonstrate that the proposed model achieves competitive performance on two public datasets.
Pengcheng XiangBaochen YaoZefeng JiangChengbin Peng
Haoming LiuLi GuoZhongwen ZhouHanyuan Zhang
Bin GeXu ZhuZihan TangChenxing XiaYiming LuZhuang Chen
Xingchao YanSujuan HouAwudu KarimWeikuan Jia