Wujie ZhouShaohua DongMeixin FangLu Yu
Color–thermal (RGB-T) urban scene parsing has recently attracted widespread interest. However, most existing approaches to RGB-T urban scene parsing do not deeply explore the information complementarity between RGB-T features. In this study, we propose a cross-modal attention-cascaded fusion network (CACFNet) that fully exploits cross-modality. In our design, a cross-modal attention fusion module mines complementary information from two modalities. Subsequently, a cascaded fusion module decodes the multi-level features in an up-bottom manner. Noting that each pixel is labeled with the category of the region to which it belongs, we present a region-based module that explores the relationship between pixel and region. Moreover, in contrast to previous methods that employ only the cross-entropy loss to penalize pixel-wise predictions, we propose an additional loss to learn pixel–pixel relationships. Extensive experiments on two datasets demonstrate that the proposed CACFNet achieves state-of-the-art performance in RGB-T urban scene parsing.
Jun LiuWei KeShuai WangDa YangSizhe Wang
Zhengwen ShenJiangyu WangYuchen WengZaiyu PanYulian LiJun Wang
Yang YangHong LiangYue YangTao Feng
Shaohua DongWujie ZhouCaie XuWeiqing Yan
Qiankun ZhaoYingcai WanJiqian XuLijin Fang