Disentangled Cross-Modal Transformer for RGB-D Salient Object Detection and Beyond

Hao Chen; Feihong Shen; Ding Ding; Yongjian Deng; Chao Li

doi:10.1109/tip.2024.3364022

ScienceGate Book Chapters

JOURNAL ARTICLE

Disentangled Cross-Modal Transformer for RGB-D Salient Object Detection and Beyond

Hao Chen Feihong Shen Ding Ding Yongjian Deng Chao Li

Year: 2024 Journal: IEEE Transactions on Image Processing Vol: 33 Pages: 1699-1709 Publisher: Institute of Electrical and Electronics Engineers

DOI: 10.1109/tip.2024.3364022

Get Full-Text PDF Get Analytical Report

Abstract

Previous multi-modal transformers for RGB-D salient object detection (SOD) generally directly connect all patches from two modalities to model cross-modal correlation and perform multi-modal combination without differentiation, which can lead to confusing and inefficient fusion. Instead, we disentangle the cross-modal complementarity from two views to reduce cross-modal fusion ambiguity: 1) Context disentanglement. We argue that modeling long-range dependencies across modalities as done before is uninformative due to the severe modality gap. Differently, we propose to disentangle the cross-modal complementary contexts to intra-modal self-attention to explore global complementary understanding, and spatial-aligned inter-modal attention to capture local cross-modal correlations, respectively. 2) Representation disentanglement. Unlike previous undifferentiated combination of cross-modal representations, we find that cross-modal cues complement each other by enhancing common discriminative regions and mutually supplement modal-specific highlights. On top of this, we divide the tokens into consistent and private ones in the channel dimension to disentangle the multi-modal integration path and explicitly boost two complementary ways. By progressively propagate this strategy across layers, the proposed Disentangled Feature Pyramid module (DFP) enables informative cross-modal cross-level integration and better fusion adaptivity. Comprehensive experiments on a large variety of public datasets verify the efficacy of our context and representation disentanglement and the consistent improvement over state-of-the-art models. Additionally, our cross-modal attention hierarchy can be plug-and-play for different backbone architectures (both transformer and CNN) and downstream tasks, and experiments on a CNN-based model and RGB-D semantic segmentation verify this generalization ability.

Keywords:

Modal Computer science Artificial intelligence Transformer RGB color model Pattern recognition (psychology) Voltage Engineering

Metrics

Cited By

17.50

FWCI (Field Weighted Citation Impact)

Refs

0.99

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Visual Attention and Saliency Detection

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Face Recognition and Perception

Life Sciences → Neuroscience → Cognitive Neuroscience

Advanced Neural Network Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Disentangled Cross-Modal Transformer for RGB-D Salient Object Detection and Beyond

Abstract

Metrics

Citation History

Topics

Related Documents

Lightweight cross-modal transformer for RGB-D salient object detection

Multi-Modal Transformer for RGB-D Salient Object Detection

Transformer-Based Cross-Modal Feature Fusion Network for RGB-D Salient Object Detection

Cross-Modal Weighting Network for RGB-D Salient Object Detection

Transformer-Based Cross-Modal Integration Network for RGB-T Salient Object Detection