JOURNAL ARTICLE

Disentangled Cross-Modal Transformer for RGB-D Salient Object Detection and Beyond

Hao ChenFeihong ShenDing DingYongjian DengChao Li

Year: 2024 Journal:   IEEE Transactions on Image Processing Vol: 33 Pages: 1699-1709   Publisher: Institute of Electrical and Electronics Engineers

Abstract

Previous multi-modal transformers for RGB-D salient object detection (SOD) generally directly connect all patches from two modalities to model cross-modal correlation and perform multi-modal combination without differentiation, which can lead to confusing and inefficient fusion. Instead, we disentangle the cross-modal complementarity from two views to reduce cross-modal fusion ambiguity: 1) Context disentanglement. We argue that modeling long-range dependencies across modalities as done before is uninformative due to the severe modality gap. Differently, we propose to disentangle the cross-modal complementary contexts to intra-modal self-attention to explore global complementary understanding, and spatial-aligned inter-modal attention to capture local cross-modal correlations, respectively. 2) Representation disentanglement. Unlike previous undifferentiated combination of cross-modal representations, we find that cross-modal cues complement each other by enhancing common discriminative regions and mutually supplement modal-specific highlights. On top of this, we divide the tokens into consistent and private ones in the channel dimension to disentangle the multi-modal integration path and explicitly boost two complementary ways. By progressively propagate this strategy across layers, the proposed Disentangled Feature Pyramid module (DFP) enables informative cross-modal cross-level integration and better fusion adaptivity. Comprehensive experiments on a large variety of public datasets verify the efficacy of our context and representation disentanglement and the consistent improvement over state-of-the-art models. Additionally, our cross-modal attention hierarchy can be plug-and-play for different backbone architectures (both transformer and CNN) and downstream tasks, and experiments on a CNN-based model and RGB-D semantic segmentation verify this generalization ability.

Keywords:
Modal Computer science Artificial intelligence Transformer RGB color model Pattern recognition (psychology) Voltage Engineering

Metrics

33
Cited By
17.50
FWCI (Field Weighted Citation Impact)
70
Refs
0.99
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Visual Attention and Saliency Detection
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Face Recognition and Perception
Life Sciences →  Neuroscience →  Cognitive Neuroscience
Advanced Neural Network Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition

Related Documents

JOURNAL ARTICLE

Lightweight cross-modal transformer for RGB-D salient object detection

Nianchang HuangYang YangQiang ZhangJungong HanJin H. Huang

Journal:   Computer Vision and Image Understanding Year: 2024 Vol: 249 Pages: 104194-104194
JOURNAL ARTICLE

Multi-Modal Transformer for RGB-D Salient Object Detection

Peipei SongJing ZhangPiotr KoniuszNick Barnes

Journal:   2022 IEEE International Conference on Image Processing (ICIP) Year: 2022 Pages: 2466-2470
BOOK-CHAPTER

Cross-Modal Weighting Network for RGB-D Salient Object Detection

Gongyang LiZhi LiuLinwei YeYang WangHaibin Ling

Lecture notes in computer science Year: 2020 Pages: 665-681
JOURNAL ARTICLE

Transformer-Based Cross-Modal Integration Network for RGB-T Salient Object Detection

Chengtao LvXiaofei ZhouBin WanShuai WangYaoqi SunJiyong ZhangChenggang Yan

Journal:   IEEE Transactions on Consumer Electronics Year: 2024 Vol: 70 (2)Pages: 4741-4755
© 2026 ScienceGate Book Chapters — All rights reserved.