DiMSOD: A Diffusion-Based Framework for Multi-Modal Salient Object Detection

Shuo Zhang; Jiaming Huang; Wenbing Tang; Yan Wu; Tengjiang Hu; Xiaogang Xu; Jing Liu

doi:10.1609/aaai.v39i10.33096

ScienceGate Book Chapters

JOURNAL ARTICLE

DiMSOD: A Diffusion-Based Framework for Multi-Modal Salient Object Detection

Shuo Zhang Jiaming Huang Wenbing Tang Yan Wu Tengjiang Hu Xiaogang Xu Jing Liu

Year: 2025 Journal: Proceedings of the AAAI Conference on Artificial Intelligence Vol: 39 (10)Pages: 10103-10111 Publisher: Association for the Advancement of Artificial Intelligence

DOI: 10.1609/aaai.v39i10.33096

Get Full-Text PDF Get Analytical Report

Abstract

Multi-modal salient object detection (SOD) through the integration of additional data such as depth or thermal information has become a significant task in computer vision during recent years. Traditionally, the challenges of identifying salient objects in RGB, RGB-D (Depth), and RGB-T (Thermal) images are tackled separately. However, without intricate cross-modal fusion strategies, such approaches struggle to effectively integrate multi-modal information, often resulting in poorly defined object edges or overconfident inaccurate predictions. Recent studies have shown that designing a unified end-to-end framework to handle all three types of SOD tasks simultaneously is both necessary and difficult. To address this need, we propose a novel approach that treats multi-modal SOD as a conditional mask generation task utilizing diffusion models. We introduce DiMSOD, which enables the concurrent use of local (depth maps, thermal maps) and global controls (original images) within a unified model for progressive denoising and refined prediction. DiMSOD is efficient, only requiring fine-tuning of our newly introduced modules on the existing stable diffusion, which not only reduces the fine-tuning cost, making it more viable for practical use, but also enhances the integration of multi-modal conditional controls. Specifically, we have developed modules including SOD-ControlNet, Feature Adaptive Network (FAN), and Feature Injection Attention Network (FIAN) to enhance the model's performance. Extensive experiments demonstrate that DiMSOD efficiently detects salient objects across RGB, RGB-D, and RGB-T datasets, achieving superior performance compared to previous well-established methods.

Keywords:

Modal Salient Computer science Diffusion Artificial intelligence Computer vision Pattern recognition (psychology) Physics Materials science

Metrics

Cited By

0.00

FWCI (Field Weighted Citation Impact)

Refs

0.16

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Topics

Visual Attention and Saliency Detection

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

DiMSOD: A Diffusion-Based Framework for Multi-Modal Salient Object Detection

Abstract

Metrics

Topics

Related Documents

Multi-modal Salient Object Detection via a Unified Diffusion Model

A Robust Salient Object Detection Framework based on Diffusion Model

Multi-Modal Salient Feature Enhance for Rgb-T Salient Object Detection

Multi-Modal Transformer for RGB-D Salient Object Detection

Multi-graph Based Salient Object Detection