JOURNAL ARTICLE

Structured Multimodal Fusion Network for Referring Image Segmentation

Abstract

Referring image segmentation aims to segment one particular object referred by a natural language expression in the image. One major challenge of this task is how to understand and align vision and language to distinguish the referent. Another major challenge is how to refine the segmentation mask of the referent. In this paper, we focus on dissecting and enhancing the interaction between modalities to address these challenges. Specifically, we propose a Structured Multimodal Fusion Network (SMFN), which consists of a multimodal tree, a cross-modal transformer, and a mask refinement module. SMFN first exploits multimodal fusion structures to deeply integrate visual and linguistic features so that the referent can be accurately distinguished and then further utilizes a mask refinement module to aggregate multi-scale visual features to clarify boundaries. We conduct extensive experiments on the four benchmark datasets and achieve new state-of-the-art performances under different evaluation metrics.

Keywords:
Computer science Referent Artificial intelligence Segmentation Exploit Focus (optics) Image segmentation Computer vision Image fusion Pattern recognition (psychology) Natural language processing Image (mathematics)

Metrics

0
Cited By
0.00
FWCI (Field Weighted Citation Impact)
42
Refs
0.15
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Topics

Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Advanced Image and Video Retrieval Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Advanced Neural Network Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition

Related Documents

JOURNAL ARTICLE

Multimodal-Aware Fusion Network for Referring Remote Sensing Image Segmentation

Lihua ShiJuan Zhang

Journal:   IEEE Geoscience and Remote Sensing Letters Year: 2025 Vol: 22 Pages: 1-5
JOURNAL ARTICLE

Structured Attention Network for Referring Image Segmentation

Liang LinPengxiang YanXiaoqian XuSibei YangKun ZengGuanbin Li

Journal:   IEEE Transactions on Multimedia Year: 2021 Vol: 24 Pages: 1922-1932
JOURNAL ARTICLE

Mixed-scale cross-modal fusion network for referring image segmentation

Xianzhu PanXuemei XieJianxiu Yang

Journal:   Neurocomputing Year: 2024 Vol: 614 Pages: 128793-128793
JOURNAL ARTICLE

Multiscale deep feature selection fusion network for referring image segmentation

Xianwen DaiJiacheng LinKe NaiQingpeng LiZhiyong Li

Journal:   Multimedia Tools and Applications Year: 2023 Vol: 83 (12)Pages: 36287-36305
© 2026 ScienceGate Book Chapters — All rights reserved.