Referring Image Segmentation via Language-Driven Attention

Ding-Jie Chen; He‐Yen Hsieh; Tyng-Luh Liu

doi:10.1109/icra48506.2021.9561797

ScienceGate Book Chapters

JOURNAL ARTICLE

Referring Image Segmentation via Language-Driven Attention

Ding-Jie Chen He‐Yen Hsieh Tyng-Luh Liu

Year: 2021 Pages: 13997-14003

DOI: 10.1109/icra48506.2021.9561797

Get Full-Text PDF Get Analytical Report

Abstract

This paper aims to tackle the problem of referring image segmentation, which is targeted at reasoning the region of interest referred by a query natural language sentence. One key issue to address the referring image segmentation is how to establish the cross-modal representation for encoding the two modalities, namely, the query sentence and the input image. Most existing methods are designed to concatenate the features from each modality or to gradually encode the cross-modal representation concerning each word's effect. In contrast, our approach leverages the correlation between the two modalities for constructing the cross-modal representation. To make the resulting cross-modal representation more discriminative for the segmentation task, we propose a novel mechanism of language-driven attention to encode the cross-modal representation for reflecting the attention between every single visual element and the entire query sentence. The proposed mechanism, named as Language-Driven Attention (LDA), first decouples the cross-modal correlation to channel-attention and spatial-attention and then integrates the two attentions for obtaining the cross-modal representation. The channel attention and the spatial attention respectively reveal how sensitive each channel or each pixel of a particular feature map is with respect to the query sentence. With a proper fusion of the two kinds of feature attention, the proposed LDA model can effectively guide the generation of the final cross-modal representation. The resulting representation is further strengthened for capturing the multi-receptive-field and multi-level-semantic for the intended segmentation. We assess our referring image segmentation model on four public benchmark datasets, and the experimental results show that our model achieves state-of-the-art performance

Keywords:

Computer science Image segmentation Artificial intelligence Segmentation Computer vision Natural language processing Image (mathematics)

Metrics

Cited By

0.20

FWCI (Field Weighted Citation Impact)

Refs

0.50

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Advanced Image and Video Retrieval Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Advanced Neural Network Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Referring Image Segmentation via Language-Driven Attention

Abstract

Metrics

Citation History

Topics

Related Documents

CRIS: CLIP-Driven Referring Image Segmentation

Structured Attention Network for Referring Image Segmentation

Instance-aware context with mutually guided vision-language attention for referring image segmentation

Prompt-Driven Referring Image Segmentation with Instance Contrasting

CLIP-driven hierarchical fusion for referring image segmentation