Extending CLIP’s Image-Text Alignment to Referring Image Segmentation

Seoyeon Kim; Minguk Kang; Dong-Won Kim; Jaesik Park; Suha Kwak

doi:10.18653/v1/2024.naacl-long.258

ScienceGate Book Chapters

JOURNAL ARTICLE

Extending CLIP’s Image-Text Alignment to Referring Image Segmentation

Seoyeon Kim Minguk Kang Dong-Won Kim Jaesik Park Suha Kwak

Year: 2024 Pages: 4611-4628

DOI: 10.18653/v1/2024.naacl-long.258

Get Full-Text PDF Get Analytical Report

Abstract

Referring Image Segmentation (RIS) is a cross-modal task that aims to segment an instance described by a natural language expression. Recent methods leverage large-scale pretrained unimodal models as backbones along with fusion techniques for joint reasoning across modalities. However, the inherent cross-modal nature of RIS raises questions about the effectiveness of unimodal backbones. We propose RISCLIP, a novel framework that effectively leverages the cross-modal nature of CLIP for RIS. Observing CLIPs inherent alignment between image and text features, we capitalize on this starting point and introduce simple but strong modules that enhance unimodal feature extraction and leverage rich alignment knowledge in CLIPs image-text shared-embedding space. RISCLIP exhibits outstanding results on all three major RIS benchmarks and also outperforms previous CLIP-based methods, demonstrating the efficacy of our strategy in extending CLIPs image-text alignment to RIS.

Keywords:

Computer science Artificial intelligence Computer vision Image segmentation Image (mathematics) Segmentation Scale-space segmentation Natural language processing

Metrics

Cited By

3.19

FWCI (Field Weighted Citation Impact)

Refs

0.88

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Natural Language Processing Techniques

Physical Sciences → Computer Science → Artificial Intelligence

Topic Modeling

Physical Sciences → Computer Science → Artificial Intelligence

Semantic Web and Ontologies

Physical Sciences → Computer Science → Artificial Intelligence

Extending CLIP’s Image-Text Alignment to Referring Image Segmentation

Abstract

Metrics

Citation History

Topics

Related Documents

Text-Vision Relationship Alignment for Referring Image Segmentation

Exploring Fine-Grained Image-Text Alignment for Referring Remote Sensing Image Segmentation

CausalCLIPSeg: Unlocking CLIP’s Potential in Referring Medical Image Segmentation with Causal Intervention

Referring Image Segmentation Without Text Annotations

Referring Image Segmentation Using Text Supervision