Fine-Grained Visual Text Prompting

Lingfeng Yang; Xiang Li; Yueze Wang; Xinlong Wang; Jian Yang

doi:10.1109/tpami.2024.3504568

ScienceGate Book Chapters

JOURNAL ARTICLE

Fine-Grained Visual Text Prompting

Lingfeng Yang Xiang Li Yueze Wang Xinlong Wang Jian Yang

Year: 2024 Journal: IEEE Transactions on Pattern Analysis and Machine Intelligence Vol: 47 (3)Pages: 1594-1609 Publisher: IEEE Computer Society

DOI: 10.1109/tpami.2024.3504568

Get Full-Text PDF Get Analytical Report

Abstract

Vision-Language Models (VLMs), such as CLIP, excel in zero-shot image-level visual understanding but struggle with object-based tasks requiring precise localization and recognition. Visual prompts, like colorful boxes or circles, are suggested to enhance local perception. However, these methods often include irrelevant and noisy pixels, leading to suboptimal performance. The design of better visual prompts and their collaboration with text prompting remains underexplored. This paper introduces Fine-Grained Visual Text Prompting (FGVTP), a new zero-shot framework for object-based tasks using precise semantic masks and reinforced image-text alignment. FGVTP comprises Fine-Grained Visual Prompting (FGVP) and Consistency-Enhanced Text Prompting (CETP). Specifically, we carefully study visual prompting designs by exploring more visual markings that vary in shape and form. FGVP uses semantic masks from a segmenter like the Segment Anything Model (SAM) and employs background blurring (Blur Reverse Mask) to highlight targets while maintaining spatial coherence. Further, CETP enhances image-text alignment by prompting captions based on FGVP-processed images. As a result, FGVTP achieves superior zero-shot referring expression comprehension on RefCOCO/+/g benchmarks, outperforming previous SOTA methods by 5.8% on average. Part detection experiments conducted on the PACO dataset further validate the preponderance of FGVTP over existing works. Code is available at https://github.com/ylingfeng/FGVP.

Keywords:

Computer science Artificial intelligence Computer vision Visualization Natural language processing Pattern recognition (psychology) Computer graphics (images)

Metrics

Cited By

3.71

FWCI (Field Weighted Citation Impact)

118

Refs

0.89

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Advanced Image and Video Retrieval Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Video Analysis and Summarization

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Fine-Grained Visual Text Prompting

Abstract

Metrics

Citation History

Topics

Related Documents

Fine-Grained Controllable Text Generation Using Non-Residual Prompting

Delving into Multimodal Prompting for Fine-Grained Visual Classification

Crop-and-Prompt: Multi-Grained Prompting for Fine-Grained Visual-Language Understanding

Medical Image Synthesis via Fine-Grained Image-Text Alignment and Anatomy-Pathology Prompting

Using Text and Visual Cues for Fine-Grained Classification