JOURNAL ARTICLE

Fine-Grained Visual Text Prompting

Lingfeng YangXiang LiYueze WangXinlong WangJian Yang

Year: 2024 Journal:   IEEE Transactions on Pattern Analysis and Machine Intelligence Vol: 47 (3)Pages: 1594-1609   Publisher: IEEE Computer Society

Abstract

Vision-Language Models (VLMs), such as CLIP, excel in zero-shot image-level visual understanding but struggle with object-based tasks requiring precise localization and recognition. Visual prompts, like colorful boxes or circles, are suggested to enhance local perception. However, these methods often include irrelevant and noisy pixels, leading to suboptimal performance. The design of better visual prompts and their collaboration with text prompting remains underexplored. This paper introduces Fine-Grained Visual Text Prompting (FGVTP), a new zero-shot framework for object-based tasks using precise semantic masks and reinforced image-text alignment. FGVTP comprises Fine-Grained Visual Prompting (FGVP) and Consistency-Enhanced Text Prompting (CETP). Specifically, we carefully study visual prompting designs by exploring more visual markings that vary in shape and form. FGVP uses semantic masks from a segmenter like the Segment Anything Model (SAM) and employs background blurring (Blur Reverse Mask) to highlight targets while maintaining spatial coherence. Further, CETP enhances image-text alignment by prompting captions based on FGVP-processed images. As a result, FGVTP achieves superior zero-shot referring expression comprehension on RefCOCO/+/g benchmarks, outperforming previous SOTA methods by 5.8% on average. Part detection experiments conducted on the PACO dataset further validate the preponderance of FGVTP over existing works. Code is available at https://github.com/ylingfeng/FGVP.

Keywords:
Computer science Artificial intelligence Computer vision Visualization Natural language processing Pattern recognition (psychology) Computer graphics (images)

Metrics

7
Cited By
3.71
FWCI (Field Weighted Citation Impact)
118
Refs
0.89
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Advanced Image and Video Retrieval Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Video Analysis and Summarization
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition

Related Documents

JOURNAL ARTICLE

Fine-Grained Controllable Text Generation Using Non-Residual Prompting

Fredrik CarlssonJoey ÖhmanFangyu LiuSeverine VerlindenJoakim NivreMagnus Sahlgren

Journal:   Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) Year: 2022
JOURNAL ARTICLE

Delving into Multimodal Prompting for Fine-Grained Visual Classification

Xin JiangHao TangJunyao GaoXiaoyu DuShengfeng HeZechao Li

Journal:   Proceedings of the AAAI Conference on Artificial Intelligence Year: 2024 Vol: 38 (3)Pages: 2570-2578
JOURNAL ARTICLE

Using Text and Visual Cues for Fine-Grained Classification

Zaryab ShakerFeng XiaoMuhammad Tahir

Journal:   International Journal of Advanced Network Monitoring and Controls Year: 2021 Vol: 6 (3)Pages: 42-49
© 2026 ScienceGate Book Chapters — All rights reserved.