JOURNAL ARTICLE

Multi-modal Contextual Prompt Learning for Multi-label Classification with Partial Labels

Abstract

Multi-label classification is a task with diverse applications, but current algorithms heavily rely on accurately labeled data, leading to time-consuming and labor-intensive data collection. However, multi-label classification with partial labels presents significant challenges. In this study, we propose Multi-modal Contextual Prompt Learning (MCPL), a novel approach that leverages large-scale visual-language models and exploits the strong image-text alignment in CLIP to address the scarcity of label annotations. We pre-train the visual language model's encoder on a large number of image-text pairs. We introduce multi-modal contextual prompt learning in both images and labeled text to better utilize the image-label correspondence within CLIP, resulting in enhanced multi-label classification performance, even when faced with partial labels. We also use the coupling function to couple the two modes and realize the interactive connection of the two modal prompts. Extensive experiments on the MS-COCO and VOC2007 datasets, demonstrating its superiority and achieving competitive performance.

Keywords:
Computer science Exploit Artificial intelligence Modal Encoder Multi-label classification Machine learning Image (mathematics) Contextual image classification Pattern recognition (psychology) Natural language processing

Metrics

0
Cited By
0.00
FWCI (Field Weighted Citation Impact)
35
Refs
0.07
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Topics

Text and Document Classification Technologies
Physical Sciences →  Computer Science →  Artificial Intelligence
Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Domain Adaptation and Few-Shot Learning
Physical Sciences →  Computer Science →  Artificial Intelligence
© 2026 ScienceGate Book Chapters — All rights reserved.