JOURNAL ARTICLE

Contrastive Learning with Expectation-Maximization for Weakly Supervised Phrase Grounding

Abstract

Weakly supervised phrase grounding aims to learn an alignment between phrases in a caption and objects in a corresponding image using only caption-image annotations, i.e., without phrase-object annotations. Previous methods typically use a caption-image contrastive loss to indirectly supervise the alignment between phrases and objects, which hinders the maximum use of the intrinsic structure of the multimodal data and leads to unsatisfactory performance. In this work, we directly use the phrase-object contrastive loss in the condition that no positive annotation is available in the first place. Specifically, we propose a novel contrastive learning framework based on the expectation-maximization algorithm that adaptively refines the target prediction. Experiments on two widely used benchmarks, Flickr30K Entities and RefCOCO+, demonstrate the effectiveness of our framework. We obtain 63.05% top-1 accuracy on Flickr30K Entities and 59.51%/43.46% on RefCOCO+ TestA/TestB, outperforming the previous methods by a large margin, even surpassing a previous SoTA that uses a pre-trained vision-language model. Furthermore, we deliver a theoretical analysis of the effectiveness of our method from the perspective of the maximum likelihood estimate with latent variables.

Keywords:
Phrase Computer science Margin (machine learning) Artificial intelligence Annotation Object (grammar) Natural language processing Perspective (graphical) Expectation–maximization algorithm Maximization Pattern recognition (psychology) Speech recognition Machine learning Maximum likelihood Mathematics

Metrics

3
Cited By
0.37
FWCI (Field Weighted Citation Impact)
33
Refs
0.57
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Natural Language Processing Techniques
Physical Sciences →  Computer Science →  Artificial Intelligence
Topic Modeling
Physical Sciences →  Computer Science →  Artificial Intelligence

Related Documents

JOURNAL ARTICLE

Counterfactual contrastive learning for weakly supervised temporal sentence grounding

Yenan XuWanru XuZhenjiang Miao

Journal:   Neurocomputing Year: 2025 Vol: 624 Pages: 129508-129508
JOURNAL ARTICLE

Weakly Supervised Temporal Sentence Grounding with Gaussian-based Contrastive Proposal Learning

Minghang ZhengYanjie HuangQing-Chao ChenYuxin PengYang Liu

Journal:   2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Year: 2022 Pages: 15534-15543
JOURNAL ARTICLE

A Dual Reinforcement Learning Framework for Weakly Supervised Phrase Grounding

Zhiyu WangChao YangBin JiangJunsong Yuan

Journal:   IEEE Transactions on Multimedia Year: 2023 Vol: 26 Pages: 394-405
© 2026 ScienceGate Book Chapters — All rights reserved.