JOURNAL ARTICLE

A Dual Reinforcement Learning Framework for Weakly Supervised Phrase Grounding

Zhiyu WangChao YangBin JiangJunsong Yuan

Year: 2023 Journal:   IEEE Transactions on Multimedia Vol: 26 Pages: 394-405   Publisher: Institute of Electrical and Electronics Engineers

Abstract

Weakly-supervised phrase grounding aims to localize a specific region in an image that corresponds to the given textual phrase, where the mapping between noun phrases and image regions is not available in the training stage. Previous methods typically exploit an additional proxy task (e.g., phrase reconstruction or image-phrase alignment) to provide supervision for training, since the lack of region-level annotations in the weakly-supervised setting. However, there exists a significant gap in optimization objectives between the proxy tasks and the target grounding task, which may result in low-efficient optimization for the target model. Therefore, in this paper, we propose a novel dual reinforcement learning framework to directly optimize the phrase grounding model. Specifically, we consider the duality of phrase grounding and phrase generation tasks. These two tasks form a closed loop that can provide quality feedback signals to measure the performance of each other. In this way, we can measure the correctness of the localized regions and thus be able to optimize the grounding model directly. We design two reward functions to quantify the feedback signals and train the models via reinforcement learning. In addition, to relieve the training difficulty of our framework, we present a heuristic algorithm to generate pseudo region-phrase pairs to warm-start our models. We perform experiments on two popular phrase grounding datasets: ReferItGame and Flickr30K Entities, and the results demonstrate that our method outperforms the previous methods by a large margin.

Keywords:
Computer science Phrase Reinforcement learning Artificial intelligence Noun phrase Correctness Machine learning Supervised learning Ground Natural language processing Algorithm Noun

Metrics

2
Cited By
0.36
FWCI (Field Weighted Citation Impact)
72
Refs
0.50
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Natural Language Processing Techniques
Physical Sciences →  Computer Science →  Artificial Intelligence
Topic Modeling
Physical Sciences →  Computer Science →  Artificial Intelligence
© 2026 ScienceGate Book Chapters — All rights reserved.