Current methods for image captioning tend to generate sentences that are generally overly rigid and composed of some most frequent words/phrases, leading to inaccurate and indistinguishable descriptions. This is primarily due to the uneven word distribution of the ground truth captions that encourages to generate high frequent words/phrases while suppressing the less frequent but more concrete ones. In this work, we propose a new Content Sensitive and Global Discriminative objective, which is formulated as two constraints on top of a reference model to facilitate generating concrete and discriminative image captions. More specifically, the content sensitive constraint is designed to place greater focus on the less frequent and more concrete words/phrases, thus facilitating the generation of sentences that better describe visual details of the given images. To further improve the discriminability, the global discriminative constraint is designed to pull the generated sentence to better discern the corresponding image from others. We evaluate the proposed method on the widely used MS-COCO dataset, where it achieves superior performance over existing competing methods. We also conduct self-retrieval experiments to demonstrate the discriminability of the proposed method.
Jie WuTianshui ChenHefeng WuZhi YangGuangchun LuoLiang Lin
Roberto DessìMichele BevilacquaEleonora GualdoniNathanaël Carraz RakotonirinaFrancesca FranzonMarco Baroni
Allen NieReuben Cohn-GordonChristopher Potts
Jin YuanShuai ZhuShuyin HuangHanwang ZhangYaoqiang XiaoZhiyong LiMeng Wang