Image captioning, as a representative cross-modal task, faces significant challenges, including high annotation costs and modality alignment difficulties. To address these issues, this paper proposes CMSA, an image captioning framework that does not require paired image-text data. The framework integrates a generator, a discriminator, and a reward module, employing a collaborative multi-module optimization strategy to enhance caption quality. The generator builds multi-level joint feature representations based on a contrastive language-image pretraining model, effectively mitigating the modality alignment problem and guiding the language model to generate text highly consistent with image semantics. The discriminator learns linguistic styles from external corpora and evaluates textual naturalness, providing critical reward signals to the generator. The reward module combines image-text relevance and textual quality metrics, optimizing the generator parameters through reinforcement learning to further improve semantic accuracy and language expressiveness. CMSA adopts a progressive multi-stage training strategy that, combined with joint feature modeling and reinforcement learning mechanisms, significantly reduces reliance on costly annotated data. Experimental results demonstrate that CMSA significantly outperforms existing methods across multiple evaluation metrics on the MSCOCO and Flickr30k datasets, exhibiting superior performance and strong cross-dataset generalization ability.
Kangda ChengJinlong LiuRui MaoZhilu WuErik Cambria
Jiahui GaoYi ZhouPhilip L. H. YuShafiq JotyJiuxiang Gu
Huixia BenYingwei PanYehao LiTing YaoRichang HongMeng WangTao Mei
Da HuoMarc A. KastnerTakahiro KomamizuIchiro Ide