JOURNAL ARTICLE

Unpaired Image Captioning via Cross-Modal Semantic Alignment

Yong YangKai ZhouGe Ren

Year: 2025 Journal:   Applied Sciences Vol: 15 (21)Pages: 11588-11588   Publisher: Multidisciplinary Digital Publishing Institute

Abstract

Image captioning, as a representative cross-modal task, faces significant challenges, including high annotation costs and modality alignment difficulties. To address these issues, this paper proposes CMSA, an image captioning framework that does not require paired image-text data. The framework integrates a generator, a discriminator, and a reward module, employing a collaborative multi-module optimization strategy to enhance caption quality. The generator builds multi-level joint feature representations based on a contrastive language-image pretraining model, effectively mitigating the modality alignment problem and guiding the language model to generate text highly consistent with image semantics. The discriminator learns linguistic styles from external corpora and evaluates textual naturalness, providing critical reward signals to the generator. The reward module combines image-text relevance and textual quality metrics, optimizing the generator parameters through reinforcement learning to further improve semantic accuracy and language expressiveness. CMSA adopts a progressive multi-stage training strategy that, combined with joint feature modeling and reinforcement learning mechanisms, significantly reduces reliance on costly annotated data. Experimental results demonstrate that CMSA significantly outperforms existing methods across multiple evaluation metrics on the MSCOCO and Flickr30k datasets, exhibiting superior performance and strong cross-dataset generalization ability.

Keywords:

Metrics

0
Cited By
0.00
FWCI (Field Weighted Citation Impact)
20
Refs
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Topics

Related Documents

© 2026 ScienceGate Book Chapters — All rights reserved.