JOURNAL ARTICLE

Image Captioning Model Based on Multi-Step Cross-Attention Cross-Modal Alignment and External Commonsense Knowledge Augmentation

Liang WangMeiqing JiaoZhihai LiMengxue ZhangHaiyan WeiYuru MaHongyu AnJiaqi LinJun Wang

Year: 2025 Journal:   Electronics Vol: 14 (16)Pages: 3325-3325   Publisher: Multidisciplinary Digital Publishing Institute

Abstract

To address the semantic mismatch between limited textual descriptions in image captioning training datasets and the multi-semantic nature of images, as well as the underutilized external commonsense knowledge, this article proposes a novel image captioning model based on multi-step cross-attention cross-modal alignment and external commonsense knowledge enhancement. The model employs a backbone architecture comprising CLIP’s ViT visual encoder, Faster R-CNN, BERT text encoder, and GPT-2 text decoder. It incorporates two core mechanisms: a multi-step cross-attention mechanism that iteratively aligns image and text features across multiple rounds, progressively enhancing inter-modal semantic consistency for more accurate cross-modal representation fusion. Moreover, the model employs Faster R-CNN to extract region-based object features. These features are mapped to corresponding entities within the dataset through entity probability calculation and entity linking. External commonsense knowledge associated with these entities is then retrieved from the ConceptNet knowledge graph, followed by knowledge embedding via TransE and multi-hop reasoning. Finally, the fused multimodal features are fed into the GPT-2 decoder to steer caption generation, enhancing the lexical richness, factual accuracy, and cognitive plausibility of the generated descriptions. In the experiments, the model achieves CIDEr scores of 142.6 on MSCOCO and 78.4 on Flickr30k. Ablations confirm both modules enhance caption quality.

Keywords:
Closed captioning Modal Computer science Commonsense knowledge Commonsense reasoning Computer vision Artificial intelligence Image (mathematics) Natural language processing Knowledge-based systems Materials science

Metrics

1
Cited By
4.77
FWCI (Field Weighted Citation Impact)
46
Refs
0.88
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Human Pose and Action Recognition
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Domain Adaptation and Few-Shot Learning
Physical Sciences →  Computer Science →  Artificial Intelligence
© 2026 ScienceGate Book Chapters — All rights reserved.