Image Captioning Model Based on Multi-Step Cross-Attention Cross-Modal Alignment and External Commonsense Knowledge Augmentation

Liang Wang; Meiqing Jiao; Zhihai Li; Mengxue Zhang; Haiyan Wei; Yuru Ma; Hongyu An; Jiaqi Lin; Jun Wang

doi:10.3390/electronics14163325

ScienceGate Book Chapters

JOURNAL ARTICLE

Image Captioning Model Based on Multi-Step Cross-Attention Cross-Modal Alignment and External Commonsense Knowledge Augmentation

Liang Wang Meiqing Jiao Zhihai Li Mengxue Zhang Haiyan Wei Yuru Ma Hongyu An Jiaqi Lin Jun Wang

Year: 2025 Journal: Electronics Vol: 14 (16)Pages: 3325-3325 Publisher: Multidisciplinary Digital Publishing Institute

DOI: 10.3390/electronics14163325

Get Full-Text PDF Get Analytical Report

Abstract

To address the semantic mismatch between limited textual descriptions in image captioning training datasets and the multi-semantic nature of images, as well as the underutilized external commonsense knowledge, this article proposes a novel image captioning model based on multi-step cross-attention cross-modal alignment and external commonsense knowledge enhancement. The model employs a backbone architecture comprising CLIP’s ViT visual encoder, Faster R-CNN, BERT text encoder, and GPT-2 text decoder. It incorporates two core mechanisms: a multi-step cross-attention mechanism that iteratively aligns image and text features across multiple rounds, progressively enhancing inter-modal semantic consistency for more accurate cross-modal representation fusion. Moreover, the model employs Faster R-CNN to extract region-based object features. These features are mapped to corresponding entities within the dataset through entity probability calculation and entity linking. External commonsense knowledge associated with these entities is then retrieved from the ConceptNet knowledge graph, followed by knowledge embedding via TransE and multi-hop reasoning. Finally, the fused multimodal features are fed into the GPT-2 decoder to steer caption generation, enhancing the lexical richness, factual accuracy, and cognitive plausibility of the generated descriptions. In the experiments, the model achieves CIDEr scores of 142.6 on MSCOCO and 78.4 on Flickr30k. Ablations confirm both modules enhance caption quality.

Keywords:

Closed captioning Modal Computer science Commonsense knowledge Commonsense reasoning Computer vision Artificial intelligence Image (mathematics) Natural language processing Knowledge-based systems Materials science

Metrics

Cited By

4.77

FWCI (Field Weighted Citation Impact)

Refs

0.88

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Human Pose and Action Recognition

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Domain Adaptation and Few-Shot Learning

Physical Sciences → Computer Science → Artificial Intelligence

Image Captioning Model Based on Multi-Step Cross-Attention Cross-Modal Alignment and External Commonsense Knowledge Augmentation

Abstract

Metrics

Citation History

Topics

Related Documents

Unpaired Image Captioning via Cross-Modal Semantic Alignment

Cross modification attention-based deliberation model for image captioning

Cross-multi-modal seamless training for image captioning

Cross2StrA: Unpaired Cross-lingual Image Captioning with Cross-lingual Cross-modal Structure-pivoted Alignment

RGTranCNet: EFFECTIVE IMAGE CAPTIONING MODEL USING CROSS-ATTENTION AND SEMANTIC KNOWLEDGE