Consensus Knowledge-Guided Semantic Enhanced Interaction for Image-Text Retrieval

Hongbin Wang; Hui Wang; Fan Li

doi:10.20965/jaciii.2025.p0956

ScienceGate Book Chapters

JOURNAL ARTICLE

Consensus Knowledge-Guided Semantic Enhanced Interaction for Image-Text Retrieval

Hongbin Wang Hui Wang Fan Li

Year: 2025 Journal: Journal of Advanced Computational Intelligence and Intelligent Informatics Vol: 29 (4)Pages: 956-967 Publisher: Fuji Technology Press Ltd.

DOI: 10.20965/jaciii.2025.p0956

Get Full-Text PDF Get Analytical Report

Abstract

Image–text retrieval, as a fundamental task in the cross-modal domain, centers on exploring semantic consistency and achieving precise alignment between related image–text pairs. Existing approaches primarily depend on co-occurrence frequency to construct coherent representations of commonsense knowledge introduction patterns, thereby facilitating high-quality semantic alignment across the two modalities. However, these methods often overlook the conceptual and syntactic correspondences between cross-modal fragments. To overcome these limitations, this work proposes a consensus knowledge-guided semantic enhanced interaction method, referred to as CSEI, for image–text retrieval. This method correlates both intra-modal and inter-modal semantics between image regions or objects and sentence words, aiming to minimize cross-modal discrepancies. Specifically, the initial step involves constructing visual and textual corpus sets that encapsulate rich concepts and relationships derived from commonsense knowledge. Subsequently, to enhance intra-modal relationships, a semantic relation-aware graph convolutional network is employed to capture more comprehensive feature representations. For inter-modal similarity reasoning, local and global similarity features are extracted through two cross-modal semantic enhancement mechanisms. In the final stage, the approach integrates commonsense knowledge with internal semantic correlations to enrich concept representation and further optimize semantic consistency by regularizing the importance disparities among association-enhanced concepts. Experiments conducted on MS-COCO and Flickr30K validate the effectiveness of the proposed method.

Keywords:

Computer science Natural language processing Artificial intelligence Consistency (knowledge bases) Modal Semantics (computer science) Information retrieval Sentence Similarity (geometry) Representation (politics) Semantic similarity Image (mathematics)

Metrics

Cited By

0.00

FWCI (Field Weighted Citation Impact)

Refs

0.27

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Topics

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Advanced Image and Video Retrieval Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Domain Adaptation and Few-Shot Learning

Physical Sciences → Computer Science → Artificial Intelligence

Consensus Knowledge-Guided Semantic Enhanced Interaction for Image-Text Retrieval

Abstract

Metrics

Topics

Related Documents

MKVSE: Multimodal Knowledge Enhanced Visual-semantic Embedding for Image-text Retrieval

Text semantic-guided adaptive feature aggregation for image-text retrieval

Causal image-text retrieval embedded with consensus knowledge

Text-guided Image Restoration and Semantic Enhancement for Text-to-Image Person Retrieval

Text-Guided Knowledge Transfer for Remote Sensing Image-Text Retrieval