Super Visual Semantic Embedding for Cross-Modal Image-Text Retrieval

Zhixian Zeng; Jianjun Cao; Guoquan Jiang; Nianfeng Weng; Yuxin Xu; Zibo Nie

doi:10.1145/3487075.3487167

ScienceGate Book Chapters

JOURNAL ARTICLE

Super Visual Semantic Embedding for Cross-Modal Image-Text Retrieval

Zhixian Zeng Jianjun Cao Guoquan Jiang Nianfeng Weng Yuxin Xu Zibo Nie

Year: 2021 Pages: 1-7

DOI: 10.1145/3487075.3487167

Get Full-Text PDF Get Analytical Report

Abstract

Visual semantic embedding network or cross-modal cross-attention network are usually adopted for image-text retrieval. Existing works have confirmed that both visual semantic embedding network and cross-modal cross-attention network can achieve similar performance, but the former has lower computational complexity so that its retrieval speed is faster and its engineering application value is higher than the latter. In this paper, we propose a Super Visual Semantic Embedding Network (SVSEN) for cross-modal image-text retrieval, which contains two independent branch substructures including the image embedding network and the text embedding network. In the design of the image embedding network, firstly, a feature extraction network is employed to extract the fine-grained features of the image. Then, we design a graph attention mechanism module with residual link for image semantic enhancement. Finally, the Softmax pooling strategy is used to map the image fine-grained features to a common embedding space. In the design of the text embedding network, we use the pre-trained BERT-base-uncased to extract context-related word vectors, which will be fine-tuned in training. Finally, the fine-grained word vectors are mapped to a common embedding space by a maximum pooling. In the common embedding space, a soft label-based triplet loss function is adopted for cross-modal semantic alignment learning. Through experimental verification on two widely used datasets, namely MS-COCO and Flickr-30K, our proposed SVSEN achieves the best performance. For instance, on Flickr-30K, our SVSEN outperforms image retrieval by 3.91% relatively and text retrieval by 1.96% relatively ([email protected]).

Keywords:

Embedding Computer science Artificial intelligence Softmax function Word embedding Semantic feature Modal Pattern recognition (psychology) Image retrieval Pooling Image (mathematics) Deep learning

Metrics

Cited By

0.00

FWCI (Field Weighted Citation Impact)

Refs

0.17

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Topics

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Domain Adaptation and Few-Shot Learning

Physical Sciences → Computer Science → Artificial Intelligence

Advanced Image and Video Retrieval Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Super Visual Semantic Embedding for Cross-Modal Image-Text Retrieval

Abstract

Metrics

Topics

Related Documents

Multi-view visual semantic embedding for cross-modal image–text retrieval

Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval

Visual Contextual Semantic Reasoning for Cross-Modal Drone Image-Text Retrieval

Cross-Modal Image-Text Retrieval with Semantic Consistency

Image–Text Cross-Modal Retrieval with Instance Contrastive Embedding