ViT2CMH: Vision Transformer Cross-Modal Hashing for Fine-Grained Vision-Text Retrieval

Mingyong Li; Qiqi Li; Zheng Jiang; Yan Ma

doi:10.32604/csse.2023.034757

ScienceGate Book Chapters

JOURNAL ARTICLE

ViT2CMH: Vision Transformer Cross-Modal Hashing for Fine-Grained Vision-Text Retrieval

Mingyong Li Qiqi Li Zheng Jiang Yan Ma

Year: 2023 Journal: Computer Systems Science and Engineering Vol: 46 (2)Pages: 1401-1414

DOI: 10.32604/csse.2023.034757

Get Full-Text PDF Get Analytical Report

Abstract

In recent years, the development of deep learning has further improved hash retrieval technology. Most of the existing hashing methods currently use Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) to process image and text information, respectively. This makes images or texts subject to local constraints, and inherent label matching cannot capture fine-grained information, often leading to suboptimal results. Driven by the development of the transformer model, we propose a framework called ViT2CMH mainly based on the Vision Transformer to handle deep Cross-modal Hashing tasks rather than CNNs or RNNs. Specifically, we use a BERT network to extract text features and use the vision transformer as the image network of the model. Finally, the features are transformed into hash codes for efficient and fast retrieval. We conduct extensive experiments on Microsoft COCO (MS-COCO) and Flickr30K, comparing with baselines of some hashing methods and image-text matching methods, showing that our method has better performance.

Keywords:

Computer science Convolutional neural network Hash function Transformer Artificial intelligence Recurrent neural network Deep learning Image retrieval Modal Pattern recognition (psychology) Artificial neural network Machine learning Image (mathematics)

Metrics

Cited By

0.55

FWCI (Field Weighted Citation Impact)

Refs

0.58

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Advanced Image and Video Retrieval Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Video Analysis and Summarization

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

ViT2CMH: Vision Transformer Cross-Modal Hashing for Fine-Grained Vision-Text Retrieval

Abstract

Metrics

Citation History

Topics

Related Documents

TECMH: Transformer-Based Cross-Modal Hashing For Fine-Grained Image-Text Retrieval

Deep Cross-modal Hashing Retrieval Based on Semantics Preserving and Vision Transformer

CrossHash: Cross-scale Vision Transformer Hashing for Image Retrieval

Vision Transformer Hashing for Image Retrieval

Similarity Preserving Transformer Cross-Modal Hashing for Video-Text Retrieval