Cross-modal Image-Recipe Retrieval via Multimodal Fusion

Lijie Li; Caiyue Hu; Haitao Zhang; Akshita Maradapu Vera Venkata Sai

doi:10.1145/3595916.3626389

ScienceGate Book Chapters

JOURNAL ARTICLE

Cross-modal Image-Recipe Retrieval via Multimodal Fusion

Lijie Li Caiyue Hu Haitao Zhang Akshita Maradapu Vera Venkata Sai

Year: 2023 Pages: 1-7

DOI: 10.1145/3595916.3626389

Get Full-Text PDF Get Analytical Report

Abstract

Cross-modal image-recipe retrieval aims to capture the correlation between food images and recipes. While existing methods have demonstrated good performance on retrieval tasks, they often overlook two crucial aspects: (1) the capture of fine-grained recipe information and (2) the consideration of correlations between embeddings from different modalities. We introduce the Multimodal Fusion Retrieval Framework (MFRF) to address these issues. The proposed framework utilizes a deep learning-based encoder to process recipe and image data effectively, incorporates a fusion network to learn cross-modal semantic alignment, and ultimately achieves image-recipe retrieval. MFRF comprises three integral modules. The recipe preprocessing module utilizes various levels of Transformer to extract essential features such as the title and ingredients from the recipe. Additionally, it employs LSTM based on BERT to establish contextual relationships and dependencies among sentences in the recipe instructions. The multimodal fusion module incorporates visual-linguistic contrastive losses to align the representations of both images and recipes. Moreover, it leverages cross-modal attention mechanisms to facilitate effective interaction between the two modalities. Lastly, the cross-modal retrieval module employs a triple loss function to enable cross-modal retrieval of image-recipe pairs. Experimental evaluations conducted on the widely-used Recipe1M benchmark dataset demonstrate the effectiveness of the proposed MFRF, achieving substantial performance improvements on both the 1k and 10k test sets. Specifically, the results indicate an increase of +9.9% (64.8 R@1) and +8.4% (33.7 R@1) respectively.

Keywords:

Recipe Computer science Artificial intelligence Preprocessor Modal Benchmark (surveying) Encoder Natural language processing Modalities Deep learning Information retrieval Pattern recognition (psychology) Machine learning

Metrics

Cited By

0.91

FWCI (Field Weighted Citation Impact)

Refs

0.72

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Advanced Image and Video Retrieval Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Domain Adaptation and Few-Shot Learning

Physical Sciences → Computer Science → Artificial Intelligence

Cross-modal Image-Recipe Retrieval via Multimodal Fusion

Abstract

Metrics

Citation History

Topics

Related Documents

Cross-Modal Image-Recipe Retrieval via Intra- and Inter-Modality Hybrid Fusion

Revamping Image-Recipe Cross-Modal Retrieval with Dual Cross Attention Encoders

Video-Based Cross-Modal Recipe Retrieval

Mitigating Cross-modal Representation Bias for Multicultural Image-to-Recipe Retrieval

Cross modal recipe retrieval with fine grained modal interaction