Abstract

Cross-modal image-recipe retrieval aims to capture the correlation between food images and recipes. While existing methods have demonstrated good performance on retrieval tasks, they often overlook two crucial aspects: (1) the capture of fine-grained recipe information and (2) the consideration of correlations between embeddings from different modalities. We introduce the Multimodal Fusion Retrieval Framework (MFRF) to address these issues. The proposed framework utilizes a deep learning-based encoder to process recipe and image data effectively, incorporates a fusion network to learn cross-modal semantic alignment, and ultimately achieves image-recipe retrieval. MFRF comprises three integral modules. The recipe preprocessing module utilizes various levels of Transformer to extract essential features such as the title and ingredients from the recipe. Additionally, it employs LSTM based on BERT to establish contextual relationships and dependencies among sentences in the recipe instructions. The multimodal fusion module incorporates visual-linguistic contrastive losses to align the representations of both images and recipes. Moreover, it leverages cross-modal attention mechanisms to facilitate effective interaction between the two modalities. Lastly, the cross-modal retrieval module employs a triple loss function to enable cross-modal retrieval of image-recipe pairs. Experimental evaluations conducted on the widely-used Recipe1M benchmark dataset demonstrate the effectiveness of the proposed MFRF, achieving substantial performance improvements on both the 1k and 10k test sets. Specifically, the results indicate an increase of +9.9% (64.8 R@1) and +8.4% (33.7 R@1) respectively.

Keywords:
Recipe Computer science Artificial intelligence Preprocessor Modal Benchmark (surveying) Encoder Natural language processing Modalities Deep learning Information retrieval Pattern recognition (psychology) Machine learning

Metrics

5
Cited By
0.91
FWCI (Field Weighted Citation Impact)
19
Refs
0.72
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Advanced Image and Video Retrieval Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Domain Adaptation and Few-Shot Learning
Physical Sciences →  Computer Science →  Artificial Intelligence
© 2026 ScienceGate Book Chapters — All rights reserved.