JOURNAL ARTICLE

Revamping Image-Recipe Cross-Modal Retrieval with Dual Cross Attention Encoders

Wenhao LiuS. C. YuanZhen WangXinyi ChangLimeng GaoHuajin Zhang

Year: 2024 Journal:   Mathematics Vol: 12 (20)Pages: 3181-3181   Publisher: Multidisciplinary Digital Publishing Institute

Abstract

The image-recipe cross-modal retrieval task, which retrieves the relevant recipes according to food images and vice versa, is now attracting widespread attention. There are two main challenges for image-recipe cross-modal retrieval task. Firstly, a recipe’s different components (words in a sentence, sentences in an entity, and entities in a recipe) have different weight values. If a recipe’s different components own the same weight, the recipe embeddings cannot pay more attention to the important components. As a result, the important components make less contribution to the retrieval task. Secondly, the food images have obvious properties of locality and only the local food regions matter. There are still difficulties in enhancing the discriminative local region features in the food images. To address these two problems, we propose a novel framework named Dual Cross Attention Encoders for Cross-modal Food Retrieval (DCA-Food). The proposed framework consists of a hierarchical cross attention recipe encoder (HCARE) and a cross attention image encoder (CAIE). HCARE consists of three types of cross attention modules to capture the important words in a sentence, the important sentences in an entity and the important entities in a recipe, respectively. CAIE extracts global and local region features. Then, it calculates cross attention between them to enhance the discriminative local features in the food images. We conduct the ablation studies to validate our design choices. Our proposed approach outperforms the existing approaches by a large margin on the Recipe1M dataset. Specifically, we improve the R@1 performance by +2.7 and +1.9 on the 1k and 10k testing sets, respectively.

Keywords:
Recipe Dual (grammatical number) Modal Encoder Computer science Artificial intelligence Computer vision Algorithm Geography Materials science Art

Metrics

0
Cited By
0.00
FWCI (Field Weighted Citation Impact)
50
Refs
0.19
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Topics

Advanced Image and Video Retrieval Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Image Retrieval and Classification Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition

Related Documents

JOURNAL ARTICLE

Cross-modal recipe retrieval with stacked attention model

Jingjing ChenLei PangChong‐Wah Ngo

Journal:   Multimedia Tools and Applications Year: 2018 Vol: 77 (22)Pages: 29457-29473
JOURNAL ARTICLE

Cross-Modal Recipe Retrieval with Self-Attention Mechanism

CHU Xu LIN Yang

Journal:   DOAJ (DOAJ: Directory of Open Access Journals) Year: 2020
JOURNAL ARTICLE

Cross-modal recipe retrieval with stacked attention model

ChenJing-JingPangLeiNgoChong-Wah

Journal:   Multimedia Tools and Applications Year: 2018
JOURNAL ARTICLE

Cross-modal recipe retrieval via parallel- and cross-attention networks learning

Da CaoJingjing ChuNingbo ZhuLiqiang Nie

Journal:   Knowledge-Based Systems Year: 2019 Vol: 193 Pages: 105428-105428
© 2026 ScienceGate Book Chapters — All rights reserved.