JOURNAL ARTICLE

Doubly-Attentive Decoder for Multi-modal Neural Machine Translation

Abstract

We introduce a Multi-modal Neural Machine Translation model in which a doubly-attentive decoder naturally incorporates spatial visual features obtained using pre-trained convolutional neural networks, bridging the gap between image description and translation. Our decoder learns to attend to source-language words and parts of an image independently by means of two separate attention mechanisms as it generates words in the target language. We find that our model can efficiently exploit not just back-translated in-domain multi-modal data but also large general-domain text-only MT corpora. We also report state-of-the-art results on the Multi30k data set.

Keywords:
Computer science Machine translation Convolutional neural network Modal Bridging (networking) Artificial intelligence Translation (biology) Exploit Natural language processing Image translation Domain (mathematical analysis) Set (abstract data type) Speech recognition Image (mathematics) Pattern recognition (psychology) Programming language

Metrics

185
Cited By
19.71
FWCI (Field Weighted Citation Impact)
55
Refs
0.99
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Natural Language Processing Techniques
Physical Sciences →  Computer Science →  Artificial Intelligence
Topic Modeling
Physical Sciences →  Computer Science →  Artificial Intelligence
Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
© 2026 ScienceGate Book Chapters — All rights reserved.