JOURNAL ARTICLE

Delving Deeper into Cross-lingual Visual Question Answering

Abstract

Visual question answering (VQA) is one of the crucial vision-and-language tasks. Yet, existing VQA research has mostly focused on the English language, due to a lack of suitable evaluation resources. Previous work on cross-lingual VQA has reported poor zero-shot transfer performance of current multilingual multimodal Transformers with large gaps to monolingual performance, without any deeper analysis. In this work, we delve deeper into the different aspects of cross-lingual VQA, aiming to understand the impact of 1) modeling methods and choices, including architecture, inductive bias, fine-tuning; 2) learning biases: including question types and modality biases in cross-lingual setups. The key results of our analysis are: 1. We show that simple modifications to the standard training setup can substantially reduce the transfer gap to monolingual English performance, yielding +10 accuracy points over existing methods. 2. We analyze cross-lingual VQA across different question types of varying complexity for different multilingual multimodal Transformers, and identify question types that are the most difficult to improve on. 3. We provide an analysis of modality biases present in training data and models, revealing why zero-shot performance gaps remain for certain question types and languages.

Keywords:
Computer science Question answering Transformer Natural language processing Artificial intelligence Transfer of learning Modality (human–computer interaction) Architecture Machine learning

Metrics

10
Cited By
1.46
FWCI (Field Weighted Citation Impact)
1
Refs
0.79
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Advanced Image and Video Retrieval Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Domain Adaptation and Few-Shot Learning
Physical Sciences →  Computer Science →  Artificial Intelligence

Related Documents

JOURNAL ARTICLE

xGQA: Cross-Lingual Visual Question Answering

Jonas PfeifferGregor GeigleAishwarya KamathJan-Martin O. SteitzStefan RothIvan VulićIryna Gurevych

Journal:   Findings of the Association for Computational Linguistics: ACL 2022 Year: 2022 Pages: 2497-2511
DISSERTATION

Cross-lingual question answering

Bogdan Sacaleanu

University:   SciDok (Saarland University and State Library) Year: 2012
JOURNAL ARTICLE

Improving the Cross-Lingual Generalisation in Visual Question Answering

Farhad NooralahzadehRico Sennrich

Journal:   Proceedings of the AAAI Conference on Artificial Intelligence Year: 2023 Vol: 37 (11)Pages: 13419-13427
JOURNAL ARTICLE

Hindi-english cross-lingual question-answering system

Satoshi SekineRalph Grishman

Journal:   ACM Transactions on Asian Language Information Processing Year: 2003 Vol: 2 (3)Pages: 181-192
© 2026 ScienceGate Book Chapters — All rights reserved.