JOURNAL ARTICLE

An Empirical Study of Multilingual Scene-Text Visual Question Answering

Abstract

In recent years, the focus on multilingual modeling has intensified, driven by the necessity to enable cross-lingual Text-based Visual Question Answering (TextVQA), which requires the understanding of questions and answers across diverse languages. Existing research predominantly revolves around the fusion of multimodal information and the processing of OCR data. This paper undertakes an empirical investigation into multilingual scene-text visual question answering, addressing both cross-lingual (English <-> Chinese) and monolingual (English <-> English and Chinese <-> Chinese) tasks, with a primary emphasis on accuracy-based metrics. Our study not only elucidates the impact of varying OCR feature extractors and distinct visual feature extractors on a selection of state-of-the-art models but also delves into the broader landscape of multilingual TextVQA. The experimental outcomes underscore the capability of multilingual pretrained models to effectively handle text-based questions. Moreover, they underscore the importance of leveraging visual features from OCR data and images in enhancing answering performance.

Keywords:
Computer science Question answering Focus (optics) Natural language processing Feature (linguistics) Artificial intelligence Empirical research Selection (genetic algorithm) Information retrieval Linguistics

Metrics

1
Cited By
0.18
FWCI (Field Weighted Citation Impact)
21
Refs
0.44
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Advanced Image and Video Retrieval Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Human Pose and Action Recognition
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
© 2026 ScienceGate Book Chapters — All rights reserved.