In recent years, the focus on multilingual modeling has intensified, driven by the necessity to enable cross-lingual Text-based Visual Question Answering (TextVQA), which requires the understanding of questions and answers across diverse languages. Existing research predominantly revolves around the fusion of multimodal information and the processing of OCR data. This paper undertakes an empirical investigation into multilingual scene-text visual question answering, addressing both cross-lingual (English <-> Chinese) and monolingual (English <-> English and Chinese <-> Chinese) tasks, with a primary emphasis on accuracy-based metrics. Our study not only elucidates the impact of varying OCR feature extractors and distinct visual feature extractors on a selection of state-of-the-art models but also delves into the broader landscape of multilingual TextVQA. The experimental outcomes underscore the capability of multilingual pretrained models to effectively handle text-based questions. Moreover, they underscore the importance of leveraging visual features from OCR data and images in enhancing answering performance.
Josep Brugués i PujolràsLluís GómezDìmosthenis Karatzas
Lin LiHaohan ZhangZeqin FangZhongwei XieJianquan Liu
Qingqing WangLiqiang XiaoYue LuYaohui JinHao He
Himanshu SharmaAnand Singh Jalal