JOURNAL ARTICLE

Open-Domain Factoid Question-Answering in Urdu: Data and Methods

Muhammad ShakeelRao Muhammad Adeel Nawab

Year: 2025 Journal:   IEEE Access Vol: 13 Pages: 30167-30185   Publisher: Institute of Electrical and Electronics Engineers

Abstract

Open-domain factoid question-answering (ODFQA) aims to answer questions posed in natural language by retrieving and extracting relevant information from large, unstructured text sources. A range of applications have benefited from ODFQA including improved search relevancy in search engines and information retrieval, enable semantic searches, interactive and personalized learning systems, and the creation of large language models. Although significant research on ODFQA for English and other languages has been done, Urdu-specific research in this area remains limited. This is due to the lack of high-quality datasets and the challenges associated with processing Urdu text. To address the unavailability of Urdu-specific resources for the ODFQA task, we developed a benchmark corpus, comprising $3,985$ Urdu questions and corresponding Urdu Wikipedia articles, with $1,006$ answerable and $2,979$ unanswerable questions. Each question in our proposed corpus was manually annotated by three independent annotators. As a secondary contribution, we carried out extensive experimentation using a range of state- of-the-art models, including retrievers (BM25 and Sentence-BERT), multilingual transformers (mBERT, XLM-RoBERTa-Large, XLM-RoBERTa-Large-Squad2), and large language models (GPT-3.5-Turbo-0125, GPT-4o-mini-2024-07-18) on our proposed corpus. Best results were obtained using the XLM-RoBERTa-Large-Squad2 model with $F_{1} = 0.61$ and $EM = 0.41 \text{@}k = 20$ . While the finetuned GPT-4o-mini model was the best model, with $F_{1} = 0.81$ and $EM = 0.81 \text{@}k=1$ . To foster research in the Urdu ODFQA, our proposed corpus is freely available under the Creative Commons license.

Keywords:
Computer science Question answering Urdu Natural language processing Open domain Artificial intelligence Information retrieval Linguistics

Metrics

2
Cited By
9.64
FWCI (Field Weighted Citation Impact)
51
Refs
0.96
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Topic Modeling
Physical Sciences →  Computer Science →  Artificial Intelligence
Natural Language Processing Techniques
Physical Sciences →  Computer Science →  Artificial Intelligence
Speech and dialogue systems
Physical Sciences →  Computer Science →  Artificial Intelligence
© 2026 ScienceGate Book Chapters — All rights reserved.