Open-Domain Factoid Question-Answering in Urdu: Data and Methods

Muhammad Shakeel; Rao Muhammad Adeel Nawab

doi:10.1109/access.2025.3540939

ScienceGate Book Chapters

JOURNAL ARTICLE

Open-Domain Factoid Question-Answering in Urdu: Data and Methods

Muhammad Shakeel Rao Muhammad Adeel Nawab

Year: 2025 Journal: IEEE Access Vol: 13 Pages: 30167-30185 Publisher: Institute of Electrical and Electronics Engineers

DOI: 10.1109/access.2025.3540939

Get Full-Text PDF Get Analytical Report

Abstract

Open-domain factoid question-answering (ODFQA) aims to answer questions posed in natural language by retrieving and extracting relevant information from large, unstructured text sources. A range of applications have benefited from ODFQA including improved search relevancy in search engines and information retrieval, enable semantic searches, interactive and personalized learning systems, and the creation of large language models. Although significant research on ODFQA for English and other languages has been done, Urdu-specific research in this area remains limited. This is due to the lack of high-quality datasets and the challenges associated with processing Urdu text. To address the unavailability of Urdu-specific resources for the ODFQA task, we developed a benchmark corpus, comprising $3,985$ Urdu questions and corresponding Urdu Wikipedia articles, with $1,006$ answerable and $2,979$ unanswerable questions. Each question in our proposed corpus was manually annotated by three independent annotators. As a secondary contribution, we carried out extensive experimentation using a range of state- of-the-art models, including retrievers (BM25 and Sentence-BERT), multilingual transformers (mBERT, XLM-RoBERTa-Large, XLM-RoBERTa-Large-Squad2), and large language models (GPT-3.5-Turbo-0125, GPT-4o-mini-2024-07-18) on our proposed corpus. Best results were obtained using the XLM-RoBERTa-Large-Squad2 model with $F_{1} = 0.61$ and $EM = 0.41 \text{@}k = 20$ . While the finetuned GPT-4o-mini model was the best model, with $F_{1} = 0.81$ and $EM = 0.81 \text{@}k=1$ . To foster research in the Urdu ODFQA, our proposed corpus is freely available under the Creative Commons license.

Keywords:

Computer science Question answering Urdu Natural language processing Open domain Artificial intelligence Information retrieval Linguistics

Metrics

Cited By

9.64

FWCI (Field Weighted Citation Impact)

Refs

0.96

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Topic Modeling

Physical Sciences → Computer Science → Artificial Intelligence

Natural Language Processing Techniques

Physical Sciences → Computer Science → Artificial Intelligence

Speech and dialogue systems

Physical Sciences → Computer Science → Artificial Intelligence

Open-Domain Factoid Question-Answering in Urdu: Data and Methods

Abstract

Metrics

Citation History

Topics

Related Documents

Open-Domain Non-factoid Question Answering

Open-domain Factoid Question Answering via Knowledge Graph Search

Syntactic open domain Arabic question/answering system for factoid questions

Joint question clustering and relevance prediction for open domain non-factoid question answering

Non-factoid Question Answering in the Legal Domain