Muhammad ShakeelRao Muhammad Adeel Nawab
Open-domain factoid question-answering (ODFQA) aims to answer questions posed in natural language by retrieving and extracting relevant information from large, unstructured text sources. A range of applications have benefited from ODFQA including improved search relevancy in search engines and information retrieval, enable semantic searches, interactive and personalized learning systems, and the creation of large language models. Although significant research on ODFQA for English and other languages has been done, Urdu-specific research in this area remains limited. This is due to the lack of high-quality datasets and the challenges associated with processing Urdu text. To address the unavailability of Urdu-specific resources for the ODFQA task, we developed a benchmark corpus, comprising $3,985$ Urdu questions and corresponding Urdu Wikipedia articles, with $1,006$ answerable and $2,979$ unanswerable questions. Each question in our proposed corpus was manually annotated by three independent annotators. As a secondary contribution, we carried out extensive experimentation using a range of state- of-the-art models, including retrievers (BM25 and Sentence-BERT), multilingual transformers (mBERT, XLM-RoBERTa-Large, XLM-RoBERTa-Large-Squad2), and large language models (GPT-3.5-Turbo-0125, GPT-4o-mini-2024-07-18) on our proposed corpus. Best results were obtained using the XLM-RoBERTa-Large-Squad2 model with $F_{1} = 0.61$ and $EM = 0.41 \text{@}k = 20$ . While the finetuned GPT-4o-mini model was the best model, with $F_{1} = 0.81$ and $EM = 0.81 \text{@}k=1$ . To foster research in the Urdu ODFQA, our proposed corpus is freely available under the Creative Commons license.
Maria KhvalchikAnagha Kulkarni
Ahmad AghaebrahimianFilip Jurčíček
Noha S. FareedHamdy M. MousaAshraf B. El-Sisi
Gayle McElvainGeorge I. SánchezDon TeoTonya Custis