JOURNAL ARTICLE

Developing a Large Benchmark Corpus for Urdu Semantic Word Similarity

Iqra MuneerGhazeefa FatimaMuhammad Salman KhanRao Muhammad Adeel NawabAli Saeed

Year: 2022 Journal:   ACM Transactions on Asian and Low-Resource Language Information Processing Vol: 22 (3)Pages: 1-19   Publisher: Association for Computing Machinery

Abstract

The semantic word similarity task aims to quantify the degree of similarity between a pair of words. In literature, efforts have been made to create standard evaluation resources to develop, evaluate, and compare various methods for semantic word similarity. The majority of these efforts focused on English and some other languages. However, the problem of semantic word similarity has not been thoroughly explored for South Asian languages, particularly Urdu. To fill this gap, this study presents a large benchmark corpus of 518 word pairs for the Urdu semantic word similarity task, which were manually annotated by 12 annotators. To demonstrate how our proposed corpus can be used for the development and evaluation of Urdu semantic word similarity systems, we applied two state-of-the-art methods: (1) a word embedding–based method and (2) a Sentence Transformer–based method. As another major contribution, we proposed a feature fusion method based on Sentence Transformers and word embedding methods. The best results were obtained using our proposed feature fusion method (the combination of best features of both methods) with a Pearson correlation score of 0.67. To foster research in Urdu (an under-resourced language), our proposed corpus will be free and publicly available for research purposes.

Keywords:
Computer science Natural language processing Artificial intelligence Word embedding Semantic similarity Sentence Word (group theory) Similarity (geometry) Transformer Benchmark (surveying) Urdu Feature (linguistics) Embedding Linguistics

Metrics

8
Cited By
1.57
FWCI (Field Weighted Citation Impact)
53
Refs
0.81
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Topic Modeling
Physical Sciences →  Computer Science →  Artificial Intelligence
Natural Language Processing Techniques
Physical Sciences →  Computer Science →  Artificial Intelligence
Sentiment Analysis and Opinion Mining
Physical Sciences →  Computer Science →  Artificial Intelligence

Related Documents

JOURNAL ARTICLE

Developing a Cross-lingual Semantic Word Similarity Corpus for English–Urdu Language Pair

Ghazeefa FatimaRao Muhammad Adeel NawabMuhammad Salman KhanAli Saeed

Journal:   ACM Transactions on Asian and Low-Resource Language Information Processing Year: 2021 Vol: 21 (2)Pages: 1-16
JOURNAL ARTICLE

Cross-Lingual English–Urdu Semantic Word Similarity Using Sentence Transformers

Iqra MuneerAli SaeedRao Muhammad Adeel Nawab

Journal:   The European Journal on Artificial Intelligence Year: 2025 Vol: 38 (1)Pages: 21-34
JOURNAL ARTICLE

Semantic text similarity using corpus-based word similarity and string similarity

Aminul IslamDiana Inkpen

Journal:   ACM Transactions on Knowledge Discovery from Data Year: 2008 Vol: 2 (2)Pages: 1-25
JOURNAL ARTICLE

Using TREC for developing semantic information retrieval benchmark for Urdu

Saba ShaukatAsma ShaukatKhurram ShahzadAli Daud

Journal:   Information Processing & Management Year: 2022 Vol: 59 (3)Pages: 102939-102939
© 2026 ScienceGate Book Chapters — All rights reserved.