Developing a Cross-lingual Semantic Word Similarity Corpus for English–Urdu Language Pair

Ghazeefa Fatima; Rao Muhammad Adeel Nawab; Muhammad Salman Khan; Ali Saeed

doi:10.1145/3472618

ScienceGate Book Chapters

JOURNAL ARTICLE

Developing a Cross-lingual Semantic Word Similarity Corpus for English–Urdu Language Pair

Ghazeefa Fatima Rao Muhammad Adeel Nawab Muhammad Salman Khan Ali Saeed

Year: 2021 Journal: ACM Transactions on Asian and Low-Resource Language Information Processing Vol: 21 (2)Pages: 1-16 Publisher: Association for Computing Machinery

DOI: 10.1145/3472618

Get Full-Text PDF Get Analytical Report

Abstract

Semantic word similarity is a quantitative measure of how much two words are contextually similar. Evaluation of semantic word similarity models requires a benchmark corpus. However, despite the millions of speakers and the large digital text of the Urdu language on the Internet, there is a lack of benchmark corpus for the Cross-lingual Semantic Word Similarity task for the Urdu language. This article reports our efforts in developing such a corpus. The newly developed corpus is based on the SemEval-2017 task 2 English dataset, and it contains 1,945 cross-lingual English–Urdu word pairs. For each of these pairs of words, semantic similarity scores were assigned by 11 native Urdu speakers. In addition to corpus generation, this article also reports the evaluation results of a baseline approach, namely “Translation Plus Monolingual Analysis” for automated identification of semantic similarity between English–Urdu word pairs. The results showed that the path length similarity measure performs better for the Google and Bing translated words. The newly created corpus and evaluation results are freely available online for further research and development.

Keywords:

Computer science Natural language processing Artificial intelligence Urdu SemEval Semantic similarity Similarity (geometry) Word (group theory) Task (project management) Benchmark (surveying) Linguistics

Metrics

Cited By

0.42

FWCI (Field Weighted Citation Impact)

Refs

0.70

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Topic Modeling

Physical Sciences → Computer Science → Artificial Intelligence

Natural Language Processing Techniques

Physical Sciences → Computer Science → Artificial Intelligence

Advanced Text Analysis Techniques

Physical Sciences → Computer Science → Artificial Intelligence

Developing a Cross-lingual Semantic Word Similarity Corpus for English–Urdu Language Pair

Abstract

Metrics

Citation History

Topics

Related Documents

Cross-Lingual English–Urdu Semantic Word Similarity Using Sentence Transformers

Developing a Large Benchmark Corpus for Urdu Semantic Word Similarity

Cross-Lingual Short-Text Semantic Similarity for Kannada–English Language Pair

Design and Development of a Large Cross-Lingual Plagiarism Corpus for Urdu-English Language Pair

Sentential Cross-lingual Paraphrase Detection for English-Urdu Language Pair