Iqra MuneerAli SaeedRao Muhammad Adeel Nawab
Semantic word similarity is a quantitative method of determining how much two terms are contextually identical, which is a considerable challenge for computational linguistics. The research community has examined a range of approaches to address this issue. However, most of these approaches are for a comparatively limited set of languages, especially English. Research on semantic word similarity for South Asian languages, particularly Urdu, is immature. In recent years, transformer-based approaches have proved extremely successful for a range of language processing tasks. The primary aim of this study is to develop and compare a variety of transformer-based approaches to the cross-lingual English–Urdu semantic word similarity task. This study evaluated a publicly available benchmark USWS-19 corpus that comprises 518 word pairs. This study mainly explored four types of transformer-based approaches: (a) cross-lingual sentence transformer-based approaches using the original dataset, (b) cross-lingual sentence transformer-based approaches using the translated dataset (translation plus monolingual analysis [T+MA] approach), (c) the feature fusion approach (mixture of features), and large language models. In addition, this study also explores the word embedding-based approach using the translated dataset (T+MA approach). In total, this study developed 29 transformer-based models, with the highest results (Pearson correlation = 0.788) achieved using a feature fusion approach, that is, Best-Two-SBERT (where SBERT stands for sentence-bidirectional encoder representations from transformers; using T+MA) + BEST Baseline (with Bing translator) + Best cross-lingual SBERT. This approach improved by 7% over previously reported results on the same corpus.
Ghazeefa FatimaRao Muhammad Adeel NawabMuhammad Salman KhanAli Saeed
Khang Huu NguyenDat Cong DinhHằng Lê Thị ThuýĐiền Đinh
Nuno PereiraVinícius R. P. Borges
Shumin WuJinho D. ChoiMartha Palmer