Iqra MuneerNida WaheedAdnan AshrafRao Muhammad Adeel Nawab
Due to vast digital data collections and paraphrasing tools, researchers have shown growing interest in Cross-lingual Paraphrase Detection (CLPD). Open-access data and tools make paraphrasing easier and detection more challenging. Translation tools further exacerbate the issue by enabling effortless text translation across languages, leading to increased cross-lingual paraphrasing. Most existing CLPD studies focus on European languages, particularly English, while the English-Urdu language pair remains underexplored due to limited standard approaches and benchmark corpora.This study addresses this gap by developing the CLPD Corpus for English-Urdu (CLPD-EU), a gold-standard benchmark corpus at the sentence level. The corpus includes 5,801 sentence pairs, comprising 3,900 paraphrased and 1,901 non-paraphrased instances. Additionally, the study implements classical machine learning methods based on bilingual dictionaries, cross-lingual word embeddings, and transfer learning using sentence transformers.The research further incorporates state-of-the-art Large Language Models (LLMs) such as Mistral and LLaMA, significantly improving detection accuracy. Our proposed Feature Fusion Approach, ‘Comb-ST+BD,’ demonstrates strong performance with an F1 score of 0.739 for the CLPD task. The CLPD-EU corpus will be publicly available to encourage further research in CLPD, especially for under-resourced languages like Urdu.
Iqra MuneerRao Muhammad Adeel Nawab
Muhammad SharjeelIqra MuneerSumaira NosheenRao Muhammad Adeel NawabPaul Rayson
Iqra MuneerRao Muhammad Adeel Nawab
Ghazeefa FatimaRao Muhammad Adeel NawabMuhammad Salman KhanAli Saeed
Hafiz Rizwan IqbalMuhammad SharjeelJawad ShafiUsama MehmoodAgha Ali Raza