JOURNAL ARTICLE

Develop corpora and methods for cross-lingual text reuse detection for English Urdu language pair at lexical, syntactical, and phrasal levels

Iqra MuneerRao Muhammad Adeel Nawab

Year: 2022 Journal:   Language Resources and Evaluation Vol: 56 (4)Pages: 1103-1130   Publisher: Springer Science+Business Media

Abstract

In recent years, Cross-Lingual Text Reuse Detection (CLTRD) has attracted the attention of the research community because large digital repositories and efficient Machine Translation systems are readily and freely available, which makes it easier to reuse text across the languages and very difficult to detect it. In the previous studies, the problem of CLTRD for the English-Urdu language pair has been explored at the sentence/passage and document level, and benchmark corpora and methods have been developed. However, there is a lack of benchmark corpora and methods for the CLTRD for the English-Urdu language pair at the lexical, syntactical, and phrasal levels. To fulfill this research gap, this study presents three large benchmark corpora for detecting the Cross-Lingual Text Reuse (CLTR) at three levels of rewrite (Wholly Derived (WD), Partially Derived (PD), and Non Derived (ND)). The CLEU-Lex, CLEU-Syn and CLEU-Phr corpora contain 66,485 (WD = 22,236, PD = 20,315 and ND = 23,934), 60,267 (WD = 20,007, PD = 16,979 and ND = 23,281) and 60,106 (WD = 23,862, PD = 15,878 and ND = 20,366) CLTR pairs respectively. As a secondary major contribution, we have applied the Cross-Lingual Word Embedding (CLWE), Cross-Lingual Semantic Tagger (CLST), and Cross-Lingual Sentence Transformer (CLSTR) based methods on our three proposed corpora for the CLTRD. Our extensive experimentation showed that for the binary classification task, the best results on the CLEU-Lex corpus were obtained using the cross-lingual sentence transformer ( $$F_{1}$$ = 0.80). For the CLEU-Syn and CLEU-Phr corpora, the best results were obtained using the cross-lingual sentence transformer and a combination of the CLWE, CLST and CLSTR methods ( $$F_{1}$$ = 0.92 on CLEU-Syn and $$F_{1}$$ = 0.94 on CLEU-Phr). For the ternary classification task, the best results on the CLEU-Lex corpus were obtained using the cross-lingual sentence transformer method ( $$F_{1}$$ = 0.69). For the CLEU-Syn corpus, the best results were obtained using a combination of the CLWE, CLST, and CLSTR methods ( $$F_{1}$$ = 0.82). For the CLEU-Phr corpus the best results were obtained using cross-lingual sentence transformer and combination of CLWE, CLST, and CLSTR methods ( $$F_{1}$$ = 0.78). To foster and promote research in Urdu (a low-resourced language) all the three proposed corpora are free and publicly available for research purposes.

Keywords:
Computer science Natural language processing Artificial intelligence Transformer Sentence Benchmark (surveying) Urdu Machine translation Linguistics

Metrics

5
Cited By
0.98
FWCI (Field Weighted Citation Impact)
77
Refs
0.74
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Topic Modeling
Physical Sciences →  Computer Science →  Artificial Intelligence
Natural Language Processing Techniques
Physical Sciences →  Computer Science →  Artificial Intelligence
Text and Document Classification Technologies
Physical Sciences →  Computer Science →  Artificial Intelligence

Related Documents

JOURNAL ARTICLE

Cross-Lingual Text Reuse Detection at sentence level for English–Urdu language pair

Iqra MuneerRao Muhammad Adeel Nawab

Journal:   Computer Speech & Language Year: 2022 Vol: 75 Pages: 101381-101381
JOURNAL ARTICLE

Cross-lingual Text Reuse Detection at Document Level for English-Urdu Language Pair

Muhammad SharjeelIqra MuneerSumaira NosheenRao Muhammad Adeel NawabPaul Rayson

Journal:   ACM Transactions on Asian and Low-Resource Language Information Processing Year: 2023 Vol: 22 (6)Pages: 1-22
JOURNAL ARTICLE

Cross-lingual Text Reuse Detection Using Translation Plus Monolingual Analysis for English-Urdu Language Pair

Iqra MuneerRao Muhammad Adeel Nawab

Journal:   ACM Transactions on Asian and Low-Resource Language Information Processing Year: 2021 Vol: 21 (2)Pages: 1-18
JOURNAL ARTICLE

Mono-lingual text reuse detection for the Urdu language at lexical level

Ayesha NoreenIqra MuneerRao Muhammad Adeel Nawab

Journal:   Engineering Applications of Artificial Intelligence Year: 2024 Vol: 136 Pages: 109003-109003
JOURNAL ARTICLE

Sentential Cross-lingual Paraphrase Detection for English-Urdu Language Pair

Iqra MuneerNida WaheedAdnan AshrafRao Muhammad Adeel Nawab

Journal:   The European Journal on Artificial Intelligence Year: 2025 Vol: 38 (3)Pages: 309-329
© 2026 ScienceGate Book Chapters — All rights reserved.