JOURNAL ARTICLE

Cross-lingual Text Reuse Detection Using Translation Plus Monolingual Analysis for English-Urdu Language Pair

Iqra MuneerRao Muhammad Adeel Nawab

Year: 2021 Journal:   ACM Transactions on Asian and Low-Resource Language Information Processing Vol: 21 (2)Pages: 1-18   Publisher: Association for Computing Machinery

Abstract

Cross-Lingual Text Reuse Detection (CLTRD) has recently attracted the attention of the research community due to a large amount of digital text readily available for reuse in multiple languages through online digital repositories. In addition, efficient machine translation systems are freely and readily available to translate text from one language into another, which makes it quite easy to reuse text across languages, and consequently difficult to detect it. In the literature, the most prominent and widely used approach for CLTRD is Translation plus Monolingual Analysis (T+MA). To detect CLTR for English-Urdu language pair, T+MA has been used with lexical approaches, namely, N-gram Overlap, Longest Common Subsequence, and Greedy String Tiling. This clearly shows that T+MA has not been thoroughly explored for the English-Urdu language pair. To fulfill this gap, this study presents an in-depth and detailed comparison of 26 approaches that are based on T+MA. These approaches include semantic similarity approaches (semantic tagger based approaches, WordNet-based approaches), probabilistic approach (Kullback-Leibler distance approach), monolingual word embedding-based approaches siamese recurrent architecture, and monolingual sentence transformer-based approaches for English-Urdu language pair. The evaluation was carried out using the CLEU benchmark corpus, both for the binary and the ternary classification tasks. Our extensive experimentation shows that our proposed approach that is a combination of 26 approaches obtained an F 1 score of 0.77 and 0.61 for the binary and ternary classification tasks, respectively, and outperformed the previously reported approaches [ 41 ] ( F 1 = 0.73) for the binary and ( F 1 = 0.55) for the ternary classification tasks) on the CLEU corpus.

Keywords:
Computer science Natural language processing Artificial intelligence Urdu Machine translation WordNet Reuse Linguistics

Metrics

11
Cited By
1.13
FWCI (Field Weighted Citation Impact)
62
Refs
0.82
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Topic Modeling
Physical Sciences →  Computer Science →  Artificial Intelligence
Natural Language Processing Techniques
Physical Sciences →  Computer Science →  Artificial Intelligence
Text and Document Classification Technologies
Physical Sciences →  Computer Science →  Artificial Intelligence

Related Documents

JOURNAL ARTICLE

Cross-Lingual Text Reuse Detection at sentence level for English–Urdu language pair

Iqra MuneerRao Muhammad Adeel Nawab

Journal:   Computer Speech & Language Year: 2022 Vol: 75 Pages: 101381-101381
JOURNAL ARTICLE

Cross-lingual Text Reuse Detection at Document Level for English-Urdu Language Pair

Muhammad SharjeelIqra MuneerSumaira NosheenRao Muhammad Adeel NawabPaul Rayson

Journal:   ACM Transactions on Asian and Low-Resource Language Information Processing Year: 2023 Vol: 22 (6)Pages: 1-22
JOURNAL ARTICLE

Sentential Cross-lingual Paraphrase Detection for English-Urdu Language Pair

Iqra MuneerNida WaheedAdnan AshrafRao Muhammad Adeel Nawab

Journal:   The European Journal on Artificial Intelligence Year: 2025 Vol: 38 (3)Pages: 309-329
JOURNAL ARTICLE

Mono-lingual text reuse detection for the Urdu language at lexical level

Ayesha NoreenIqra MuneerRao Muhammad Adeel Nawab

Journal:   Engineering Applications of Artificial Intelligence Year: 2024 Vol: 136 Pages: 109003-109003
© 2026 ScienceGate Book Chapters — All rights reserved.