JOURNAL ARTICLE

Cross-lingual Text Reuse Detection at Document Level for English-Urdu Language Pair

Muhammad SharjeelIqra MuneerSumaira NosheenRao Muhammad Adeel NawabPaul Rayson

Year: 2023 Journal:   ACM Transactions on Asian and Low-Resource Language Information Processing Vol: 22 (6)Pages: 1-22   Publisher: Association for Computing Machinery

Abstract

In recent years, the problem of Cross-Lingual Text Reuse Detection (CLTRD) has gained the interest of the research community due to the availability of large digital repositories and automatic Machine Translation (MT) systems. These systems are readily available and openly accessible, which makes it easier to reuse text across languages but hard to detect. In previous studies, different corpora and methods have been developed for CLTRD at the sentence/passage level for the English-Urdu language pair. However, there is a lack of large standard corpora and methods for CLTRD for the English-Urdu language pair at the document level. To overcome this limitation, the significant contribution of this study is the development of a large benchmark cross-lingual (English-Urdu) text reuse corpus, called the TREU (Text Reuse for English-Urdu) corpus. It contains English to Urdu real cases of text reuse at the document level. The corpus is manually labelled into three categories (Wholly Derived = 672, Partially Derived = 888, and Non Derived = 697) with the source text in English and the derived text in the Urdu language. Another contribution of this study is the evaluation of the TREU corpus using a diversified range of methods to show its usefulness and how it can be utilized in the development of automatic methods for measuring cross-lingual (English-Urdu) text reuse at the document level. The best evaluation results, for both binary ( F 1 = 0.78) and ternary ( F 1 = 0.66) classification tasks, are obtained using a combination of all Translation plus Mono-lingual Analysis (T+MA) based methods. The TREU corpus is publicly available to promote CLTRD research in an under-resourced language, i.e., Urdu.

Keywords:
Urdu Computer science Artificial intelligence Natural language processing Reuse Machine translation Sentence Linguistics Engineering

Metrics

4
Cited By
1.02
FWCI (Field Weighted Citation Impact)
35
Refs
0.75
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Topic Modeling
Physical Sciences →  Computer Science →  Artificial Intelligence
Natural Language Processing Techniques
Physical Sciences →  Computer Science →  Artificial Intelligence
Data Quality and Management
Social Sciences →  Decision Sciences →  Management Science and Operations Research

Related Documents

JOURNAL ARTICLE

Cross-Lingual Text Reuse Detection at sentence level for English–Urdu language pair

Iqra MuneerRao Muhammad Adeel Nawab

Journal:   Computer Speech & Language Year: 2022 Vol: 75 Pages: 101381-101381
JOURNAL ARTICLE

Cross-lingual Text Reuse Detection Using Translation Plus Monolingual Analysis for English-Urdu Language Pair

Iqra MuneerRao Muhammad Adeel Nawab

Journal:   ACM Transactions on Asian and Low-Resource Language Information Processing Year: 2021 Vol: 21 (2)Pages: 1-18
JOURNAL ARTICLE

Mono-lingual text reuse detection for the Urdu language at lexical level

Ayesha NoreenIqra MuneerRao Muhammad Adeel Nawab

Journal:   Engineering Applications of Artificial Intelligence Year: 2024 Vol: 136 Pages: 109003-109003
JOURNAL ARTICLE

Sentential Cross-lingual Paraphrase Detection for English-Urdu Language Pair

Iqra MuneerNida WaheedAdnan AshrafRao Muhammad Adeel Nawab

Journal:   The European Journal on Artificial Intelligence Year: 2025 Vol: 38 (3)Pages: 309-329
© 2026 ScienceGate Book Chapters — All rights reserved.