Develop corpora and methods for cross-lingual text reuse detection for English Urdu language pair at lexical, syntactical, and phrasal levels

Iqra Muneer; Rao Muhammad Adeel Nawab

doi:10.1007/s10579-022-09613-4

ScienceGate Book Chapters

JOURNAL ARTICLE

Develop corpora and methods for cross-lingual text reuse detection for English Urdu language pair at lexical, syntactical, and phrasal levels

Iqra Muneer Rao Muhammad Adeel Nawab

Year: 2022 Journal: Language Resources and Evaluation Vol: 56 (4)Pages: 1103-1130 Publisher: Springer Science+Business Media

DOI: 10.1007/s10579-022-09613-4

Get Full-Text PDF Get Analytical Report

Abstract

In recent years, Cross-Lingual Text Reuse Detection (CLTRD) has attracted the attention of the research community because large digital repositories and efficient Machine Translation systems are readily and freely available, which makes it easier to reuse text across the languages and very difficult to detect it. In the previous studies, the problem of CLTRD for the English-Urdu language pair has been explored at the sentence/passage and document level, and benchmark corpora and methods have been developed. However, there is a lack of benchmark corpora and methods for the CLTRD for the English-Urdu language pair at the lexical, syntactical, and phrasal levels. To fulfill this research gap, this study presents three large benchmark corpora for detecting the Cross-Lingual Text Reuse (CLTR) at three levels of rewrite (Wholly Derived (WD), Partially Derived (PD), and Non Derived (ND)). The CLEU-Lex, CLEU-Syn and CLEU-Phr corpora contain 66,485 (WD = 22,236, PD = 20,315 and ND = 23,934), 60,267 (WD = 20,007, PD = 16,979 and ND = 23,281) and 60,106 (WD = 23,862, PD = 15,878 and ND = 20,366) CLTR pairs respectively. As a secondary major contribution, we have applied the Cross-Lingual Word Embedding (CLWE), Cross-Lingual Semantic Tagger (CLST), and Cross-Lingual Sentence Transformer (CLSTR) based methods on our three proposed corpora for the CLTRD. Our extensive experimentation showed that for the binary classification task, the best results on the CLEU-Lex corpus were obtained using the cross-lingual sentence transformer ( $$F_{1}$$ = 0.80). For the CLEU-Syn and CLEU-Phr corpora, the best results were obtained using the cross-lingual sentence transformer and a combination of the CLWE, CLST and CLSTR methods ( $$F_{1}$$ = 0.92 on CLEU-Syn and $$F_{1}$$ = 0.94 on CLEU-Phr). For the ternary classification task, the best results on the CLEU-Lex corpus were obtained using the cross-lingual sentence transformer method ( $$F_{1}$$ = 0.69). For the CLEU-Syn corpus, the best results were obtained using a combination of the CLWE, CLST, and CLSTR methods ( $$F_{1}$$ = 0.82). For the CLEU-Phr corpus the best results were obtained using cross-lingual sentence transformer and combination of CLWE, CLST, and CLSTR methods ( $$F_{1}$$ = 0.78). To foster and promote research in Urdu (a low-resourced language) all the three proposed corpora are free and publicly available for research purposes.

Keywords:

Computer science Natural language processing Artificial intelligence Transformer Sentence Benchmark (surveying) Urdu Machine translation Linguistics

Metrics

Cited By

0.98

FWCI (Field Weighted Citation Impact)

Refs

0.74

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Topic Modeling

Physical Sciences → Computer Science → Artificial Intelligence

Natural Language Processing Techniques

Physical Sciences → Computer Science → Artificial Intelligence

Text and Document Classification Technologies

Physical Sciences → Computer Science → Artificial Intelligence

Develop corpora and methods for cross-lingual text reuse detection for English Urdu language pair at lexical, syntactical, and phrasal levels

Abstract

Metrics

Citation History

Topics

Related Documents

Cross-Lingual Text Reuse Detection at sentence level for English–Urdu language pair

Cross-lingual Text Reuse Detection at Document Level for English-Urdu Language Pair

Cross-lingual Text Reuse Detection Using Translation Plus Monolingual Analysis for English-Urdu Language Pair

Mono-lingual text reuse detection for the Urdu language at lexical level

Sentential Cross-lingual Paraphrase Detection for English-Urdu Language Pair