ExaPPC: a Large-Scale Persian Paraphrase Detection Corpus

Reyhaneh Sadeghi; Hamed Karbasi; Ahmad Akbari

doi:10.1109/icwr54782.2022.9786243

ScienceGate Book Chapters

JOURNAL ARTICLE

ExaPPC: a Large-Scale Persian Paraphrase Detection Corpus

Reyhaneh Sadeghi Hamed Karbasi Ahmad Akbari

Year: 2022 Vol: 32 Pages: 168-175

DOI: 10.1109/icwr54782.2022.9786243

Get Full-Text PDF Get Analytical Report

Abstract

This paper describes the creation of Exa Persian Paraphrase Corpus (ExaPPC), a large paraphrase corpus consisting of monolingual sentence-level paraphrases using different sources. ExaPPC is the first large-scale paraphrase dataset used in Persian paraphrase detection to the best of our knowledge. There are 2.3M labeled sentence pairs in the corpus consisting of a 1M paraphrase label and 1.3M non-paraphrase label. Efforts were made manually and semi-automatically to construct this corpus using techniques such as subtitle alignment, translating existing parallel English-Persian corpus and similarity corpus on English tweets. In addition to enriching the corpus, candidate sentence pairs among tweets have been extracted via NLP tools and labeled by two Persian native speakers. The advantages of this corpus compared to the existing ones are the number of pair sentences, sentence Length variation and textual diversity, including formal and dialogue sentences. The result on the provided test corpus shows that ExaPPC achieves 94% accuracy on paraphrase detection task. The corpus is publicly available ¹ ¹ https://github.com/exaco/exappc

Keywords:

Paraphrase Natural language processing Artificial intelligence Computer science Sentence Persian Linguistics Philosophy

Metrics

Cited By

0.20

FWCI (Field Weighted Citation Impact)

Refs

0.51

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Topic Modeling

Physical Sciences → Computer Science → Artificial Intelligence

Natural Language Processing Techniques

Physical Sciences → Computer Science → Artificial Intelligence

Advanced Text Analysis Techniques

Physical Sciences → Computer Science → Artificial Intelligence

ExaPPC: a Large-Scale Persian Paraphrase Detection Corpus

Abstract

Metrics

Citation History

Topics

Related Documents

Constructing a Large-Scale English-Persian Parallel Corpus

Matina: A Large-Scale 73B Token Persian Text Corpus

Corpus-Based Paraphrase Detection Experiments and Review

Verb Detection in Persian Corpus

Turkish Paraphrase Corpus