JOURNAL ARTICLE

ExaPPC: a Large-Scale Persian Paraphrase Detection Corpus

Abstract

This paper describes the creation of Exa Persian Paraphrase Corpus (ExaPPC), a large paraphrase corpus consisting of monolingual sentence-level paraphrases using different sources. ExaPPC is the first large-scale paraphrase dataset used in Persian paraphrase detection to the best of our knowledge. There are 2.3M labeled sentence pairs in the corpus consisting of a 1M paraphrase label and 1.3M non-paraphrase label. Efforts were made manually and semi-automatically to construct this corpus using techniques such as subtitle alignment, translating existing parallel English-Persian corpus and similarity corpus on English tweets. In addition to enriching the corpus, candidate sentence pairs among tweets have been extracted via NLP tools and labeled by two Persian native speakers. The advantages of this corpus compared to the existing ones are the number of pair sentences, sentence Length variation and textual diversity, including formal and dialogue sentences. The result on the provided test corpus shows that ExaPPC achieves 94% accuracy on paraphrase detection task. The corpus is publicly available 1 1 https://github.com/exaco/exappc

Keywords:
Paraphrase Natural language processing Artificial intelligence Computer science Sentence Persian Linguistics Philosophy

Metrics

1
Cited By
0.20
FWCI (Field Weighted Citation Impact)
40
Refs
0.51
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Topic Modeling
Physical Sciences →  Computer Science →  Artificial Intelligence
Natural Language Processing Techniques
Physical Sciences →  Computer Science →  Artificial Intelligence
Advanced Text Analysis Techniques
Physical Sciences →  Computer Science →  Artificial Intelligence

Related Documents

JOURNAL ARTICLE

Constructing a Large-Scale English-Persian Parallel Corpus

Tayebeh Mosavi Miangah

Journal:   Meta Journal des traducteurs Year: 2009 Vol: 54 (1)Pages: 181-188
JOURNAL ARTICLE

Corpus-Based Paraphrase Detection Experiments and Review

Tedo VrbanecAna Meštrović

Journal:   Information Year: 2020 Vol: 11 (5)Pages: 241-241
JOURNAL ARTICLE

Verb Detection in Persian Corpus

Majid Iranpour Mobarakeh

Journal:   International Journal of Digital Content Technology and its Applications Year: 2009 Vol: 3 (1)
JOURNAL ARTICLE

Turkish Paraphrase Corpus

Şeniz Demirİlknur Durgar El-KahloutErdem ÜnalHamza Kaya

Journal:   Language Resources and Evaluation Year: 2012 Pages: 4087-4091
© 2026 ScienceGate Book Chapters — All rights reserved.