Reyhaneh SadeghiHamed KarbasiAhmad Akbari
This paper describes the creation of Exa Persian Paraphrase Corpus (ExaPPC), a large paraphrase corpus consisting of monolingual sentence-level paraphrases using different sources. ExaPPC is the first large-scale paraphrase dataset used in Persian paraphrase detection to the best of our knowledge. There are 2.3M labeled sentence pairs in the corpus consisting of a 1M paraphrase label and 1.3M non-paraphrase label. Efforts were made manually and semi-automatically to construct this corpus using techniques such as subtitle alignment, translating existing parallel English-Persian corpus and similarity corpus on English tweets. In addition to enriching the corpus, candidate sentence pairs among tweets have been extracted via NLP tools and labeled by two Persian native speakers. The advantages of this corpus compared to the existing ones are the number of pair sentences, sentence Length variation and textual diversity, including formal and dialogue sentences. The result on the provided test corpus shows that ExaPPC achieves 94% accuracy on paraphrase detection task. The corpus is publicly available 1 1 https://github.com/exaco/exappc
Sara Bourbour HosseinbeigiFatemeh TaherinezhadHeshaam FailiHamed BaghbaniFatemeh NadiMohsen Amiri
Şeniz Demirİlknur Durgar El-KahloutErdem ÜnalHamza Kaya