Unsupervised construction of large paraphrase corpora

Bill Dolan; Chris Quirk; Chris Brockett

doi:10.3115/1220355.1220406

ScienceGate Book Chapters

JOURNAL ARTICLE

Unsupervised construction of large paraphrase corpora

Bill Dolan Chris Quirk Chris Brockett

Year: 2004 Pages: 350-es

DOI: 10.3115/1220355.1220406

Get Full-Text PDF Get Analytical Report

Abstract

We investigate unsupervised techniques for acquiring monolingual sentence-level paraphrases from a corpus of temporally and topically clustered news articles collected from thousands of web-based news sources. Two techniques are employed: (1) simple string edit distance, and (2) a heuristic strategy that pairs initial (presumably summary) sentences from different news stories in the same cluster. We evaluate both datasets using a word alignment algorithm and a metric borrowed from machine translation. Results show that edit distance data is cleaner and more easily-aligned than the heuristic data, with an overall alignment error rate (AER) of 11.58% on a similarly-extracted test set. On test data extracted by the heuristic strategy, however, performance of the two training sets is similar, with AERs of 13.2% and 14.7% respectively. Analysis of 100 pairs of sentences from each set reveals that the edit distance data lacks many of the complex lexical and syntactic alternations that characterize monolingual paraphrase. The summary sentences, while less readily alignable, retain more of the non-trivial alternations that are of greatest interest learning paraphrase relationships.

Keywords:

Paraphrase Computer science Natural language processing Artificial intelligence Sentence Set (abstract data type) Metric (unit) Heuristic Edit distance Machine translation Test set Word (group theory) Similarity (geometry) Data set Linguistics

Metrics

734

Cited By

15.83

FWCI (Field Weighted Citation Impact)

Refs

0.99

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Topic Modeling

Physical Sciences → Computer Science → Artificial Intelligence

Natural Language Processing Techniques

Physical Sciences → Computer Science → Artificial Intelligence

Advanced Text Analysis Techniques

Physical Sciences → Computer Science → Artificial Intelligence

Unsupervised construction of large paraphrase corpora

Abstract

Metrics

Citation History

Topics

Related Documents

Construction of a Russian Paraphrase Corpus: Unsupervised Paraphrase Extraction

High-Performance Unsupervised Relation Extraction from Large Corpora

An Unsupervised Approach of Paraphrase Discovery from Large Crime Corpus

Automatic Construction of Fine-Grained Paraphrase Corpora System Using Language Inference Model

Collecting paraphrase corpora from volunteer contributors