Parallel corpora are crucial for training SMT systems. However, for many language pairs they are available only in very limited quantities. For these language pairs a huge portion of phrases encountered at run-time will be unknown. We show how techniques from paraphrasing can be used to deal with these otherwise unknown source language phrases. Our results show that augmenting a state-of-the-art SMT system with paraphrases leads to significantly improved coverage and translation quality. For a training corpus with 10,000 sentence pairs we increase the coverage of unique test set unigrams from 48% to 90%, with more than half of the newly covered items accurately translated, as opposed to none in current approaches.
Yuval MartonChris Callison-BurchPhilip Resnik
Nitin MadnaniNecip Fazıl AyanPhilip ResnikBonnie J. Dorr
Francisco Guzmán HerreraLeonardo Garrido
Francisco GuzmánLeonardo Garrido