Copied Monolingual Data Improves Low-Resource Neural Machine Translation

Anna Currey; Antonio Valerio Miceli Barone; Kenneth Heafield

doi:10.18653/v1/w17-4715

ScienceGate Book Chapters

JOURNAL ARTICLE

Copied Monolingual Data Improves Low-Resource Neural Machine Translation

Anna Currey Antonio Valerio Miceli Barone Kenneth Heafield

Year: 2017

DOI: 10.18653/v1/w17-4715

Get Full-Text PDF Get Analytical Report

Abstract

We train a neural machine translation (NMT) system to both translate sourcelanguage text and copy target-language text, thereby exploiting monolingual corpora in the target language.Specifically, we create a bitext from the monolingual text in the target language so that each source sentence is identical to the target sentence.This copied data is then mixed with the parallel corpus and the NMT system is trained like normal, with no metadata to distinguish the two input languages.Our proposed method proves to be an effective way of incorporating monolingual data into low-resource NMT.On Turkish↔English and Romanian↔English translation tasks, we see gains of up to 1.2 BLEU over a strong baseline with back-translation.Further analysis shows that the linguistic phenomena behind these gains are different from and largely orthogonal to back-translation, with our copied corpus method improving accuracy on named entities and other words that should remain identical between the source and target languages.

Keywords:

Machine translation Computer science Translation (biology) Artificial intelligence Natural language processing Speech recognition Chemistry

Metrics

220

Cited By

30.71

FWCI (Field Weighted Citation Impact)

Refs

1.00

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Natural Language Processing Techniques

Physical Sciences → Computer Science → Artificial Intelligence

Topic Modeling

Physical Sciences → Computer Science → Artificial Intelligence

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Copied Monolingual Data Improves Low-Resource Neural Machine Translation

Abstract

Metrics

Citation History

Topics

Related Documents

Low-Resource Neural Machine Translation Improvement Using Source-Side Monolingual Data

Rethinking the Exploitation of Monolingual Data for Low-Resource Neural Machine Translation

Zero-Resource Neural Machine Translation with Monolingual Pivot Data

Data Cartography for Low-Resource Neural Machine Translation

Filtered Pseudo-parallel Corpus Improves Low-resource Neural Machine Translation