Neural networks present a major advance in modeling for statistical machine translation systems. These data-driven systems consist of an encoder that computes a representation of the source sentence and a decoder that accesses the encoder output and generates a probability distribution over all target sentences. The components are connected via a cross-attention layer and trained jointly to minimize the cross-entropy loss on a corpus of bilingual training data, i.e. a set of sentence pairs where one is the translation of the other. In this dissertation, we focus on two important aspects of neural machine translation systems, namely the training data and the attention layer. Since sentence-aligned bilingual data is a scarce resource and availability depends on the language pair, we investigate the use of monolingual data to improve the performance of the machine translation system. We verify the results reported with the use of synthetic data (back-translation) and extended language model fusion and introduce pre-training to neural machine translation. Using a language model trained on monolingual target data is an established method in count-based machine translation approaches. We adapt this to neural machine translation and extend the approach by training the parameters of the translation model as part of a greater fusion model. Furthermore, we use monolingual source and target data to find a better initialization for the training. This pre-training also allows the use of monolingual source data, which is commonly ignored in machine translation systems. We evaluate these methods empirically on four different language pairs with different data conditions and report improvements for all described methods over a purely bilingual baseline. Overall, back-translation provides the best results with respect to translation performance and data efficiency. Inspired by existing work on alignment models, we also incorporate a first-order dependency in the attention layer. In contrast with previous machine translation models, the transformer is a purely feed-forward model without any recurrent layers. This means that no information about the previous attention decision is input to the computation of the attention layer. Modeling attention with first-order dependencies allows the attention layer to access previous attention decisions, which is an important prerequisite to express, e.g. source coverage. We adapt and propose several extensions to include this time-dependent information. Interpreting attention as a soft-lookup of a query to a list of key-value pairs, we introduce the previous attention information in different ways and using different encodings. All methods are verified on several machine translation tasks and we conclude that a zero-order attention model is sufficiently strong for the task of machine translation.
Hexuan DengLiang DingXuebo LiuMeishan ZhangDacheng TaoMin Zhang
Atnafu Lambebo TonjaOlga KolesnikovaAlexander GelbukhGrigori Sidorov