Bilingual Subword Segmentation for Neural Machine Translation

Hiroyuki Deguchi; Masao Utiyama; Akihiro Tamura; Takashi Ninomiya; Eiichiro Sumita

doi:10.18653/v1/2020.coling-main.378

ScienceGate Book Chapters

JOURNAL ARTICLE

Bilingual Subword Segmentation for Neural Machine Translation

Hiroyuki Deguchi Masao Utiyama Akihiro Tamura Takashi Ninomiya Eiichiro Sumita

Year: 2020

DOI: 10.18653/v1/2020.coling-main.378

Get Full-Text PDF Get Analytical Report

Abstract

This paper proposed a new subword segmentation method for neural machine translation, "Bilingual Subword Segmentation," which tokenizes sentences to minimize the difference between the number of subword units in a sentence and that of its translation. While existing subword segmentation methods tokenize a sentence without considering its translation, the proposed method tokenizes a sentence by using subword units induced from bilingual sentences; this method could be more favorable to machine translation. Evaluations on WAT Asian Scientific Paper Excerpt Corpus (ASPEC) English-to-Japanese and Japanese-to-English translation tasks and WMT14 English-to-German and German-to-English translation tasks show that our bilingual subword segmentation improves the performance of Transformer neural machine translation (up to +0.81 BLEU).

Keywords:

Machine translation Computer science Artificial intelligence Natural language processing Segmentation Sentence Transformer German Translation (biology) Machine translation system Speech recognition Linguistics Engineering Voltage

Metrics

Cited By

0.73

FWCI (Field Weighted Citation Impact)

Refs

0.77

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Natural Language Processing Techniques

Physical Sciences → Computer Science → Artificial Intelligence

Bilingual Subword Segmentation for Neural Machine Translation

Abstract

Metrics

Citation History

Topics

Related Documents

Bilingual Subword Segmentation for Neural Machine Translation

Finding Better Subword Segmentation for Neural Machine Translation

Exploring Subword Segmentation Methods in English-Vietnamese Neural Machine Translation

BERTSeg: BERT Based Unsupervised Subword Segmentation for Neural Machine Translation

A Compression-Based Multiple Subword Segmentation for Neural Machine Translation