Abstract

This paper proposed a new subword segmentation method for neural machine translation, "Bilingual Subword Segmentation," which tokenizes sentences to minimize the difference between the number of subword units in a sentence and that of its translation. While existing subword segmentation methods tokenize a sentence without considering its translation, the proposed method tokenizes a sentence by using subword units induced from bilingual sentences; this method could be more favorable to machine translation. Evaluations on WAT Asian Scientific Paper Excerpt Corpus (ASPEC) English-to-Japanese and Japanese-to-English translation tasks and WMT14 English-to-German and German-to-English translation tasks show that our bilingual subword segmentation improves the performance of Transformer neural machine translation (up to +0.81 BLEU).

Keywords:
Machine translation Computer science Artificial intelligence Natural language processing Segmentation Sentence Transformer German Translation (biology) Machine translation system Speech recognition Linguistics Engineering Voltage

Metrics

7
Cited By
0.73
FWCI (Field Weighted Citation Impact)
15
Refs
0.77
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Natural Language Processing Techniques
Physical Sciences →  Computer Science →  Artificial Intelligence
© 2026 ScienceGate Book Chapters — All rights reserved.