JOURNAL ARTICLE

A Compression-Based Multiple Subword Segmentation for Neural Machine Translation

Keita NonakaKazutaka YamanouchiI TomohiroTsuyoshi OkitaKazutaka ShimadaHiroshi Sakamoto

Year: 2022 Journal:   Electronics Vol: 11 (7)Pages: 1014-1014   Publisher: Multidisciplinary Digital Publishing Institute

Abstract

In this study, we propose a simple and effective preprocessing method for subword segmentation based on a data compression algorithm. Compression-based subword segmentation has recently attracted significant attention as a preprocessing method for training data in neural machine translation. Among them, BPE/BPE-dropout is one of the fastest and most effective methods compared to conventional approaches; however, compression-based approaches have a drawback in that generating multiple segmentations is difficult due to the determinism. To overcome this difficulty, we focus on a stochastic string algorithm, called locally consistent parsing (LCP), that has been applied to achieve optimum compression. Employing the stochastic parsing mechanism of LCP, we propose LCP-dropout for multiple subword segmentation that improves BPE/BPE-dropout, and we show that it outperforms various baselines in learning from especially small training data.

Keywords:
Computer science Segmentation Parsing Dropout (neural networks) Artificial intelligence Compression (physics) Preprocessor Machine translation Lossless compression String (physics) Data compression Machine learning Pattern recognition (psychology) Speech recognition Mathematics

Metrics

19
Cited By
3.72
FWCI (Field Weighted Citation Impact)
34
Refs
0.91
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Natural Language Processing Techniques
Physical Sciences →  Computer Science →  Artificial Intelligence
Topic Modeling
Physical Sciences →  Computer Science →  Artificial Intelligence
Algorithms and Data Compression
Physical Sciences →  Computer Science →  Artificial Intelligence
© 2026 ScienceGate Book Chapters — All rights reserved.