A Compression-Based Multiple Subword Segmentation for Neural Machine Translation

Keita Nonaka; Kazutaka Yamanouchi; I Tomohiro; Tsuyoshi Okita; Kazutaka Shimada; Hiroshi Sakamoto

doi:10.3390/electronics11071014

ScienceGate Book Chapters

JOURNAL ARTICLE

A Compression-Based Multiple Subword Segmentation for Neural Machine Translation

Keita Nonaka Kazutaka Yamanouchi I Tomohiro Tsuyoshi Okita Kazutaka Shimada Hiroshi Sakamoto

Year: 2022 Journal: Electronics Vol: 11 (7)Pages: 1014-1014 Publisher: Multidisciplinary Digital Publishing Institute

DOI: 10.3390/electronics11071014

Get Full-Text PDF Get Analytical Report

Abstract

In this study, we propose a simple and effective preprocessing method for subword segmentation based on a data compression algorithm. Compression-based subword segmentation has recently attracted significant attention as a preprocessing method for training data in neural machine translation. Among them, BPE/BPE-dropout is one of the fastest and most effective methods compared to conventional approaches; however, compression-based approaches have a drawback in that generating multiple segmentations is difficult due to the determinism. To overcome this difficulty, we focus on a stochastic string algorithm, called locally consistent parsing (LCP), that has been applied to achieve optimum compression. Employing the stochastic parsing mechanism of LCP, we propose LCP-dropout for multiple subword segmentation that improves BPE/BPE-dropout, and we show that it outperforms various baselines in learning from especially small training data.

Keywords:

Computer science Segmentation Parsing Dropout (neural networks) Artificial intelligence Compression (physics) Preprocessor Machine translation Lossless compression String (physics) Data compression Machine learning Pattern recognition (psychology) Speech recognition Mathematics

Metrics

Cited By

3.72

FWCI (Field Weighted Citation Impact)

Refs

0.91

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Natural Language Processing Techniques

Physical Sciences → Computer Science → Artificial Intelligence

Topic Modeling

Physical Sciences → Computer Science → Artificial Intelligence

Algorithms and Data Compression

Physical Sciences → Computer Science → Artificial Intelligence

A Compression-Based Multiple Subword Segmentation for Neural Machine Translation

Abstract

Metrics

Citation History

Topics

Related Documents

Bilingual Subword Segmentation for Neural Machine Translation

Bilingual Subword Segmentation for Neural Machine Translation

BERTSeg: BERT Based Unsupervised Subword Segmentation for Neural Machine Translation

Finding Better Subword Segmentation for Neural Machine Translation

Exploring Subword Segmentation Methods in English-Vietnamese Neural Machine Translation