Keita NonakaKazutaka YamanouchiI TomohiroTsuyoshi OkitaKazutaka ShimadaHiroshi Sakamoto
In this study, we propose a simple and effective preprocessing method for subword segmentation based on a data compression algorithm. Compression-based subword segmentation has recently attracted significant attention as a preprocessing method for training data in neural machine translation. Among them, BPE/BPE-dropout is one of the fastest and most effective methods compared to conventional approaches; however, compression-based approaches have a drawback in that generating multiple segmentations is difficult due to the determinism. To overcome this difficulty, we focus on a stochastic string algorithm, called locally consistent parsing (LCP), that has been applied to achieve optimum compression. Employing the stochastic parsing mechanism of LCP, we propose LCP-dropout for multiple subword segmentation that improves BPE/BPE-dropout, and we show that it outperforms various baselines in learning from especially small training data.
Hiroyuki DeguchiMasao UtiyamaAkihiro TamuraTakashi NinomiyaEiichiro Sumita
Haiyue SongRaj DabreZhuoyuan MaoChenhui ChuSadao Kurohashi
Thang H. Nguyen-VoAnh Duc TruongLong NguyenĐiền Đinh