Analysis of Subword Tokenization for Transformer Model in Neural Machine Translation between Myanmar and English Languages

Nang Zin Min Aye; Khin Mar Soe

doi:10.1109/icca62361.2024.10533045

ScienceGate Book Chapters

JOURNAL ARTICLE

Analysis of Subword Tokenization for Transformer Model in Neural Machine Translation between Myanmar and English Languages

Nang Zin Min Aye Khin Mar Soe

Year: 2024 Pages: 1-5

DOI: 10.1109/icca62361.2024.10533045

Get Full-Text PDF Get Analytical Report

Abstract

Machine translation between Myanmar and English, and vice versa, presents significant challenges but stands as an important area of research for fostering connectivity and facilitating information access for Myanmar language speakers. Sustained research efforts and continual innovation are imperative for elevating the quality and accessibility of machine translation in these language pairs. Neural Machine Translation (NMT) models, especially those utilizing attention mechanisms and the transformer model, show promise in the field of machine translation. The integration of subword approaches in machine translation is crucial for managing the complexity and diversity of languages. It improves adaptability and enhances the overall performance of translation models. This is particularly important in scenarios involving morphological variations and limited resources for languages. In this study, our aim is to present the results of evaluating the translation performance of Transformer and Recurrent Neural Network (RNN) models optimized with subwording on the Myanmar-English WAT2019 corpus. Subsequently, we will conduct an evaluation and comparison of these models. Importantly, we highlight that the correct selection of the subword model emerges as the most significant factor influencing translation performance. An optimized Transformer model using subwording with a 32k Byte Pair Encoding (BPE) demonstrated a significant improvement in BLEU scores. Specifically, there was an 16.92 points improvement for the English-Myanmar direction and a 17.01 points improvement for the Myanmar-English direction when compared to a baseline RNN model that was also optimized with subwording using 32k BPE. We conducted an assessment of SentencePiece models utilizing both unigram and BPE approaches. The results indicated an improvement in BLEU scores 50.76 points for English-Myanamar direction and 48.91 for Myanmar-English direction, particularly with Transformer models optimized with 32k BPE subword models.

Keywords:

Machine translation Computer science Transformer Artificial intelligence Natural language processing Language model Evaluation of machine translation Recurrent neural network Example-based machine translation Machine learning Artificial neural network Machine translation software usability Voltage Engineering

Metrics

Cited By

1.28

FWCI (Field Weighted Citation Impact)

Refs

0.76

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Natural Language Processing Techniques

Physical Sciences → Computer Science → Artificial Intelligence

Topic Modeling

Physical Sciences → Computer Science → Artificial Intelligence

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Analysis of Subword Tokenization for Transformer Model in Neural Machine Translation between Myanmar and English Languages

Abstract

Metrics

Citation History

Topics

Related Documents

Urdu-to-English Neural Machine Translation using Transformer with Subword Tokenization

NEURAL MACHINE TRANSLATION BETWEEN MYANMAR AND ENGLISH LANGUAGES

Optimizing Tokenization Techniques for Agglutinative Languages in Neural Machine Translation

Jamo-Level Subword Tokenization in Low-Resource Korean Machine Translation

A Transformer-Based Neural Machine Translation Model for Arabic Dialects That Utilizes Subword Units