JOURNAL ARTICLE

Data Augmentation for Low Resource Neural Machine Translation for Sotho-Tswana Languages

Mojapelo, MaxwellBuys, JanMojapelo, Maxwell

Year: 2023 Journal:   Zenodo (CERN European Organization for Nuclear Research)   Publisher: European Organization for Nuclear Research

Abstract

Neural Machine Translation (NMT) models have achieved remarkable performance on translating between high resource languages. However, translation quality for languages with limited data is much worse. This research focuses on the low resource language of Sepedi and considers two data augmentation techniques to increase the size and diversity of English-Sepedi corpora for training an NMT model. First we consider backtranslation, which makes use of the larger amount of available monolingual Sepedi text. We train a reverse (Sepedi to English) model and generate synthetic English sentences from the monolingual Sepedi sentences. These synthetic translations examples are added to the parallel English-Sepedi sentences. We carry out various experiments to investigate translation quality improvements. The second technique we consider is to generate synthetic data from parallel sentences between English and a closely-related language, Setswana. Setwana word are replacing with Sepedi words through an induced bilingual dictionary, which is created by using a supervised Generative Adversarial Network to align the embeddings of Sepedi and Setswana words. We evaluate our models on the JW300, FLoRes and Autshumato evaluation test sets, finding improvements over the current benchmark BLEU scores across all three datasets.

Keywords:
Machine translation Benchmark (surveying) Word (group theory) Generative grammar Translation (biology) Bilingual dictionary Resource (disambiguation)

Metrics

0
Cited By
0.00
FWCI (Field Weighted Citation Impact)
0
Refs
0.37
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Topics

Natural Language Processing Techniques
Physical Sciences →  Computer Science →  Artificial Intelligence
Topic Modeling
Physical Sciences →  Computer Science →  Artificial Intelligence
Translation Studies and Practices
Social Sciences →  Arts and Humanities →  Language and Linguistics
© 2026 ScienceGate Book Chapters — All rights reserved.