JOURNAL ARTICLE

Data augmentation English-Indonesia-Madurese parallel corpus dataset using neural machine translation

Abstract

INMAD is a dataset containing a corpus of English-Indonesian-Madurese translated sentences. This corpus stores a list of 23086 lines of sentences, as well as their translations in Indonesian and English. The details of each Madurese translation cover 1 language level, namely the 'engghi-enten' level. The framework for creating the dataset consists of two stages. First, the Combine source of parallel corpus to create and improve the quality of sentences corpus. Second, Data Augmentation with Back-translation using MarianMT and combine parallel dataset with original parallel corpus. INMAD received validation from a Madurese language specialist, who also served as the translator for the source of this dataset. Consequently, this dataset can serve as the primary resource for Natural Language Processing (NLP) research, particularly for Madurese at the 'engghi-enten' level.

Keywords:

Metrics

1
Cited By
4.82
FWCI (Field Weighted Citation Impact)
12
Refs
0.94
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Natural Language Processing Techniques
Physical Sciences →  Computer Science →  Artificial Intelligence
Topic Modeling
Physical Sciences →  Computer Science →  Artificial Intelligence
Handwritten Text Recognition Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition

Related Documents

© 2026 ScienceGate Book Chapters — All rights reserved.