Data Cartography for Low-Resource Neural Machine Translation

Aquia Richburg; Marine Carpuat

doi:10.18653/v1/2022.findings-emnlp.410

ScienceGate Book Chapters

JOURNAL ARTICLE

Data Cartography for Low-Resource Neural Machine Translation

Aquia Richburg Marine Carpuat

Year: 2022 Pages: 5594-5607

DOI: 10.18653/v1/2022.findings-emnlp.410

Get Full-Text PDF Get Analytical Report

Abstract

While collecting or generating more parallel data is necessary to improve machine translation (MT) in low-resource settings, we lack an understanding of how the limited amounts of existing data are actually used to help guide the collection of further resources. In this paper, we apply data cartography techniques (Swayamdipta et al., 2020) to characterize the contribution of training samples in two low-resource MT tasks (Swahili-English and Turkish-English) throughout the training of standard neural MT models. Our empirical study shows that, unlike in prior work for classification tasks, most samples contribute to model training in low-resource MT, albeit not uniformly throughout the training process. Furthermore, uni-dimensional characterizations of samples – e.g., based on dual cross-entropy or word frequency – do not suffice to characterize to what degree they are hard or easy to learn. Taken together, our results suggest that data augmentation strategies for low-resource MT would benefit from model-in-the-loop strategies to maximize improvements.

Keywords:

Computer science Machine translation Resource (disambiguation) Swahili Artificial intelligence Natural language processing Machine learning Empirical research Linguistics

Metrics

Cited By

0.39

FWCI (Field Weighted Citation Impact)

Refs

0.64

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Natural Language Processing Techniques

Physical Sciences → Computer Science → Artificial Intelligence

Topic Modeling

Physical Sciences → Computer Science → Artificial Intelligence

Semantic Web and Ontologies

Physical Sciences → Computer Science → Artificial Intelligence

Data Cartography for Low-Resource Neural Machine Translation

Abstract

Metrics

Citation History

Topics

Related Documents

Data Augmentation for Low-Resource Neural Machine Translation

Low Resource Neural Machine Translation

Copied Monolingual Data Improves Low-Resource Neural Machine Translation

A data-guided curriculum towards low-resource neural machine translation

Low-Resource Neural Machine Translation Improvement Using Data Augmentation Strategies