JOURNAL ARTICLE

Exploring Paracrawl for Document-level Neural Machine Translation

Abstract

Document-level neural machine translation (NMT) has outperformed sentence-level NMT on a number of datasets. However, document-level NMT is still not widely adopted in realworld translation systems mainly due to the lack of large-scale general-domain training data for document-level NMT. We examine the effectiveness of using Paracrawl for learning document-level translation. Paracrawl is a large-scale parallel corpus crawled from the Internet and contains data from various domains. The official Paracrawl corpus was released as parallel sentences (extracted from parallel webpages) and therefore previous works only used Paracrawl for learning sentence-level translation. In this work, we extract parallel paragraphs from Paracrawl parallel webpages using automatic sentence alignments and we use the extracted parallel paragraphs as parallel documents for training document-level translation models. We show that document-level NMT models trained with only parallel paragraphs from Paracrawl can be used to translate real documents from TED, News and Europarl, outperforming sentence-level NMT models. We also perform a targeted pronoun evaluation and show that document-level models trained with Paracrawl data can help context-aware pronoun translation.

Keywords:
Computer science Machine translation Natural language processing Artificial intelligence Sentence Context (archaeology) Parallel corpora Information retrieval Translation (biology) Pronoun The Internet World Wide Web Linguistics

Metrics

4
Cited By
1.02
FWCI (Field Weighted Citation Impact)
27
Refs
0.77
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Natural Language Processing Techniques
Physical Sciences →  Computer Science →  Artificial Intelligence
Topic Modeling
Physical Sciences →  Computer Science →  Artificial Intelligence
Text Readability and Simplification
Physical Sciences →  Computer Science →  Artificial Intelligence

Related Documents

JOURNAL ARTICLE

Rethinking Document-level Neural Machine Translation

Zewei SunMingxuan WangHao ZhouChengqi ZhaoShujian HuangJiajun ChenLei Li

Journal:   Findings of the Association for Computational Linguistics: ACL 2022 Year: 2022 Pages: 3537-3548
JOURNAL ARTICLE

Context-Adaptive Document-Level Neural Machine Translation

Zhang LiZhirui ZhangBoxing ChenWeihua LuoLuo Si

Journal:   ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) Year: 2022 Pages: 6232-6236
© 2026 ScienceGate Book Chapters — All rights reserved.