JOURNAL ARTICLE

Sinhala and English Document Alignment using Statistical Machine Translation

Abstract

Document alignment is the process of identifying documents that have the same content in different languages. Document alignment is a very useful prerequisite for creating parallel corpora to be used in Machine Translation (MT). Hybrid document alignment techniques are commonly used, where a set of heuristics are used along with an existing MT system. In these systems, first all the target documents are translated into source language using an existing MT system. Candidate pairs are identified using a heuristics such as web domain or published date. Similarity of these candidate pairs is calculated using a similarity calculation algorithm. Heuristics are used either to reduce the candidate pair count or to improve the accuracy of alignment. However, the considered heuristics are dependent on the selected document sources. In this paper, we present a hybrid document alignment system for Sinhala and English, where a set of source-independent heuristics is used on the output of an MT system. In addition, we demonstrate how transliteration between Sinhala and English is exploited to improve the performance of the document alignment process.

Keywords:
Heuristics Computer science Similarity (geometry) Artificial intelligence Natural language processing Set (abstract data type) Transliteration Machine translation Process (computing) Information retrieval Programming language Image (mathematics)

Metrics

5
Cited By
0.29
FWCI (Field Weighted Citation Impact)
24
Refs
0.66
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Natural Language Processing Techniques
Physical Sciences →  Computer Science →  Artificial Intelligence
Topic Modeling
Physical Sciences →  Computer Science →  Artificial Intelligence
Semantic Web and Ontologies
Physical Sciences →  Computer Science →  Artificial Intelligence
© 2026 ScienceGate Book Chapters — All rights reserved.