Sinhala and English Document Alignment using Statistical Machine Translation

Rajitha M. D. C; Piyarathna L.L. C; Nayanajith M. M.D. S; S Surangika

doi:10.1109/icter51097.2020.9325462

ScienceGate Book Chapters

JOURNAL ARTICLE

Sinhala and English Document Alignment using Statistical Machine Translation

Rajitha M. D. C Piyarathna L.L. C Nayanajith M. M.D. S S Surangika

Year: 2020 Pages: 29-34

DOI: 10.1109/icter51097.2020.9325462

Get Full-Text PDF Get Analytical Report

Abstract

Document alignment is the process of identifying documents that have the same content in different languages. Document alignment is a very useful prerequisite for creating parallel corpora to be used in Machine Translation (MT). Hybrid document alignment techniques are commonly used, where a set of heuristics are used along with an existing MT system. In these systems, first all the target documents are translated into source language using an existing MT system. Candidate pairs are identified using a heuristics such as web domain or published date. Similarity of these candidate pairs is calculated using a similarity calculation algorithm. Heuristics are used either to reduce the candidate pair count or to improve the accuracy of alignment. However, the considered heuristics are dependent on the selected document sources. In this paper, we present a hybrid document alignment system for Sinhala and English, where a set of source-independent heuristics is used on the output of an MT system. In addition, we demonstrate how transliteration between Sinhala and English is exploited to improve the performance of the document alignment process.

Keywords:

Heuristics Computer science Similarity (geometry) Artificial intelligence Natural language processing Set (abstract data type) Transliteration Machine translation Process (computing) Information retrieval Programming language Image (mathematics)

Metrics

Cited By

0.29

FWCI (Field Weighted Citation Impact)

Refs

0.66

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Natural Language Processing Techniques

Physical Sciences → Computer Science → Artificial Intelligence

Topic Modeling

Physical Sciences → Computer Science → Artificial Intelligence

Semantic Web and Ontologies

Physical Sciences → Computer Science → Artificial Intelligence

Sinhala and English Document Alignment using Statistical Machine Translation

Abstract

Metrics

Citation History

Topics

Related Documents

Statistical machine translation using hierarchical phrase alignment

Statistical machine translation of systems for Sinhala - Tamil

Statistical Machine Translation and Word Alignment

English to Sinhala Neural Machine Translation

Discriminative Spoken Language Understanding Using Statistical Machine Translation Alignment Models