Document alignment is the process of identifying documents that have the same content in different languages. Document alignment is a very useful prerequisite for creating parallel corpora to be used in Machine Translation (MT). Hybrid document alignment techniques are commonly used, where a set of heuristics are used along with an existing MT system. In these systems, first all the target documents are translated into source language using an existing MT system. Candidate pairs are identified using a heuristics such as web domain or published date. Similarity of these candidate pairs is calculated using a similarity calculation algorithm. Heuristics are used either to reduce the candidate pair count or to improve the accuracy of alignment. However, the considered heuristics are dependent on the selected document sources. In this paper, we present a hybrid document alignment system for Sinhala and English, where a set of source-independent heuristics is used on the output of an MT system. In addition, we demonstrate how transliteration between Sinhala and English is exploited to improve the performance of the document alignment process.
Taro WatanabeKenji ImamuraEiichiro SumitaHiroshi G. Okuno
Thilakshi FonsekaRashmini NaranpanawaRavinga PereraUthayasanker Thayasivam
Mohammad AliannejadiShahram KhadiviSaeed Shiry GhidaryMohammad Hadi Bokaei