JOURNAL ARTICLE

Statistical transliteration for english-arabic cross language information retrieval

Abstract

Out of vocabulary (OOV) words are problematic for cross language information retrieval. One way to deal with OOV words when the two languages have different alphabets, is to transliterate the unknown words, that is, to render them in the orthography of the second language. In the present study, we present a simple statistical technique to train an English to Arabic transliteration model from pairs of names. We call this a selected n-gram model because a two-stage training procedure first learns which n-gram segments should be added to the unigram inventory for the source language, and then a second stage learns the translation model over this inventory. This technique requires no heuristics or linguistic knowledge of either language. We evaluate the statistically-trained model and a simpler hand-crafted model on a test set of named entities from the Arabic AFP corpus and demonstrate that they perform better than two online translation sources. We also explore the effectiveness of these systems on the TREC 2002 cross language IR task. We find that transliteration either of OOV named entities or of all OOV words is an effective approach for cross language IR.

Keywords:
Transliteration Computer science Natural language processing Artificial intelligence Language model Cross-language information retrieval Task (project management) Set (abstract data type) Heuristics Machine translation Vocabulary Test set Word (group theory) Speech recognition Linguistics Programming language

Metrics

136
Cited By
6.52
FWCI (Field Weighted Citation Impact)
14
Refs
0.97
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Natural Language Processing Techniques
Physical Sciences →  Computer Science →  Artificial Intelligence
Topic Modeling
Physical Sciences →  Computer Science →  Artificial Intelligence
Biomedical Text Mining and Ontologies
Life Sciences →  Biochemistry, Genetics and Molecular Biology →  Molecular Biology

Related Documents

JOURNAL ARTICLE

Arabic Cross-Language Information Retrieval

Bilel ElayebIbrahim Bounhas

Journal:   ACM Transactions on Asian and Low-Resource Language Information Processing Year: 2016 Vol: 15 (3)Pages: 1-44
BOOK-CHAPTER

Information Retrieval Based on Telugu Cross-Language Transliteration

N. SwapnaVijaya Kumar KoppulaG. Suryanarayana

Advances in intelligent systems and computing Year: 2021 Pages: 343-350
JOURNAL ARTICLE

Proper nouns in English–Arabic cross language information retrieval

Abdelghani BellaachiaGhita Amor‐Tijani

Journal:   Journal of the American Society for Information Science and Technology Year: 2008 Vol: 59 (12)Pages: 1925-1932
© 2026 ScienceGate Book Chapters — All rights reserved.