In-domain texts for estimating statistical language models are not easily found for most languages of the world. We present two techniques to take advantage of in-domain text resources in other languages. First, we extend the notion of <i>lexical triggers</i>, which have been used monolingually for language model adaptation, to the cross-lingual problem, permitting the construction of sharper language models for a target-language document by drawing statistics from related documents in a resource-rich language. Next, we show that <i>cross-lingual latent semantic analysis</i> is similarly capable of extracting useful statistics for language modeling. Neither technique requires explicit translation capabilities between the two languages! We demonstrate significant reductions in both perplexity and word error rate on a Mandarin speech recognition task by using these techniques.
Yu WengWenbin HeJun DongChaomurilige ChaomuriligeXuan LiuZheng Liu