There are many abbreviation and non-standard words in SMS and Twitter messages. They are problematic for text-to-speech (TTS) or language processing techniques for these data. A character-based machine translation (MT) approach was previously used for normalization of non-standard words. In this paper, we propose a two-stage translation method to leverage phonetic information, where non-standard words are first translated to possible pronunciations, which are then translated to standard words. We further combine it with the single-step character-based translation module. Our experiments show that our proposed method significantly outperforms previous results in both n-best coverage and 1-best accuracy.
Tim SchlippeChenfei ZhuDaniel LemckeTanja Schultz
Muhmad Amin Hakim bin SazaliNorisma Idris
Lisa BeinbornTorsten ZeschIryna Gurevych