Tone plays an important role in distinguishing lexical meaning in tonal languages, such as Mandarin and Thai. It has been revealed that tone information is helpful to improve automatic speech recognition (ASR) for these languages. In this study, we incorporate tone features from the fundamental frequency (Fo) and fundamental frequency variation (FFV) to the convolutional neural network (CNN), a state-of-the-art acoustic modeling approach, for acoustic modeling of the ASR systems. Due to its abilities of reducing spectral variations and modeling spectral correlations existing in speech signals, the CNN is expected to model well tone patterns which mainly behave in the frequency domain, by Fo contur. We conduct speech ASR experiments on Mandarin and Thai to evaluate the effectivenesses of the proposed approaches. With the help of tone features, the character error rates (CERs) of Mandarin achieve 4.3-7.1% relative reductions, and the word error rates (WERs) of Thai achieve 0.41-6.26% relative reductions. The CNN shows its clear superiority to the deep neural network (DNN), with relative CER reductions of 5.4-13.1% for Mandarin, and relative WER reductions of 0.5-5.6% for Thai.
Xiangli WangKenji HiroseJinkai ZhangNobuaki Minematsu
Niyada RukwongSunee Pongpinigpinyo
Ala Saleh AlluhaidanOumaima SaidaniRashid JahangirMuhammad Asif NaumanOmnia Saidani Neffati