Hanna SilénElina HelanderJani NurminenMoncef Gabbouj
Appropriate phoneme durations are essential for high quality speech synthesis.In hidden Markov model-based text-tospeech (HMM-TTS), durations are typically modeled statistically using state duration probability distributions and duration prediction for unseen contexts.Use of rich context features enables synthesis without high-level linguistic knowledge.In this paper we analyze the accuracy of state duration modeling against phone duration modeling using simple prediction techniques.In addition to the decision tree-based techniques, regression techniques for rich context features with high collinearity are discussed and evaluated.
Takayoshi YoshimuraKeiichi TokudaTakashi MasukoTakao KobayashiTadashi Kitamura
Heiga ZenTakashi MasukoKeiichi TokudaTetsuhiko YoshimuraTakao KobayasihTakashi Kitamura
Heng LuYi-Jian WuKeiichi TokudaLi-Rong DaiRen-Hua Wang
Tuomo RaitioAntti SuniMartti VainioPaavo Alku
Yang WangMinghao YangZhengqi WenJianhua Tao