The NAM-to-speech conversion proposed by Toda and colleagues which converts Non-Audible Murmur (NAM) to audible speech by statistical mapping trained using aligned corpora is a very promising technique, but its performance is still insufficient, mainly due to the difficulty in estimating F 0 of the transformed voice from unvoiced speech.In this paper, we propose a method to improve F 0 estimation and voicing decision in a NAM-to-speech conversion system based on Gaussian Mixture Models (GMM) applied to whispered speech.Instead of combining voicing decision and F 0 estimation in a single GMM, a simple feed-forward neural network is used to detect voiced segments in the whisper while a GMM estimates a continuous melodic contour based on training voiced segments.The error rate for the voiced/unvoiced decision of the network is 6.8% compared to 9.2% with the original system.Our proposal benefits also to F 0 estimation error.
Patrícia Cristina Ramalho de Oliveira
Yana D. GilichinskayaWinifred Strange
Luís M. T. JesusSara CastilhoAńıbal FerreiraMaria da Conceição Costa
Yana D. GilichinskayaWinifred Strange