Yixuan ZhangHeming WangDeLiang Wang
Estimating fundamental frequency ( F 0 ) from an audio signal is a necessary step in many tasks such as speech synthesis and speech analysis. Although high estimation accuracy has been achieved for clean speech, it is still challenging for F 0 estimation to handle noisy speech, mainly because of the corruption of harmonic structure caused by noise. In this paper, we view F 0 estimation as a multi-class classification problem and train a frequency-domain densely-connected convolutional neural network (DC-CRN) to estimate F 0 from noisy speech. The proposed model significantly outperforms baseline methods in terms of detection rate. We find that using complex short-time Fourier transform (STFT) as input produces better performance compared to using magnitude STFT as input. Furthermore, we explore improving F 0 estimation with speech enhancement. Although the F 0 estimation model trained on clean speech performs well on enhanced speech, the distortion introduced by the speech enhancement model limits the estimation performance. We propose a cascade model which consists of two modules that optimize enhanced speech and estimated F 0 in turn. Experimental results show that the cascade model brings further improvements to the DC-CRN model, especially in low signal-to-noise ratio (SNR) conditions.
Fahimeh FooladgarShohreh Kasaei
Yaxing LiXiaoqi LiYuanjie DongMeng LiShan XuShengwu Xiong
Jiangyu HanYanhua LongLukáš BurgetJaň Černocký