An alternative approach to speech denoising using generative diffusion models that model the distribution of training data is proposed. In recent years, such models have led to promising results to be obtained in the field of generating signals of various kinds, and these are superior in many ways to previous generative models, such as variational autoencoders. However, diffusion models have not yet found wide application in the field of speech denoising. A new diffusion model is presented, which can be used to denoise real speech signals using a deep neural network. Our own data set, with more than 150 h of pure speech in Russian, has been created. The obtained results, estimated using the metrics scale invariant signal to distortion ratio and perceptual evaluation of speech quality, are comparable or superior to the results of the best discriminative models.
Berné NortierMostafa SadeghiRomain Serizel
Julius RichterSimon WelkerJean-Marie LemercierBunlong LayTimo Gerkmann
Xiaoyu LinSimon LeglaiveLaurent GirinXavier Alameda-Pineda