Speech enhancement and source separation are related tasks that aim to extract and/or improve a signal of interest from a recording that may involve sounds from various sources, reverberation, and/or degradation of capture quality. Taking into account that the Wave-U-Net is an end-to-end deep learning architecture that has obtained relevant results for the source separation task operating in the time domain, this thesis studies the performance of this architecture for the speech enhancement task in terms of denoising, dereverberation, decoloration, and bandwidth extension. The experiments were conducted using a combination of a noisy version of the Voice Bank Corpus (VCTK) and the Device and Produced Speech dataset (DAPS). In addition to the original framework, variations inspired by relevant deep learning networks for speech enhancement were explored here, of which losses with spectral components presented the most favorable e˙ects for the improvement of low-quality speech signals. Also, the concatenation of the input audio with a noise vector in the network was shown to generate more coherent high-frequency content in the output signal.
Jharna AgrawalManish GuptaHitendra Garg
Ritwik GiriUmut IsikArvindh Krishnaswamy
Mohamed Nabih AliAlessio BruttiDaniele Falavigna
Kang ZhengZhihua HuangChenhua Lu