For single-channel speech enhancement, both time-domain and time-frequency-domain methods have their respective pros and cons. In this paper, we present a cross-domain framework named TFT-Net, which takes time-frequency spectrogram as input and produces time-domain waveform as output. Such a framework takes advantage of the knowledge we have about spectrogram and avoids some of the drawbacks that T-F-domain methods have been suffering from. In TFT-Net, we design an innovative dual-path attention block (DAB) to fully exploit correlations along the time and frequency axes. We further discover that a sample-independent DAB (SDAB) achieves a good tradeoff between enhanced speech quality and complexity. Ablation studies show that both the cross-domain design and the SDAB block bring large performance gain. When logarithmic MSE is used as the training criteria, TFT-Net achieves the highest SDR and SSNR among state-of-the-art methods on two major speech enhancement benchmarks.
Wenbo ZhangXuefeng XieYanling DuDongmei Huang
Gongzhen ZouJun DuShutong NiuHang ChenYuling RenQinglong LiRuibo LiuChin‐Hui Lee
Charturong TantibundhitFranz PernkopfGernot Kubin
YIN Wen-bing, GAO Ge, ZENG Bang, WANG Xiao, CHEN Yi