Both spatial and tempo-spectral information are essential for multi-channel speech enhancement, a field that has gained significant popularity in recent years. While many studies focus on improving feature extraction capabilities through unique network architectures, these approaches often prioritize raw feature learning without fully addressing how to effectively utilize the extracted features for enhanced performance. In this work, we focus on the post-extracted features and introduce a Channel-Time-Frequency Attention (CTFA) module, which allocates weights to the extracted features, aiming to enhance feature utilization and enabling the model to focus more effectively on informative features. The CTFA module is structured with three parallel attention branches—channel, time, and frequency branches—to effectively refine both spatial and tempo-spectral features. It facilitates better feature reuse by assigning greater weight to effective features, thereby improving the model’s robustness. We incorporate the CTFA module into our previously proposed model and conduct an ablation study to evaluate its effectiveness. Extensive experimental results confirm the efficacy of the CTFA module, with our proposed method outperforming state-of-the-art baselines.
Shiyun XuYunhe CaoZehua ZhangMingjiang Wang
Zehua ZhangXingwei LiangRuifeng XuMingjiang Wang
Yongyi YangJie CaoHong ZhaoZhaobin ChangWeijie Wang
Xiao ZengShiyun XuMingjiang Wang
Jing BaiHao LiXueliang ZhangFei Chen