Low Complexity Speech Enhancement Network Based on Frame-Level Swin Transformer

Weiqi Jiang; Chengli Sun; Feilong Chen; Yan Leng; Guo Q; Jiayi Sun; Jiankun Peng

doi:10.3390/electronics12061330

ScienceGate Book Chapters

JOURNAL ARTICLE

Low Complexity Speech Enhancement Network Based on Frame-Level Swin Transformer

Weiqi Jiang Chengli Sun Feilong Chen Yan Leng Guo Q Jiayi Sun Jiankun Peng

Year: 2023 Journal: Electronics Vol: 12 (6)Pages: 1330-1330 Publisher: Multidisciplinary Digital Publishing Institute

DOI: 10.3390/electronics12061330

Get Full-Text PDF Get Analytical Report

Abstract

In recent years, Transformer has shown great performance in speech enhancement by applying multi-head self-attention to capture long-term dependencies effectively. However, the computation of Transformer is quadratic with the input speech spectrograms, which makes it computationally expensive for practical use. In this paper, we propose a low complexity hierarchical frame-level Swin Transformer network (FLSTN) for speech enhancement. FLSTN takes several consecutive frames as a local window and restricts self-attention within it, reducing the complexity to linear with spectrogram size. A shifted window mechanism enhances information exchange between adjacent windows, so that window-based local attention becomes disguised global attention. The hierarchical structure allows FLSTN to learn speech features at different scales. Moreover, we designed the band merging layer and the band expanding layer for decreasing and increasing the spatial resolution of feature maps, respectively. We tested FLSTN on both 16 kHz wide-band speech and 48 kHz full-band speech. Experimental results demonstrate that FLSTN can handle speech with different bandwidths well. With very few multiply–accumulate operations (MACs), FLSTN not only has a significant advantage in computational complexity but also achieves comparable objective speech quality metrics with current state-of-the-art (SOTA) models.

Keywords:

Computer science Spectrogram Transformer Speech enhancement Computational complexity theory Speech recognition Computation Artificial intelligence Algorithm Voltage Engineering

Metrics

Cited By

2.68

FWCI (Field Weighted Citation Impact)

Refs

0.88

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Speech and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Advanced Adaptive Filtering Techniques

Physical Sciences → Engineering → Computational Mechanics

Speech Recognition and Synthesis

Physical Sciences → Computer Science → Artificial Intelligence

Low Complexity Speech Enhancement Network Based on Frame-Level Swin Transformer

Abstract

Metrics

Citation History

Topics

Related Documents

Swin Transformer based Unsupervised Network for Low-Light Image Enhancement

Speech Semantic Communication Based on Swin Transformer

Speech Emotion Recognition Based on Swin-Transformer

FEA-Swin: Foreground Enhancement Attention Swin Transformer Network for Accurate UAV-Based Dense Object Detection

Swin Transformer Based on Image Enhancement Algorithm