Keyword spotting (KWS) presents a formidable challenge, particularly in high-noise environments. Traditional denoising algorithms that rely solely on speech have difficulty recovering speech that has been severely corrupted by noise. In this investigation, we develop an adaptive text-informed denoising model to bolster reliable keyword identification in the presence of considerable noise degradation. The whole proposed TE-KWS incorporates a tripartite branch structure, where the speech branch (SB) takes noisy speech as input which provides the raw speech information, the alignment branch (AB) accommodates aligned text input which facilitates accurate restoration of the corresponding speech when text with alignment is preserved, and the text branch (TB) handles unaligned text which prompts the model to autonomously learn the alignment between speech and text. To make the proposed denoising model more beneficial for KWS, following the training of the whole model,the alignment branch (AB) is frozen, and the model is fine-tuned by leveraging its speech restoration and forced alignment capabilities. Subsequently, the input for the text branch (TB) is supplanted with designated keywords, and a heavier denoising penalty is applied on the keywords period, thereby explicitly intensifying the speech restoration ability of the model for keywords. Finally, the Combined Adversarial Domain Adaptation (CADA) is implemented to enhance the robustness of KWS with regard to data pre-and post-speech enhancement (SE). Experimental results indicate that our approach not only markedly ameliorates highly corrupted speech, achieving SOTA performance for marginally corrupted speech, but also bolsters the efficacy and generalizability of prevailing mainstream KWS models.
Shubo LvXiong WangSining SunLong MaLei Xie