JOURNAL ARTICLE

Audio–visual keyword transformer for unconstrained sentence‐level keyword spotting

Yidi LiJiale RenYawei WangGuoquan WangXia LiHong Liu

Year: 2023 Journal:   CAAI Transactions on Intelligence Technology Vol: 9 (1)Pages: 142-152   Publisher: Institution of Engineering and Technology

Abstract

Abstract As one of the most effective methods to improve the accuracy and robustness of speech tasks, the audio–visual fusion approach has recently been introduced into the field of Keyword Spotting (KWS). However, existing audio–visual keyword spotting models are limited to detecting isolated words, while keyword spotting for unconstrained speech is still a challenging problem. To this end, an Audio–Visual Keyword Transformer (AVKT) network is proposed to spot keywords in unconstrained video clips. The authors present a transformer classifier with learnable CLS tokens to extract distinctive keyword features from the variable‐length audio and visual inputs. The outputs of audio and visual branches are combined in a decision fusion module. As humans can easily notice whether a keyword appears in a sentence or not, our AVKT network can detect whether a video clip with a spoken sentence contains a pre‐specified keyword. Moreover, the position of the keyword is localised in the attention map without additional position labels. Experimental results on the LRS2‐KWS dataset and our newly collected PKU‐KWS dataset show that the accuracy of AVKT exceeded 99% in clean scenes and 85% in extremely noisy conditions. The code is available at https://github.com/jialeren/AVKT .

Keywords:
Keyword spotting Computer science Speech recognition Transformer Sentence Artificial intelligence Robustness (evolution) Audio visual Natural language processing Multimedia

Metrics

11
Cited By
2.95
FWCI (Field Weighted Citation Impact)
54
Refs
0.89
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Speech and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing
Music and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing
Video Analysis and Summarization
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
© 2026 ScienceGate Book Chapters — All rights reserved.