JOURNAL ARTICLE

Audio-Visual Keyword Spotting Based on Multidimensional Convolutional Neural Network

Abstract

The fusion of audio and visual information is one of the most promising solutions for reliable keyword spotting (KWS), particularly when audio is corrupted by noise. KWS aims to detect a specific word in an audio stream, which still remains a challenging problem under noisy environments. In this paper, an audio-visual neural network based on multidimensional convolutional neural network (MCNN) is proposed to perform audio-visual KWS. Firstly, the log mel-spectrogram and lip area sequence are extracted, respectively, from the audio and visual streams, and are taken as the input of the audio-visual neural network. Then, an audio-visual neural network based on MCNN consisting of 2D CNN and 3D CNN is used to model the time-frequency feature of the log mel-spectrogram and the spatiotemporal feature of the lip area sequence, respectively. Finally, the outputs of the audio and visual networks are combined for KWS through decision fusion. Experimental results on the PKU-AV database under complex acoustic conditions demonstrate that the proposed method achieves preferable performance compared to other state-of-the-art methods.

Keywords:
Computer science Spectrogram Audio visual Keyword spotting Convolutional neural network Artificial intelligence Speech recognition Artificial neural network Pattern recognition (psychology) Feature (linguistics) Visualization Feature extraction Spotting Noise (video) Image (mathematics) Multimedia

Metrics

25
Cited By
1.63
FWCI (Field Weighted Citation Impact)
21
Refs
0.84
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Speech and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing
Music and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing
Speech Recognition and Synthesis
Physical Sciences →  Computer Science →  Artificial Intelligence
© 2026 ScienceGate Book Chapters — All rights reserved.