Audio-Visual Keyword Spotting Based on Multidimensional Convolutional Neural Network

Runwei Ding; Cheng Pang; Hong Liu

doi:10.1109/icip.2018.8451096

ScienceGate Book Chapters

JOURNAL ARTICLE

Audio-Visual Keyword Spotting Based on Multidimensional Convolutional Neural Network

Runwei Ding Cheng Pang Hong Liu

Year: 2018 Pages: 4138-4142

DOI: 10.1109/icip.2018.8451096

Get Full-Text PDF Get Analytical Report

Abstract

The fusion of audio and visual information is one of the most promising solutions for reliable keyword spotting (KWS), particularly when audio is corrupted by noise. KWS aims to detect a specific word in an audio stream, which still remains a challenging problem under noisy environments. In this paper, an audio-visual neural network based on multidimensional convolutional neural network (MCNN) is proposed to perform audio-visual KWS. Firstly, the log mel-spectrogram and lip area sequence are extracted, respectively, from the audio and visual streams, and are taken as the input of the audio-visual neural network. Then, an audio-visual neural network based on MCNN consisting of 2D CNN and 3D CNN is used to model the time-frequency feature of the log mel-spectrogram and the spatiotemporal feature of the lip area sequence, respectively. Finally, the outputs of the audio and visual networks are combined for KWS through decision fusion. Experimental results on the PKU-AV database under complex acoustic conditions demonstrate that the proposed method achieves preferable performance compared to other state-of-the-art methods.

Keywords:

Computer science Spectrogram Audio visual Keyword spotting Convolutional neural network Artificial intelligence Speech recognition Artificial neural network Pattern recognition (psychology) Feature (linguistics) Visualization Feature extraction Spotting Noise (video) Image (mathematics) Multimedia

Metrics

Cited By

1.63

FWCI (Field Weighted Citation Impact)

Refs

0.84

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Speech and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Music and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Speech Recognition and Synthesis

Physical Sciences → Computer Science → Artificial Intelligence

Audio-Visual Keyword Spotting Based on Multidimensional Convolutional Neural Network

Abstract

Metrics

Citation History

Topics

Related Documents

Keyword spotting in continuous speech using convolutional neural network

Low-Latency Convolutional Recurrent Neural Network for Keyword Spotting

Keyword spotting based on recurrent neural network

Embedded Device Keyword Spotting Model with Quantized Convolutional Neural Network

Seeing wake words: Audio-visual Keyword Spotting