JOURNAL ARTICLE

Improving Voice Activity Detection for Multimodal Movie Dialogue Corpus

Abstract

Detection of speech segments from sound sequences is an important task for various applications. Methods for the voice activity detection (VAD) task have been developed, but when they are applied to movie data, because of the noise in movies, the accuracy has deteriorated. This noise problem is addressed by using a deep neural network (DNN) model for VAD. Although the overall performance of the DNN-based model was satisfactory, there was a clear drop in performance when singing voices or musical sounds existed as background noise. In this study, the effectiveness of changing VAD models from a binary classifier to a multi-class classifier was examined. As a result, it was found that DNN-based and multi-class VAD models can incorporate singing voices and musical sounds sufficiently. In the experiments, an equal error rate of 3.92% was obtained in movies.

Keywords:
Computer science Speech recognition Singing Classifier (UML) Word error rate Binary classification Artificial neural network Voice activity detection Background noise Artificial intelligence Noise (video) Pattern recognition (psychology) Speech processing Acoustics Support vector machine

Metrics

4
Cited By
0.72
FWCI (Field Weighted Citation Impact)
14
Refs
0.71
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Speech and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing
Music and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing
Speech Recognition and Synthesis
Physical Sciences →  Computer Science →  Artificial Intelligence
© 2026 ScienceGate Book Chapters — All rights reserved.