JOURNAL ARTICLE

Audio-visual multimodal speech recognition method fusing audio and video data

Abstract

With the widespread application of deep learning methods, multimodal techniques have also achieved rapid development. Since single-modal speech recognition may affect the accuracy of recognition results in noisy environments, multimodal fusion recognition gradually replaces the traditional single-modal recognition methods. In this paper, we mainly strengthen and pre-process audio and video data first, and use LSTM recurrent neural network for deep feature extraction of audio and video streams, which effectively solves the problem of long-term forgetting of general neural networks. The audio and video feature vectors are then fused by a fully connected neural network with linear connections. Compared with the speech recognition technique alone, this audiovisual fusion recognition method has a better recognition effect in the case of noise interference. Compared with the traditional audiovisual recognition method, the model simplifies the recognition work. Recognition experiments on the LRS2-BBC dataset show that the recognition accuracy of this method improves to a certain extent over that of other methods in a clean environment and greatly improves in noisy conditions.

Keywords:
Computer science Speech recognition Feature extraction Artificial intelligence Artificial neural network Feature (linguistics) Noise (video) Audio mining Hidden Markov model Pattern recognition (psychology) Voice activity detection Speech processing Image (mathematics)

Metrics

1
Cited By
0.19
FWCI (Field Weighted Citation Impact)
8
Refs
0.39
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Music and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing
Speech and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing
Digital Media Forensic Detection
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
© 2026 ScienceGate Book Chapters — All rights reserved.