JOURNAL ARTICLE

Development of Visual and Audio Speech Recognition Systems Using Deep Neural Networks

Abstract

In this paper we design end-to-end neural network for the low-resource lip-reading task and audio speech recognition task using 3D CNNs, pre-trained CNN weights of several state-of- the-art models (e.g. VGG19, InceptionV3, MobileNetV2, etc.) and LSTMs. We present two phrase-level speech recognition pipelines: for lip-reading and acoustic speech recognition. We evaluate different combinations of front-end and back-end modules on the RUSAVIC dataset. We compare our results with traditional 2D CNN approach and demonstrate the increase in recognition accuracy up to 14%. Moreover, we carefully studied existing state-of-the-art models to be use for augmentation. Based on the conducted analysis we have chosen 5 most promising model’s architectures and evaluated them on own data. We have tested our systems on a real-word data of two different scenarios: recorded in idling vehicle and during actual driving. Our independently trained systems demonstrated acoustic speech accuracy up to 90% and lip-reading accuracy up to 61%. Future work will focus on the fusion of visual and audio speech modalities and on speaker adaptation. We expect that fused multi-modal information will help to further improve recognition performance compared to a single modality. Another possible direction could be the research of different NN-based architectures to better tackle end-to-end lip-reading task.

Keywords:
Computer science Speech recognition Task (project management) Phrase Artificial intelligence Artificial neural network Focus (optics) Modality (human–computer interaction)

Metrics

6
Cited By
0.86
FWCI (Field Weighted Citation Impact)
32
Refs
0.75
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Speech and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing
Hearing Loss and Rehabilitation
Life Sciences →  Neuroscience →  Cognitive Neuroscience
Music and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing
© 2026 ScienceGate Book Chapters — All rights reserved.