Development of Visual and Audio Speech Recognition Systems Using Deep Neural Networks

Denis Ivanko; Dmitry Ryumin

doi:10.20948/graphicon-2021-3027-905-916

ScienceGate Book Chapters

JOURNAL ARTICLE

Development of Visual and Audio Speech Recognition Systems Using Deep Neural Networks

Denis Ivanko Dmitry Ryumin

Year: 2021 Pages: 905-916

DOI: 10.20948/graphicon-2021-3027-905-916

Get Full-Text PDF Get Analytical Report

Abstract

In this paper we design end-to-end neural network for the low-resource lip-reading task and audio speech recognition task using 3D CNNs, pre-trained CNN weights of several state-of- the-art models (e.g. VGG19, InceptionV3, MobileNetV2, etc.) and LSTMs. We present two phrase-level speech recognition pipelines: for lip-reading and acoustic speech recognition. We evaluate different combinations of front-end and back-end modules on the RUSAVIC dataset. We compare our results with traditional 2D CNN approach and demonstrate the increase in recognition accuracy up to 14%. Moreover, we carefully studied existing state-of-the-art models to be use for augmentation. Based on the conducted analysis we have chosen 5 most promising model’s architectures and evaluated them on own data. We have tested our systems on a real-word data of two different scenarios: recorded in idling vehicle and during actual driving. Our independently trained systems demonstrated acoustic speech accuracy up to 90% and lip-reading accuracy up to 61%. Future work will focus on the fusion of visual and audio speech modalities and on speaker adaptation. We expect that fused multi-modal information will help to further improve recognition performance compared to a single modality. Another possible direction could be the research of different NN-based architectures to better tackle end-to-end lip-reading task.

Keywords:

Computer science Speech recognition Task (project management) Phrase Artificial intelligence Artificial neural network Focus (optics) Modality (human–computer interaction)

Metrics

Cited By

0.86

FWCI (Field Weighted Citation Impact)

Refs

0.75

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Speech and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Hearing Loss and Rehabilitation

Life Sciences → Neuroscience → Cognitive Neuroscience

Music and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Development of Visual and Audio Speech Recognition Systems Using Deep Neural Networks

Abstract

Metrics

Citation History

Topics

Related Documents

Audio Visual Speech Recognition Using Deep Recurrent Neural Networks

Audio-visual speech enhancement using deep neural networks

Audio-Visual Speech Recognition using 3D Convolutional Neural Networks

Audio-to-Visual Speech Conversion Using Deep Neural Networks

Audio-Visual Person Recognition Using Deep Convolutional Neural Networks