JOURNAL ARTICLE

Audio-Visual Speech Recognition System Using Recurrent Neural Network

Abstract

An audio-visual speech recognition system (AVSR) integrates audio and visual information to perform speech recognition task. The AVSR has various applications in practice especially in natural language processing systems such as speech-to-text conversion, automatic translation and sentiment analysis. Decades ago, researchers tend to use Hidden Markov Model (HMM) to construct speech recognition system due to its good achievements in success recognition rate. However, HMM's training dataset is enormous in order to have sufficient linguistic coverage. Besides, its recognition rate under noisy environments is not satisfying. To overcome this deficiency, a Recurrent Neural Network (RNN) based AVSR is proposed. The proposed AVSR model consists of three components: 1) audio features extraction mechanism, 2) visual features extraction mechanism and 3) audio and visual features integration mechanism. The features integration mechanism combines the output features from both audio and visual extraction mechanisms to generate final classification results. In this research, the audio features mechanism is modelled by Mel-frequency Cepstrum Coefficient (MFCC) and further processed by RNN system, whereas the visual features mechanism is modelled by Haar-Cascade Detection with OpenCV and again, it is further processed by RNN system. Then, both of these extracted features were integrated by multimodal RNN-based features-integration mechanism. The performance in terms of the speech recognition rate and the robustness of the proposed AVSR system were evaluated using speech under clean environment and Signal Noise Ratio (SNR) levels ranging from -20 dB to 20 dB with 5 dB interval. On average, final speech recognition rate is 89% across different levels of SNR.

Keywords:
Computer science Speech recognition Audio mining Artificial neural network Time delay neural network Recurrent neural network Audio visual Artificial intelligence Speech processing Voice activity detection Multimedia

Metrics

10
Cited By
0.82
FWCI (Field Weighted Citation Impact)
30
Refs
0.74
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Speech and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing
Speech Recognition and Synthesis
Physical Sciences →  Computer Science →  Artificial Intelligence
Music and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing

Related Documents

JOURNAL ARTICLE

Recurrent Neural Network Transducer for Audio-Visual Speech Recognition

Takaki MakinoHank LiaoYannis AssaelBrendan ShillingfordBasilio GarciaOtavio BragaOlivier Siohan

Journal:   2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Year: 2019 Pages: 905-912
BOOK-CHAPTER

Audio Visual Speech Recognition Using Deep Recurrent Neural Networks

Abhinav ThandaShankar M. Venkatesan

Lecture notes in computer science Year: 2017 Pages: 98-109
JOURNAL ARTICLE

DEEP RECURRENT NEURAL NETWORK BASED AUDIO SPEECH RECOGNITION SYSTEM

Et. al. Savitha G

Journal:   INFORMATION TECHNOLOGY IN INDUSTRY Year: 2021 Vol: 9 (2)Pages: 941-949
JOURNAL ARTICLE

RETRACTED: Audio–Visual (Multimodal) Speech Recognition System Using Deep Neural Network

Hebsibah PaulinR. S. MiltonS. JanakiramanK. Chandraprabha

Journal:   Journal of Testing and Evaluation Year: 2019 Vol: 47 (6)Pages: 3963-3974
© 2026 ScienceGate Book Chapters — All rights reserved.