JOURNAL ARTICLE

Automatic speech recognition using audio visual cues

Abstract

Automatic speech recognition (ASR) systems have been able to gain much popularity since many multimedia applications require robust speech recognition algorithms. The use of audio and visual information in the speaker-independent continuous speech recognition process makes the performance of the system better compared to the ones with only the audio information. There has been a marked increase in the recognition rates by the use of visual data to aid the audio data available. This is due to the fact that video information is less susceptible to ambient noise than audio information. In this paper a robust audio-video speech recognition (AVSR) system that allows us to incorporate the coupled hidden Markov model (CHMM) model for fusion of audio and video modalities is presented. The application records the input data and recognizes the isolated words in the input file over a wide range of signal to noise ratio (SNR). The experimental results show a remarkable increase of about 10% in the recognition rate in the AVSR compared to the audio only ASR and 20% compared to the video only ASR for an SNR of 5 dB.

Keywords:
Speech recognition Computer science Audio mining Hidden Markov model Speech coding Speaker recognition Acoustic model Voice activity detection Noise (video) Artificial intelligence Speech processing Image (mathematics)

Metrics

13
Cited By
0.65
FWCI (Field Weighted Citation Impact)
9
Refs
0.72
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Speech and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing
Multisensory perception and integration
Social Sciences →  Psychology →  Experimental and Cognitive Psychology
Music and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing
© 2026 ScienceGate Book Chapters — All rights reserved.