JOURNAL ARTICLE

Indonesian audio-visual speech corpus for multimodal automatic speech recognition

Abstract

Advancement of Automatic Speech Recognition (ASR) relies heavily on the availability of the data, even more so for deep learning ASR system which is at the forefront of ASR research. A multitude of such corpus has been built to accommodate such need, ranging from single modal corpus which caters the need for mostly acoustic speech recognition, with several exceptions on visual speech decoding, to multimodal corpus which provides the need for both modalities. Multimodal corpus was significant in the development of ASR as speech is inherently multimodal in the very first place. Despite the importance, none of this corpus was built for Indonesian language, resulting in little to no development of visual-only or multimodal ASR systems. This research is an attempt to solve that problem by constructing AVID, an Indonesian audio-visual speech corpus for multimodal ASR. The corpus consists of 10 speakers speaking 1,040 sentences with a simple structure, resulting in 10,400 videos of spoken sentences. To the best of our knowledge, AVID is the first audio-visual speech corpus for the Indonesian language which is designed for multimodal ASR. AVID was heavily tested and contains overall low errors in both modality tests, which indicates the high quality and suitability of the corpus for building multimodal ASR systems.

Keywords:
Computer science Speech corpus Speech recognition Audio mining Natural language processing Modality (human–computer interaction) Artificial intelligence Acoustic model Speech processing Speech synthesis

Metrics

8
Cited By
0.81
FWCI (Field Weighted Citation Impact)
13
Refs
0.73
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Speech and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing
Speech Recognition and Synthesis
Physical Sciences →  Computer Science →  Artificial Intelligence
Music and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing

Related Documents

JOURNAL ARTICLE

An audio-visual corpus for multimodal automatic speech recognition

Andrzej CzyżewskiBożena KostekPiotr BratoszewskiJózef KotusMarcin Szykulski

Journal:   Journal of Intelligent Information Systems Year: 2017 Vol: 49 (2)Pages: 167-192
JOURNAL ARTICLE

An audio-visual corpus for speech perception and automatic speech recognition

Martin CookeJon BarkerStuart CunninghamXu Shao

Journal:   The Journal of the Acoustical Society of America Year: 2006 Vol: 120 (5)Pages: 2421-2424
JOURNAL ARTICLE

Advances in Automatic Speech Recognition: From Audio-Only To Audio-Visual Speech Recognition

Akriti Bahal

Journal:   IOSR Journal of Computer Engineering Year: 2012 Vol: 5 (1)Pages: 31-36
© 2026 ScienceGate Book Chapters — All rights reserved.