JOURNAL ARTICLE

Synthesis and automatic recognition of audio-visual speech

Abstract

Since the 1950s, several experiments have been run to evaluate the benefit of lip-reading on speech intelligibility, all presenting a natural face speaking at different levels of background noise. In this paper, we present a similar experiment run with French stimuli. Experiments run by McGrath (1985) and then by Summerfield et al. (1989) showed that the lips carry more than half the visual information provided by the whole face of an English speaker, and that vision of the teeth somewhat increases the intelligibility of a message. Similar experiments have been carried out at the Institut de la Communication Parlee in French. We compared the overall performance of normal hearers in audio-visual intelligibility tests where the visual displays were made of a natural face (Benoit et al., 1992), natural lips alone (Le Goff et al., 1995), and a bunch of 3D parametric models of the main components of a speaker's face: the lips, the jaw and the skin (Guiard-Marigny et al., 1995). The same parameters as those used to animate our synthetic models of the face have been measured on the same corpus to evaluate the performances of an HMM classifier in an identification task analogous to that performed by human subjects (Adjoudani and Benoit, 1996). Overall results are presented too. (6 pages)

Keywords:
Computer science Audio visual Speech recognition Audio mining Artificial intelligence Speech processing Multimedia Acoustic model

Metrics

3
Cited By
0.00
FWCI (Field Weighted Citation Impact)
0
Refs
0.23
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Topics

Speech and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing
© 2026 ScienceGate Book Chapters — All rights reserved.