JOURNAL ARTICLE

Audio-visual graphical models for speech processing

Abstract

Perceiving sounds in a noisy environment is a challenging problem. Visual lip-reading can provide relevant information but is also challenging because lips are moving and a tracker must deal with a variety of conditions. Typically audio-visual systems have been assembled from individually engineered modules. We propose to fuse audio and video in a probabilistic generative model that implements cross-model self-supervised learning, enabling adaptation to audio-visual data. The video model features a Gaussian mixture model embedded in a linear subspace of a sprite which translates in the video. The system can learn to detect and enhance speech in noise given only a short (30 second) sequence of audio-visual data. We show some results for speech detection and enhancement, and discuss extensions to the model that are under investigation.

Keywords:
Computer science Speech recognition Graphical model Speech processing Audio visual Natural language processing Artificial intelligence Multimedia

Metrics

15
Cited By
1.11
FWCI (Field Weighted Citation Impact)
11
Refs
0.78
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Music and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing
Speech and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing
Speech Recognition and Synthesis
Physical Sciences →  Computer Science →  Artificial Intelligence

Related Documents

BOOK-CHAPTER

Audio-visual Speech Processing

Ruth Campbell

Elsevier eBooks Year: 2006 Pages: 562-569
BOOK-CHAPTER

Audio-Visual Speech Processing

Simon Lucey

Encyclopedia of Biometrics Year: 2009 Pages: 43-43
JOURNAL ARTICLE

Somatosensory contribution to audio-visual speech processing

Takayuki ItoHiroki OhashiVincent L. Gracco

Journal:   Cortex Year: 2021 Vol: 143 Pages: 195-204
© 2026 ScienceGate Book Chapters — All rights reserved.