Shinichi TamuraKoji IwanoSadaoki Furui
For multi-stream HMM that are widely used in audio-visual speech recognition, it is important to automatically and properly adjust stream weights. This paper proposes a stream-weight optimization technique based on a likelihood-ratio maximization criterion. In our audiovisual speech recognition system, video signals are captured and converted into visual features using HMM-based techniques. Extracted acoustic and visual features are concatenated into an audio-visual vector. A multi-stream HMM is obtained from audio and visual HMM. Experiments are conducted using Japanese connected digit speech recorded in real-world environments. Applying the MLLR (maximum likelihood linear regression) adaptation and our optimization method, we achieve a 29% absolute accuracy improvement and a 76% relative error rate reduction compared with the audio-only scheme.
David DeanPatrick LuceySridha SridharanTim Wark
Chiyomi MiyajimaKeiichi TokudaTadashi Kitamura
Etienne MarcheretVit LibalGerasimos Potamianos
Mihai GurbanJean‐Philippe ThiranThomas DrugmanThierry Dutoit
Satoshi NakamuraHidetoshi ItoKiyohiro Shikano