In this paper, we perform beamforming with a speech recognition-level criterion. A beamformer is usually designed by optimizing signal-level criteria, e.g., by minimizing the beamformer output covariance or by maximizing the signal-to-noise ratio (SNR). Such signal-level criteria do not always guarantee that the optimized beamformer is the best for noise robust automatic speech recognition. Recently, a few approaches have been proposed for performing beamforming with a speech recognition-level criterion. These approaches train beamformers along with an acoustic model by using multichannel training data and a parallel corpus of noisy and clean data. This paper proposes a novel approach for estimating the beamformer for every test utterance with a speech recognition-level criterion. We use an unsupervised acoustic model adaptation scheme to optimize our beamformer. Specifically, we first obtain decoding results with an initialized beamformer, and then we optimize our beamformer using back propagation to minimize the cross entropy between the first-pass decoding results and actual network outputs. With this approach, our beamformer can be trained to discriminate hidden Markov model states more clearly for every test utterance. Experimental results show that our beamformer outperforms a beamformer designed with a signal-level criterion.
Jaemin LimKiyeon KimSunghyun ChoSuk-Bok Lee
Jian HuangJianhua TaoBin LiuZheng Lian
Takuya HiguchiTakuya YoshiokaTomohiro Nakatani
Bogdan MocanuRuxandra ȚapuTitus Zaharia