In this work, we present a novel training procedure for attention-based end-to-end automatic speech recognition. Our goal is to push the encoder network to output only linguistic information, improving generalization performance particularly in low-resource scenarios. We accomplish this with the addition of a text encoder network, which the speech encoder is encouraged to mimic. Our main innovation is the comparison of the attention-weighted speech encoder outputs to the outputs of the text encoder - this guarantees two sequences of the same length that can be directly aligned. We show that our training procedure significantly decreases word error rates in all experiments and has the biggest absolute impact in the lowest resource scenarios.
Zhong MengYashesh GaurJinyu LiYifan Gong
Jumon NozakiTatsuya KawaharaKenkichi IshizukaTaiichi Hashimoto
Ghayas AhmedAadil Ahmad LawayeTawseef Ahmad MirParveen Rana
Seongmin LimJahyun GooHoirin Kim