Abstract

In this paper we report state-of-the-art results on LibriSpeech among end-to-end speech recognition models without any external training data.Our model, Jasper, uses only 1D convolutions, batch normalization, ReLU, dropout, and residual connections.To improve training, we further introduce a new layer-wise optimizer called NovoGrad.Through experiments, we demonstrate that the proposed deep architecture performs as well or better than more complex choices.Our deepest Jasper variant uses 54 convolutional layers.With this architecture, we achieve 2.95% WER using a beam-search decoder with an external neural language model and 3.86% WER with a greedy decoder on LibriSpeech test-clean.We also report competitive results on Wall Street Journal and the Hub5'00 conversational evaluation datasets.

Keywords:
Dropout (neural networks) End-to-end principle Computer science Normalization (sociology) Convolutional neural network Speech recognition Residual Language model Decoding methods Architecture Beam search Layer (electronics) Artificial intelligence Algorithm Machine learning Search algorithm Art

Metrics

213
Cited By
21.51
FWCI (Field Weighted Citation Impact)
35
Refs
0.99
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Speech Recognition and Synthesis
Physical Sciences →  Computer Science →  Artificial Intelligence
Speech and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing
Music and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing
© 2026 ScienceGate Book Chapters — All rights reserved.