Abstract

In automatic speech recognition systems, preprocessing the audio signal to generate features is an important part of achieving a good recognition rate. Previous works have shown that artificial neural networks can be used to extract good, discriminative features that yield better recognition performance than manually engineered feature extraction algorithms. One possible approach for this is to train a network with a small bottleneck layer, and then use the activations of the units in this layer to produce feature vectors for the remaining parts of the system. Deep learning is a field of machine learning that deals with e cient training algorithms for neural networks with many hidden layers, and with automatic discovery of relevant features from data. While most frequently used in computer vision, multiple recent works have demonstrated the ability of deep networks to achieve superior performance on speech recognition tasks as well. In this work, a novel approach for extracting bottleneck features from deep neural networks is proposed. A stack of denoising auto-encoders is first trained in a layer-wise and unsupervised manner. Afterwards, the stack is transformed to a feed-forward neural network and a bottleneck layer, an additional hidden layer and the classification layer are added. The whole network is then fine-tuned to estimate phonetic target states in order to generate discriminative features in the bottleneck layer. Multiple experiments on conversational telephone speech in Cantonese show that the proposed architecture can e↵ectively leverage the increased capacity introduced by deep neural networks by generating more useful features that result in better recognition performance. Experiments confirm that this ability heavily depends on initializing the stack of autoencoders with pre-training. Extracting features from log mel scale filterbank coe cients results in additional gains when compared to features from cepstral coe cients. Further, small improvements can be achieved by pre-training auto-encoders with more data, which is an interesting property for settings where only little transcribed data is available. Evaluations on larger datasets result in significant reductions of recognition error rates (8% to 10% relative) over baseline systems using standard features, and therefore demonstrate the general applicability of the proposed architecture.

Keywords:
Bottleneck Computer science Artificial intelligence Artificial neural network Deep learning Discriminative model Feature extraction Time delay neural network Preprocessor Machine learning Pattern recognition (psychology) Feature (linguistics) Speech recognition

Metrics

0
Cited By
0.00
FWCI (Field Weighted Citation Impact)
46
Refs
0.21
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Topics

Speech Recognition and Synthesis
Physical Sciences →  Computer Science →  Artificial Intelligence
Speech and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing
Music and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing
© 2026 ScienceGate Book Chapters — All rights reserved.