Training Deep Neural Networks for Bottleneck Feature Extraction

Jonas Gehring; Waibel, Alexander; Metze, Florian; Stüker, Sebastian

doi:10.5445/ir/1000166691

ScienceGate Book Chapters

JOURNAL ARTICLE

Training Deep Neural Networks for Bottleneck Feature Extraction

Jonas Gehring Waibel, Alexander Metze, Florian Stüker, Sebastian

Year: 2012

DOI: 10.5445/ir/1000166691

Get Full-Text PDF Get Analytical Report

Abstract

In automatic speech recognition systems, preprocessing the audio signal to generate features is an important part of achieving a good recognition rate. Previous works have shown that artificial neural networks can be used to extract good, discriminative features that yield better recognition performance than manually engineered feature extraction algorithms. One possible approach for this is to train a network with a small bottleneck layer, and then use the activations of the units in this layer to produce feature vectors for the remaining parts of the system. Deep learning is a field of machine learning that deals with e cient training algorithms for neural networks with many hidden layers, and with automatic discovery of relevant features from data. While most frequently used in computer vision, multiple recent works have demonstrated the ability of deep networks to achieve superior performance on speech recognition tasks as well. In this work, a novel approach for extracting bottleneck features from deep neural networks is proposed. A stack of denoising auto-encoders is first trained in a layer-wise and unsupervised manner. Afterwards, the stack is transformed to a feed-forward neural network and a bottleneck layer, an additional hidden layer and the classification layer are added. The whole network is then fine-tuned to estimate phonetic target states in order to generate discriminative features in the bottleneck layer. Multiple experiments on conversational telephone speech in Cantonese show that the proposed architecture can e↵ectively leverage the increased capacity introduced by deep neural networks by generating more useful features that result in better recognition performance. Experiments confirm that this ability heavily depends on initializing the stack of autoencoders with pre-training. Extracting features from log mel scale filterbank coe cients results in additional gains when compared to features from cepstral coe cients. Further, small improvements can be achieved by pre-training auto-encoders with more data, which is an interesting property for settings where only little transcribed data is available. Evaluations on larger datasets result in significant reductions of recognition error rates (8% to 10% relative) over baseline systems using standard features, and therefore demonstrate the general applicability of the proposed architecture.

Keywords:

Bottleneck Computer science Artificial intelligence Artificial neural network Deep learning Discriminative model Feature extraction Time delay neural network Preprocessor Machine learning Pattern recognition (psychology) Feature (linguistics) Speech recognition

Metrics

Cited By

0.00

FWCI (Field Weighted Citation Impact)

Refs

0.21

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Topics

Speech Recognition and Synthesis

Physical Sciences → Computer Science → Artificial Intelligence

Speech and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Music and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Training Deep Neural Networks for Bottleneck Feature Extraction

Abstract

Metrics

Topics

Related Documents

Lattice-based training of bottleneck feature extraction neural networks

Deep Convolutional Neural Networks: Structure, Feature Extraction and Training

Optimizing deep bottleneck feature extraction

Optimizing Deep Bottleneck Feature Extraction

TANDEM-bottleneck feature combination using hierarchical Deep Neural Networks