JOURNAL ARTICLE

Speech Emotion Recognition Using Transfer Learning and Self-Supervised Speech Representation Learning

Abstract

Self-supervised speech representation learning (S3RL) models like wav2vec2.0, Hidden-unit BERT (HuBERT), and WavLM are trained with a great amount of speech data and subsequently give a general purpose speech representation that then needs to be finetuned for different speech processing tasks like ASR. Despite these models' good performance, they suffer from massive structures and a great number of parameters which makes their finetuning inapplicable for low-resource tasks like speech emotion recognition. In this paper, a small model is introduced for speech emotion recognition based on the Hubert model by transferring the Hubert convolutional feature encoder and substituting all of its transformers with a simple conformer block. Then this simple model is trained with emotional speech signals. The experimental results indicate that the proposed model has comparable results with other well-performing S3RL models.

Keywords:
Computer science Speech recognition Encoder Feature learning Speech processing Transfer of learning Artificial intelligence Natural language processing Language model Speech analytics Transformer Emotion recognition Representation (politics) Feature (linguistics) Convolutional neural network Acoustic model

Metrics

0
Cited By
0.00
FWCI (Field Weighted Citation Impact)
25
Refs
0.17
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Topics

Speech Recognition and Synthesis
Physical Sciences →  Computer Science →  Artificial Intelligence
Speech and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing
Music and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing
© 2026 ScienceGate Book Chapters — All rights reserved.