Self-supervised speech representation learning (S3RL) models like wav2vec2.0, Hidden-unit BERT (HuBERT), and WavLM are trained with a great amount of speech data and subsequently give a general purpose speech representation that then needs to be finetuned for different speech processing tasks like ASR. Despite these models' good performance, they suffer from massive structures and a great number of parameters which makes their finetuning inapplicable for low-resource tasks like speech emotion recognition. In this paper, a small model is introduced for speech emotion recognition based on the Hubert model by transferring the Hubert convolutional feature encoder and substituting all of its transformers with a simple conformer block. Then this simple model is trained with emotional speech signals. The experimental results indicate that the proposed model has comparable results with other well-performing S3RL models.
Peng SongYun JinLi ZhaoMinghai Xin
Sayan GhoshEugene LaksanaLouis–Philippe MorencyStefan Scherer
Edmilson MoraisRon HooryWeizhong ZhuItai GatMatheus DamascenoHagai Aronowitz