Recently, pre-trained models (PTMs) have been extensively applied in speaker verification (SV) and greatly boosted system performance. However, mainstream PTMs currently concentrate on using frame-level universal representations. In this paper, we propose a novel pre-training framework that jointly models speaker information — Speaker Related HuBERT, abbreviated as SR-HuBERT. This framework aims to further explore speaker-related information inherent in speech universal representations. The proposed SR-HuBERT utilizes an unsupervised clustering algorithm based on graph structures to generate speaker pseudo-labels and promotes the learning of segment-level speaker-related representations through a multi-task pre-training framework. Experimental results on VoxCeleb1 test set demonstrate the effectiveness of the proposed SR-HuBERT. Even in the scenarios of limited fine-tuning data, SR-HuBERT outperforms the other existing PTMs on SV tasks. Additionally, SR-HuBERT also performs well on speaker-related tasks of SUPERB benchmark.
Siqi ZhengHongbin SuoQian Chen
Yishuang LiWenhao GuanHukai HuangShiyu MiaoQi SuLin LiQingyang Hong
Zhicong ChenJie WangWenxuan HuLin LiQingyang Hong
Chan-yeong LimHyun-seo ShinJu-ho KimJungwoo HeoKyo-Won KooSeung-bin KimHa-Jin Yu