Recently, the segmented utterance-level modeling approach based on Graph Attention Network (GAT) has been proved to be effective in Clustering-based Speaker Diarization (CSD). However, these existing methods only rely on the message passing by a single neighbor per layer, ignoring the influence of sub-region and global information. In this paper, we propose clustering driven multi-hop Graph Attention Network (CD-MGAT) with the multi-hop neighbor module and the clustering-oriented prototype module, which effectively explores the sub-region and global information for each segmented utterance. Specifically, the developed modules can adaptively interact with each other by clustering-consistency loss, which ensures the consistency of learning between the prototype and speaker embedding. Extensive experiments demonstrate the effectiveness of our solution on the AMI datasets.
Yi WeiHaiyan GuoZirui GeZhen Yang
Huan SongMegan WilliJayaraman J. ThiagarajanVisar BerishaAndreas Spanias
Tong WangWU Jun-huaZhenquan ZhangWen ZhouGuang ChenShasha Liu
Prachi SinghAmrit KaulSriram Ganapathy