This paper presents a novel design of attention model for textindependent speaker verification.The model takes a pair of input utterances and generates an utterance-level embedding to represent speaker-specific characteristics in each utterance.The input utterances are expected to have highly similar embeddings if they are from the same speaker.The proposed attention model consists of a self-attention module and a mutual attention module, which jointly contributes to the generation of the utterance-level embedding.The self-attention weights are computed from the utterance itself while the mutual-attention weights are computed with the involvement of the other utterance in the input pairs.As a result, each utterance is represented by a self-attention weighted embedding and a mutual-attention weighted embedding.The similarity between the embeddings is measured by a cosine distance score and a binary classifier output score.The whole model, named Dual Attention Network, is trained end-to-end on Voxceleb database.The evaluation results on Voxceleb 1 test set show that the Dual Attention Network significantly outperforms the baseline systems.The best result yields an equal error rate of 1.6%.
Tianyan ZhouYong ZhaoJinyu LiYifan GongJian Wu
Tengyue BianFangzhou ChenXu Li