In this work, we have developed an end-to-end approach for text dependent speaker verification task. With this method, phonetic labels are fused with spectral features, and used to train a neural network for same/different speaker decision. The data used for tests is obtained from a real call center integrated voice response system. It consists of audio taken from calls made by people at different times in which they utter a specific, short sentence in Turkish. Contribution of in-domain data with target sentence and free format human-human call data for model training is investigated. For the inclusion of phonetic information in modelling three different methods are applied which are phoneme boundary, utterance boundary and phoneme boundary group. Test results show that, we attain an equal error rate of 10.7% for speaker verification on given dataset.
Wan LinJunhui ChenTianhao WangZhenyu ZhouLantian LiDong Wang
Giacomo ValentiAdrien DanielNicholas Evans
Georg HeigoldIgnacio López MorenoSamy BengioNoam Shazeer