Aiming at the study of voice wake-up, this paper builds a 12-layer deep separable convolutional neural network- DSCNN based on deep separable convolutions. It determines whether wake words are recognized by binary classification of the feature spectrum after feature extraction. Choosing, 'HelloMia" as the wake-up word, the training set contains 7982 positive sample speeches with the label (1,0), negative sample speech 1315 with the label $(0,1)$ , by introducing the batch normalization layer (BN layer), the model converges at 0.3 epochs, the accuracy rate is 0.9994 on the test set of 10,000 positive samples, and the accuracy rate is 0.9889 on the test set of 2362 negative samples. The wake-up rate is 99.94%, and the false wake-up rate is only 1.11%. Compared with ordinary convolutional models, it is found that DSCNN greatly reduces the number of parameters and memory consumption, while the convergence speed and training effect have not decreased.
Chen Yi-fangPeng FengXiangui KangZexin Wang
Jianzhong YuanWujie ZhouSijia LvYuzhen Chen
Kevin Maulana AfriyantoAbas Setiawan
Yue LuJianguo JiangMin YuChao LiuChaochao LiuWeiqing HuangZhiqiang Lv