Voice Deepfake Detection Using the Self-Supervised Pre-Training Model HuBERT

Lanting Li; Tianliang Lu; Xingbang Ma; Mengjiao Yuan; Da Wan

doi:10.3390/app13148488

ScienceGate Book Chapters

JOURNAL ARTICLE

Voice Deepfake Detection Using the Self-Supervised Pre-Training Model HuBERT

Lanting Li Tianliang Lu Xingbang Ma Mengjiao Yuan Da Wan

Year: 2023 Journal: Applied Sciences Vol: 13 (14)Pages: 8488-8488 Publisher: Multidisciplinary Digital Publishing Institute

DOI: 10.3390/app13148488

Get Full-Text PDF Get Analytical Report

Abstract

In recent years, voice deepfake technology has developed rapidly, but current detection methods have the problems of insufficient detection generalization and insufficient feature extraction for unknown attacks. This paper presents a forged speech detection method (HuRawNet2_modified) based on a self-supervised pre-trained model (HuBERT) to improve detection (and address the above problems). A combination of impulsive signal-dependent additive noise and additive white Gaussian noise was adopted for data boosting and augmentation, and the HuBERT model was fine-tuned on different language databases. On this basis, the size of the extracted feature maps was modified independently by the α-feature map scaling (α-FMS) method, with a modified end-to-end method using the RawNet2 model as the backbone structure. The results showed that the HuBERT model could extract features more comprehensively and accurately. The best evaluation indicators were an equal error rate (EER) of 2.89% and a minimum tandem detection cost function (min t-DCF) of 0.2182 on the database of the ASVspoof2021 LA challenge, which verified the effectiveness of the detection method proposed in this paper. Compared with the baseline systems in databases of the ASVspoof 2021 LA challenge and the FMFCC-A, the values of EER and min t-DCF decreased. The results also showed that the self-supervised pre-trained model with fine-tuning can extract acoustic features across languages. And the detection can be slightly improved when the languages of the pre-trained database, and the fine-tuned and tested database are the same.

Keywords:

Computer science Generalization Pattern recognition (psychology) Artificial intelligence Mixture model Feature (linguistics) Boosting (machine learning) Speech recognition Machine learning Mathematics

Metrics

Cited By

5.62

FWCI (Field Weighted Citation Impact)

Refs

0.95

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Speech Recognition and Synthesis

Physical Sciences → Computer Science → Artificial Intelligence

Speech and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Music and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Voice Deepfake Detection Using the Self-Supervised Pre-Training Model HuBERT

Abstract

Metrics

Citation History

Topics

Related Documents

Fall Detection Using Self-Supervised Pre-Training Model

HuBERT Ensemble Models for Singing Voice Deepfake Detection

HuBERT Ensemble Models for Singing Voice Deepfake Detection

Audio Deepfake Detection via Dual Branch Classifier with Self-Supervised Pre-Trained Model

DeepFake Videos Detection Using Self-Supervised Decoupling Network