Levine, GabrielThurlow, DrewLevitan, Sarah ItaArfa, Jon
Tools for singing voice synthesis and vocal conversion focused on music have greatly improved in both quality and ease of use, leading to an explosion of music with synthetic vocals. This proliferation has made it difficult for listeners to discern human from deepfake and synthetic vocals. While there are robust approaches for detecting synthetic speech and vocal spoofing, identifying synthetic singing voices presents a unique set of challenges. In this paper, we present a new, publicly available dataset of labeled music tracks containing human and synthetic vocals. We evaluate existing synthetic speech detection models using this new dataset. We also introduce a novel ensemble approach that combines high-level speech representations from HuBERT embeddings with a CNN classifier using traditional low-level audio features. Our evaluation confirms this to be an effective approach. We share our results, trained models, and our labeled dataset to encourage future research.
Levine, GabrielThurlow, DrewLevitan, Sarah ItaArfa, Jon
Yongyi ZangYou ZhangMojtaba HeydariZhiyao Duan
Mahyar GohariDavide SalviPaolo BestaginiNicola Adami
You ZhangYongyi ZangJiatong ShiRyuichi YamamotoTomoki TodaZhiyao Duan
Lanting LiTianliang LuXingbang MaMengjiao YuanDa Wan