Singing Voice Conversion is the task of converting the timbre of a source singer to another one without modifying content or intonation. One of the main concerns when building a Singing Voice Conversion system is the type of content representations used for ensuring high intelligibility in converted voices. One approach is to use Cotatron encoder, however it has a major drawback since it requires lyrics transcriptions as input. In order not to be dependent on those transcriptions, a new area in Automatic Speech Recognition known as Self-Supervised Speech Representations seeks to extract robust latent representations from large-scale unlabeled speech corpus. A recent and popular family of such algorithms is VQ-Wav2Vec, that has been already applied to Speech Voice Conversion, however its use for Singing Voice Conversion has not been explored yet. In this master thesis, we implement a new Singing Voice Conversion using VQ-Wav2Vec features and perform a performance comparison with respect to Cotatron. We found through subjective listening tests and Word Error Rate calculation that self-supervised speech representations with VQ-Wav2Vec content features provide higher intelligibility when compared with transcription-guided content features extracted with Cotatron. In addition, singer voice similarity is slightly improved when using VQ-Wav2Vec features.
Tejas JayashankarJilong WuLeda SarıDavid KantVimal ManoharQing He
Ruiqi LiRongjie HuangYongqi WangZhiqing HongZhou Zhao
Panos KakoulidisNikolaos EllinasGeorgios VamvoukakisMyrsini ChristidouAlexandra VioniGeorgia ManiatiJunkwang OhGunu JhoIn-Chul HwangPirros TsiakoulisAimilios Chalamandaris
Wen-Chin HuangShu-Wen YangTomoki HayashiHung-yi LeeShinji WatanabeTomoki Toda