Transcription-Guided and Self-Supervised Speech Representations for Singing Voice Conversion

Betty Cortiñas Lorenzo

doi:10.5281/zenodo.7116006

ScienceGate Book Chapters

DISSERTATION

Transcription-Guided and Self-Supervised Speech Representations for Singing Voice Conversion

Betty Cortiñas Lorenzo

Year: 2022 University: Zenodo (CERN European Organization for Nuclear Research) Publisher: European Organization for Nuclear Research

DOI: 10.5281/zenodo.7116006

Get Full-Text PDF Get Analytical Report

Abstract

Singing Voice Conversion is the task of converting the timbre of a source singer to another one without modifying content or intonation. One of the main concerns when building a Singing Voice Conversion system is the type of content representations used for ensuring high intelligibility in converted voices. One approach is to use Cotatron encoder, however it has a major drawback since it requires lyrics transcriptions as input. In order not to be dependent on those transcriptions, a new area in Automatic Speech Recognition known as Self-Supervised Speech Representations seeks to extract robust latent representations from large-scale unlabeled speech corpus. A recent and popular family of such algorithms is VQ-Wav2Vec, that has been already applied to Speech Voice Conversion, however its use for Singing Voice Conversion has not been explored yet. In this master thesis, we implement a new Singing Voice Conversion using VQ-Wav2Vec features and perform a performance comparison with respect to Cotatron. We found through subjective listening tests and Word Error Rate calculation that self-supervised speech representations with VQ-Wav2Vec content features provide higher intelligibility when compared with transcription-guided content features extracted with Cotatron. In addition, singer voice similarity is slightly improved when using VQ-Wav2Vec features.

Keywords:

Singing Speech recognition Transcription (linguistics) Computer science Linguistics Acoustics

Metrics

Cited By

0.00

FWCI (Field Weighted Citation Impact)

Refs

0.10

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Topics

Speech Recognition and Synthesis

Physical Sciences → Computer Science → Artificial Intelligence

Speech and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Music and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Transcription-Guided and Self-Supervised Speech Representations for Singing Voice Conversion

Abstract

Metrics

Topics

Related Documents

Transcription-Guided and Self-Supervised Speech Representations for Singing Voice Conversion

Self-Supervised Representations for Singing Voice Conversion

Self-Supervised Singing Voice Pre-Training towards Speech-to-Singing Conversion

Low-Resource Cross-Domain Singing Voice Synthesis via Reduced Self-Supervised Speech Representations

S3PRL-VC: Open-Source Voice Conversion Framework with Self-Supervised Speech Representations