ACE-VC: Adaptive and Controllable Voice Conversion Using Explicitly Disentangled Self-Supervised Speech Representations

Shehzeen Hussain; Paarth Neekhara; Jocelyn Huang; Jason Li; Boris Ginsburg

doi:10.1109/icassp49357.2023.10094850

ScienceGate Book Chapters

JOURNAL ARTICLE

ACE-VC: Adaptive and Controllable Voice Conversion Using Explicitly Disentangled Self-Supervised Speech Representations

Shehzeen Hussain Paarth Neekhara Jocelyn Huang Jason Li Boris Ginsburg

Year: 2023

DOI: 10.1109/icassp49357.2023.10094850

Get Full-Text PDF Get Analytical Report

Abstract

In this work, we propose a zero-shot voice conversion method using speech representations trained with self-supervised learning. First, we develop a multi-task model to decompose a speech utterance into features such as linguistic content, speaker characteristics, and speaking style. To disentangle content and speaker representations, we propose a training strategy based on Siamese networks that encourages similarity between the content representations of the original and pitch-shifted audio. Next, we develop a synthesis model with pitch and duration predictors that can effectively reconstruct the speech signal from its decomposed representation. Our framework allows controllable and speaker-adaptive synthesis to perform zero-shot any-to-any voice conversion achieving state-of-the-art results on metrics evaluating speaker similarity, intelligibility, and naturalness. Using just 10 seconds of data for a target speaker, our framework can perform voice swapping and achieves a speaker verification EER of 5.5% for seen speakers and 8.4% for unseen speakers. ¹

Keywords:

Naturalness Computer science Speech recognition Utterance Intelligibility (philosophy) Speaker recognition Speech synthesis Similarity (geometry) Artificial intelligence Speech processing Natural language processing

Metrics

Cited By

3.07

FWCI (Field Weighted Citation Impact)

Refs

0.90

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Speech Recognition and Synthesis

Physical Sciences → Computer Science → Artificial Intelligence

Speech and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Music and Audio Processing

Physical Sciences → Computer Science → Signal Processing

ACE-VC: Adaptive and Controllable Voice Conversion Using Explicitly Disentangled Self-Supervised Speech Representations

Abstract

Metrics

Citation History

Topics

Related Documents

S3PRL-VC: Open-Source Voice Conversion Framework with Self-Supervised Speech Representations

Speech Resynthesis from Discrete Disentangled Self-Supervised Representations

DisC-VC: Disentangled and F₀-Controllable Neural Voice Conversion

CLESSR-VC: Contrastive learning enhanced self-supervised representations for one-shot voice conversion

Transcription-Guided and Self-Supervised Speech Representations for Singing Voice Conversion

ACE-VC: Adaptive and Controllable Voice Conversion Using Explicitly Disentangled Self-Supervised Speech Representations

Abstract

Metrics

Citation History

Topics

Related Documents

S3PRL-VC: Open-Source Voice Conversion Framework with Self-Supervised Speech Representations

Speech Resynthesis from Discrete Disentangled Self-Supervised Representations

DisC-VC: Disentangled and F0-Controllable Neural Voice Conversion

CLESSR-VC: Contrastive learning enhanced self-supervised representations for one-shot voice conversion

Transcription-Guided and Self-Supervised Speech Representations for Singing Voice Conversion

DisC-VC: Disentangled and F₀-Controllable Neural Voice Conversion