Audio demos for thesis: "Disentanglement Learning for Text-Free Voice Conversion"

Mingjie Chen, University of Sheffield

Introduction This thesis aims to study disentanglement learning methods for Voice Conversion (VC). This demo page presents audio demos from three experiments of this thesis. The first experiment studies VQ-WAE, IN-WAE, WAE and SVQ-WAE models on a many-to-many VC task on VCTK dataset. The second experiment focuses on comparing model performance robustness of WAGAN-VC and baseline models. The third experiment explore four types of systems composing different linguistic encoder and decoder on three VC tasks. Details of experiment setups will be introduced in each experiment part.

Table of Contents

Experiment 1

In this experiment, we provide audio samples from two proposed models (IN-WAE, SVQ-WAE) and two baseline models (WAE and VQ-WAE).

Source Target VQ-WAE WAE IN-WAE SVQ-WAE

Experiment 2

In this experiment, three models (StarGAN-VC, StarGAN-VC2 and WAGAN-VC) are studied under two sessions. Session 1 explores six training data situations with varying numbers of speakers (N) and numbers of training sample per speaker (M), while keeping a fixing number of training samples. Session 2 explores four training data situations with a fixing number of speakers and decreasing numbers of training samples per speakers.

Session1: exploring number of speakers N and number of training samples per speaker M

N M Source Target StarGAN-VC StarGAN-VC2 WAGAN-VC
109 35
90 40
60 60
40 90
20 180
10 360

Session2: decreasing number of training samples per speaker M

N M Source Target StarGAN-VC StarGAN-VC2 WAGAN-VC
109 35
109 20
109 10
109 5

Experiment 3

In this section, four encoder-decoder VC systems are compared on three VC tasks. We firstly present the four VC systems with different linguistic encoders and different decoders. Then we present audio demos on three VC tasks, including a many-to-many VC task on VCTK, a intral-lingual one-shot VC task on VCC2020 and a cross-lingual one-shot VC task on VCC2020.

System index Linguistic encoder Decoder
Sys-1 VQ-Wav2vec Taco-AR
Sys-2 VQ-Wav2vec FastSpeech
Sys-3 ASR-BNE Taco-AR
Sys-1 ASR-BNE FastSpeech

Session1: many-to-many VC on VCTK

Source Target Sys-1 Sys-2 Sys-3 Sys-4

Session2: intra-ligual one-shot VC on VCTK

Source Target Sys-1 Sys-2 Sys-3 Sys-4

Session3: cross-ligual one-shot VC on VCTK

Source Target Sys-1 Sys-2 Sys-3 Sys-4