Audio demos for thesis: "Disentanglement Learning for Text-Free Voice Conversion"

Mingjie Chen, University of Sheffield

Introduction This thesis aims to study disentanglement learning methods for Voice Conversion (VC). This demo page presents audio demos from three experiments of this thesis. The first experiment studies VQ-WAE, IN-WAE, WAE and SVQ-WAE models on a many-to-many VC task on VCTK dataset. The second experiment focuses on comparing model performance robustness of WAGAN-VC and baseline models. The third experiment explore four types of systems composing different linguistic encoder and decoder on three VC tasks. Details of experiment setups will be introduced in each experiment part.

Experiment 1
Experiment 2
Experiment 3

Experiment 1

In this experiment, we provide audio samples from two proposed models (IN-WAE, SVQ-WAE) and two baseline models (WAE and VQ-WAE).

Source	Target	VQ-WAE	WAE	IN-WAE	SVQ-WAE

Experiment 2

In this experiment, three models (StarGAN-VC, StarGAN-VC2 and WAGAN-VC) are studied under two sessions. Session 1 explores six training data situations with varying numbers of speakers (N) and numbers of training sample per speaker (M), while keeping a fixing number of training samples. Session 2 explores four training data situations with a fixing number of speakers and decreasing numbers of training samples per speakers.

Session1: exploring number of speakers N and number of training samples per speaker M

N	M	Source	Target	StarGAN-VC	StarGAN-VC2	WAGAN-VC
109	35
90	40
60	60
40	90
20	180
10	360

Session2: decreasing number of training samples per speaker M

N	M	Source	Target	StarGAN-VC	StarGAN-VC2	WAGAN-VC
109	35
109	20
109	10
109	5

Experiment 3

In this section, four encoder-decoder VC systems are compared on three VC tasks. We firstly present the four VC systems with different linguistic encoders and different decoders. Then we present audio demos on three VC tasks, including a many-to-many VC task on VCTK, a intral-lingual one-shot VC task on VCC2020 and a cross-lingual one-shot VC task on VCC2020.

System index	Linguistic encoder	Decoder
Sys-1	VQ-Wav2vec	Taco-AR
Sys-2	VQ-Wav2vec	FastSpeech
Sys-3	ASR-BNE	Taco-AR
Sys-1	ASR-BNE	FastSpeech

Session1: many-to-many VC on VCTK

Source	Target	Sys-1	Sys-2	Sys-3	Sys-4

Session2: intra-ligual one-shot VC on VCTK

Source	Target	Sys-1	Sys-2	Sys-3	Sys-4

Session3: cross-ligual one-shot VC on VCTK

Source	Target	Sys-1	Sys-2	Sys-3	Sys-4