EasyVC: A Toolkit for Any-to-Any Encoder-Decoder Voice Conversion Systems

Mingjie Chen, Thomas Hain

Department of Computer Science, University of Sheffield

Abstract. Current state-of-the-art voice conversion (VC) systems typically are developed based on an encoder-decoder framework. In this framework, encoders are used to extract linguistic, speaker or prosodic features from speech, then a decoder is to generate speech from speech features. Recently, there have been more and more advance models deployed as encoders or decoders for VC. Although obtaining good performance, the effects of these encoders and decoders have not been fully studied. On the other hand, VC technologies have been applied in different scenarios, which brings a lot of challenges for VC techiques. Hence, studies and understandings of encoders and decoders are becoming necessary and important. However, due to the complexity of VC systems, it is not always easy to compare and analyse these encoders and decoders. This paper introduces a toolkit, EasyVC, which is built upon the encoder-decoder framework. EasyVC supports a number of encoders and decoders within a unified framework, which makes it easy and convenient for VC training, inference, evaluation and deployment. EasyVC provides step-wise recipes covering from dataset downloading to objective evaluations and online demo presentation. Furthermore, EasyVC focuses on challenging VC scenarios such as one-shot, emotional, singing and real-time, which have not bee fully studied at the moment. EasyVC could help researchers and developers to investigate modules of VC systems and also promote the development of VC techniques.

Encoder-Decoder Voice Conversion Framework



Here we introduces the encoder-decoder framework for VC systems. As show in the figure, this framework typically is composed of three encoders, a decoder and a vocoder. More specifically, three encoders are used to extract representations from speech, including a linguistic encoder, a prosodic encoder and a speaker encoder. Then a decoder is used to reconstruct speech mel-spectrograms. Finally, a vocoder converts mel-spectrograms to waveforms. Note that this repo also supports decoders that directly reconstruct waveforms (e.g. VITS), in these case, vocoders are not needed.

Table of Contents

One-shot Voice Conversion Results

In this section, we provide the objective results and audio demos of a simple comparison one-shot VC experiment. We use VQ-Wav2vec and ConformerPPG as linguistic encoders, in the meantime, we choose FastSpeech2, TacoAR and TacoMOL as decoders. Traing dataset is LibriTTS-460-clean, testing dataset is VCTK. We use d-vector as speaker encoder, ppgvc_f0 as prosodic encoder, ppgvc_hifigan as vocoder.

List VC Systems:

Index Linguistic Encoder Speaker Encoder Prosodic Encoder Decoder Vocoder
System 0 VQ-Wav2vec D-Vector Cont-Log-F0 FastSpeech2 HifiGAN
System 1 VQ-Wav2vec D-Vector Cont-Log-F0 TacoAR HifiGAN
System 2 VQ-Wav2vec D-Vector Cont-Log-F0 TacoMOL HifiGAN
System 3 ConformerPPGD-Vector Cont-Log-F0 FastSpeech2 HifiGAN
System 4 ConformerPPG D-Vector Cont-Log-F0 TacoAR HifiGAN
System 5 ConformerPPG D-Vector Cont-Log-F0 TacoMOL HifiGAN
Objective Results
VC System Predicted MOS ASR WER ASV EER
System 0 3.78 21.4 15.5
System 1 3.75 21.2 19.8
System 2 3.33 22.4 16.6
System 3 3.8815.0 14.4
System 4 3.82 17.1 23.3
System 5 3.12 21.2 16.4
Audio Demos
Source Audio Target Audio System 0 System 1 System 2 System 3 System 4 System 5