Variational Inference with adversarial learning for end-to-end Singing Voice Conversion based on VITS
The tree bigvgan-mix-v2 has good audio quality
The tree RoFormer-HiFTNet has fast infer speed
No More Upgrade
- This project targets deep learning beginners, basic knowledge of Python and PyTorch are the prerequisites for this project;
- This project aims to help deep learning beginners get rid of boring pure theoretical learning, and master the basic knowledge of deep learning by combining it with practices;
- This project does not support real-time voice converting; (need to replace whisper if real-time voice converting is what you are looking for)
- This project will not develop one-click packages for other purposes;
-
A minimum VRAM requirement of 6GB for training
-
Support for multiple speakers
-
Create unique speakers through speaker mixing
-
It can even convert voices with light accompaniment
-
You can edit F0 using Excel
AI_Elysia_LoveStory.mp4
Powered by @ShadowVap
Feature | From | Status | Function |
---|---|---|---|
whisper | OpenAI | ✅ | strong noise immunity |
bigvgan | NVIDA | ✅ | alias and snake |
natural speech | Microsoft | ✅ | reduce mispronunciation |
neural source-filter | NII | ✅ | solve the problem of audio F0 discontinuity |
speaker encoder | ✅ | Timbre Encoding and Clustering | |
GRL for speaker | Ubisoft | ✅ | Preventing Encoder Leakage Timbre |
SNAC | Samsung | ✅ | One Shot Clone of VITS |
SCLN | Microsoft | ✅ | Improve Clone |
Diffusion | HuaWei | ✅ | Improve sound quality |
PPG perturbation | this project | ✅ | Improved noise immunity and de-timbre |
HuBERT perturbation | this project | ✅ | Improved noise immunity and de-timbre |
VAE perturbation | this project | ✅ | Improve sound quality |
MIX encoder | this project | ✅ | Improve conversion stability |
USP infer | this project | ✅ | Improve conversion stability |
HiFTNet | Columbia University | ✅ | NSF-iSTFTNet for speed up |
RoFormer | Zhuiyi Technology | ✅ | Rotary Positional Embeddings |
due to the use of data perturbation, it takes longer to train than other projects.
USP : Unvoice and Silence with Pitch when infer
Leveraging Content-based Features from Multiple Acoustic Models for Singing Voice Conversion
-
Install PyTorch.
-
Install project dependencies
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple -r requirements.txt
Note: whisper is already built-in, do not install it again otherwise it will cuase conflict and error
-
Download the Timbre Encoder: Speaker-Encoder by @mueller91, put
best_model.pth.tar
intospeaker_pretrain/
. -
Download whisper model whisper-large-v2. Make sure to download
large-v2.pt
,put it intowhisper_pretrain/
. -
(optional)Download whisper model whisper-large-v3. Make sure to download
large-v3.pt
,put it intowhisper_pretrain/
. -
Download hubert_soft model,put
hubert-soft-0d54a1f4.pt
intohubert_pretrain/
. -
Download pitch extractor crepe full,put
full.pth
intocrepe/assets
. -
(optional) Download rmvpepretrain,put
rmvpe.pt
intopretrain
.Note: crepe full.pth is 84.9 MB, not 6kb
-
(choose one at your option)Download pretrain model,this model Optimized for female voices(LargeV3) sovits5.0.pretrain.pth, and put it into
vits_pretrain/
.python svc_inference.py --config configs/base.yaml --model ./vits_pretrain/sovits5.0.pretrain.pth --spk ./configs/singers/singer0001.npy --wave test.wav
-
(choose one at your option)Download pretrain model(LargeV2) sovits5.0.pretrain.pth, and put it into
vits_pretrain/
.python svc_inference.py --config configs/base.yaml --model ./vits_pretrain/sovits5.0.pretrain.pth --spk ./configs/singers/singer0001.npy --wave test.wav
If you chose LargeV2, use svc_inferencermvpev2 or svc_inferencevcrepev2
If you chose LargeV3, use svc_inferencermvpev3 or svc_inferencevcrepev3
Necessary pre-processing:
- Separate voice and accompaniment with UVR (skip if no accompaniment)
- Cut audio input to shorter length with slicer, whisper takes input less than 30 seconds.
- Manually check generated audio input, remove inputs shorter than 2 seconds or with obivous noise.
- Adjust loudness if necessary, recommend Adobe Audiiton.
- Put the dataset into the
dataset_raw
directory following the structure below.
dataset_raw
├───speaker0
│ ├───000001.wav
│ ├───...
│ └───000xxx.wav
└───speaker1
├───000001.wav
├───...
└───000xxx.wav
By default, largeV2 is used for PPG analysis
However, you can also select "V2" or "V3" in "()" to determine whether to use LargeV2 for PPG processing or LarheV3 for PPG processing
If you need to do that with crepe(v2)
python svc_preprocessingV2.py -t 2
If you need to do that with crepe(v3)
python svc_preprocessingV3.py -t 2
If you need to do that with rmvpe(v2)
python svc_preprocessingrmvpeV2.py -t 2
If you need to do that with rmvpe(v3)
python svc_preprocessingrmvpeV3.py -t 2
WARNING:Rmvpe might not work, but you can try
-t
: threading, max number should not exceed CPU core count, usually 2 is enough.
After preprocessing you will get an output with following structure.
data_svc/
└── waves-16k
│ └── speaker0
│ │ ├── 000001.wav
│ │ └── 000xxx.wav
│ └── speaker1
│ ├── 000001.wav
│ └── 000xxx.wav
└── waves-32k
│ └── speaker0
│ │ ├── 000001.wav
│ │ └── 000xxx.wav
│ └── speaker1
│ ├── 000001.wav
│ └── 000xxx.wav
└── pitch
│ └── speaker0
│ │ ├── 000001.pit.npy
│ │ └── 000xxx.pit.npy
│ └── speaker1
│ ├── 000001.pit.npy
│ └── 000xxx.pit.npy
└── hubert
│ └── speaker0
│ │ ├── 000001.vec.npy
│ │ └── 000xxx.vec.npy
│ └── speaker1
│ ├── 000001.vec.npy
│ └── 000xxx.vec.npy
└── whisper
│ └── speaker0
│ │ ├── 000001.ppg.npy
│ │ └── 000xxx.ppg.npy
│ └── speaker1
│ ├── 000001.ppg.npy
│ └── 000xxx.ppg.npy
└── speaker
│ └── speaker0
│ │ ├── 000001.spk.npy
│ │ └── 000xxx.spk.npy
│ └── speaker1
│ ├── 000001.spk.npy
│ └── 000xxx.spk.npy
└── singer
├── speaker0.spk.npy
└── speaker1.spk.npy
-
Re-sampling
- Generate audio with a sampling rate of 16000Hz in
./data_svc/waves-16k
python prepare/preprocess_a.py -w ./dataset_raw -o ./data_svc/waves-16k -s 16000
- Generate audio with a sampling rate of 32000Hz in
./data_svc/waves-32k
python prepare/preprocess_a.py -w ./dataset_raw -o ./data_svc/waves-32k -s 32000
- Generate audio with a sampling rate of 16000Hz in
-
Use 16K audio to extract pitch(crepe)
python prepare/preprocess_crepe.py -w data_svc/waves-16k/ -p data_svc/pitch
-
Use 16K audio to extract pitch(Rmvpe)
python prepare/preprocess_rmvpe.py -w data_svc/waves-16k/ -p data_svc/pitch
-
Use 16K audio to extract ppg(v2)
python prepare/preprocess_ppgv2.py -w data_svc/waves-16k/ -p data_svc/whisper
-
Use 16K audio to extract ppg(v3)
python prepare/preprocess_ppgv3.py -w data_svc/waves-16k/ -p data_svc/whisper
-
Use 16K audio to extract hubert
python prepare/preprocess_hubert.py -w data_svc/waves-16k/ -v data_svc/hubert
-
Use 16k audio to extract timbre code
python prepare/preprocess_speaker.py data_svc/waves-16k/ data_svc/speaker
-
Extract the average value of the timbre code for inference; it can also replace a single audio timbre in generating the training index, and use it as the unified timbre of the speaker for training
python prepare/preprocess_speaker_ave.py data_svc/speaker/ data_svc/singer
-
Use 32k audio to extract the linear spectrum
python prepare/preprocess_spec.py -w data_svc/waves-32k/ -s data_svc/specs
-
Use 32k audio to generate training index
python prepare/preprocess_train.py
-
Training file debugging
python prepare/preprocess_zzz.py
-
If fine-tuning is based on the pre-trained model, you need to download the pre-trained model: (LargeV3)sovits5.0.pretrain.pth. (LargeV2)sovits5.0.pretrain.pth.Put pretrained model under project root, change this line
pretrain: "./vits_pretrain/sovits5.0.pretrain.pth"
in
configs/base.yaml
,and adjust the learning rate appropriately, eg 5e-5.batch_szie
: for GPU with 6G VRAM, 6 is the recommended value, 8 will work but step speed will be much slower. -
Start training
python svc_trainer.py -c configs/base.yaml -n sovits5.0
-
Resume training
python svc_trainer.py -c configs/base.yaml -n sovits5.0 -p chkpt/sovits5.0/sovits5.0_***.pt
-
Log visualization
tensorboard --logdir logs/
-
Export inference model: text encoder, Flow network, Decoder network
python svc_export.py --config configs/base.yaml --checkpoint_path chkpt/sovits5.0/***.pt
-
Inference
- if there is no need to adjust
f0
, just run the following command.(Crepe V2)
python svc_inferencevcrepev2.py --config configs/base.yaml --model sovits5.0.pth --spk ./data_svc/singer/your_singer.spk.npy --wave test.wav --shift 0
- if there is no need to adjust
f0
, just run the following command.(Crepe V3)
python svc_inferencevcrepev3.py --config configs/base.yaml --model sovits5.0.pth --spk ./data_svc/singer/your_singer.spk.npy --wave test.wav --shift 0
- if there is no need to adjust
f0
, just run the following command.(RMVPE V2)
python svc_inferencermvpev2.py --config configs/base.yaml --model sovits5.0.pth --spk ./data_svc/singer/your_singer.spk.npy --wave test.wav --shift 0
- if there is no need to adjust
f0
, just run the following command.(RMVPE V3)
python svc_inferencermvpev3.py --config configs/base.yaml --model sovits5.0.pth --spk ./data_svc/singer/your_singer.spk.npy --wave test.wav --shift 0
- if
f0
will be adjusted manually, follow the steps:- use whisper to extract content encoding, generate
test.vec.npy
.
python whisper/inference.py -w test.wav -p test.ppg.npy
- use hubert to extract content vector, without using one-click reasoning, in order to reduce GPU memory usage
python hubert/inference.py -w test.wav -v test.vec.npy
- extract the F0 parameter to the csv text format, open the csv file in Excel, and manually modify the wrong F0 according to Audition or SonicVisualiser
python pitch/inference.py -w test.wav -p test.csv
- final inference
python svc_inference.py --config configs/base.yaml --model sovits5.0.pth --spk ./data_svc/singer/your_singer.spk.npy --wave test.wav --ppg test.ppg.npy --vec test.vec.npy --pit test.csv --shift 0
- use whisper to extract content encoding, generate
- if there is no need to adjust
-
Notes
-
when
--ppg
is specified, when the same audio is reasoned multiple times, it can avoid repeated extraction of audio content codes; if it is not specified, it will be automatically extracted; -
when
--vec
is specified, when the same audio is reasoned multiple times, it can avoid repeated extraction of audio content codes; if it is not specified, it will be automatically extracted; -
when
--pit
is specified, the manually tuned F0 parameter can be loaded; if not specified, it will be automatically extracted; -
generate files in the current directory:svc_out.wav
-
-
Arguments ref
args --config --model --spk --wave --ppg --vec --pit --shift name config path model path speaker wave input wave ppg wave hubert wave pitch pitch shift -
post by vad
python svc_inference_post.py --ref test.wav --svc svc_out.wav --out svc_out_post.wav
To increase the stability of the generated timbre, you can use the method described in the Retrieval-based-Voice-Conversion repository. This method consists of 2 steps:
-
Training the retrieval index on hubert and whisper features Run training with default settings:
python svc_train_retrieval.py
If the number of vectors is more than 200_000 they will be compressed to 10_000 using the MiniBatchKMeans algorithm. You can change these settings using command line options:
usage: crate faiss indexes for feature retrieval [-h] [--debug] [--prefix PREFIX] [--speakers SPEAKERS [SPEAKERS ...]] [--compress-features-after COMPRESS_FEATURES_AFTER] [--n-clusters N_CLUSTERS] [--n-parallel N_PARALLEL] options: -h, --help show this help message and exit --debug --prefix PREFIX add prefix to index filename --speakers SPEAKERS [SPEAKERS ...] speaker names to create an index. By default all speakers are from data_svc --compress-features-after COMPRESS_FEATURES_AFTER If the number of features is greater than the value compress feature vectors using MiniBatchKMeans. --n-clusters N_CLUSTERS Number of centroids to which features will be compressed --n-parallel N_PARALLEL Nuber of parallel job of MinibatchKmeans. Default is cpus-1
Compression of training vectors can speed up index inference, but reduces the quality of the retrieve. Use vector count compression if you really have a lot of them.
The resulting indexes will be stored in the "indexes" folder as:
data_svc ... └── indexes ├── speaker0 │ ├── some_prefix_hubert.index │ └── some_prefix_whisper.index └── speaker1 ├── hubert.index └── whisper.index
-
At the inference stage adding the n closest features in a certain proportion of the vits model Enable Feature Retrieval with settings:
python svc_inference.py --config configs/base.yaml --model sovits5.0.pth --spk ./data_svc/singer/your_singer.spk.npy --wave test.wav --shift 0 \ --enable-retrieval \ --retrieval-ratio 0.5 \ --n-retrieval-vectors 3
For a better retrieval effect, you can try to cycle through different parameters:
--retrieval-ratio
and--n-retrieval-vectors
If you have multiple sets of indexes, you can specify a specific set via the parameter:
--retrieval-index-prefix
You can explicitly specify the paths to the hubert and whisper indexes using the parameters:
--hubert-index-path
and--whisper-index-path
named by pure coincidence:average -> ave -> eva,eve(eva) represents conception and reproduction
python svc_eva.py
eva_conf = {
'./configs/singers/singer0022.npy': 0,
'./configs/singers/singer0030.npy': 0,
'./configs/singers/singer0047.npy': 0.5,
'./configs/singers/singer0051.npy': 0.5,
}
the generated singer file will be eva.spk.npy
.
https://github.com/facebookresearch/speech-resynthesis paper
https://github.com/jaywalnut310/vits paper
https://github.com/openai/whisper/ paper
https://github.com/NVIDIA/BigVGAN paper
https://github.com/mindslab-ai/univnet paper
https://github.com/nii-yamagishilab/project-NN-Pytorch-scripts/tree/master/project/01-nsf
https://github.com/huawei-noah/Speech-Backbones/tree/main/Grad-TTS
https://github.com/brentspell/hifi-gan-bwe
https://github.com/mozilla/TTS
https://github.com/bshall/soft-vc
https://github.com/maxrmorrison/torchcrepe
https://github.com/MoonInTheRiver/DiffSinger
https://github.com/OlaWod/FreeVC paper
https://github.com/yl4579/HiFTNet paper
Adapter-Based Extension of Multi-Speaker Text-to-Speech Model for New Speakers
AdaSpeech: Adaptive Text to Speech for Custom Voice
AdaVITS: Tiny VITS for Low Computing Resource Speaker Adaptation
Cross-Speaker Prosody Transfer on Any Text for Expressive Speech Synthesis
Multilingual Speech Synthesis and Cross-Language Voice Cloning: GRL
RoFormer: Enhanced Transformer with rotary position embedding
https://github.com/auspicious3000/contentvec/blob/main/contentvec/data/audio/audio_utils_1.py
https://github.com/revsic/torch-nansy/blob/main/utils/augment/praat.py
https://github.com/revsic/torch-nansy/blob/main/utils/augment/peq.py
https://github.com/biggytruck/SpeechSplit2/blob/main/utils.py
https://github.com/OlaWod/FreeVC/blob/main/preprocess_sr.py
https://github.com/svc-develop-team/so-vits-svc
This project does not participate in any disputes in the original project, just for learning and use, thank you for using, the effect may not be as good as 4.1, but you can definitely try this project, I believe you will not regret
If you encounter a processing error from Intel, add it to the first line of the Python file (.py) where the problem occurred
import sys,os
sys.path.append(os.path.dirname(os.path.abspath(__file__)))