Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stylemelagn model structure change for 32Khz vocoder #370

Open
wblgers opened this issue Jun 8, 2022 · 3 comments
Open

stylemelagn model structure change for 32Khz vocoder #370

wblgers opened this issue Jun 8, 2022 · 3 comments

Comments

@wblgers
Copy link

wblgers commented Jun 8, 2022

Dear professor,

I'd like to train stylemelgan vocoder of 32kHz, here is my config to train a multi-speaker model, now the speaker similarity on VC task is worse than fregan/hifigan.
Can you give me some advice to improve quality. Here are two points I want to try:
(1)improve in_channels from 128 to 256 ;
(2)improve conv kernal size;
Thanks !

###########################################################
#                FEATURE EXTRACTION SETTING               #
###########################################################
sampling_rate: 32000     # Sampling rate.
fft_size: 2048           # FFT size.
hop_size: 320            # Hop size.
win_length: 1600         # Window length.
                         # If set to null, it will be the same as fft_size.
window: "hann"           # Window function.
num_mels: 80             # Number of mel basis.
fmin: 80                 # Minimum freq in mel basis calculation.
fmax: 7600               # Maximum frequency in mel basis calculation.
global_gain_scale: 1.0   # Will be multiplied to all of waveform.
trim_silence: false      # Whether to trim the start and end of silence.
trim_threshold_in_db: 60 # Need to tune carefully if the recording is not good.
trim_frame_size: 2048    # Frame size in trimming.
trim_hop_size: 512       # Hop size in trimming.
format: "npy"           # Feature file format. " npy " or " hdf5 " is supported.

min_level_db: -100
ref_level_db: 20.0

preemphasis: 0.97
###########################################################
#         GENERATOR NETWORK ARCHITECTURE SETTING          #
###########################################################
generator_type: "StyleMelGANGenerator" # Generator type.
generator_params:
    in_channels: 128
    aux_channels: 80
    channels: 64
    out_channels: 1
    kernel_size: 9
    dilation: 2
    bias: True
    noise_upsample_scales: [5, 5, 2, 2]
    noise_upsample_activation: "LeakyReLU"
    noise_upsample_activation_params:
        negative_slope: 0.2
    upsample_scales: [5, 1, 4, 1, 4, 1, 2, 2, 1]
    upsample_mode: "nearest"
    gated_function: "softmax"
    use_weight_norm: True

###########################################################
#       DISCRIMINATOR NETWORK ARCHITECTURE SETTING        #
###########################################################
discriminator_type: "StyleMelGANDiscriminator" # Discriminator type.
discriminator_params:
    repeats: 4
    window_sizes: [512, 1024, 2048, 4096]
    pqmf_params:
        - [1, None, None, None]
        - [2, 62, 0.26700, 9.0]
        - [4, 62, 0.14200, 9.0]
        - [8, 62, 0.07949, 9.0]
    discriminator_params:
        out_channels: 1
        kernel_sizes: [5, 3]
        channels: 16
        max_downsample_channels: 512
        bias: True
        downsample_scales: [4, 4, 4, 1]
        nonlinear_activation: "LeakyReLU"
        nonlinear_activation_params:
            negative_slope: 0.2
    use_weight_norm: True

###########################################################
#                   STFT LOSS SETTING                     #
###########################################################
stft_loss_params:
    fft_sizes: [1024, 2048, 512]  # List of FFT size for STFT-based loss.
    hop_sizes: [120, 240, 50]     # List of hop size for STFT-based loss
    win_lengths: [600, 1200, 240] # List of window length for STFT-based loss.
    window: "hann_window"         # Window function for STFT-based loss
lambda_aux: 1.0                   # Loss balancing coefficient for aux loss.

###########################################################
#               ADVERSARIAL LOSS SETTING                  #
###########################################################
lambda_adv: 1.0 # Loss balancing coefficient for adv loss.
generator_adv_loss_params:
    average_by_discriminators: false # Whether to average loss by #discriminators.
discriminator_adv_loss_params:
    average_by_discriminators: false # Whether to average loss by #discriminators.

###########################################################
#                  DATA LOADER SETTING                    #
###########################################################
batch_size: 32              # Batch size.
batch_max_steps: 32000      # Length of each audio in batch. Make sure dividable by hop_size.
pin_memory: true            # Whether to pin memory in Pytorch DataLoader.
num_workers: 4              # Number of workers in Pytorch DataLoader.
remove_short_samples: false # Whether to remove samples the length of which are less than batch_max_steps.
allow_cache: false           # Whether to allow cache in dataset. If true, it requires cpu memory.

###########################################################
#             OPTIMIZER & SCHEDULER SETTING               #
###########################################################
generator_optimizer_type: Adam
generator_optimizer_params:
    lr: 4.0e-4
    betas: [0.5, 0.9]
    weight_decay: 0.0
generator_scheduler_type: MultiStepLR
generator_scheduler_params:
    gamma: 0.5
    milestones:
        - 100000
        - 300000
        - 500000
        - 700000
        - 900000
        - 110000
generator_grad_norm: -1
discriminator_optimizer_type: Adam
discriminator_optimizer_params:
    lr: 2.0e-4
    betas: [0.5, 0.9]
    weight_decay: 0.0
discriminator_scheduler_type: MultiStepLR
discriminator_scheduler_params:
    gamma: 0.5
    milestones:
        - 200000
        - 400000
        - 600000
        - 800000
discriminator_grad_norm: -1

###########################################################
#                    INTERVAL SETTING                     #
###########################################################
discriminator_train_start_steps: 100000 # Number of steps to start to train discriminator.
train_max_steps: 1500000                # Number of training steps.
save_interval_steps: 50000              # Interval steps to save checkpoint.
eval_interval_steps: 1000              # Interval steps to evaluate the network.
log_interval_steps: 100                # Interval steps to record the training log.

###########################################################
#                     OTHER SETTING                       #
###########################################################
num_save_intermediate_results: 4  # Number of results to be saved as intermediate results.
@MikeAleksa
Copy link

@wblgers have you seen any improvement in StyleMelGAN at higher sampling rates by changing input channels or kernel size?

@wblgers
Copy link
Author

wblgers commented Jun 29, 2022

conv kernal size

I'm working on it. Since the multi-gpu training does not speed up training as expceted, the training is slow. I'll give out some results when finished.

@MikeAleksa
Copy link

@wblgers Great - I'm testing increasing in_channels and will respond when I have results as well.

As far as multi-gpu training, I have found that increasing the LR and reducing steps between scheduled LR changes helps speed up training.

My understanding is by using the same config with multi-gpu training, you are only increasing the batch size. Since batch size is bigger, you can increase LR proportional to batch size since the step taken by optimizer should be better due to increased batch size.

I tried multiplying LR by 4 for 8 GPUs, and decreasing number of steps in each scheduler change by 1/2. I'm not sure what the best settings are, though. You may be able to do things 1:1 for fastest training (e.g. 8x LR for 8 GPUs and divide number of steps by 8) but I'm not sure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants