Skip to content

Commit

Permalink
fix variance levels in variance adaptor
Browse files Browse the repository at this point in the history
  • Loading branch information
keonlee9420 committed Mar 6, 2022
1 parent f52c66d commit dca252c
Show file tree
Hide file tree
Showing 47 changed files with 56 additions and 26 deletions.
21 changes: 21 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -160,6 +160,23 @@ The loss curves, synthesized mel-spectrograms, and audios are shown.
![](./img/tensorboard_spec_vctk.png)
![](./img/tensorboard_audio_vctk.png)
## Ablation Study
![](./img/tensorboard_loss_ljs_comparison.png)
| ID | Model | Block Type | Pitch Conditioning |
| --- | --- | ----------- | ----- |
|1|LJSpeech_transformer_fs2_cwt| `transformer_fs2` | continuous wavelet transform
|2|LJSpeech_transformer_cwt| `transformer` | continuous wavelet transform
|3|LJSpeech_transformer_frame| `transformer` | frame-level f0
|4|LJSpeech_transformer_ph| `transformer` | phoneme-level f0
Observations from
1. changing building block (ID 1~2):
"transformer_fs2" seems to be more optimized in terms of memory usage and model size so that the training time and mel losses are decreased. However, the output quality is not improved dramatically, and sometimes the "transformer" block generates speech with an even more stable pitch contour than "transformer_fs2".
2. changing pitch conditioning (ID 2~4): There is a trade-off between audio quality (pitch stability) and expressiveness.
- audio quality: "ph" >= "frame" > "cwt"
- expressiveness: "cwt" > "frame" > "ph"
# Notes
- Both phoneme-level and frame-level variance are supported in both supervised and unsupervised duration modeling.
Expand All @@ -175,6 +192,10 @@ The loss curves, synthesized mel-spectrograms, and audios are shown.
- For vocoder, **HiFi-GAN** and **MelGAN** are supported.
### Updates Log
- Mar.05, 2022 (v0.2.1): Fix and update codebase & pre-trained models with demo samples
1. Fix variance adaptor to make it work with all combinations of building block and variance type/level
2. Update pre-trained models with demo samples of LJSpeech and VCTK under "transformer_fs2" building block and "cwt" pitch conditioning
3. Share the result of ablation studies of comparing "transformer" vs. "transformer_fs2" paired among three types of pitch conditioning ("frame", "ph", and "cwt")
- Feb.18, 2022 (v0.2.0): Update data preprocessor and variance adaptor & losses following [keonlee9420's DiffSinger](https://github.com/keonlee9420/DiffSinger) / Add various prosody modeling methods
1. Prepare two different types of data pipeline in preprocessor to maximize unsupervised/supervised duration modelings
2. Adopt wavelet for pitch modeling & loss
Expand Down
9 changes: 0 additions & 9 deletions dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -125,12 +125,6 @@ def __getitem__(self, idx):
)
f0cwt_mean_std = np.load(f0cwt_mean_std_path)
f0_mean, f0_std = float(f0cwt_mean_std[0]), float(f0cwt_mean_std[1])
elif self.pitch_type == "ph":
f0_phlevel_sum = torch.zeros(phone.shape).float().scatter_add(
0, torch.from_numpy(mel2ph).long() - 1, torch.from_numpy(f0).float())
f0_phlevel_num = torch.zeros(phone.shape).float().scatter_add(
0, torch.from_numpy(mel2ph).long() - 1, torch.ones(f0.shape)).clamp_min(1)
f0_ph = (f0_phlevel_sum / f0_phlevel_num).numpy()

sample = {
"id": basename,
Expand All @@ -140,7 +134,6 @@ def __getitem__(self, idx):
"mel": mel,
"pitch": pitch,
"f0": f0,
"f0_ph": f0_ph,
"uv": uv,
"cwt_spec": cwt_spec,
"f0_mean": f0_mean,
Expand Down Expand Up @@ -187,8 +180,6 @@ def reprocess(self, data, idxs):
cwt_specs = pad_2D(cwt_specs)
f0_means = np.array(f0_means)
f0_stds = np.array(f0_stds)
elif self.pitch_type == "ph":
f0s = [data[idx]["f0_ph"] for idx in idxs]
energies = [data[idx]["energy"] for idx in idxs]
durations = [data[idx]["duration"] for idx in idxs] if not self.learn_alignment else None
mel2phs = [data[idx]["mel2ph"] for idx in idxs] if not self.learn_alignment else None
Expand Down
Binary file added demo/LJSpeech_v0.2.1/900000/LJ001-0092.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added demo/LJSpeech_v0.2.1/900000/LJ001-0092.wav
Binary file not shown.
Binary file added demo/LJSpeech_v0.2.1/900000/LJ001-0133.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added demo/LJSpeech_v0.2.1/900000/LJ001-0133.wav
Binary file not shown.
Binary file added demo/LJSpeech_v0.2.1/900000/LJ001-0142.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added demo/LJSpeech_v0.2.1/900000/LJ001-0142.wav
Binary file not shown.
Binary file added demo/LJSpeech_v0.2.1/900000/LJ001-0147.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added demo/LJSpeech_v0.2.1/900000/LJ001-0147.wav
Binary file not shown.
Binary file added demo/LJSpeech_v0.2.1/900000/LJ001-0151.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added demo/LJSpeech_v0.2.1/900000/LJ001-0151.wav
Binary file not shown.
Binary file added demo/LJSpeech_v0.2.1/900000/LJ001-0159.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added demo/LJSpeech_v0.2.1/900000/LJ001-0159.wav
Binary file not shown.
Binary file added demo/VCTK_v0.2.1/900000/p225-021.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added demo/VCTK_v0.2.1/900000/p225-021.wav
Binary file not shown.
Binary file added demo/VCTK_v0.2.1/900000/p226-351.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added demo/VCTK_v0.2.1/900000/p226-351.wav
Binary file not shown.
Binary file added demo/VCTK_v0.2.1/900000/p232-197.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added demo/VCTK_v0.2.1/900000/p232-197.wav
Binary file not shown.
Binary file added demo/VCTK_v0.2.1/900000/p236-148.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added demo/VCTK_v0.2.1/900000/p236-148.wav
Binary file not shown.
Binary file added demo/VCTK_v0.2.1/900000/p269-040.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added demo/VCTK_v0.2.1/900000/p269-040.wav
Binary file not shown.
Binary file added demo/VCTK_v0.2.1/900000/p285-400.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added demo/VCTK_v0.2.1/900000/p285-400.wav
Binary file not shown.
Binary file added demo/VCTK_v0.2.1/900000/p304-027.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added demo/VCTK_v0.2.1/900000/p304-027.wav
Binary file not shown.
Binary file added demo/VCTK_v0.2.1/900000/p317-019.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added demo/VCTK_v0.2.1/900000/p317-019.wav
Binary file not shown.
Binary file added demo/VCTK_v0.2.1/900000/p334-182.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added demo/VCTK_v0.2.1/900000/p334-182.wav
Binary file not shown.
Binary file added demo/VCTK_v0.2.1/900000/p345-158.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added demo/VCTK_v0.2.1/900000/p345-158.wav
Binary file not shown.
Binary file added demo/VCTK_v0.2.1/900000/p361-227.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added demo/VCTK_v0.2.1/900000/p361-227.wav
Binary file not shown.
Binary file added demo/VCTK_v0.2.1/900000/s5-360.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added demo/VCTK_v0.2.1/900000/s5-360.wav
Binary file not shown.
Binary file modified img/tensorboard_audio_ljs.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified img/tensorboard_audio_vctk.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified img/tensorboard_loss_ljs.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/tensorboard_loss_ljs_comparison.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified img/tensorboard_loss_vctk.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified img/tensorboard_spec_ljs.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified img/tensorboard_spec_vctk.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
43 changes: 26 additions & 17 deletions model/modules.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@

from utils.tools import (
get_variance_level,
get_phoneme_level_pitch,
get_phoneme_level_energy,
get_mask_from_lengths,
pad_1D,
Expand Down Expand Up @@ -870,6 +871,14 @@ def binarize_attention_parallel(self, attn, in_lens, out_lens):
attn_out = b_mas(attn_cpu, in_lens.cpu().numpy(), out_lens.cpu().numpy(), width=1)
return torch.from_numpy(attn_out).to(attn.device)

def get_phoneme_level_pitch(self, phone, src_len, mel2ph, mel_len, pitch_frame):
return torch.from_numpy(
pad_1D(
[get_phoneme_level_pitch(ph[:s_len], m2ph[:m_len], var[:m_len]) for ph, s_len, m2ph, m_len, var \
in zip(phone.int().cpu().numpy(), src_len.cpu().numpy(), mel2ph.cpu().numpy(), mel_len.cpu().numpy(), pitch_frame.cpu().numpy())]
)
).float().to(pitch_frame.device)

def get_phoneme_level_energy(self, duration, src_len, energy_frame):
return torch.from_numpy(
pad_1D(
Expand Down Expand Up @@ -972,7 +981,7 @@ def forward(
):
pitch_prediction = energy_prediction = prosody_info = None

x = text
x = text.clone()
if speaker_embedding is not None:
x = x + speaker_embedding.unsqueeze(1).expand(
-1, text.shape[1], -1
Expand Down Expand Up @@ -1032,17 +1041,8 @@ def forward(
attn_hard_dur = attn_hard.sum(2)[:, 0, :]
attn_out = (attn_soft, attn_hard, attn_hard_dur, attn_logprob)

# Note that there is no pre-extracted phoneme-level variance features in unsupervised duration modeling.
# Alternatively, we can use attn_hard_dur instead of duration_target for computing phoneme-level variances.
output_1 = x.clone()
if self.use_energy_embed and self.energy_feature_level == "phoneme_level":
if attn_prior is not None:
energy_target = self.get_phoneme_level_energy(attn_hard_dur, src_len, energy_target)
energy_prediction, energy_embedding = self.get_energy_embedding(x, energy_target, src_mask, e_control)
output_1 = output_1 + energy_embedding
x = output_1.clone()

# Upsampling from src length to mel length
x_org = x.clone()
if attn_prior is not None: # Trainig of unsupervised duration modeling
if step < self.binarization_start_steps:
A_soft = attn_soft.squeeze(1)
Expand All @@ -1065,7 +1065,9 @@ def forward(
mel_mask = get_mask_from_lengths(mel_len)
mel2ph = dur_to_mel2ph(duration_rounded, src_mask)

output_2 = x.clone()
# Note that there is no pre-extracted phoneme-level variance features in unsupervised duration modeling.
# Alternatively, we can use attn_hard_dur instead of duration_target for computing phoneme-level variances.
x_temp = x.clone()
if self.use_pitch_embed:
if pitch_target is not None:
mel2ph = pitch_target["mel2ph"]
Expand All @@ -1077,18 +1079,25 @@ def forward(
cwt_spec, f0_mean, f0_std, mel2ph, self.preprocess_config["preprocessing"]["pitch"],
)
pitch_target.update({"f0_cwt": pitch_target["f0"]})
if self.pitch_type == "ph":
pitch_target["f0"] = self.get_phoneme_level_pitch(text, src_len, mel2ph, mel_len, pitch_target["f0"])
pitch_prediction, pitch_embedding = self.get_pitch_embedding(
x, pitch_target["f0"], pitch_target["uv"], mel2ph, p_control, encoder_out=output_1
x, pitch_target["f0"], pitch_target["uv"], mel2ph, p_control, encoder_out=x_org
)
else:
pitch_prediction, pitch_embedding = self.get_pitch_embedding(
x, None, None, mel2ph, p_control, encoder_out=output_1
x, None, None, mel2ph, p_control, encoder_out=x_org
)
output_2 = output_2 + pitch_embedding
x_temp = x_temp + pitch_embedding
if self.use_energy_embed and self.energy_feature_level == "frame_level":
energy_prediction, energy_embedding = self.get_energy_embedding(x, energy_target, mel_mask, e_control)
output_2 = output_2 + energy_embedding
x = output_2.clone()
x_temp = x_temp + energy_embedding
elif self.use_energy_embed and self.energy_feature_level == "phoneme_level":
if attn_prior is not None:
energy_target = self.get_phoneme_level_energy(attn_hard_dur, src_len, energy_target)
energy_prediction, energy_embedding = self.get_energy_embedding(x_org, energy_target, src_mask, e_control)
x_temp = x_temp + self.length_regulator(energy_embedding, duration_rounded, max_len)[0]
x = x_temp.clone()

return (
x,
Expand Down
9 changes: 9 additions & 0 deletions utils/tools.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,15 @@ def get_variance_level(preprocess_config, model_config, data_loading=True):
return energy_level_tag, energy_feature_level


def get_phoneme_level_pitch(phone, mel2ph, pitch):
pitch_phlevel_sum = torch.zeros(phone.shape[:-1]).float().scatter_add(
0, torch.from_numpy(mel2ph).long() - 1, torch.from_numpy(pitch).float())
pitch_phlevel_num = torch.zeros(phone.shape[:-1]).float().scatter_add(
0, torch.from_numpy(mel2ph).long() - 1, torch.ones(pitch.shape)).clamp_min(1)
pitch = (pitch_phlevel_sum / pitch_phlevel_num).numpy()
return pitch


def get_phoneme_level_energy(duration, energy):
# Phoneme-level average
pos = 0
Expand Down

0 comments on commit dca252c

Please sign in to comment.