Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature] add support for gemini-dfresnet #291

Merged
merged 7 commits into from
Apr 25, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 12 additions & 10 deletions docs/pretrained.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,14 +39,16 @@ in [the voxconverse recipe](https://github.com/wenet-e2e/wespeaker/tree/master/e

## Model List

| Datasets | Languages | Checkpoint (pt) | Runtime Model (onnx) |
|-----------------------------------------------|-----------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| [VoxCeleb](../examples/voxceleb/v2/README.md) | EN | [ResNet34](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_resnet34.zip) / [ResNet34_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_resnet34_LM.zip) | [ResNet34](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_resnet34.onnx) / [ResNet34_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_resnet34_LM.onnx) |
| [VoxCeleb](../examples/voxceleb/v2/README.md) | EN | [ResNet152_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_resnet152_LM.zip) | [ResNet152_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_resnet152_LM.onnx) |
| [VoxCeleb](../examples/voxceleb/v2/README.md) | EN | [ResNet221_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_resnet221_LM.zip) | [ResNet221_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_resnet221_LM.onnx) |
| [VoxCeleb](../examples/voxceleb/v2/README.md) | EN | [ResNet293_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_resnet293_LM.zip) | [ResNet293_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_resnet293_LM.onnx) |
| [VoxCeleb](../examples/voxceleb/v2/README.md) | EN | [CAM++](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_CAM++.zip) / [CAM++_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_CAM++_LM.zip) | [CAM++](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_CAM++.onnx) / [CAM++_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_CAM++_LM.onnx) |
| [VoxCeleb](../examples/voxceleb/v2/README.md) | EN | [ECAPA512](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_ECAPA512.zip) / [ECAPA512_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_ECAPA512_LM.zip) | [ECAPA512](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_ECAPA512.onnx) / [ECAPA512_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_ECAPA512_LM.onnx) |
| [VoxCeleb](../examples/voxceleb/v2/README.md) | EN | [ECAPA1024](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_ECAPA1024.zip) / [ECAPA1024_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_ECAPA1024_LM.zip) | [ECAPA1024](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_ECAPA1024.onnx) / [ECAPA1024_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_ECAPA1024_LM.onnx) |
| [CNCeleb](../examples/cnceleb/v2/README.md) | CN | [ResNet34](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/cnceleb/cnceleb_resnet34.zip) / [ResNet34_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/cnceleb/cnceleb_resnet34_LM.zip) | [ResNet34](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/cnceleb/cnceleb_resnet34.onnx) / [ResNet34_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/cnceleb/cnceleb_resnet34_LM.onnx) |
| Datasets | Languages | Checkpoint (pt) | Runtime Model (onnx) |
|--- |--- |--- |--- |
| [VoxCeleb](../examples/voxceleb/v2/README.md) | EN | [ResNet34](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_resnet34.zip) / [ResNet34_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_resnet34_LM.zip) | [ResNet34](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_resnet34.onnx) / [ResNet34_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_resnet34_LM.onnx) |
| [VoxCeleb](../examples/voxceleb/v2/README.md) | EN | [ResNet152_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_resnet152_LM.zip)| [ResNet152_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_resnet152_LM.onnx) |
| [VoxCeleb](../examples/voxceleb/v2/README.md) | EN | [ResNet221_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_resnet221_LM.zip)| [ResNet221_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_resnet221_LM.onnx) |
| [VoxCeleb](../examples/voxceleb/v2/README.md) | EN | [ResNet293_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_resnet293_LM.zip)| [ResNet293_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_resnet293_LM.onnx) |
| [VoxCeleb](../examples/voxceleb/v2/README.md) | EN | [CAM++](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_CAM++.zip) / [CAM++_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_CAM++_LM.zip) | [CAM++](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_CAM++.onnx) / [CAM++_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_CAM++_LM.onnx) |
| [VoxCeleb](../examples/voxceleb/v2/README.md) | EN | [ECAPA512](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_ECAPA512.zip) / [ECAPA512_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_ECAPA512_LM.zip) | [ECAPA512](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_ECAPA512.onnx) / [ECAPA512_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_ECAPA512_LM.onnx) |
| [VoxCeleb](../examples/voxceleb/v2/README.md) | EN | [ECAPA1024](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_ECAPA1024.zip) / [ECAPA1024_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_ECAPA1024_LM.zip) | [ECAPA1024](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_ECAPA1024.onnx) / [ECAPA1024_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_ECAPA1024_LM.onnx) |
| [VoxCeleb](../examples/voxceleb/v2/README.md) | EN | [Gemini_DFResnet114_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_gemini_dfresnet114_LM.zip)| [Gemini_DFResnet114_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_gemini_dfresnet114_LM.onnx) |
| [CNCeleb](../examples/cnceleb/v2/README.md) | CN | [ResNet34](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/cnceleb/cnceleb_resnet34.zip) / [ResNet34_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/cnceleb/cnceleb_resnet34_LM.zip) | [ResNet34](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/cnceleb/cnceleb_resnet34.onnx) / [ResNet34_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/cnceleb/cnceleb_resnet34_LM.onnx) |


9 changes: 6 additions & 3 deletions examples/cnceleb/v3_finetune/README.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,9 @@
## Fine-tuning Results Based on DINO

* Setup: fbank80, num_frms200, epoch75 (pretrain), epoch50 (finetune), ArcMargin, aug_prob0.6, speed_perturb (no spec_aug)
* [Pre-trained ECAPA-TDNN checkpoints](https://drive.google.com/drive/folders/1XDIUjnKPrvJE5auBWT5CcE4mqcglCwzq?usp=drive_link): teacher models extracted from `model_75.pt` (please refer to `wespeaker/ssl/bin/average_dino_model.py` for information on the extraction process)
* Setup: fbank80, num_frms200, epoch50 (finetune), ArcMargin, aug_prob0.6, speed_perturb (no spec_aug)
* test_trials: CNC-Eval-Avg.lst
* These results are obtained by pretraining on different datasets and then finetuning with CNCeleb.


| Model | Params | FLOPs | Pretraining Data | LM | AS-Norm | EER (%) | minDCF (p=0.01) |
| :------------------------------ | :-----: | :-----: | :--------------------: | :-: | :-------: | :-------: | :--------------: |
| ECAPA_TDNN_GLOB_c1024-ASTP-emb192 | 14.65M | 2.65 G | CNCeleb | × | × | 8.217 | 0.439 |
Expand All @@ -20,3 +18,8 @@
* 🔥 UPDATE 2024.03: We support finetuning DINO-based self-supervised models, which is trained on the WenetSpeech dataset. Pretrained Paper related to the finetuning results:
* [WenetSpeech: A 10000+ Hours Multi-domain Mandarin Corpus for Speech Recognition](https://arxiv.org/pdf/2110.03370.pdf)
* [Leveraging In-the-wild Data for Effective Self-supervised Pretraining in Speaker Recognition](https://arxiv.org/pdf/2309.11730.pdf)

## Resources
* [Pre-trained ECAPA-TDNN checkpoints](https://drive.google.com/drive/folders/1XDIUjnKPrvJE5auBWT5CcE4mqcglCwzq?usp=drive_link)
* [The filtering metadata for wenetspeech](https://drive.google.com/file/d/1UaGuyT1wcKc5g9vRdfIBvLoDRcuOxBlX/view?usp=drive_link)

4 changes: 4 additions & 0 deletions examples/voxceleb/v2/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,10 @@
| | | ||| 0.744 | 0.896 | 1.603 |
| Res2Net34_Base | 4.68M | 1.77G | × | × | 1.351 | 1.347 | 2.478 |
| | | | × || 1.234 | 1.232 | 2.162 |
| Gemini_DFResNet114 | 6.53M | 5.42G | × | × | 0.787 | 0.963 | 1.760 |
| | | | × || 0.707 | 0.889 | 1.546 |
| | | || x | 0.771 | 0.906 | 1.599 |
| | | ||| 0.638 | 0.839 | 1.427 |


## PLDA results
Expand Down
81 changes: 81 additions & 0 deletions examples/voxceleb/v2/conf/gemini_dfresnet_adam.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
### train configuraton

exp_dir: exp/Gemini_DF_ResNet114-TSTP-emb256-fbank80-num_frms200-aug0.6-spTrue-saFalse-ArcMargin-AdamW-epoch165
gpus: "[0,1]"
num_avg: 2
enable_amp: False # whether enable automatic mixed precision training

seed: 42
num_epochs: 165
save_epoch_interval: 5 # save model every 5 epochs
log_batch_interval: 100 # log every 100 batchs

dataloader_args:
batch_size: 128
num_workers: 8
pin_memory: False
prefetch_factor: 8
drop_last: True

dataset_args:
# the sample number which will be traversed within one epoch, if the value equals to 0,
# the utterance number in the dataset will be used as the sample_num_per_epoch.
sample_num_per_epoch: 0
shuffle: True
shuffle_args:
shuffle_size: 2500
filter: True
filter_args:
min_num_frames: 100
max_num_frames: 800
resample_rate: 16000
speed_perturb: True
num_frms: 200
aug_prob: 0.6 # prob to add reverb & noise aug per sample
fbank_args:
num_mel_bins: 80
frame_shift: 10
frame_length: 25
dither: 1.0
spec_aug: False
spec_aug_args:
num_t_mask: 1
num_f_mask: 1
max_t: 10
max_f: 8
prob: 0.6

model: Gemini_DF_ResNet114 # Gemini_DF_ResNet60 Gemini_DF_ResNet114 GemGemini_DF_ResNet183 Gemini_DF_ResNet237
model_init: null
model_args:
feat_dim: 80
embed_dim: 256
pooling_func: "TSTP" # TSTP, ASTP, MQMHASTP
two_emb_layer: False
projection_args:
project_type: "arc_margin" # add_margin, arc_margin, sphere, sphereface2, softmax, arc_margin_intertopk_subcenter
scale: 32.0
easy_margin: False

margin_scheduler: MarginScheduler
margin_update:
initial_margin: 0.2
final_margin: 0.2
increase_start_epoch: 20
fix_start_epoch: 40
update_margin: False
increase_type: "exp" # exp, linear

loss: CrossEntropyLoss
loss_args: {}

optimizer: AdamW
optimizer_args:
weight_decay: 0.05

scheduler: ExponentialDecrease
scheduler_args:
initial_lr: 0.000125
final_lr: 0.000001
warm_up_epoch: 6
warm_from_zero: False
91 changes: 91 additions & 0 deletions examples/voxceleb/v2/conf/gemini_dfresnet_sgd_lm.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
### Large margin fine-tuning configuration
#
# The large margin fine-tuning operation is often used in speaker
# verification challenge system to further improve the performance.
# In this fine-tuning stage, large margin and longer segment will
# be used.

exp_dir: exp/Gemini_DF_ResNet114-TSTP-emb256-fbank80-num_frms200-aug0.6-spTrue-saFalse-ArcMargin-AdamW-epoch165-LM
gpus: "[0,1]"
num_avg: 1
enable_amp: False # whether enable automatic mixed precision training
do_lm: True

seed: 42
num_epochs: 5
save_epoch_interval: 1 # save model per epoch
log_batch_interval: 100 # log every 100 batchs

dataloader_args:
batch_size: 32
num_workers: 8
pin_memory: False
prefetch_factor: 8
drop_last: True

dataset_args:
# the sample number which will be traversed within one epoch, if the value equals to 0,
# the utterance number in the dataset will be used as the sample_num_per_epoch.
sample_num_per_epoch: 0
shuffle: True
shuffle_args:
shuffle_size: 2500
filter: True
filter_args:
min_num_frames: 100
max_num_frames: 800
resample_rate: 16000
speed_perturb: True
num_frms: 600
aug_prob: 0.6 # prob to add reverb & noise aug per sample
fbank_args:
num_mel_bins: 80
frame_shift: 10
frame_length: 25
dither: 1.0
spec_aug: False
spec_aug_args:
num_t_mask: 1
num_f_mask: 1
max_t: 10
max_f: 8
prob: 0.6

model: Gemini_DF_ResNet114 # ResNet18, ResNet34, ResNet50, ResNet101, ResNet152
model_init: null
model_args:
feat_dim: 80
embed_dim: 256
pooling_func: "TSTP" # TSTP, ASTP, MQMHASTP
two_emb_layer: False
projection_args:
project_type: "arc_margin" # add_margin, arc_margin, sphere, softmax, arc_margin_intertopk_subcenter
scale: 32.0
easy_margin: False

margin_scheduler: MarginScheduler
margin_update:
initial_margin: 0.5
final_margin: 0.5
increase_start_epoch: 1
fix_start_epoch: 1
update_margin: True
increase_type: "exp" # exp, linear

loss: CrossEntropyLoss
loss_args: {}

optimizer: SGD
optimizer_args:
momentum: 0.9
nesterov: True
weight_decay: 0.0001

scheduler: ExponentialDecrease
scheduler_args:
initial_lr: 1.0e-4
final_lr: 2.5e-5
warm_up_epoch: 1
warm_from_zero: True


Loading
Loading