add the results of gemini-df-resnet

wenet-e2e · Apr 25, 2024 · 98cf3d6 · 98cf3d6
1 parent f163eda
commit 98cf3d6
Show file tree

Hide file tree

Showing 5 changed files with 104 additions and 6 deletions.
diff --git a/examples/cnceleb/v3_finetune/README.md b/examples/cnceleb/v3_finetune/README.md
@@ -1,11 +1,9 @@
 ## Fine-tuning Results Based on DINO
 
-* Setup: fbank80, num_frms200, epoch75 (pretrain), epoch50 (finetune), ArcMargin, aug_prob0.6, speed_perturb (no spec_aug)
-* [Pre-trained ECAPA-TDNN checkpoints](https://drive.google.com/drive/folders/1XDIUjnKPrvJE5auBWT5CcE4mqcglCwzq?usp=drive_link): teacher models extracted from `model_75.pt` (please refer to `wespeaker/ssl/bin/average_dino_model.py` for information on the extraction process)
+* Setup: fbank80, num_frms200, epoch50 (finetune), ArcMargin, aug_prob0.6, speed_perturb (no spec_aug)
 * test_trials: CNC-Eval-Avg.lst
 * These results are obtained by pretraining on different datasets and then finetuning with CNCeleb.
 
-
 | Model                             | Params  |  FLOPs  |    Pretraining Data    | LM  | AS-Norm   | EER (%)   | minDCF (p=0.01)  |
 | :------------------------------   | :-----: | :-----: | :--------------------: | :-: | :-------: | :-------: | :--------------: |
 | ECAPA_TDNN_GLOB_c1024-ASTP-emb192 | 14.65M  | 2.65 G  |        CNCeleb         | ×   | ×         | 8.217     | 0.439            |
@@ -20,3 +18,8 @@
 * 🔥 UPDATE 2024.03: We support finetuning DINO-based self-supervised models, which is trained on the WenetSpeech dataset. Pretrained Paper related to the finetuning results:
     * [WenetSpeech: A 10000+ Hours Multi-domain Mandarin Corpus for Speech Recognition](https://arxiv.org/pdf/2110.03370.pdf)
     * [Leveraging In-the-wild Data for Effective Self-supervised Pretraining in Speaker Recognition](https://arxiv.org/pdf/2309.11730.pdf)
+
+## Resources
+* [Pre-trained ECAPA-TDNN checkpoints](https://drive.google.com/drive/folders/1XDIUjnKPrvJE5auBWT5CcE4mqcglCwzq?usp=drive_link)
+* [The filtering metadata for wenetspeech](https://drive.google.com/file/d/1UaGuyT1wcKc5g9vRdfIBvLoDRcuOxBlX/view?usp=drive_link)
+
diff --git a/examples/voxceleb/v2/README.md b/examples/voxceleb/v2/README.md
@@ -47,6 +47,10 @@
 |                      |       |       | √ | √ | 0.744 | 0.896 | 1.603 |
 | Res2Net34_Base       | 4.68M | 1.77G | × | × | 1.351 | 1.347 | 2.478 |
 |                      |       |       | × | √ | 1.234 | 1.232 | 2.162 |
+| Gemini_DFResNet114   | 6.53M | 5.42G | × | × | 0.787 | 0.963 | 1.760 |
+|                      |       |       | × | √ | 0.707 | 0.889 | 1.546 |
+|                      |       |       | √ | x | 0.771 | 0.906 | 1.599 |
+|                      |       |       | √ | √ | 0.638 | 0.839 | 1.427 |
 
 
 ## PLDA results

diff --git a/examples/voxceleb/v2/conf/gemini_dfresnet_adam.yaml b/examples/voxceleb/v2/conf/gemini_dfresnet_adam.yaml
@@ -1,6 +1,6 @@
 ### train configuraton
 
-exp_dir: exp/Gemini_DF_ResNet60-TSTP-emb256-fbank80-num_frms200-aug0.6-spTrue-saFalse-ArcMargin-SGD-epoch150
+exp_dir: exp/Gemini_DF_ResNet114-TSTP-emb256-fbank80-num_frms200-aug0.6-spTrue-saFalse-ArcMargin-AdamW-epoch165
 gpus: "[0,1]"
 num_avg: 2
 enable_amp: False # whether enable automatic mixed precision training
@@ -45,7 +45,7 @@ dataset_args:
     max_f: 8
     prob: 0.6
 
-model: Gemini_DF_ResNet60 # Gemini_DF_ResNet60 Gemini_DF_ResNet114 GemGemini_DF_ResNet183 Gemini_DF_ResNet237
+model: Gemini_DF_ResNet114 # Gemini_DF_ResNet60 Gemini_DF_ResNet114 GemGemini_DF_ResNet183 Gemini_DF_ResNet237
 model_init: null
 model_args:
   feat_dim: 80

diff --git a/examples/voxceleb/v2/conf/gemini_dfresnet_sgd_lm.yaml b/examples/voxceleb/v2/conf/gemini_dfresnet_sgd_lm.yaml
@@ -0,0 +1,91 @@
+### Large margin fine-tuning configuration
+#
+#   The large margin fine-tuning operation is often used in speaker
+#   verification challenge system to further improve the performance.
+#   In this fine-tuning stage, large margin and longer segment will
+#   be used.
+
+exp_dir: exp/Gemini_DF_ResNet114-TSTP-emb256-fbank80-num_frms200-aug0.6-spTrue-saFalse-ArcMargin-AdamW-epoch165-LM
+gpus: "[0,1]"
+num_avg: 1
+enable_amp: False # whether enable automatic mixed precision training
+do_lm: True
+
+seed: 42
+num_epochs: 5
+save_epoch_interval: 1 # save model per epoch
+log_batch_interval: 100 # log every 100 batchs
+
+dataloader_args:
+  batch_size: 32
+  num_workers: 8
+  pin_memory: False
+  prefetch_factor: 8
+  drop_last: True
+
+dataset_args:
+  # the sample number which will be traversed within one epoch, if the value equals to 0,
+  # the utterance number in the dataset will be used as the sample_num_per_epoch.
+  sample_num_per_epoch: 0
+  shuffle: True
+  shuffle_args:
+    shuffle_size: 2500
+  filter: True
+  filter_args:
+    min_num_frames: 100
+    max_num_frames: 800
+  resample_rate: 16000
+  speed_perturb: True
+  num_frms: 600
+  aug_prob: 0.6 # prob to add reverb & noise aug per sample
+  fbank_args:
+    num_mel_bins: 80
+    frame_shift: 10
+    frame_length: 25
+    dither: 1.0
+  spec_aug: False
+  spec_aug_args:
+    num_t_mask: 1
+    num_f_mask: 1
+    max_t: 10
+    max_f: 8
+    prob: 0.6
+
+model: Gemini_DF_ResNet114 # ResNet18, ResNet34, ResNet50, ResNet101, ResNet152
+model_init: null
+model_args:
+  feat_dim: 80
+  embed_dim: 256
+  pooling_func: "TSTP" # TSTP, ASTP, MQMHASTP
+  two_emb_layer: False
+projection_args:
+  project_type: "arc_margin" # add_margin, arc_margin, sphere, softmax, arc_margin_intertopk_subcenter
+  scale: 32.0
+  easy_margin: False
+
+margin_scheduler: MarginScheduler
+margin_update:
+  initial_margin: 0.5
+  final_margin: 0.5
+  increase_start_epoch: 1
+  fix_start_epoch: 1
+  update_margin: True
+  increase_type: "exp" # exp, linear
+
+loss: CrossEntropyLoss
+loss_args: {}
+
+optimizer: SGD
+optimizer_args:
+  momentum: 0.9
+  nesterov: True
+  weight_decay: 0.0001
+
+scheduler: ExponentialDecrease
+scheduler_args:
+  initial_lr: 1.0e-4
+  final_lr: 2.5e-5
+  warm_up_epoch: 1
+  warm_from_zero: True
+
+
diff --git a/wespeaker/models/gemini_dfresnet.py b/wespeaker/models/gemini_dfresnet.py
@@ -165,7 +165,7 @@ def Gemini_DF_ResNet237(feat_dim, embed_dim, pooling_func='TSTP', two_emb_layer=
 
 if __name__ == '__main__':
     x = torch.zeros(1, 200, 80)
-    model = Gemini_DF_ResNet183(80, 256, 'TSTP')
+    model = Gemini_DF_ResNet114(80, 256, 'TSTP')
     model.eval()
     out = model(x)
     print(out[-1].size())