wenet-e2e · cdliang11 · Jan 30, 2024 · Jan 30, 2024 · Jan 30, 2024 · Jan 30, 2024
diff --git a/README.md b/README.md
@@ -50,7 +50,7 @@ Please refer to [python usage](docs/python_package.md) for more command line and
 git clone https://github.com/wenet-e2e/wespeaker.git
 ```
 
-* Create conda env: pytorch version >= 1.10.0 is required !!!
+* Create conda env: pytorch version >= 1.12.1 is recommended !!!
 ``` sh
 conda create -n wespeaker python=3.9
 conda activate wespeaker
@@ -64,11 +64,8 @@ pre-commit install  # for clean and tidy code
 * 2023.07.18: Support the kaldi-compatible PLDA and unsupervised adaptation, see [#186](https://github.com/wenet-e2e/wespeaker/pull/186).
 * 2023.07.14: Support the [NIST SRE16 recipe](https://www.nist.gov/itl/iad/mig/speaker-recognition-evaluation-2016), see [#177](https://github.com/wenet-e2e/wespeaker/pull/177).
 * 2023.07.10: Support the [Self-Supervised Learning recipe](https://github.com/wenet-e2e/wespeaker/tree/master/examples/voxceleb/v3) on Voxceleb, including [DINO](https://openaccess.thecvf.com/content/ICCV2021/papers/Caron_Emerging_Properties_in_Self-Supervised_Vision_Transformers_ICCV_2021_paper.pdf), [MoCo](https://openaccess.thecvf.com/content_CVPR_2020/papers/He_Momentum_Contrast_for_Unsupervised_Visual_Representation_Learning_CVPR_2020_paper.pdf) and [SimCLR](http://proceedings.mlr.press/v119/chen20j/chen20j.pdf), see [#180](https://github.com/wenet-e2e/wespeaker/pull/180).
-
 * 2023.06.30: Support the [SphereFace2](https://ieeexplore.ieee.org/abstract/document/10094954) loss function, with better performance and noisy robust in comparison with the ArcMargin Softmax, see [#173](https://github.com/wenet-e2e/wespeaker/pull/173).
 
-* 2023.04.27: Support the [CAM++](https://arxiv.org/abs/2303.00332) model, with better performance and single-thread inference rtf in comparison with the ResNet34 model, see [#153](https://github.com/wenet-e2e/wespeaker/pull/153).
-
 ## Recipes
 
 * [VoxCeleb](https://github.com/wenet-e2e/wespeaker/tree/master/examples/voxceleb): Speaker Verification recipe on the [VoxCeleb dataset](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/)

diff --git a/examples/cnceleb/v2/README.md b/examples/cnceleb/v2/README.md
@@ -2,29 +2,30 @@
 
 * Setup: fbank80, num_frms200, epoch150, ArcMargin, aug_prob0.6, speed_perturb (no spec_aug)
 * test_trials: CNC-Eval-Avg.lst
-* 🔥 UPDATE: We update this recipe according to the setups in the winning system of CNSRC 2022, and get obvious performance improvement compared with the old recipe. Check the [commit1](https://github.com/wenet-e2e/wespeaker/pull/63/commits/b08804987b3bbb26f4963cedf634058474c743dd), [commit2](https://github.com/wenet-e2e/wespeaker/pull/66/commits/6f6af29197f0aa0a5d1b1993b7feb2f41b97891f) for details.
+* 🔥 UPDATE 2022.07.12: We update this recipe according to the setups in the winning system of CNSRC 2022, and get obvious performance improvement compared with the old recipe. Check the [commit1](https://github.com/wenet-e2e/wespeaker/pull/63/commits/b08804987b3bbb26f4963cedf634058474c743dd), [commit2](https://github.com/wenet-e2e/wespeaker/pull/66/commits/6f6af29197f0aa0a5d1b1993b7feb2f41b97891f) for details.
     * LR scheduler warmup from 0
     * Remove one embedding layer
     * Add large margin fine-tuning strategy (LM)
 
-| Model                             | Params    | LM  | AS-Norm   | EER (%)   | minDCF (p=0.01)  |
-| :------------------------------   | :-------: | :-: | :-------: | :-------: | :--------------: |
-| ResNet34-TSTP-emb256 (OLD)        | 6.70M     | ×   | ×         | 8.426     | 0.487            |
-| ResNet34-TSTP-emb256              | 6.63M     | ×   | ×         | 7.134     | 0.408            |
-|                                   |           | ×   | √         | 6.747     | 0.367            |
-|                                   |           | √   | ×         | 6.652     | 0.393            |
-|                                   |           | √   | √         | 6.492     | 0.354            |
-| ResNet221-TSTP-emb256             | 23.86M    | ×   | ×         | 5.965     | 0.362            |
-|                                   |           | ×   | √         | 5.708     | **0.326**        |
-|                                   |           | √   | ×         | 5.886     | 0.362            |
-|                                   |           | √   | √         | **5.655** | 0.330            |
-| ECAPA_TDNN_GLOB_c512-ASTP-emb192  | 6.19M     | ×   | ×         | 8.313     | 0.432            |
-|                                   |           | ×   | √         | 7.644     | 0.390            |
-|                                   |           | √   | ×         | 8.004     | 0.422            |
-|                                   |           | √   | √         | 7.417     | 0.379            |
-| ECAPA_TDNN_GLOB_c1024-ASTP-emb192 | 14.65M    | ×   | ×         | 7.879     | 0.420            |
-|                                   |           | ×   | √         | 7.412     | 0.379            |
-|                                   |           | √   | ×         | 7.986     | 0.417            |
-|                                   |           | √   | √         | 7.395     | 0.372            |
-| RepVGG_TINY_A0                    | 6.26M     | ×   | ×         | 6.883     | 0.399            |
-|                                   |           | ×   | √         | 6.550     | 0.355            |
+| Model                             | Params    | FLOPs   | LM  | AS-Norm   | EER (%)   | minDCF (p=0.01)  |
+| :------------------------------   | :-------: | :-----: | :-: | :-------: | :-------: | :--------------: |
+| ResNet34-TSTP-emb256 (OLD)        | 6.70M     | 4.55 G  | ×   | ×         | 8.426     | 0.487            |
+| ResNet34-TSTP-emb256              | 6.63M     | 4.55 G  | ×   | ×         | 7.134     | 0.408            |
+|                                   |           |         | ×   | √         | 6.747     | 0.367            |
+|                                   |           |         | √   | ×         | 6.652     | 0.393            |
+|                                   |           |         | √   | √         | 6.492     | 0.354            |
+| ResNet221-TSTP-emb256             | 23.86M    | 21.29 G | ×   | ×         | 5.965     | 0.362            |
+|                                   |           |         | ×   | √         | 5.708     | **0.326**        |
+|                                   |           |         | √   | ×         | 5.886     | 0.362            |
+|                                   |           |         | √   | √         | **5.655** | 0.330            |
+| ECAPA_TDNN_GLOB_c512-ASTP-emb192  | 6.19M     | 1.04 G  | ×   | ×         | 8.313     | 0.432            |
+|                                   |           |         | ×   | √         | 7.644     | 0.390            |
+|                                   |           |         | √   | ×         | 8.004     | 0.422            |
+|                                   |           |         | √   | √         | 7.417     | 0.379            |
+| ECAPA_TDNN_GLOB_c1024-ASTP-emb192 | 14.65M    | 2.65 G  | ×   | ×         | 7.879     | 0.420            |
+|                                   |           |         | ×   | √         | 7.412     | 0.379            |
+|                                   |           |         | √   | ×         | 7.986     | 0.417            |
+|                                   |           |         | √   | √         | 7.395     | 0.372            |
+| RepVGG_TINY_A0                    | 6.26M     | 4.65 G  | ×   | ×         | 6.883     | 0.399            |
+|                                   |           |         | ×   | √         | 6.550     | 0.355            |
+
diff --git a/examples/sre/v2/README.md b/examples/sre/v2/README.md
@@ -4,11 +4,11 @@
 * Scoring: cosine & PLDA & PLDA Adaptation
 * Metric: EER(%)
 
-| Model                | Params |  Backend   | Pooled | Tagalog | Cantonese |
-|:---------------------|:------:|:----------:|:------:|:-------:|:---------:|
-| ResNet34-TSTP-emb256 | 6.63M  |   Cosine   |  15.4  |  19.82  |   10.39   |
-|                      |        |    PLDA    | 11.689 | 16.961  |   6.239   |
-|                      |        | Adapt PLDA | 5.788  |  8.974  |   2.674   |
+| Model                | Params | FLOPs  |  Backend   | Pooled | Tagalog | Cantonese |
+|:---------------------|:------:|:------:|:----------:|:------:|:-------:|:---------:|
+| ResNet34-TSTP-emb256 | 6.63M  | 4.55G  |   Cosine   |  15.4  |  19.82  |   10.39   |
+|                      |        |        |    PLDA    | 11.689 | 16.961  |   6.239   |
+|                      |        |        | Adapt PLDA | 5.788  |  8.974  |   2.674   |
 
 Current PLDA implementation is fully compatible with the Kaldi version, note that
 we can definitely improve the results with out adaptation with parameter tuning and extra LDA as shown in the Kaldi

diff --git a/examples/voxceleb/v2/README.md b/examples/voxceleb/v2/README.md
@@ -4,54 +4,43 @@
 * Scoring: cosine (sub mean of vox2_dev)
 * Metric: EER(%)
 
-| Model | Params | AS-Norm(300) | vox1-O-clean | vox1-E-clean | vox1-H-clean |
-|:------|:------:|:------------:|:------------:|:------------:|:------------:|
-| XVEC-TSTP-emb512 | 4.61M | × | 1.962 | 1.918 | 3.389 |
-|                  |       | √ | 1.835 | 1.822 | 3.110 |
-| ECAPA_TDNN_GLOB_c512-ASTP-emb192 | 6.19M | × | 1.149 | 1.248 | 2.313 |
-|                                  |       | √ | 1.026 | 1.154 | 2.089 |
-| ResNet34-TSTP-emb256 | 6.63M | × | 0.941 | 1.114 | 2.026 |
-|                      |       | √ | 0.899 | 1.064 | 1.856 |
-
-* 🔥 UPDATE 2023.6.30: We support SphereFace2 loss function and obtain better and robust performance, see [#173](https://github.com/wenet-e2e/wespeaker/pull/173).
-
-* 🔥 UPDATE 2022.07.19: We apply the same setups as the winning system of CNSRC 2022 (see [cnceleb](https://github.com/wenet-e2e/wespeaker/tree/master/examples/cnceleb/v2) recipe for details), and obtain significant performance improvement compared with our previous implementation.
+* 🔥 UPDATE 2022.07.19: We apply the same setups as the winning system of CNSRC 2022 (see [cnceleb](https://github.com/wenet-e2e/wespeaker/tree/master/examples/cnceleb/v2) recipe for details), and obtain significant performance improvement.
     * LR scheduler warmup from 0
     * Remove one embedding layer in ResNet models
     * Add large margin fine-tuning strategy (LM)
 
-| Model | Params | LM | AS-Norm | vox1-O-clean | vox1-E-clean | vox1-H-clean |
-|:------|:------:|:--:|:-------:|:------------:|:------------:|:------------:|
-| ECAPA_TDNN_GLOB_c512-ASTP-emb192  | 6.19M | × | × | 1.069 | 1.209 | 2.310 |
-|                                   |       | × | √ | 0.957 | 1.128 | 2.105 |
-|                                   |       | √ | × | 0.878 | 1.072 | 2.007 |
-|                                   |       | √ | √ | 0.782 | 1.005 | 1.824 |
-| ECAPA_TDNN_GLOB_c1024-ASTP-emb192 | 14.65M | × | × | 0.856 | 1.072 | 2.059 |
-|                                   |        | × | √ | 0.808 | 0.990 | 1.874 |
-|                                   |        | √ | × | 0.798 | 0.993 | 1.883 |
-|                                   |        | √ | √ | 0.728 | 0.929 | 1.721 |
-| ResNet34-TSTP-emb256 | 6.63M | × | × | 0.867 | 1.049 | 1.959 |
-|                      |       | × | √ | 0.787 | 0.964 | 1.726 |
-|                      |       | √ | × | 0.797 | 0.937 | 1.695 |
-|                      |       | √ | √ | 0.723 | 0.867 | 1.532 |
-| ResNet221-TSTP-emb256 | 23.86M | × | × | 0.569 | 0.774 | 1.464 |
-|                      |       | × | √ | 0.479 | 0.707 | 1.290 |
-|                      |       | √ | × | 0.580 | 0.729 | 1.351 |
-|                      |       | √ | √ | 0.505 | 0.676 | 1.213 |
-| ResNet293-TSTP-emb256 | 28.69M | × | × | 0.595 | 0.756 | 1.433 |
-|                      |       | × | √ | 0.537 | 0.701 | 1.276 |
-|                      |       | √ | × | 0.532 | 0.707 | 1.311 |
-|                      |       | √ | √ | **0.447** | **0.657** | **1.183** |
-| RepVGG_TINY_A0       | 6.26M | × | × | 0.909 | 1.034 | 1.943 |
-|                      |       | × | √ | 0.824 | 0.953 | 1.709 |
-| CAM++                | 7.18M | × | × | 0.803 | 0.932 | 1.860 |
-|                      |       | × | √ | 0.718 | 0.879 | 1.735 |
-|                      |       | √ | x | 0.707 | 0.845 | 1.664 |
-|                      |       | √ | √ | 0.659 | 0.803 | 1.569 |
-
-
-* 🔥 UPDATE 2022.11.30: We support arc_margin_intertopk_subcenter loss function and Multi-query Multi-head Attentive Statistics Pooling, and obtain better performance especially on hard trials [VoxSRC](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/competition2021.html).
-    * See [#115](https://github.com/wenet-e2e/wespeaker/pull/115).
+| Model | Params | Flops | LM | AS-Norm | vox1-O-clean | vox1-E-clean | vox1-H-clean |
+|:------|:------:|:------|:--:|:-------:|:------------:|:------------:|:------------:|
+| XVEC-TSTP-emb512 | 4.61M | 0.53G | × | × | 1.989 | 1.209 | 3.412 |
+|                  |       |       | × | √ | 1.834 | 1.846 | 3.124 |
+|                  |       |       | √ | × | 1.749 | 1.721 | 2.944 |
+|                  |       |       | √ | √ | 1.590 | 1.641 | 2.726 |
+| ECAPA_TDNN_GLOB_c512-ASTP-emb192  | 6.19M | 1.04G | × | × | 1.069 | 1.209 | 2.310 |
+|                                   |       |       | × | √ | 0.957 | 1.128 | 2.105 |
+|                                   |       |       | √ | × | 0.878 | 1.072 | 2.007 |
+|                                   |       |       | √ | √ | 0.782 | 1.005 | 1.824 |
+| ECAPA_TDNN_GLOB_c1024-ASTP-emb192 | 14.65M | 2.65G | × | × | 0.856 | 1.072 | 2.059 |
+|                                   |        |       | × | √ | 0.808 | 0.990 | 1.874 |
+|                                   |        |       | √ | × | 0.798 | 0.993 | 1.883 |
+|                                   |        |       | √ | √ | 0.728 | 0.929 | 1.721 |
+| ResNet34-TSTP-emb256 | 6.63M | 4.55G | × | × | 0.867 | 1.049 | 1.959 |
+|                      |       |       | × | √ | 0.787 | 0.964 | 1.726 |
+|                      |       |       | √ | × | 0.797 | 0.937 | 1.695 |
+|                      |       |       | √ | √ | 0.723 | 0.867 | 1.532 |
+| ResNet221-TSTP-emb256 | 23.79M | 21.29G | × | × | 0.569 | 0.774 | 1.464 |
+|                       |        |        | × | √ | 0.479 | 0.707 | 1.290 |
+|                       |        |        | √ | × | 0.580 | 0.729 | 1.351 |
+|                       |        |        | √ | √ | 0.505 | 0.676 | 1.213 |
+| ResNet293-TSTP-emb256 | 28.62M | 28.10G | × | × | 0.595 | 0.756 | 1.433 |
+|                       |        |        | × | √ | 0.537 | 0.701 | 1.276 |
+|                       |        |        | √ | × | 0.532 | 0.707 | 1.311 |
+|                       |        |        | √ | √ | **0.447** | **0.657** | **1.183** |
+| RepVGG_TINY_A0       | 6.26M | 4.65G | × | × | 0.909 | 1.034 | 1.943 |
+|                      |       |       | × | √ | 0.824 | 0.953 | 1.709 |
+| CAM++                | 7.18M | 1.15G | × | × | 0.803 | 0.932 | 1.860 |
+|                      |       |       | × | √ | 0.718 | 0.879 | 1.735 |
+|                      |       |       | √ | x | 0.707 | 0.845 | 1.664 |
+|                      |       |       | √ | √ | 0.659 | 0.803 | 1.569 |
 
 
 ## PLDA results
@@ -66,3 +55,4 @@ The results on ResNet34 (large margin, no asnorm) are:
 | Scoring method | vox1-O-clean | vox1-E-clean | vox1-H-clean |
 |:--------------:|:------------:|:------------:|:------------:|
 |      PLDA      |    1.207     |    1.350     |    2.528     |
+
diff --git a/examples/voxconverse/README.md b/examples/voxconverse/README.md
@@ -0,0 +1,3 @@
+This is a **WeSpeaker** speaker diarization recipe on the Voxconverse 2020 dataset. It focused on a ``in the wild`` scenario, which was collected from YouTube videos with a semi-automatic pipeline and released for the diarization track in VoxSRC 2020 Challenge. See https://www.robots.ox.ac.uk/~vgg/data/voxconverse/ for more detailed information.
+
+Two recipes are provided, including **v1** and **v2**. Their only difference is that in **v2**, we split the Fbank extraction, embedding extraction and clustering modules to different stages. We recommend newcomers to follow the **v2** recipe and run it stage by stage.
diff --git a/examples/voxconverse/v2/README.md b/examples/voxconverse/v2/README.md
@@ -1,6 +1,5 @@
 ## Overview
 
-* Compared with [v1](https://github.com/wenet-e2e/wespeaker/tree/master/examples/voxconverse/v1) version, here we split the Fbank extraction, embedding extraction and clustering modules to different stages.
 * We suggest to run this recipe on a gpu-available machine, with onnxruntime-gpu supported.
 * Dataset: voxconverse_dev that consists of 216 utterances
 * Speaker model: ResNet34 model pretrained by wespeaker

diff --git a/runtime/onnxruntime/README.md b/runtime/onnxruntime/README.md
@@ -74,16 +74,18 @@ onnx_dir=your_model_dir
 >
 > CPU: Intel(R) Xeon(R) Platinum 8160 CPU @ 2.10GHz
 
-| Model               | Params  | RTF      |
-| ------------------- | ------- | -------- |
-| ECAPA-TDNN (C=512)  | 6.19 M  | 0.018351 |
-| ECAPA-TDNN (C=1024) | 14.65 M | 0.041724 |
-| RepVGG-TINY-A0      | 6.26 M  | 0.055117 |
-| ResNet-34           | 6.63 M  | 0.060735 |
-| ResNet-152          | 19.88 M | 0.179379 |
-| ResNet-221          | 23.86 M | 0.267511 |
-| ResNet-293          | 28.69 M | 0.364011 |
-| CAM++               | 7.18 M  | 0.022978 |
+| Model               | Params  | FLOPs    | RTF      |
+| :------------------ | :------ | :------- | :------- |
+| ECAPA-TDNN (C=512)  | 6.19 M  | 1.04 G   | 0.018351 |
+| ECAPA-TDNN (C=1024) | 14.65 M | 2.65 G   | 0.041724 |
+| RepVGG-TINY-A0      | 6.26 M  | 4.65 G   | 0.055117 |
+| ResNet-34           | 6.63 M  | 4.55 G   | 0.060735 |
+| ResNet-50           | 11.13 M | 5.17 G   | 0.073231 |
+| ResNet-101          | 15.89 M | 9.96 G   | 0.124613 |
+| ResNet-152          | 19.81 M | 14.76 G  | 0.179379 |
+| ResNet-221          | 23.79 M | 21.29 G  | 0.267511 |
+| ResNet-293          | 28.62 M | 28.10 G  | 0.364011 |
+| CAM++               | 7.18 M  | 1.15 G   | 0.022978 |
 
 > num_threads = 1
 >
@@ -93,16 +95,16 @@ onnx_dir=your_model_dir
 >
 > GPU: NVIDIA 3090
 
-| Model               | Params  | RTF        |
-| ------------------- | ------- | ---------- |
-| ResNet-34           | 6.63 M  | 0.00857436 |
+| Model               | Params  | FLOPs    | RTF        |
+| :------------------ | :------ | :------- | :--------- |
+| ResNet-34           | 6.63 M  | 4.55 G   | 0.00857436 |
 
 2. EER (%)
 > onnxruntime: samples_per_chunk=-1.
 >
 > don't use mean normalization for evaluation embeddings.
 
 | Model          | vox-O | vox-E | vox-H |
-| -------------- | ----- | ----- | ----- |
+| :------------- | ----- | ----- | ----- |
 | ResNet-34-pt   | 0.814 | 0.933 | 1.679 |
 | ResNet-34-onnx | 0.814 | 0.933 | 1.679 |
diff --git a/wespeaker/models/campplus.py b/wespeaker/models/campplus.py
@@ -402,11 +402,16 @@ def forward(self, x):
 
 
 if __name__ == '__main__':
-    x = torch.zeros(10, 200, 80)
+    x = torch.zeros(1, 200, 80)
     model = CAMPPlus(feat_dim=80, embed_dim=512, pooling_func='TSTP')
     model.eval()
     out = model(x)
     print(out.shape)
 
     num_params = sum(param.numel() for param in model.parameters())
     print("{} M".format(num_params / 1e6))
+
+    # from thop import profile
+    # x_np = torch.randn(1, 200, 80)
+    # flops, params = profile(model, inputs=(x_np, ))
+    # print("FLOPs: {} G, Params: {} M".format(flops / 1e9, params / 1e6))
diff --git a/wespeaker/models/ecapa_tdnn.py b/wespeaker/models/ecapa_tdnn.py
@@ -262,13 +262,18 @@ def ECAPA_TDNN_GLOB_c512(feat_dim,
 
 
 if __name__ == '__main__':
-    x = torch.zeros(10, 200, 80)
+    x = torch.zeros(1, 200, 80)
     model = ECAPA_TDNN_GLOB_c512(feat_dim=80,
-                                 embed_dim=192,
-                                 pooling_func='MQMHASTP')
+                                 embed_dim=256,
+                                 pooling_func='ASTP')
     model.eval()
     out = model(x)
     print(out.shape)
 
     num_params = sum(param.numel() for param in model.parameters())
     print("{} M".format(num_params / 1e6))
+
+    # from thop import profile
+    # x_np = torch.randn(1, 200, 80)
+    # flops, params = profile(model, inputs=(x_np, ))
+    # print("FLOPs: {} G, Params: {} M".format(flops / 1e9, params / 1e6))