Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[docs] update README.md and add model FLOPs info #269

Merged
merged 6 commits into from
Jan 30, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 1 addition & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ Please refer to [python usage](docs/python_package.md) for more command line and
git clone https://github.com/wenet-e2e/wespeaker.git
```

* Create conda env: pytorch version >= 1.10.0 is required !!!
* Create conda env: pytorch version >= 1.12.1 is recommended !!!
``` sh
conda create -n wespeaker python=3.9
conda activate wespeaker
Expand All @@ -64,11 +64,8 @@ pre-commit install # for clean and tidy code
* 2023.07.18: Support the kaldi-compatible PLDA and unsupervised adaptation, see [#186](https://github.com/wenet-e2e/wespeaker/pull/186).
* 2023.07.14: Support the [NIST SRE16 recipe](https://www.nist.gov/itl/iad/mig/speaker-recognition-evaluation-2016), see [#177](https://github.com/wenet-e2e/wespeaker/pull/177).
* 2023.07.10: Support the [Self-Supervised Learning recipe](https://github.com/wenet-e2e/wespeaker/tree/master/examples/voxceleb/v3) on Voxceleb, including [DINO](https://openaccess.thecvf.com/content/ICCV2021/papers/Caron_Emerging_Properties_in_Self-Supervised_Vision_Transformers_ICCV_2021_paper.pdf), [MoCo](https://openaccess.thecvf.com/content_CVPR_2020/papers/He_Momentum_Contrast_for_Unsupervised_Visual_Representation_Learning_CVPR_2020_paper.pdf) and [SimCLR](http://proceedings.mlr.press/v119/chen20j/chen20j.pdf), see [#180](https://github.com/wenet-e2e/wespeaker/pull/180).

* 2023.06.30: Support the [SphereFace2](https://ieeexplore.ieee.org/abstract/document/10094954) loss function, with better performance and noisy robust in comparison with the ArcMargin Softmax, see [#173](https://github.com/wenet-e2e/wespeaker/pull/173).

* 2023.04.27: Support the [CAM++](https://arxiv.org/abs/2303.00332) model, with better performance and single-thread inference rtf in comparison with the ResNet34 model, see [#153](https://github.com/wenet-e2e/wespeaker/pull/153).

## Recipes

* [VoxCeleb](https://github.com/wenet-e2e/wespeaker/tree/master/examples/voxceleb): Speaker Verification recipe on the [VoxCeleb dataset](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/)
Expand Down
45 changes: 23 additions & 22 deletions examples/cnceleb/v2/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,29 +2,30 @@

* Setup: fbank80, num_frms200, epoch150, ArcMargin, aug_prob0.6, speed_perturb (no spec_aug)
* test_trials: CNC-Eval-Avg.lst
* 🔥 UPDATE: We update this recipe according to the setups in the winning system of CNSRC 2022, and get obvious performance improvement compared with the old recipe. Check the [commit1](https://github.com/wenet-e2e/wespeaker/pull/63/commits/b08804987b3bbb26f4963cedf634058474c743dd), [commit2](https://github.com/wenet-e2e/wespeaker/pull/66/commits/6f6af29197f0aa0a5d1b1993b7feb2f41b97891f) for details.
* 🔥 UPDATE 2022.07.12: We update this recipe according to the setups in the winning system of CNSRC 2022, and get obvious performance improvement compared with the old recipe. Check the [commit1](https://github.com/wenet-e2e/wespeaker/pull/63/commits/b08804987b3bbb26f4963cedf634058474c743dd), [commit2](https://github.com/wenet-e2e/wespeaker/pull/66/commits/6f6af29197f0aa0a5d1b1993b7feb2f41b97891f) for details.
* LR scheduler warmup from 0
* Remove one embedding layer
* Add large margin fine-tuning strategy (LM)

| Model | Params | LM | AS-Norm | EER (%) | minDCF (p=0.01) |
| :------------------------------ | :-------: | :-: | :-------: | :-------: | :--------------: |
| ResNet34-TSTP-emb256 (OLD) | 6.70M | × | × | 8.426 | 0.487 |
| ResNet34-TSTP-emb256 | 6.63M | × | × | 7.134 | 0.408 |
| | | × | √ | 6.747 | 0.367 |
| | | √ | × | 6.652 | 0.393 |
| | | √ | √ | 6.492 | 0.354 |
| ResNet221-TSTP-emb256 | 23.86M | × | × | 5.965 | 0.362 |
| | | × | √ | 5.708 | **0.326** |
| | | √ | × | 5.886 | 0.362 |
| | | √ | √ | **5.655** | 0.330 |
| ECAPA_TDNN_GLOB_c512-ASTP-emb192 | 6.19M | × | × | 8.313 | 0.432 |
| | | × | √ | 7.644 | 0.390 |
| | | √ | × | 8.004 | 0.422 |
| | | √ | √ | 7.417 | 0.379 |
| ECAPA_TDNN_GLOB_c1024-ASTP-emb192 | 14.65M | × | × | 7.879 | 0.420 |
| | | × | √ | 7.412 | 0.379 |
| | | √ | × | 7.986 | 0.417 |
| | | √ | √ | 7.395 | 0.372 |
| RepVGG_TINY_A0 | 6.26M | × | × | 6.883 | 0.399 |
| | | × | √ | 6.550 | 0.355 |
| Model | Params | FLOPs | LM | AS-Norm | EER (%) | minDCF (p=0.01) |
| :------------------------------ | :-------: | :-----: | :-: | :-------: | :-------: | :--------------: |
| ResNet34-TSTP-emb256 (OLD) | 6.70M | 4.55 G | × | × | 8.426 | 0.487 |
| ResNet34-TSTP-emb256 | 6.63M | 4.55 G | × | × | 7.134 | 0.408 |
| | | | × | √ | 6.747 | 0.367 |
| | | | √ | × | 6.652 | 0.393 |
| | | | √ | √ | 6.492 | 0.354 |
| ResNet221-TSTP-emb256 | 23.86M | 21.29 G | × | × | 5.965 | 0.362 |
| | | | × | √ | 5.708 | **0.326** |
| | | | √ | × | 5.886 | 0.362 |
| | | | √ | √ | **5.655** | 0.330 |
| ECAPA_TDNN_GLOB_c512-ASTP-emb192 | 6.19M | 1.04 G | × | × | 8.313 | 0.432 |
| | | | × | √ | 7.644 | 0.390 |
| | | | √ | × | 8.004 | 0.422 |
| | | | √ | √ | 7.417 | 0.379 |
| ECAPA_TDNN_GLOB_c1024-ASTP-emb192 | 14.65M | 2.65 G | × | × | 7.879 | 0.420 |
| | | | × | √ | 7.412 | 0.379 |
| | | | √ | × | 7.986 | 0.417 |
| | | | √ | √ | 7.395 | 0.372 |
| RepVGG_TINY_A0 | 6.26M | 4.65 G | × | × | 6.883 | 0.399 |
| | | | × | √ | 6.550 | 0.355 |

10 changes: 5 additions & 5 deletions examples/sre/v2/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,11 @@
* Scoring: cosine & PLDA & PLDA Adaptation
* Metric: EER(%)

| Model | Params | Backend | Pooled | Tagalog | Cantonese |
|:---------------------|:------:|:----------:|:------:|:-------:|:---------:|
| ResNet34-TSTP-emb256 | 6.63M | Cosine | 15.4 | 19.82 | 10.39 |
| | | PLDA | 11.689 | 16.961 | 6.239 |
| | | Adapt PLDA | 5.788 | 8.974 | 2.674 |
| Model | Params | FLOPs | Backend | Pooled | Tagalog | Cantonese |
|:---------------------|:------:|:------:|:----------:|:------:|:-------:|:---------:|
| ResNet34-TSTP-emb256 | 6.63M | 4.55G | Cosine | 15.4 | 19.82 | 10.39 |
| | | | PLDA | 11.689 | 16.961 | 6.239 |
| | | | Adapt PLDA | 5.788 | 8.974 | 2.674 |

Current PLDA implementation is fully compatible with the Kaldi version, note that
we can definitely improve the results with out adaptation with parameter tuning and extra LDA as shown in the Kaldi
Expand Down
78 changes: 34 additions & 44 deletions examples/voxceleb/v2/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,54 +4,43 @@
* Scoring: cosine (sub mean of vox2_dev)
* Metric: EER(%)

| Model | Params | AS-Norm(300) | vox1-O-clean | vox1-E-clean | vox1-H-clean |
|:------|:------:|:------------:|:------------:|:------------:|:------------:|
| XVEC-TSTP-emb512 | 4.61M | × | 1.962 | 1.918 | 3.389 |
| | | √ | 1.835 | 1.822 | 3.110 |
| ECAPA_TDNN_GLOB_c512-ASTP-emb192 | 6.19M | × | 1.149 | 1.248 | 2.313 |
| | | √ | 1.026 | 1.154 | 2.089 |
| ResNet34-TSTP-emb256 | 6.63M | × | 0.941 | 1.114 | 2.026 |
| | | √ | 0.899 | 1.064 | 1.856 |

* 🔥 UPDATE 2023.6.30: We support SphereFace2 loss function and obtain better and robust performance, see [#173](https://github.com/wenet-e2e/wespeaker/pull/173).

* 🔥 UPDATE 2022.07.19: We apply the same setups as the winning system of CNSRC 2022 (see [cnceleb](https://github.com/wenet-e2e/wespeaker/tree/master/examples/cnceleb/v2) recipe for details), and obtain significant performance improvement compared with our previous implementation.
* 🔥 UPDATE 2022.07.19: We apply the same setups as the winning system of CNSRC 2022 (see [cnceleb](https://github.com/wenet-e2e/wespeaker/tree/master/examples/cnceleb/v2) recipe for details), and obtain significant performance improvement.
* LR scheduler warmup from 0
* Remove one embedding layer in ResNet models
* Add large margin fine-tuning strategy (LM)

| Model | Params | LM | AS-Norm | vox1-O-clean | vox1-E-clean | vox1-H-clean |
|:------|:------:|:--:|:-------:|:------------:|:------------:|:------------:|
| ECAPA_TDNN_GLOB_c512-ASTP-emb192 | 6.19M | × | × | 1.069 | 1.209 | 2.310 |
| | | × | √ | 0.957 | 1.128 | 2.105 |
| | | √ | × | 0.878 | 1.072 | 2.007 |
| | | √ | √ | 0.782 | 1.005 | 1.824 |
| ECAPA_TDNN_GLOB_c1024-ASTP-emb192 | 14.65M | × | × | 0.856 | 1.072 | 2.059 |
| | | × | √ | 0.808 | 0.990 | 1.874 |
| | | √ | × | 0.798 | 0.993 | 1.883 |
| | | √ | √ | 0.728 | 0.929 | 1.721 |
| ResNet34-TSTP-emb256 | 6.63M | × | × | 0.867 | 1.049 | 1.959 |
| | | × | √ | 0.787 | 0.964 | 1.726 |
| | | √ | × | 0.797 | 0.937 | 1.695 |
| | | √ | √ | 0.723 | 0.867 | 1.532 |
| ResNet221-TSTP-emb256 | 23.86M | × | × | 0.569 | 0.774 | 1.464 |
| | | × | √ | 0.479 | 0.707 | 1.290 |
| | | √ | × | 0.580 | 0.729 | 1.351 |
| | | √ | √ | 0.505 | 0.676 | 1.213 |
| ResNet293-TSTP-emb256 | 28.69M | × | × | 0.595 | 0.756 | 1.433 |
| | | × | √ | 0.537 | 0.701 | 1.276 |
| | | √ | × | 0.532 | 0.707 | 1.311 |
| | | √ | √ | **0.447** | **0.657** | **1.183** |
| RepVGG_TINY_A0 | 6.26M | × | × | 0.909 | 1.034 | 1.943 |
| | | × | √ | 0.824 | 0.953 | 1.709 |
| CAM++ | 7.18M | × | × | 0.803 | 0.932 | 1.860 |
| | | × | √ | 0.718 | 0.879 | 1.735 |
| | | √ | x | 0.707 | 0.845 | 1.664 |
| | | √ | √ | 0.659 | 0.803 | 1.569 |


* 🔥 UPDATE 2022.11.30: We support arc_margin_intertopk_subcenter loss function and Multi-query Multi-head Attentive Statistics Pooling, and obtain better performance especially on hard trials [VoxSRC](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/competition2021.html).
* See [#115](https://github.com/wenet-e2e/wespeaker/pull/115).
| Model | Params | Flops | LM | AS-Norm | vox1-O-clean | vox1-E-clean | vox1-H-clean |
|:------|:------:|:------|:--:|:-------:|:------------:|:------------:|:------------:|
| XVEC-TSTP-emb512 | 4.61M | 0.53G | × | × | 1.989 | 1.209 | 3.412 |
| | | | × | √ | 1.834 | 1.846 | 3.124 |
| | | | √ | × | 1.749 | 1.721 | 2.944 |
| | | | √ | √ | 1.590 | 1.641 | 2.726 |
| ECAPA_TDNN_GLOB_c512-ASTP-emb192 | 6.19M | 1.04G | × | × | 1.069 | 1.209 | 2.310 |
| | | | × | √ | 0.957 | 1.128 | 2.105 |
| | | | √ | × | 0.878 | 1.072 | 2.007 |
| | | | √ | √ | 0.782 | 1.005 | 1.824 |
| ECAPA_TDNN_GLOB_c1024-ASTP-emb192 | 14.65M | 2.65G | × | × | 0.856 | 1.072 | 2.059 |
| | | | × | √ | 0.808 | 0.990 | 1.874 |
| | | | √ | × | 0.798 | 0.993 | 1.883 |
| | | | √ | √ | 0.728 | 0.929 | 1.721 |
| ResNet34-TSTP-emb256 | 6.63M | 4.55G | × | × | 0.867 | 1.049 | 1.959 |
| | | | × | √ | 0.787 | 0.964 | 1.726 |
| | | | √ | × | 0.797 | 0.937 | 1.695 |
| | | | √ | √ | 0.723 | 0.867 | 1.532 |
| ResNet221-TSTP-emb256 | 23.79M | 21.29G | × | × | 0.569 | 0.774 | 1.464 |
| | | | × | √ | 0.479 | 0.707 | 1.290 |
| | | | √ | × | 0.580 | 0.729 | 1.351 |
| | | | √ | √ | 0.505 | 0.676 | 1.213 |
| ResNet293-TSTP-emb256 | 28.62M | 28.10G | × | × | 0.595 | 0.756 | 1.433 |
| | | | × | √ | 0.537 | 0.701 | 1.276 |
| | | | √ | × | 0.532 | 0.707 | 1.311 |
| | | | √ | √ | **0.447** | **0.657** | **1.183** |
| RepVGG_TINY_A0 | 6.26M | 4.65G | × | × | 0.909 | 1.034 | 1.943 |
| | | | × | √ | 0.824 | 0.953 | 1.709 |
| CAM++ | 7.18M | 1.15G | × | × | 0.803 | 0.932 | 1.860 |
| | | | × | √ | 0.718 | 0.879 | 1.735 |
| | | | √ | x | 0.707 | 0.845 | 1.664 |
| | | | √ | √ | 0.659 | 0.803 | 1.569 |


## PLDA results
Expand All @@ -66,3 +55,4 @@ The results on ResNet34 (large margin, no asnorm) are:
| Scoring method | vox1-O-clean | vox1-E-clean | vox1-H-clean |
|:--------------:|:------------:|:------------:|:------------:|
| PLDA | 1.207 | 1.350 | 2.528 |

3 changes: 3 additions & 0 deletions examples/voxconverse/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
This is a **WeSpeaker** speaker diarization recipe on the Voxconverse 2020 dataset. It focused on a ``in the wild`` scenario, which was collected from YouTube videos with a semi-automatic pipeline and released for the diarization track in VoxSRC 2020 Challenge. See https://www.robots.ox.ac.uk/~vgg/data/voxconverse/ for more detailed information.

Two recipes are provided, including **v1** and **v2**. Their only difference is that in **v2**, we split the Fbank extraction, embedding extraction and clustering modules to different stages. We recommend newcomers to follow the **v2** recipe and run it stage by stage.
1 change: 0 additions & 1 deletion examples/voxconverse/v2/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
## Overview

* Compared with [v1](https://github.com/wenet-e2e/wespeaker/tree/master/examples/voxconverse/v1) version, here we split the Fbank extraction, embedding extraction and clustering modules to different stages.
* We suggest to run this recipe on a gpu-available machine, with onnxruntime-gpu supported.
* Dataset: voxconverse_dev that consists of 216 utterances
* Speaker model: ResNet34 model pretrained by wespeaker
Expand Down
30 changes: 16 additions & 14 deletions runtime/onnxruntime/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,16 +74,18 @@ onnx_dir=your_model_dir
>
> CPU: Intel(R) Xeon(R) Platinum 8160 CPU @ 2.10GHz

| Model | Params | RTF |
| ------------------- | ------- | -------- |
| ECAPA-TDNN (C=512) | 6.19 M | 0.018351 |
| ECAPA-TDNN (C=1024) | 14.65 M | 0.041724 |
| RepVGG-TINY-A0 | 6.26 M | 0.055117 |
| ResNet-34 | 6.63 M | 0.060735 |
| ResNet-152 | 19.88 M | 0.179379 |
| ResNet-221 | 23.86 M | 0.267511 |
| ResNet-293 | 28.69 M | 0.364011 |
| CAM++ | 7.18 M | 0.022978 |
| Model | Params | FLOPs | RTF |
| :------------------ | :------ | :------- | :------- |
| ECAPA-TDNN (C=512) | 6.19 M | 1.04 G | 0.018351 |
| ECAPA-TDNN (C=1024) | 14.65 M | 2.65 G | 0.041724 |
| RepVGG-TINY-A0 | 6.26 M | 4.65 G | 0.055117 |
| ResNet-34 | 6.63 M | 4.55 G | 0.060735 |
| ResNet-50 | 11.13 M | 5.17 G | 0.073231 |
| ResNet-101 | 15.89 M | 9.96 G | 0.124613 |
| ResNet-152 | 19.81 M | 14.76 G | 0.179379 |
| ResNet-221 | 23.79 M | 21.29 G | 0.267511 |
| ResNet-293 | 28.62 M | 28.10 G | 0.364011 |
| CAM++ | 7.18 M | 1.15 G | 0.022978 |

> num_threads = 1
>
Expand All @@ -93,16 +95,16 @@ onnx_dir=your_model_dir
>
> GPU: NVIDIA 3090

| Model | Params | RTF |
| ------------------- | ------- | ---------- |
| ResNet-34 | 6.63 M | 0.00857436 |
| Model | Params | FLOPs | RTF |
| :------------------ | :------ | :------- | :--------- |
| ResNet-34 | 6.63 M | 4.55 G | 0.00857436 |

2. EER (%)
> onnxruntime: samples_per_chunk=-1.
>
> don't use mean normalization for evaluation embeddings.

| Model | vox-O | vox-E | vox-H |
| -------------- | ----- | ----- | ----- |
| :------------- | ----- | ----- | ----- |
| ResNet-34-pt | 0.814 | 0.933 | 1.679 |
| ResNet-34-onnx | 0.814 | 0.933 | 1.679 |
7 changes: 6 additions & 1 deletion wespeaker/models/campplus.py
Original file line number Diff line number Diff line change
Expand Up @@ -402,11 +402,16 @@ def forward(self, x):


if __name__ == '__main__':
x = torch.zeros(10, 200, 80)
x = torch.zeros(1, 200, 80)
model = CAMPPlus(feat_dim=80, embed_dim=512, pooling_func='TSTP')
model.eval()
out = model(x)
print(out.shape)

num_params = sum(param.numel() for param in model.parameters())
print("{} M".format(num_params / 1e6))

# from thop import profile
# x_np = torch.randn(1, 200, 80)
# flops, params = profile(model, inputs=(x_np, ))
# print("FLOPs: {} G, Params: {} M".format(flops / 1e9, params / 1e6))
11 changes: 8 additions & 3 deletions wespeaker/models/ecapa_tdnn.py
Original file line number Diff line number Diff line change
Expand Up @@ -262,13 +262,18 @@ def ECAPA_TDNN_GLOB_c512(feat_dim,


if __name__ == '__main__':
x = torch.zeros(10, 200, 80)
x = torch.zeros(1, 200, 80)
model = ECAPA_TDNN_GLOB_c512(feat_dim=80,
embed_dim=192,
pooling_func='MQMHASTP')
embed_dim=256,
pooling_func='ASTP')
model.eval()
out = model(x)
print(out.shape)

num_params = sum(param.numel() for param in model.parameters())
print("{} M".format(num_params / 1e6))

# from thop import profile
# x_np = torch.randn(1, 200, 80)
# flops, params = profile(model, inputs=(x_np, ))
# print("FLOPs: {} G, Params: {} M".format(flops / 1e9, params / 1e6))
Loading
Loading