Skip to content
This repository has been archived by the owner on Oct 13, 2022. It is now read-only.

espnet-style attn_output_weight scaling and extra after-norm layer #204

Merged
merged 3 commits into from
Jun 4, 2021

Conversation

glynpu
Copy link
Contributor

@glynpu glynpu commented Jun 2, 2021

Conformer structure differences are identified by loading espnet trained model into snowfall. #201

  1. snowfall only scaling q; while espnet scale attn_output_weights.
  2. espnet conformer has an extra layer_norm after encoder

With these two modifications and 30 epoch training, final result is a bit better(3.69 < 3.86 as reported in #154) than otherwise.

Could you help verify their effectiveness (maybe they are just training variance)? @zhu-han @pzelasko
BTW, is there any mathmatics background which explains when to scaling during attn_output_weights computation? I read several papers, but failed to find a clue about this.

Rescoring WITH 4-gram lm lattice rescore
with modifications of this pr

avg epoch 16-20
2021-06-02 19:37:12,429 INFO [common.py:380] [test-clean] %WER 3.77% [1983 / 52576, 348 ins, 105 del, 1530 sub ]
2021-06-02 21:38:48,140 INFO [common.py:382] [test-other] %WER 7.86% [4116 / 52343, 704 ins, 260 del, 3152 sub ]
avg epoch 26-30
2021-06-02 19:25:40,616 INFO [common.py:380] [test-clean] %WER 3.69% [1938 / 52576, 386 ins, 96 del, 1456 sub ]
2021-06-02 21:45:22,304 INFO [common.py:382] [test-other] %WER 7.68% [4021 / 52343, 746 ins, 251 del, 3024 sub ]

results of 4-gram lattice rescore from #154

avg epoch 16-20
2021-05-21 09:46:26,814 INFO [common.py:380] [test-clean] %WER 3.87% [2036 / 52576, 334 ins, 116 del, 1586 sub ]
2021-05-21 09:53:26,347 INFO [common.py:380] [test-other] %WER 8.08% [4231 / 52343, 710 ins, 241 del, 3280 sub ]
avg epoch 26-30
2021-05-22 14:53:36,527 INFO [common.py:380] [test-clean] %WER 3.86% [2030 / 52576, 345 ins, 114 del, 1571 sub ]
2021-05-22 15:00:10,075 INFO [common.py:380] [test-other] %WER 8.07% [4223 / 52343, 708 ins, 254 del, 3261 sub ]

Rescoring WITHOUT 4-gram lm lattice rescore
with modifications of this pr

avg epoch 16-20
2021-06-02 21:54:00,942 INFO [common.py:382] [test-clean] %WER 4.26% [2241 / 52576, 278 ins, 184 del, 1779 sub ]
2021-06-02 21:55:52,071 INFO [common.py:382] [test-other] %WER 8.61% [4505 / 52343, 602 ins, 386 del, 3517 sub ]
avg epoch 26-30
2021-06-02 21:49:51,271 INFO [common.py:382] [test-clean] %WER 4.14% [2179 / 52576, 296 ins, 177 del, 1706 sub ]
2021-06-02 21:51:30,037 INFO [common.py:382] [test-other] %WER 8.41% [4402 / 52343, 626 ins, 380 del, 3396 sub ]

results from #154

avg epoch 16-20
2021-05-21 09:34:55,569 INFO [common.py:380] [test-clean] %WER 4.33% [2274 / 52576, 268 ins, 183 del, 1823 sub ]
2021-05-21 09:35:43,453 INFO [common.py:380] [test-other] %WER 8.96% [4690 / 52343, 584 ins, 389 del, 3717 sub ]
avg epoch 26-30
2021-05-22 14:45:39,709 INFO [common.py:380] [test-clean] %WER 4.31% [2267 / 52576, 293 ins, 182 del, 1792 sub ]
2021-05-22 14:46:36,179 INFO [common.py:380] [test-other] %WER 8.98% [4700 / 52343, 610 ins, 388 del, 3702 sub ]

@danpovey
Copy link
Contributor

danpovey commented Jun 2, 2021

The scaling is just so that, assuming the input variance is about 1, the variance going into the softmax is about 1.
But the difference between the 2 scaling methods only affects the bias parameters (effectively this new way
scales down the bias).. it's surprising that it makes so much difference, perhaps this new version does not focus too much
on nearby frames.

@danpovey
Copy link
Contributor

danpovey commented Jun 2, 2021

Maybe there will be more WER difference at worse WERs, e.g. before LM rescoring.

@danpovey
Copy link
Contributor

danpovey commented Jun 2, 2021

.. don't you have the test-other results?

@glynpu
Copy link
Contributor Author

glynpu commented Jun 2, 2021

Results of before rescoring and "test_other" are giving soon(being re-tested.)

@glynpu
Copy link
Contributor Author

glynpu commented Jun 2, 2021

Maybe there will be more WER difference at worse WERs, e.g. before LM rescoring.

Relative wer decrease seems no significant difference before and after LM rescoring.

avg epoch 16-20 no rescore no rescore 4-gram lattice rescore 4-gram lattice rescore
test-clean test-other test-clean test-other
before 4.33 8.96 3.87 8.08
current 4.26 8.61 3.77 7.86
relative decrease 1.62% 3.91% 2.58% 2.72%
avg epoch 26-30 no rescore no rescore 4-gram lattice rescore 4-gram lattice rescore
test-clean test-other test-clean test-other
before 4.31 8.98 3.86 8.07
current 4.14 8.41 3.69 7.68
relative decrease 3.94% 6.35% 4.40% 4.83%

@danpovey
Copy link
Contributor

danpovey commented Jun 2, 2021 via email

@danpovey
Copy link
Contributor

danpovey commented Jun 3, 2021

Can you make this an option passed in from the user code, like in your other branch, so that we can
more easily decode with "old" models if we need to?

@danpovey
Copy link
Contributor

danpovey commented Jun 3, 2021

..I'm just concerned it might be disruptive to make this change as-is.

@glynpu
Copy link
Contributor Author

glynpu commented Jun 3, 2021

..I'm just concerned it might be disruptive to make this change as-is.

To be compatible to previously trained models, maybe an optional config, e.g. is_espnet_structure (or another properer name) which default be false could be used.
Like these:

def __init__(self, num_features: int, num_classes: int, subsampling_factor: int = 4,
                   ....
                   is_espnet_structure: bool = False) -> None:
                   ...
                   self.is_espnet_structure = is_espnet_structure
                   if self.normalize_before and self.is_espnet_structure:
                       self.after_norm = nn.LayerNorm(d_model)

@danpovey
Copy link
Contributor

danpovey commented Jun 3, 2021

Yes.
We can change it to True in our current scripts; but it would at least make it possible to revert to False so we can test old models.

@@ -285,7 +285,8 @@ def main():
num_classes=len(phone_ids) + 1, # +1 for the blank symbol
subsampling_factor=4,
num_decoder_layers=num_decoder_layers,
vgg_frontend=True)
vgg_frontend=True,
is_espnet_structure=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should have this in training script too

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.. and it's better if you change the directory name, when changing the model structure.
you can remove a couple of older components of the filename, to stop it getting too long.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should have this in training script too

added.

.. and it's better if you change the directory name, when changing the model structure.
you can remove a couple of older components of the filename, to stop it getting too long.

-    -noam-mmi-att-musan-sa-vgg
+    -mmi-att-sa-vgg-normlayer

@danpovey
Copy link
Contributor

danpovey commented Jun 4, 2021

Thanks a lot!

@danpovey danpovey merged commit f863026 into k2-fsa:master Jun 4, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants