espnet-style attn_output_weight scaling and extra after-norm layer #204

glynpu · 2021-06-02T11:43:45Z

Conformer structure differences are identified by loading espnet trained model into snowfall. #201

snowfall only scaling q; while espnet scale attn_output_weights.
espnet conformer has an extra layer_norm after encoder

With these two modifications and 30 epoch training, final result is a bit better(3.69 < 3.86 as reported in #154) than otherwise.

Could you help verify their effectiveness (maybe they are just training variance)? @zhu-han @pzelasko
BTW, is there any mathmatics background which explains when to scaling during attn_output_weights computation? I read several papers, but failed to find a clue about this.

Rescoring WITH 4-gram lm lattice rescore
with modifications of this pr

avg epoch 16-20
2021-06-02 19:37:12,429 INFO [common.py:380] [test-clean] %WER 3.77% [1983 / 52576, 348 ins, 105 del, 1530 sub ]
2021-06-02 21:38:48,140 INFO [common.py:382] [test-other] %WER 7.86% [4116 / 52343, 704 ins, 260 del, 3152 sub ]
avg epoch 26-30
2021-06-02 19:25:40,616 INFO [common.py:380] [test-clean] %WER 3.69% [1938 / 52576, 386 ins, 96 del, 1456 sub ]
2021-06-02 21:45:22,304 INFO [common.py:382] [test-other] %WER 7.68% [4021 / 52343, 746 ins, 251 del, 3024 sub ]

results of 4-gram lattice rescore from #154

avg epoch 16-20
2021-05-21 09:46:26,814 INFO [common.py:380] [test-clean] %WER 3.87% [2036 / 52576, 334 ins, 116 del, 1586 sub ]
2021-05-21 09:53:26,347 INFO [common.py:380] [test-other] %WER 8.08% [4231 / 52343, 710 ins, 241 del, 3280 sub ]
avg epoch 26-30
2021-05-22 14:53:36,527 INFO [common.py:380] [test-clean] %WER 3.86% [2030 / 52576, 345 ins, 114 del, 1571 sub ]
2021-05-22 15:00:10,075 INFO [common.py:380] [test-other] %WER 8.07% [4223 / 52343, 708 ins, 254 del, 3261 sub ]

Rescoring WITHOUT 4-gram lm lattice rescore
with modifications of this pr

avg epoch 16-20
2021-06-02 21:54:00,942 INFO [common.py:382] [test-clean] %WER 4.26% [2241 / 52576, 278 ins, 184 del, 1779 sub ]
2021-06-02 21:55:52,071 INFO [common.py:382] [test-other] %WER 8.61% [4505 / 52343, 602 ins, 386 del, 3517 sub ]
avg epoch 26-30
2021-06-02 21:49:51,271 INFO [common.py:382] [test-clean] %WER 4.14% [2179 / 52576, 296 ins, 177 del, 1706 sub ]
2021-06-02 21:51:30,037 INFO [common.py:382] [test-other] %WER 8.41% [4402 / 52343, 626 ins, 380 del, 3396 sub ]

results from #154

avg epoch 16-20
2021-05-21 09:34:55,569 INFO [common.py:380] [test-clean] %WER 4.33% [2274 / 52576, 268 ins, 183 del, 1823 sub ]
2021-05-21 09:35:43,453 INFO [common.py:380] [test-other] %WER 8.96% [4690 / 52343, 584 ins, 389 del, 3717 sub ]
avg epoch 26-30
2021-05-22 14:45:39,709 INFO [common.py:380] [test-clean] %WER 4.31% [2267 / 52576, 293 ins, 182 del, 1792 sub ]
2021-05-22 14:46:36,179 INFO [common.py:380] [test-other] %WER 8.98% [4700 / 52343, 610 ins, 388 del, 3702 sub ]

danpovey · 2021-06-02T11:48:25Z

The scaling is just so that, assuming the input variance is about 1, the variance going into the softmax is about 1.
But the difference between the 2 scaling methods only affects the bias parameters (effectively this new way
scales down the bias).. it's surprising that it makes so much difference, perhaps this new version does not focus too much
on nearby frames.

danpovey · 2021-06-02T11:49:41Z

Maybe there will be more WER difference at worse WERs, e.g. before LM rescoring.

danpovey · 2021-06-02T11:50:03Z

.. don't you have the test-other results?

glynpu · 2021-06-02T12:01:30Z

Results of before rescoring and "test_other" are giving soon(being re-tested.)

glynpu · 2021-06-02T14:12:32Z

Maybe there will be more WER difference at worse WERs, e.g. before LM rescoring.

Relative wer decrease seems no significant difference before and after LM rescoring.

avg epoch 16-20	no rescore	no rescore	4-gram lattice rescore	4-gram lattice rescore
	test-clean	test-other	test-clean	test-other
before	4.33	8.96	3.87	8.08
current	4.26	8.61	3.77	7.86
relative decrease	1.62%	3.91%	2.58%	2.72%

avg epoch 26-30	no rescore	no rescore	4-gram lattice rescore	4-gram lattice rescore
	test-clean	test-other	test-clean	test-other
before	4.31	8.98	3.86	8.07
current	4.14	8.41	3.69	7.68
relative decrease	3.94%	6.35%	4.40%	4.83%

danpovey · 2021-06-02T14:51:41Z

still better though.. good..

…

On Wednesday, June 2, 2021, LIyong.Guo ***@***.***> wrote: Maybe there will be more WER difference at worse WERs, e.g. before LM rescoring. Relative wer decrease seems no significant difference before and after LM rescoring. avg epoch 16-20 no rescore no rescore 4-gram lattice rescore 4-gram lattice rescore test-clean test-other test-clean test-other before 4.33 8.96 3.87 8.08 current 4.26 8.61 3.77 7.86 relative decrease 1.62% 3.91% 2.58% 2.72% avg epoch 26-30 no rescore no rescore 4-gram lattice rescore 4-gram lattice rescore test-clean test-other test-clean test-other before 4.31 8.98 3.86 8.07 current 4.14 8.41 3.69 7.68 relative decrease 3.94% 6.35% 4.40% 4.83% — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#204 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZFLO4QDSQ3H7TBDTICNCLTQY36DANCNFSM456R7TKQ> .

danpovey · 2021-06-03T08:32:54Z

Can you make this an option passed in from the user code, like in your other branch, so that we can
more easily decode with "old" models if we need to?

danpovey · 2021-06-03T08:33:14Z

..I'm just concerned it might be disruptive to make this change as-is.

glynpu · 2021-06-03T08:53:00Z

..I'm just concerned it might be disruptive to make this change as-is.

To be compatible to previously trained models, maybe an optional config, e.g. is_espnet_structure (or another properer name) which default be false could be used.
Like these:

def __init__(self, num_features: int, num_classes: int, subsampling_factor: int = 4,
                   ....
                   is_espnet_structure: bool = False) -> None:
                   ...
                   self.is_espnet_structure = is_espnet_structure
                   if self.normalize_before and self.is_espnet_structure:
                       self.after_norm = nn.LayerNorm(d_model)

danpovey · 2021-06-03T08:57:37Z

Yes.
We can change it to True in our current scripts; but it would at least make it possible to revert to False so we can test old models.

danpovey · 2021-06-03T10:56:46Z

egs/librispeech/asr/simple_v1/mmi_att_transformer_decode.py

@@ -285,7 +285,8 @@ def main():
            num_classes=len(phone_ids) + 1,  # +1 for the blank symbol
            subsampling_factor=4,
            num_decoder_layers=num_decoder_layers,
-            vgg_frontend=True)
+            vgg_frontend=True,
+            is_espnet_structure=True)


Should have this in training script too

.. and it's better if you change the directory name, when changing the model structure.
you can remove a couple of older components of the filename, to stop it getting too long.

Should have this in training script too

added.

.. and it's better if you change the directory name, when changing the model structure.
you can remove a couple of older components of the filename, to stop it getting too long.

- -noam-mmi-att-musan-sa-vgg + -mmi-att-sa-vgg-normlayer

danpovey · 2021-06-04T14:53:13Z

Thanks a lot!

espnet-style attn_output_weight scaling and extra after-norm layer

fb00230

is_espnet_structure flag

099a22f

danpovey reviewed Jun 3, 2021

View reviewed changes

rename exp_dir

4ce43f8

danpovey merged commit f863026 into k2-fsa:master Jun 4, 2021

glynpu mentioned this pull request Jun 21, 2021

bpe ctc decoder with a released model #217

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

espnet-style attn_output_weight scaling and extra after-norm layer #204

espnet-style attn_output_weight scaling and extra after-norm layer #204

glynpu commented Jun 2, 2021 •

edited

Loading

danpovey commented Jun 2, 2021

danpovey commented Jun 2, 2021

danpovey commented Jun 2, 2021

glynpu commented Jun 2, 2021

glynpu commented Jun 2, 2021 •

edited

Loading

danpovey commented Jun 2, 2021 via email

danpovey commented Jun 3, 2021

danpovey commented Jun 3, 2021

glynpu commented Jun 3, 2021

danpovey commented Jun 3, 2021

danpovey Jun 3, 2021

danpovey Jun 3, 2021

glynpu Jun 4, 2021

danpovey commented Jun 4, 2021

espnet-style attn_output_weight scaling and extra after-norm layer #204

espnet-style attn_output_weight scaling and extra after-norm layer #204

Conversation

glynpu commented Jun 2, 2021 • edited Loading

danpovey commented Jun 2, 2021

danpovey commented Jun 2, 2021

danpovey commented Jun 2, 2021

glynpu commented Jun 2, 2021

glynpu commented Jun 2, 2021 • edited Loading

danpovey commented Jun 2, 2021 via email

danpovey commented Jun 3, 2021

danpovey commented Jun 3, 2021

glynpu commented Jun 3, 2021

danpovey commented Jun 3, 2021

danpovey Jun 3, 2021

Choose a reason for hiding this comment

danpovey Jun 3, 2021

Choose a reason for hiding this comment

glynpu Jun 4, 2021

Choose a reason for hiding this comment

danpovey commented Jun 4, 2021

glynpu commented Jun 2, 2021 •

edited

Loading

glynpu commented Jun 2, 2021 •

edited

Loading