Skip to content
This repository has been archived by the owner on Oct 13, 2022. It is now read-only.

Use a pre-trained bigram P for LF-MMI training #222

Merged
merged 4 commits into from
Jul 2, 2021

Conversation

csukuangfj
Copy link
Collaborator

Will post the result (the WER, probably tomorrow) when it is available.


Here is the information about the size of P when the number of phones is 86.

current P pre-trained P pre-trained P after epsilon removal
num_states 88 74 74
num_arcs 7568 3634 7209

If we are going to use word pieces with vocab_size 5000, hope that P is not going to increase its size quadratically.
Will show the size of P for word pieces soon.

@csukuangfj
Copy link
Collaborator Author

csukuangfj commented Jun 30, 2021

Here are the WERs
(100 hours data with 10 epochs)

test-clean test-other
Screen Shot 2021-06-30 at 1 21 22 PM Screen Shot 2021-06-30 at 1 26 33 PM

Results from #212 (comment)

122004558-25c8fb00-cde7-11eb-8e9c-2534c036ba6d

Results with this pull request are a little worse than that of #212


Results from #218 (comment)

#218 is the latest run that I have. The experiment environments between #218 and this pull request are similar and comparable. Compared with #218, it shows pre-trained P has a lower WER on both test-clean (5.74 vs 5.83) and test-other (15.00 vs 15.64).

@danpovey
Copy link
Contributor

That's interesting!
To make the graphs smaller we can consider using count cutoffs (min-counts) to take away low-count n-grams.
I'm kind of confused why there is so much difference; I would have expected the two would be very similar because
how we train the graphs is very similar to ML. But could be due to the ARPA having smoothing, or due to differences
RE silence.

@@ -0,0 +1,377 @@
#!/usr/bin/env python3

# Copyright 2016 Johns Hopkins University (Author: Daniel Povey)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danpovey
Copy link
Contributor

danpovey commented Jun 30, 2021 via email

@csukuangfj
Copy link
Collaborator Author

I'm OK to merge this as-is.

I have removed the code supporting training P on-the-fly in this pull-request.
Shall I add an option to let the user choose which kinds of P to use (This will make the code complicated.)

@danpovey
Copy link
Contributor

danpovey commented Jun 30, 2021 via email

@@ -88,9 +88,11 @@ def load_checkpoint(
src_key = '{}.{}'.format('module', key)
dst_state_dict[key] = src_state_dict.pop(src_key)
assert len(src_state_dict) == 0
model.load_state_dict(dst_state_dict)
model.load_state_dict(dst_state_dict, strict=False)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danpovey
Adding strict=False should prevent PyTorch from complaining
about extra key P_scores in the checkpoints.

@csukuangfj
Copy link
Collaborator Author

Will merge.

@danpovey
Copy link
Contributor

danpovey commented Jul 1, 2021

You might want to check the strict=False option. IIRC last time I tried it the torch code was broken and was not correctly respecting that option, and I had to make changes to torch itself locally.

@csukuangfj
Copy link
Collaborator Author

@danpovey

You might want to check the strict=False option. IIRC last time I tried it the torch code was broken and was not correctly respecting that option, and I had to make changes to torch itself locally.

I suspect that you forgot to add it to the function average_checkpoint() and you added it only to load_checkpoint().

I just verified it works perfectly with strict=False to load the old model checkpoint which has P_scores.

@csukuangfj csukuangfj merged commit 25051ea into k2-fsa:master Jul 2, 2021
@csukuangfj csukuangfj deleted the pretrained-P branch July 2, 2021 02:59
@pzelasko
Copy link
Collaborator

pzelasko commented Jul 2, 2021

After this PR, the “pretrained” P is always going to be used, right?

also did you try 3 or 4 gram (Kaldi style)? I guess it should further help

@csukuangfj
Copy link
Collaborator Author

After this PR, the “pretrained” P is always going to be used, right?

Yes, that's right. Supporting both pre-trained P and on-the-fly trained P makes the code complicated.
Pre-trained P gives a slightly better WER according to the above experiments.


also did you try 3 or 4 gram (Kaldi style)? I guess it should further help

Thanks. I will try that.

@danpovey
Copy link
Contributor

danpovey commented Jul 3, 2021

For using a 3 or 4-gram LM, we would definitely need to do some kind of pruning, or the LM will be way too large.
Ruizhe Huang @huangruizhe is working on a self-contained Python script for Kaldi that can do that. Let's try this after he finishes it.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants