-
Notifications
You must be signed in to change notification settings - Fork 287
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Add timit recipe #96
Conversation
# | ||
# - $dl_dir/lm | ||
# This directory contains the language model(LM) downloaded from | ||
# https://huggingface.co/luomingshuang/timit_lm, and the LM is based |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please describe how lm_tgmed.arpa
is obtained?
Is it possible to train it inside icefall?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Em....the lm_tgmed.arpa is obtained by this train_lms.sh which is followed by kaldi. About training lm inside icefall, I think it is a good idea. I have ever considered this problem that if we can train lm with python. There are some methods for it by using KenLM. Maybe I can have look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, train_lms.sh
uses https://github.com/danpovey/kaldi_lm.git
I will wrap it to Python with pybind11 when I have time.
2021-10-28 13:20:42,952 INFO [decode.py:360] Wrote detailed error stats to tdnn_lstm_ctc/exp/errs-TEST-lm_scale_2.0.txt | ||
2021-10-28 13:20:42,986 INFO [decode.py:374] | ||
For TEST, PER of different settings are: | ||
lm_scale_0.1 20.82 best for TEST |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you use a smaller lm scale value as it reaches the edge?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing.
recordings=m["recordings"], | ||
supervisions=m["supervisions"], | ||
) | ||
if "train" in partition: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please note that in librispeech, the names of the training datasets begin with
train
(lowercase).
In TIMIT, I find that it is TRAIN
(uppercase) , see line 52 in this file, so this if
statement is never executed.
Please change train
to TRAIN
and re-run your experiments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh....will do it....
load_dicts = json.load(load_f) | ||
for load_dict in load_dicts: | ||
text = load_dict["text"] | ||
phones_list = list(filter(None, text.split(" "))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could it be changed to
phones_list = text.split()
?
It's simpler and easier to understand.
phones_list = list(filter(None, text.split(" "))) | ||
|
||
for phone in phones_list: | ||
if phone not in phones: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you use a set
to represent phones
, not a list
?
set
is more efficient for looking up.
|
||
with open(lexicon, "w") as f: | ||
for phone in sorted(phones): | ||
f.write(str(phone) + " " + str(phone)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
phone
is already of type str
, can we remove str
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK.
# We assume that you have installed the git-lfs, if not, you could install it | ||
# using: `sudo apt-get install git-lfs && git-lfs install` | ||
[ ! -e $dl_dir/lm ] && mkdir -p $dl_dir/lm | ||
git clone https://huggingface.co/luomingshuang/timit_lm $dl_dir/lm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a check that lm_tgmed.arpa
is downloaded correctly.
Some users may forget to run git lfs install
.
You can add an extra statement
( cd $dl_dir/lm && git lfs pull )
|
||
if [ $stage -le 6 ] && [ $stop_stage -ge 6 ]; then | ||
log "Stage 6: Prepare G" | ||
# We assume you have install kaldilm, if not, please install |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: install -> installed
--read-symbol-table="data/lang_phone/words.txt" \ | ||
--disambig-symbol='#0' \ | ||
--max-order=4 \ | ||
$dl_dir/lm/lm_tgmed.arpa > data/lm/G_4_gram.fst.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tgmed
means this arpa is a tri-gram, of medium size, I think.
Please use a 4-gram arpa to generate G_4_gram.fst.txt
, if you need it for decoding/rescoring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do it.
@@ -0,0 +1,97 @@ | |||
#!/usr/bin/env bash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file is shared across various recipes.
Could you make it a symlink, like what we are doing in the librispeech recipe?
@@ -0,0 +1,400 @@ | |||
FADG0_SI1279 TEST/DR4/FADG0/SI1279.WAV |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this file be generated by some scripts? If so, we don't need to check it in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Em....about {train, dev, test} spliting files, I don't find there are some scripts to generate them. In kaldi, there are placed in a list file. In speechbrain, there are placed in timit_prepare.py which listing the speakers with a list. A option for us is to learn speechbrain. We can use a list to contain speakers' names in data preparing process. I will add it to Lhotse.
Add timit recipe for icefall. This script uses phone as modeling units and it aims to compute the PER. Our target output is a list of phones. The split for {dev, test} is following kaldi ({kaldi-timit-dev, kaldi-timit-test}. At present, this script contains tdnn_lstm_ctc for use. And I will add other models and methods (such as comformer and crdnn, mmi) for it.
In fact, I have done some experiments for timit based on snowfall. k2-fsa/snowfall#247
The current result is not the best. I will continue to improve it.
log-train-2021-10-28-15-24-21.txt
https://tensorboard.dev/experiment/twUbZTxoTAK32bPCJsYF7Q/#scalars
TODOs: