Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Add timit recipe #96

Closed
wants to merge 10 commits into from
Closed

WIP: Add timit recipe #96

wants to merge 10 commits into from

Conversation

luomingshuang
Copy link
Collaborator

Add timit recipe for icefall. This script uses phone as modeling units and it aims to compute the PER. Our target output is a list of phones. The split for {dev, test} is following kaldi ({kaldi-timit-dev, kaldi-timit-test}. At present, this script contains tdnn_lstm_ctc for use. And I will add other models and methods (such as comformer and crdnn, mmi) for it.

In fact, I have done some experiments for timit based on snowfall. k2-fsa/snowfall#247

The current result is not the best. I will continue to improve it.
log-train-2021-10-28-15-24-21.txt
https://tensorboard.dev/experiment/twUbZTxoTAK32bPCJsYF7Q/#scalars

TODOs:

  • Add and check the scripts
  • Add documents for timit
  • Improve the performance

#
# - $dl_dir/lm
# This directory contains the language model(LM) downloaded from
# https://huggingface.co/luomingshuang/timit_lm, and the LM is based
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please describe how lm_tgmed.arpa is obtained?
Is it possible to train it inside icefall?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Em....the lm_tgmed.arpa is obtained by this train_lms.sh which is followed by kaldi. About training lm inside icefall, I think it is a good idea. I have ever considered this problem that if we can train lm with python. There are some methods for it by using KenLM. Maybe I can have look.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, train_lms.sh uses https://github.com/danpovey/kaldi_lm.git

I will wrap it to Python with pybind11 when I have time.

2021-10-28 13:20:42,952 INFO [decode.py:360] Wrote detailed error stats to tdnn_lstm_ctc/exp/errs-TEST-lm_scale_2.0.txt
2021-10-28 13:20:42,986 INFO [decode.py:374]
For TEST, PER of different settings are:
lm_scale_0.1 20.82 best for TEST
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use a smaller lm scale value as it reaches the edge?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing.

recordings=m["recordings"],
supervisions=m["supervisions"],
)
if "train" in partition:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note that in librispeech, the names of the training datasets begin with
train (lowercase).

In TIMIT, I find that it is TRAIN (uppercase) , see line 52 in this file, so this if
statement is never executed.

Please change train to TRAIN and re-run your experiments.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh....will do it....

load_dicts = json.load(load_f)
for load_dict in load_dicts:
text = load_dict["text"]
phones_list = list(filter(None, text.split(" ")))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could it be changed to

phones_list = text.split()

?
It's simpler and easier to understand.

phones_list = list(filter(None, text.split(" ")))

for phone in phones_list:
if phone not in phones:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use a set to represent phones, not a list?
set is more efficient for looking up.


with open(lexicon, "w") as f:
for phone in sorted(phones):
f.write(str(phone) + " " + str(phone))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

phone is already of type str, can we remove str here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK.

# We assume that you have installed the git-lfs, if not, you could install it
# using: `sudo apt-get install git-lfs && git-lfs install`
[ ! -e $dl_dir/lm ] && mkdir -p $dl_dir/lm
git clone https://huggingface.co/luomingshuang/timit_lm $dl_dir/lm
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a check that lm_tgmed.arpa is downloaded correctly.
Some users may forget to run git lfs install.

You can add an extra statement

( cd $dl_dir/lm && git lfs pull )


if [ $stage -le 6 ] && [ $stop_stage -ge 6 ]; then
log "Stage 6: Prepare G"
# We assume you have install kaldilm, if not, please install
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: install -> installed

--read-symbol-table="data/lang_phone/words.txt" \
--disambig-symbol='#0' \
--max-order=4 \
$dl_dir/lm/lm_tgmed.arpa > data/lm/G_4_gram.fst.txt
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tgmed means this arpa is a tri-gram, of medium size, I think.
Please use a 4-gram arpa to generate G_4_gram.fst.txt, if you need it for decoding/rescoring.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do it.

@@ -0,0 +1,97 @@
#!/usr/bin/env bash
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is shared across various recipes.

Could you make it a symlink, like what we are doing in the librispeech recipe?

@@ -0,0 +1,400 @@
FADG0_SI1279 TEST/DR4/FADG0/SI1279.WAV
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this file be generated by some scripts? If so, we don't need to check it in.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Em....about {train, dev, test} spliting files, I don't find there are some scripts to generate them. In kaldi, there are placed in a list file. In speechbrain, there are placed in timit_prepare.py which listing the speakers with a list. A option for us is to learn speechbrain. We can use a list to contain speakers' names in data preparing process. I will add it to Lhotse.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants