-
Notifications
You must be signed in to change notification settings - Fork 278
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generate batch of LMs #2249
Closed
Closed
Generate batch of LMs #2249
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Better! |
The output is a little messy since we do runs simultaneously so we need to report everything nicely at the end. |
root@c53e06a85b12:/code# ./bin/run-ci-lm-gen-batch.sh
sources_lm_filepath=./data/smoke_test/vocab.txt
+ python data/lm/generate_lm_batch.py --input_txt ./data/smoke_test/vocab.txt --output_dir ./data/lm --top_k_list 30000 --arpa_order_list 4 --max_arpa_memory 85% --arpa_prune_list 0|0|2 --binary_a_bits 255 --binary_q_bits 8 --binary_type trie --kenlm_bins /code/kenlm/build/bin/ -j 1
Converting to lowercase and counting word occurrences ...
| |# | 500 Elapsed Time: 0:00:00
Saving top 30000 words ...
Calculating word statistics ...
Your text file has 13343 words in total
It has 2559 unique words
Your top-30000 words are 100.0000 percent of all words
Your most common word "the" occurred 687 times
The least common word in your top-k is "ultraconservative" with 1 times
The first word with 2 occurrences is "mens" at place 1146
Creating ARPA file ...
=== 1/5 Counting and sorting n-grams ===
Reading /code/data/lm/4-30000-0|0|2/lower.txt.gz
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Unigram tokens 13343 types 2562
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:30744 2:14627018752 3:27425658880 4:43881058304
Statistics:
1 2562 D1=0.651407 D2=1.09117 D3+=1.64993
2 9399 D1=0.831861 D2=1.21647 D3+=1.44108
3 148/12347 D1=0.937292 D2=1.53845 D3+=1.55801
4 21/12584 D1=0.967272 D2=1.7362 D3+=3
Memory estimate for binary LM:
type kB
probing 289 assuming -p 1.5
probing 355 assuming -r models -p 1.5
trie 156 without quantization
trie 107 assuming -q 8 -b 8 quantization
trie 148 assuming -a 22 array pointer compression
trie 99 assuming -a 22 -q 8 -b 8 array pointer compression and quantization
=== 3/5 Calculating and sorting initial probabilities ===
Chain sizes: 1:30744 2:150384 3:2960 4:504
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
####################################################################################################
=== 4/5 Calculating and writing order-interpolated probabilities ===
Chain sizes: 1:30744 2:150384 3:2960 4:504
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
####################################################################################################
=== 5/5 Writing ARPA model ===
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Name:lmplz VmPeak:84649108 kB VmRSS:6756 kB RSSMax:16794516 kB user:0.940238 sys:4.20439 CPU:5.14465 real:5.14232
Filtering ARPA file using vocabulary of top-k words ...
Reading ./data/lm/4-30000-0|0|2/lm.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Building lm.binary ...
Reading ./data/lm/4-30000-0|0|2/lm_filtered.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Identifying n-grams omitted by SRI
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Quantizing
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Writing trie
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
SUCCESS
----------------------------------------------------------------
2022-07-04 13:32 RUNNING 1/1 FOR arpa_order=4 top_k=30000 arpa_prune='0|0|2'
LM generation 1 took: 5.443297207000796 seconds
----------------------------------------------------------------
INFO:root:Took 5.445083366999825 seconds to generate 1 language model.
|
This was referenced Jul 4, 2022
Closed
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
@HarikalarKutusu had made a double of
data/lm/generate_lm.py
to create multiple LMs with only one command.Unfortunately his implementation was rather lacking so I made the following changes:
generate_lm_batch.py
run-ci-lm-gen.sh
toworkflows/build-and-test.yml
pipelineSo much so that you can now do the following.
This will test for all possible combinaison of :
The created scorers will be stored in
{--output_path}/{arpa_order}-{top_k}-{arpa_prune}/
.Needs
libboost-program-options-dev
andlibboost-thread-dev
installed orlmplz
crashes with: