text pre-processing toolkit for character-based languages: dict-based tokenization (Jieba, Mecab, Kytea), subword tokenization (Sentencepiece unigram/bpe), vocab generalization, text normalization, reverse tokenization, character decomposition (based on cjkvi-ids project and manual data), etc.
“结巴”中文分词:做最好的 Python 中文分词组件
"Jieba" (Chinese for "to stutter") Chinese text segmentation: built to be the best Python Chinese word segmentation module.
Yet Another Part-of-Speech and Morphological Analyzer (doc in Japanese)
A general toolkit developed for analyzing text, with a focus on Japanese, Chinese and other languages requiring word or morpheme segmentation.
Extra models for specific languages can be found at here.
Moses, the machine translation system. Mainly the preprocessing script is used. (should pay attention to script path when use it)
Unsupervised text tokenizer for Neural Network-based text generation.
-
handle progress bars
There are four sub-commands: tok, vocab, decomp, reverse. Use
python3 textprep.py -h
to get detailed usage information for each sub-commands.
-
use 'jieba' to tokenize chinese text. if you choose spm/bpe, relevant subword models will be trained by
Sentencepiece
first.python3 textprep.py tok -m jieba -i input.cn -o output.cn
-
generate vocab of a maximum vocab size
python3 textprep.py vocab -m mecab -i input.jp -m 30000
-
decomposition chinese text into ideograph sequences. the ids file
ids.txt
can be found incjkvi-ids
sub-module. the circle/single char files can be found indata
folder.python3 textprep.py decomp -d ./cjkvi-ids/ids.txt -c ./data/circle_char.txt -s ./data/single_char.txt -i tok/input.cn
-
reverse transform decomposed/tokenized files back to original text. if reverse transform decomposed data, the decomp file (decomp dict) should be specified
python3 textprep.py reverse -i ./tok/input.cn -m bpe
- pipeline the sub-commands