Tutorial

This page covers some what you need to do to start training models. To follow the instructions on this page, you must have successfully compiled and installed Chalk as described in the README.

MASC data

The Open American National corpus has provided a set of open, unencumbered annotations for multiple domains (yay!) in the Manually Annotated Sub-Corpus (MASC), which we'll use here. They are still finalizing MASC 3.0, but MASC 3.0 RC2 is available.

Converting the data

The MASC annotations are provided in multiple XML files. Chalk provides a conversion utility that transforms the XML into the input formats needed for training sentence detection, tokenizer, and named-entity recognition models (for both Chalk and OpenNLP).

$ cd /tmp/
$ mkdir masc
$ cd masc
$ wget www.anc.org/downloads/MASC-3.0.0-RC2.tgz
$ tar xzf MASC-3.0.0-RC2.tgz
$ chalk run chalk.corpora.MascTransform data /tmp/chalk-masc-data
Creating train
Success: data/spoken/court-transcript,Day3PMSession
Success: data/spoken/court-transcript,Lessig-court-transcript
Success: data/spoken/telephone,sw2025-ms98-a-trans
Success: data/spoken/telephone,sw2014-ms98-a-trans
<...more status output...>
$ cd /tmp/chalk-masc-data
$ ls
dev  test  train

The three directories contain data splits for training models (train), evaluating their performance while tweaking them (dev), and a held out test set for evaluating them blindly (test). Each directory contains files for sentence detection, tokenization and named entity recognition.

$ ls train/
train-ner.txt  train-sent.txt  train-tok.txt

Check that you've got the right output by running the following command and comparing your output to this.

$ tail -3 train/train-tok.txt 
I<SPLIT>, too checked out a few listings for under $1,000 and was absolutely shocked at the property taxes<SPLIT>.
A $1,000 house<SPLIT>, which could be fixed up into maybe a $<SPLIT>30<SPLIT>-<SPLIT>40,000 house comes with a tax bill of $<SPLIT>4<SPLIT>-<SPLIT>6<SPLIT>K per year<SPLIT>!
The taxes on my $140K house in an urban area of Mississippi are only $<SPLIT>1500<SPLIT>/<SPLIT>year<SPLIT>.

Assuming things went smoothly, you are ready to train models.

Training a sentence detector and evaluating it

TBA

Training a tokenizer and evaluating it

In the chalk-masc-data directory, do the following to train a tokenizer.

$ chalk cli TokenizerTrainer -encoding UTF-8 -lang en -data train/train-tok.txt -model eng-masc-token-tmp.bin
Indexing events using cutoff of 5

	Computing event counts...  done. 1134714 events
	Indexing...  done.
Sorting and merging events... done. Reduced 1134714 events to 197480.
Done indexing.
Incorporating indexed data for training...  
done.
	Number of Event Tokens: 197480
	    Number of Outcomes: 2
	  Number of Predicates: 43226
...done.
Computing model parameters ...
Performing 100 iterations.
  1:  ... loglikelihood=-786523.8098431849	0.9463609332395652
  2:  ... loglikelihood=-151888.98955917868	0.9595113834851777
  3:  ... loglikelihood=-83761.58961438235	0.9800205161829324
<more iterations>
 99:  ... loglikelihood=-8550.341262204405	0.9977633130462831
100:  ... loglikelihood=-8526.913068011585	0.9977641943256186
Writing tokenizer model ... done (0.648s)

Wrote tokenizer model to
path: /tmp/chalk-masc-data/eng-masc-token.bin

You can evaluate the performance of the trained tokenizer against the development data as follows.

$ chalk cli TokenizerMEEvaluator -model eng-masc-token-tmp.bin -data dev/dev-tok.txt -lang en
Loading Tokenizer model ... done (0.274s)
Evaluating ... done

Precision: 0.9867372561824133
Recall: 0.9743674384327212
F-Measure: 0.9805133355221228

Training named entity recognizers and evaluating them

TBA

Training on all data

Once you are satisfied with the development cycle, you probably want to train models on all the available data for use in applications. I'll make this easier in the future, but here's a straightforward way to do it, e.g. for the tokenizer:

$ cat */*-tok.txt > all-tok.txt
$ chalk cli TokenizerTrainer -encoding UTF-8 -lang en -data all-tok.txt -model eng-masc-token.bin

Provide feedback

Saved searches

Use saved searches to filter your results more quickly