Tutorial

This page covers some what you need to do to start training models. To follow the instructions on this page, you must have successfully compiled and installed Chalk as described in the README.

MASC data

The Open American National corpus has provided a set of open, unencumbered annotations for multiple domains (yay!) in the Manually Annotated Sub-Corpus (MASC), which we'll use here. They are still finalizing MASC 3.0, but MASC 3.0 RC2 is available.

Note: You may find there are things you wish were different about the MASC annotations (choices about tokenization, etc). They love to get feedback, so be sure to let them know by writing to [email protected].

Converting the data

The MASC annotations are provided in multiple XML files. Chalk provides a conversion utility that transforms the XML into the input formats needed for training sentence detection, tokenizer, and named-entity recognition models (for both Chalk and OpenNLP).

$ cd /tmp/
$ mkdir masc
$ cd masc
$ wget www.anc.org/downloads/MASC-3.0.0-RC2.tgz
$ tar xzf MASC-3.0.0-RC2.tgz
$ chalk run chalk.corpora.MascTransform data /tmp/chalk-masc-data
Creating train
Success: data/written/ficlets,1401
Success: data/written/ficlets,1403
Success: data/written/ficlets,1402
Failure: data/written/non-fiction,CUP1
Success: data/written/non-fiction,rybczynski-ch3
<...more status output...>
$ cd /tmp/chalk-masc-data
$ ls
dev  test  train

The three directories contain data splits for training models (train), evaluating their performance while tweaking them (dev), and a held out test set for evaluating them blindly (test). Each directory contains files for sentence detection, tokenization and named entity recognition.

$ ls train/
train-ner.txt  train-sent.txt  train-tok.txt

Check that you've got the right output by running the following command and comparing your output to this.

$ tail -3 train/train-tok.txt 
I<SPLIT>, too checked out a few listings for under $1,000 and was absolutely shocked at the property taxes<SPLIT>.
A $1,000 house<SPLIT>, which could be fixed up into maybe a $<SPLIT>30<SPLIT>-<SPLIT>40,000 house comes with a tax bill of $<SPLIT>4<SPLIT>-<SPLIT>6<SPLIT>K per year<SPLIT>!
The taxes on my $140K house in an urban area of Mississippi are only $<SPLIT>1500<SPLIT>/<SPLIT>year<SPLIT>.

Assuming things went smoothly, you are ready to train models. All of the following instructions assume you are in the chalk-masc-data directory.

Example text

We need an example text, so let's use one about Aravind Joshi's ACL lifetime achievement award. (Note: I've made a few modifications and edits to make it a better example.)

The Association for Computational Linguistics is proud to present its first Lifetime Achievement Award to Prof. Aravind Joshi of the University of Pennsylvania. Aravind Joshi was born in 1929 in Pune, India, where he completed his secondary education as well as his first degree in Mechanical and Electrical Engineering, the latter in 1950. He worked as a research assistant in Linguistics at Penn from 1958-60, while completing his Ph.D. in Electrical Engineering, in 1960. Joshi's work and the work of his Penn colleagues at the frontiers of Cognitive Science was rewarded in 1991 by the establishment of a National Science Foundation Science and Technology Center for Research in Cognitive Science, which Aravind Joshi co-directed until 2001. Dr. Joshi has supervised thirty-six Ph.D. theses to-date, on topics including information and coding theory, and also pure linguistics.

Joshi rocks.

Run the following commands to get things set up with this text.

$ cd /tmp/chalk-masc-data
$ echo "The Association for Computational Linguistics is proud to present its first Lifetime Achievement Award to Prof. Aravind Joshi of the University of Pennsylvania. Aravind Joshi was born in 1929 in Pune, India, where he completed his secondary education as well as his first degree in Mechanical and Electrical Engineering, the latter in 1950. He worked as a research assistant in Linguistics at Penn from 1958-60, while completing his Ph.D. in Electrical Engineering, in 1960. Joshi's work and the work of his Penn colleagues at the frontiers of Cognitive Science was rewarded in 1991 by the establishment of a National Science Foundation Science and Technology Center for Research in Cognitive Science, which Aravind Joshi co-directed until 2001. Dr. Joshi has supervised thirty-six Ph.D. theses to-date, on topics including information and coding theory, and also pure linguistics." > joshi.txt

Training a sentence detector and evaluating it

Do the following to train a sentence detector.

$ chalk cli SentenceDetectorTrainer -encoding UTF-8 -lang en -data train/train-sent.txt -model eng-masc-sent-tmp.bin
Indexing events using cutoff of 5

	Computing event counts...  done. 24097 events
	Indexing...  done.
Sorting and merging events... done. Reduced 24097 events to 18744.
Done indexing.
Incorporating indexed data for training...  
done.
	Number of Event Tokens: 18744
	    Number of Outcomes: 2
	  Number of Predicates: 2141
...done.
Computing model parameters ...
Performing 100 iterations.
  1:  ... loglikelihood=-16702.767609955576	0.24567373531974934
  2:  ... loglikelihood=-11107.7261789502	0.7935012657177242
  3:  ... loglikelihood=-9800.58887752625	0.8331327551147446
  4:  ... loglikelihood=-9119.485892487737	0.8484458646304519
  5:  ... loglikelihood=-8668.888020054215	0.859111092667137
<more iterations>
 96:  ... loglikelihood=-5931.843414354149	0.9025604847076399
 97:  ... loglikelihood=-5926.309056645842	0.9036809561356185
 98:  ... loglikelihood=-5920.848201519377	0.9037639540191725
 99:  ... loglikelihood=-5915.459295879804	0.9038054529609495
100:  ... loglikelihood=-5910.1408344018955	0.9038469519027265
Writing sentence detector model ... done (0.094s)

Wrote sentence detector model to
path: /tmp/chalk-masc-data/eng-masc-sent-tmp-bin

Now run it on the example text.

$ chalk cli SentenceDetector eng-masc-sent-tmp.bin < joshi.txt 
Loading Sentence Detector model ... done (0.052s)
The Association for Computational Linguistics is proud to present its first Lifetime Achievement Award to Prof.
Aravind Joshi of the University of Pennsylvania.
Aravind Joshi was born in 1929 in Pune, India, where he completed his secondary education as well as his first degree in Mechanical and Electrical Engineering, the latter in 1950.
He worked as a research assistant in Linguistics at Penn from 1958-60, while completing his Ph.D. in Electrical Engineering, in 1960.
Joshi's work and the work of his Penn colleagues at the frontiers of Cognitive Science was rewarded in 1991 by the establishment of a National Science Foundation Science and Technology Center for Research in Cognitive Science, which Aravind Joshi co-directed until 2001.
Dr. Joshi has supervised thirty-six Ph.D. theses to-date, on topics including information and coding theory, and also pure linguistics.



Average: 2000.0 sent/s 
Total: 6 sent
Runtime: 0.0030s

Overall, things look fine except the split on 'Prof.'. There is only one training example in train-sent.txt, so there is not much to go on for the model and it thinks it is a sentence ending period rather than an indicator of an abbreviation.

Evaluate the model.

$ chalk cli SentenceDetectorEvaluator -model eng-masc-sent-tmp.bin -data dev/dev-sent.txt -lang en
Loading Sentence Detector model ... done (0.050s)
Evaluating ... done

Precision: 0.7094477998274374
Recall: 0.7389350707706134
F-Measure: 0.7238912732474964

This performance is pretty bad. Looking at the data, there are probably some changes that need to be made to the MASC conversion. E.g. it includes lines like this in dev/dev-sent.txt:

Vocal Impulses g2 (la pianista irlandesa) 2008-11-09T21:46:33Z ID: 45759 Prequels: 45750  Sequels: 458
73

and

"Not now. It's too confusing to get into now."
?
I hated how, no matter how rude and angry I could be, Karon would always stay so calm.

Training a tokenizer and evaluating it

Do the following to train a tokenizer. (I'll suppress the output from here on.)

$ chalk cli TokenizerTrainer -encoding UTF-8 -lang en -data train/train-tok.txt -model eng-masc-token-tmp.bin

To test the tokenizer on the example text, we need to pass it through the sentence detector first and then on to the tokenizer.

$ chalk cli SentenceDetector eng-masc-sent-tmp.bin < joshi.txt  | chalk cli TokenizerME eng-masc-token-tmp.bin 
Loading Sentence Detector model ... Loading Tokenizer model ... done (0.062s)


Average: 2000.0 sent/s 
Total: 6 sent
Runtime: 0.0030s
done (0.294s)
The Association for Computational Linguistics is proud to present its first Lifetime Achievement Award to Prof .
Aravind Joshi of the University of Pennsylvania .
Aravind Joshi was born in 1929 in Pune , India , where he completed his secondary education as well as his first degree in Mechanical and Electrical Engineering , the latter in 1950 .
He worked as a research assistant in Linguistics at Penn from 1958- 60 , while completing his Ph .D. in Electrical Engineering , in 1960 .
Joshi 's work and the work of his Penn colleagues at the frontiers of Cognitive Science was rewarded in 1991 by the establishment of a National Science Foundation Science and Technology Center for Research in Cognitive Science , which Aravind Joshi co-directed until 2001 .
Dr . Joshi has supervised thirty-six Ph .D. theses to-date , on topics including information and coding theory , and also pure linguistics .



Average: 67.3 sent/s 
Total: 7 sent
Runtime: 0.104s

There are definitely some odd tokenizations there -- but the model is doing what it is supposed to do given the annotation. For example, MASC has two tokens "Dr" and "." for "Dr.". (I've contacted the MASC creators about this, since, e.g. the Penn Treebank tends do have tokenization of "Dr." and "Ph." "D.", etc.)

You can evaluate the performance of the trained tokenizer against the development data as follows.

$ chalk cli TokenizerMEEvaluator -model eng-masc-token-tmp.bin -data dev/dev-tok.txt -lang en
Loading Tokenizer model ... done (0.280s)
Evaluating ... done

Precision: 0.9870173026240403
Recall: 0.9801661356395084
F-Measure: 0.9835797887524979

Training named entity recognizers and evaluating them

The MASC conversion utility in Chalk produces CONLL 2003 formatted annotations, e.g.:

Isabella NNP NNP B-PER
Shae NNP NNP I-PER
, , , O
a DT DT O
girl NN NN O
from IN IN O
Mebane NNP NNP B-LOC
North NNP NNP B-LOC
Carolina NNP NNP I-LOC
. . . O

However, Chalk (currently) needs NER training data in OpenNLP format, e.g.: We first need to convert

<START:person> Isabella Shae <END> , a girl from <START:location> Mebane <END> <START:location> North Carolina <END> .

To convert the data, use the format converter.

$ chalk cli TokenNameFinderConverter conll03 -lang en -encoding UTF-8 -types person,location,organization -data train/train-ner.txt > train/train-ner-opennlp.txtjbaldrid@bluebird:/tmp/chalk-masc-data
$ chalk cli TokenNameFinderConverter conll03 -lang en -encoding UTF-8 -types person,location,organization -data dev/dev-ner.txt > dev/dev-ner-opennlp.txt

Now train the model.

$ chalk cli TokenNameFinderTrainer -lang en -encoding UTF-8 -data train/train-ner-opennlp.txt -model eng-masc-ner-tmp.bin

Run it on the example text. (I have removed the timing information from the output given here.)

$ chalk cli SentenceDetector eng-masc-sent-tmp.bin < joshi.txt | chalk cli TokenizerME eng-masc-token-tmp.bin | chalk cli TokenNameFinder eng-masc-ner-tmp.bin 
Loading Sentence Detector model ... Loading Tokenizer model ... Loading Token Name Finder model ... done (0.066s)

The Association for Computational Linguistics is proud to present its first Lifetime Achievement Award to Prof .
Aravind Joshi of the <START:organization> University of Pennsylvania <END> .
Aravind Joshi was born in 1929 in <START:location> Pune <END> , <START:location> India <END> , where he completed his secondary education as well as his first degree in Mechanical and Electrical Engineering , the latter in 1950 .
He worked as a research assistant in Linguistics at Penn from 1958- 60 , while completing his Ph .D. in Electrical Engineering , in 1960 .
Joshi 's work and the work of his Penn colleagues at the frontiers of Cognitive Science was rewarded in 1991 by the establishment of a <START:organization> National Science Foundation Science <END> and <START:organization> Technology Center <END> for Research in Cognitive Science , which Aravind Joshi co-directed until 2001 .
Dr . Joshi has supervised thirty-six Ph .D. theses to-date , on topics including information and coding theory , and also pure linguistics .

Clearly some things can be improved! This will require some changes to the annotations and perhaps some modifications to the features, etc.

Evaluate the model.

$ chalk cli TokenNameFinderEvaluator -lang en -encoding UTF-8 -model eng-masc-ner-tmp.bin -data dev/dev-ner-opennlp.txt 
Precision: 0.6956187548039969
Recall: 0.3271872740419378
F-Measure: 0.44504548807474803

More confirmation that there is more work to do. (Which could include checking/debugging the MASC transformation code to make sure it isn't messing up.)

Training on all data

Once you are satisfied with the development cycle, you probably want to train models on all the available data for use in applications. I'll make this easier in the future, but here's a straightforward way to do it, e.g. for the tokenizer:

$ cat */*-tok.txt > all-tok.txt
$ chalk cli TokenizerTrainer -encoding UTF-8 -lang en -data all-tok.txt -model eng-masc-token.bin

Provide feedback

Saved searches

Use saved searches to filter your results more quickly