This module includes NLP Corpus readers that reads into datastructures
provided by the cogcomp-core-utilities
package.
- PennTreebank Reader
- Propbank Reader
- Treebank Chunk Reader
- Ontonotes Reader
- PennTreebank POS Reader
- Nombank Reader
- ACE Reader
- Ontonotes 5.0 Named Entity Reader
- Ontonotes 5.0 POS Reader
- TAC KBP reader
- MASC Lemma, POS, Shallow Parse, NER Reader
- PARC 3.0 Reader
If you use this code in your research, please provide the URL for this github repository in the relevant publications. Thank you for citing us if you use us in your work!
The set of available readers for the Onotonotes 5.0 corpus is still evolving, but there currently exists two utilities. One converts the named entities found in the Ontonotes 5.0 corpus to CoNLL column format, the other converts the tree bank parse data to JSON format. See edu.illinois.cs.cogcomp.nlp.corpusreaders.ontonotes.OntonotesNamedEntityReader for details on the named entity reader class (that can generate column format), or see edu.illinois.cs.cogcomp.nlp.corpusreaders.ontonotes.OntonotesTreebankReader for details on generating JSON treebank data.
Reader for TAC 2016 KBP corpus. This data was originally created for the TAC KBP EDL track and can be obtained with a license here: http://nlp.cs.rpi.edu/kbp/2017/index.html
This reader assumes that the data is structured like PATH-TO-DATA/LANG/TYPE where LANG is one of {cmn,eng,spa} and TYPE is either nw or df, for newswire and discussion forum data, respectively.
mascReader.MascXCESReader is a reader for the entire MASC dataset from Open American National Corpus. It processes Lemma, POS, Sentence, Shallow Parse, NER CoNLL (LOC, ORG, PER), and NER Ontonotes (DATE, LOCATION, ORGANIZATION, PERSON) annotations.
The reader takes XCES XML format as input. Please check MascXCESReader.java for details on how to generate the input files.
IMPORTANT: Some documents in the corpus can contain irregular sentence annotations, such as nested sentences (as in written/fiction/cable_spool_fort.xml), and tokens not covered by any sentence (as in written/blog/Acephalous-Cant-believe.xml). Please check MascXCESReader.java for details on how they are being handled.
corpusutils.CreateTrainDevTestSplit is a utility that divides a corpus in to training, development, and test subsets. In addition to respecting fractions specified by the user, it stratifies the subsets to balance characteristics specified by the user, preferring to balance the least frequent. This was developed to support use cases where the corpus should be divided in a way that prevents certain kinds of overlap between training and test data, to avoid "memorization" by a learner applied to the task (so, for example, making sure all examples from a given document are assigned to the same subset). For cases where the annotation can be read into TextAnnotation objects, use a corpusutils.TextAnnotationLabelCounter; otherwise, use a corpusUtils.ListExampleLabelCounter and generate a flat boolean-featured representation using the characteristics you care about balancing across dev/train/test, and where each ListExample represents the non-divisible unit of assignment (e.g. a document, or a complete game, etc.).