This software bundles some basic preprocessing steps that a lot of NLP
applications need, with the goal of making them run locally. Some
components have significant memory requirements, but given a machine
with sufficient memory, you can instantiate an AnnotatorService
object that provides plain text tokenization, Part-of-Speech tagging,
chunking, Named Entity tagging, lemmatization, dependency and
constituency parsing, and (verb) semantic role labeling. You can also
use just a subset of these by changing the configuration file
that the pipeline uses.
By default, the cogcomp-nlp-pipeline will cache its outputs in a local directory, so that if you process overlapping data sets (or process the same data set multiple times), it will use a cached copy of the system outputs and run much faster.
The pipeline has the following annotators. To understand the annotations,
please refer to the descriptions of the individual packages at the URLs
provided. These annotations are stored as View
s in a single TextAnnotation
data structure.
The CogComp NLP Pipeline provides a suite of state-of-the-art Natural Language Processing tools of varying complexity. Some have specific prerequisites that must be present if you want to run them. The memory is expected MAXIMUM run-time memory required for the component by itself. Note that the pipeline runs only one copy of each active component so that, for example, a single Chunker component fulfils the needs of several other components for which it is a dependency.
- Lemmatizer: <1G memory, no dependencies.
- Part-of-Speech tagger: <1G, no dependencies.
- Chunker: <1G, requires Part-of-Speech tagger.
- Named Entity Recognizer (CoNLL): 4G, no dependencies.
- Named Entity Recognizer (OntoNotes) 6G, no dependencies.
- Constituency Parser (Stanford): 1G, no dependencies.
- Dependency Parser (Stanford): shares resources of Constituency parser so no individual footprint; no dependencies.
- Dependency Parser (CogComp): <1G requires Part-of-Speech tagger, Chunker.
- Verb Semantic Role Labeler: ~40G (see issue656), requires Lemmatizer, Part-of-Speech, Shallow Parsing, Named Entity Recognizer (CoNLL), Constituency Parser.
- Noun Semantic Role Labeler: 1G, requires Lemmatizer, Part-of-Speech, Named Entity Recognizer (CoNLL), Constituency Parser.
- Quantifier: <2G, requires Part-of-Speech.
- Preposition SRL: <2GB
- Comma-SRL: <1GB, requires POS, Lemmatizer, Part-of-Speech, Named Entity Recognizer (CoNLL), Constituency Parser.
Note that individual CogComp NLP tools may depend on other tools for inputs, and will not work unless those components are also active. If you try to run the system with an invalid configuration, it will print a warning about the missing components.
The pipeline module is organized thus:
config/ : configuration files
scripts/ : scripts to allow command-line test of the CogComp NLP Pipeline
src/ : source code for the CogComp NLP Pipeline
test/ : test files used for the command line test of the CogComp NLP Pipeline
See the section "Running the CogComp NLP Pipeline" for details on running the pipeline.
This distribution contains all the dependencies needed to run the CogComp NLP Pipeline. This includes configuration files for some individual components; scripts to process plain text files from the command line; and .jar files for the libraries used by the pipeline and its components.
This software has been developed to allow some of our more complex tools to be run completely within a single JVM, either programmatically or from the command line, instead of in tandem with the CCG NLP Curator.
The cogcomp-nlp-pipeline
package was designed to be used either
programmatically -- inline in your Java code -- or from the command line,
using only those components you need to use for a given task.
Currently, the pipeline works only for English plain text. You will need to remove XML/HTML mark-up, as well as formatting like bulleted lists if you want well-formed output. (The pipeline may generate output for such texts, but it is not guaranteed that the different tools will succeed in producing mutually consistent output.)
One important note: if you wish to use your own tokenization, you should
implement a class that follows the Tokenizer
interface from
illinois-core-utilities
, and use it as an argument to a
TokenizerTextAnnotationBuilder
(also from cogcomp-core-utilities
).
NOTE: These commands assume you ran mvn install
and mvn dependency:copy-dependencies
,
which create the pipeline binary in target/
and copies all dependency jars into
target/dependency
.
Two sample scripts are provided to test that the pipeline works after you have downloaded
it. scripts/runPipelineOnDataset.sh
takes as arguments a configuration file
and a text file; it processes the text file according to the
properties set in the config file, and writes output to STDOUT.
scripts/testPreprocessor.sh
is a self-contained script that calls
runPipelineOnDataset.sh
with fixed arguments and compares the output to
some reference output. If the new output and reference output are
different, the script prints an error message and indicates the
differences.
To process a set of plain text files in one directory and generate a corresponding set of annotated files in json format in a second directory, run the command: Running the test:
scripts/testPipeline.sh
Running your own text to get a visual sense of what IllinoisPreprocessor is doing:
scripts/runPipelineOnDataset.sh config/pipelineConfig.txt [yourInputFile] [yourOutputFile]
First you have to add it as a dependency to your project. If this package is used in maven, please add the following dependencies with proper repositories.
<dependencies>
<dependency>
<groupId>edu.illinois.cs.cogcomp</groupId>
<artifactId>illinois-nlp-pipeline</artifactId>
<version>#VERSION</version>
</dependency>
</dependencies>
<repositories>
<repository>
<id>CogcompSoftware</id>
<name>CogcompSoftware</name>
<url>http://cogcomp.org/m2repo/</url>
</repository>
</repositories>
where #VERSION
is the version included in the pom.xml
file.
The main class is PipelineFactory
, in the package
edu.illinois.cs.cogcomp.pipeline.main
under src/main/java
. For an
example showing how the PipelineFactory
and BasicAnnotatorService
(the
class it instantiates, which is the pipeline itself) can be used, look at
CachingPipelineTest
class under src/test/resources/
, in
edu.illinois.cs.cogcomp.pipeline.main
.
To process text input, use the 'createAnnotatedTextAnnotation()' method:
import edu.illinois.cs.cogcomp.annotation.AnnotatorService;
import edu.illinois.cs.cogcomp.core.datastructures.textannotation.TextAnnotation;
import edu.illinois.cs.cogcomp.pipeline.main.PipelineFactory;
String docId = "APW-20140101.3018"; // arbitrary string identifier
String textId = "body"; // arbitrary string identifier
String text = ...; // contains plain text to be annotated
AnnotatorService pipeline = PipelineFactory.buildPipeline();
TextAnnotation ta = pipeline.createAnnotatedTextAnnotation( docId, textId, text );
The output of this will be a tokenized document. If you want to add more views,
you have to specify them either as arguments of buildPipeline
, or
in a configurator. (both explained in the next sub-sections)
This method takes as its argument a String variable containing the
text you want to process. This String
should not be too long --
depending on the annotators you plan to use, a reasonable upper limit
is 1,000 words (fewer if you use resource-intensive annotators like
Verb or Noun SRL).
The method returns a TextAnnotation
data structure (see the
cogcomp-core-utilities
package for details), which contains
a View corresponding to each annotation source. Each View contains
a set of Constituents
and Relations
representing the annotator output.
Access views and constituents via:
String viewName = ViewNames.POS; // example using ViewNames class constants
View view = ta.getView(viewName);
List<Constituent> constituents = view.getConstituents();
See the documentation for individual components (links in section 1 above) for more information about the annotations and their representation as Constituents and Relations.
The previous usage will add only the basic annotations (e.g. tokenization, sentences, etc).
To add more high-level annotations you have to specify them in the definition of the buildPipeline
funtion:
import edu.illinois.cs.cogcomp.annotation.AnnotatorService;
import edu.illinois.cs.cogcomp.core.datastructures.textannotation.TextAnnotation;
import edu.illinois.cs.cogcomp.core.datastructures.ViewNames;
import edu.illinois.cs.cogcomp.pipeline.main.PipelineFactory;
String docId = "APW-20140101.3018"; // arbitrary string identifier
String textId = "body"; // arbitrary string identifier
String text = ...; // contains plain text to be annotated
AnnotatorService pipeline = PipelineFactory.buildPipeline(ViewNames.POS, ViewNames.SRL_VERB);
TextAnnotation ta = pipeline.createAnnotatedTextAnnotation( docId, textId, text );
This will include the two views ViewNames.POS
and ViewNames.SRL_VERB
in the output, in addition to the default views.
If you want to change specific behaviors, such as activating or deactivating specific components, you need to write a custom config file and use it as the example below.
This mostly happens when you have limited resourcesΩ. For example, SRL and parsers tend to take more time and memory, so you can turn them off if you don't need them.
import edu.illinois.cs.cogcomp.annotation.AnnotatorService;
import edu.illinois.cs.cogcomp.core.utilities.configuration.ResourceManager;
import edu.illinois.cs.cogcomp.pipeline.main.PipelineFactory;
// An example of "[PATH_TO_YOUR_CONFIG_FILE]" is "config/pipeline-config.properties"
ResourceManager userConfig = new ResourceManager("[PATH_TO_YOUR_CONFIG_FILE]");
AnnotatorService pipeline = PipelineFactory.buildPipeline(userConfig);
The config file is composed with lines of [KEY]\t[VAL]
pairs,
where each pair specifies a property name and its value.
[KEY]
s are property names specified in
PipelineConfigurator.
The mechanism behind is that the config file will be parsed by our
ResourceManager.
Please see the documentation of
core-utilities
to learn more.
You can refer to the default config file if you need an example. Most property names are self-explanatory, please see specific usages if some are not.
Note that individual annotators have their own configuration options -- see the documentation for individual components for details.
This project uses slf4j's log4j libraries. You can change the settings by creating a log4j.properties file and adding the directory containing that file to the classpath.
Often a convenient model of using the pipeline server is, running the server (which includes all the annotators) on a big machine (=big memory) and sending calls to the server with clients. Here we first introduce the details of the server and later we will delineate the clients.
The server supports post and get requests to obtain annotation for a requested text, with desired views. In order to run the webserver with default settings (port = 8080), do:
pipeline/scripts/runWebserver.sh
The following arguments are supported:
usage: pipeline/scripts/runWebserver.sh [-h] [--port PORT] [--rate HOURS]
optional arguments:
-h, --help show this help message and exit
--port PORT, -P PORT Port to run the webserver.
--rate HOUR, -L HOUR Max number of queries per day. If empty, there won't be any limit.
Here are the available APIs:
API | Address | Supported request type | Parameters | Example |
---|---|---|---|---|
Annotating text | /annotate |
POST/GET | text : the target raw text ; views : views to be added, separated by comma |
/annotate?text="This is sample text"&views=POS,NER_CONLL |
Getting existing views | /viewNames |
POST/GET | N/A | /viewNames |
Note that the current web server is very basic. It does not support parallel processing within a single request, nor across multiple requests.
- While running the Pipeline if you see an error regarding insufficient Java heap space, you will need to set the
JAVA_OPTIONS
orMAVEN_OPTIONS
to include "-Xmx20g":
export MAVEN_OPTS="-Xmx10g"
- Between different runs of the Pipeline, if you see the following exception, you should remove the temporary cache folders created by MapDB.
Caused by: org.mapdb.DBException$DataCorruption: Header checksum broken. Store was not closed correctly, or is corrupted
- Initializing multiple instances of
PipelineFactory
in a single run will lead to an exception in MapDB. For example, the code below:
public class TestPipeline {
public static TextAnnotation getTA(String id, String text) throws Exception{
ResourceManager rm = new PipelineConfigurator().getConfig(new ResourceManager( "pipeline-config.properties" ));
AnnotatorService prep = PipelineFactory.buildPipeline(rm);//pipeline is instantiated everytime this function is called.
TextAnnotation rec = prep.createAnnotatedTextAnnotation(id, "", text);
return rec;
}
public static void main(String[] args_) throws Exception {
String text = "Houston, Monday, July 21 -- Men have landed and walked on the moon.";
TextAnnotation rec1 = testpipeline.getTA("1",text);
text = "Here's another sentence to process.";
TextAnnotation rec2 = testpipeline.getTA("2",text);
}
}
would lead to the following exception:
Exception in thread "main" org.mapdb.DBException$FileLocked: File is already opened and is locked: annotation-cache
at org.mapdb.volume.Volume.lockFile(Volume.java:446)
at org.mapdb.volume.RandomAccessFileVol.<init>(RandomAccessFileVol.java:52)
at org.mapdb.volume.RandomAccessFileVol$1.makeVolume(RandomAccessFileVol.java:26)
...
Caused by: java.nio.channels.OverlappingFileLockException
at sun.nio.ch.SharedFileLockTable.checkList(FileLockTable.java:255)
at sun.nio.ch.SharedFileLockTable.add(FileLockTable.java:152)
at sun.nio.ch.FileChannelImpl.lock(FileChannelImpl.java:1030)
...
To fix this problem, consider changing it to:
public class TestPipeline {
public static TextAnnotation getTA(String id, String text, AnnotatorService prep) throws Exception{
TextAnnotation rec = prep.createAnnotatedTextAnnotation(id, "", text);
return rec;
}
public static void main(String[] args_) throws Exception {
ResourceManager rm = new PipelineConfigurator().getConfig(new ResourceManager( "pipeline-config.properties" ));
AnnotatorService prep = PipelineFactory.buildPipeline(rm);//pipeline is only instantiated once.
String text = "Houston, Monday, July 21 -- Men have landed and walked on the moon.";
TextAnnotation rec1 = testpipeline.getTA("1",text);
text = "Here's another sentence to process.";
TextAnnotation rec2 = testpipeline.getTA("2",text);
}
}
To see the full license for this software, see LICENSE or visit the download page for this software and press 'Download'. The next screen displays the license.