Skip to content
This repository has been archived by the owner on May 29, 2020. It is now read-only.

History

jasonbaldridge edited this page Jan 16, 2013 · 3 revisions

Background

Chalk is based on OpenNLP. OpenNLP was started in 2000 by Jason Baldridge and Gann Bierner while they were graduate students in the Division of Informatics at the University of Edinburgh. OpenNLP, broadly speaking, was meant to be a high-level organizational unit for various open source software packages for natural language processing; more practically, it provided a high-level package name for various Java packages of the form opennlp.*. The first OpenNLP software package was the Grok natural language parsing toolkit, which was also the genesis of what is now called the OpenNLP Toolkit. The software released on the OpenNLP sourceforge site (started in 2000, along with Grok) was simply a set of interfaces defined in the package opennlp.common and referred to as the OpenNLP Java API. The actual implementations of natural language processing components were provided in Grok, along with code for sentence parsing with Combinatory Categorial Grammar. This code was used heavily in both Baldridge's and Bierner's dissertations. The first paper that used Grok, and especially the components that would become the OpenNLP Toolkit is Hockenmaier,Bierner and Baldridge (2000), which later appeared as the journal article Hockenmaier,Bierner, and Baldridge (2004).

In 2003, it was decided to remove the NLP infrastructure from Grok as there was a clear separation between the basic text processing components and the syntactic and semantic analysis components. At the same time, Grok was rebranded as OpenCCG. The final release of the OpenNLP Java API was made in March 2003; the new OpenNLP Toolkit was created from the API and the Grok text processing components, with version 1.0 being released in April 2004. The OpenNLP Toolkit and OpenCCG have evolved independently since then and have mostly independent and active developer and user communities. OpenCCG is primarily used in the academic community, while OpenNLP has considerable use in both academia and industry. As in indication of the academic impact of OpenNLP, a search on Google scholar (done in March 2010) returned about 650 publications citing the package. Some of these included the OpenNLP website and a few non-publications plus some self-citations. Based on a scan of these results, perhaps about 500 actual publications had used OpenNLP in their work as of March 2010, and there were an additional 50 or so quasi-publications like surveys and instruction manuals.

Tom Morton saw OpenNLP through the mid-2000's, including co-authoring Taming Text. OpenNLP was accepted for incubation at the Apache Software Foundation in December 2010, and it graduated to being a full Apache project in March 2012.

Chalk's purpose is to allow development of the OpenNLP library in a Scala-centric way, incorporating elements of Scalabha and using/interfacing with other libraries like Junto and Breeze.

Releases

  • 1.1.0: Pulled out opennlp.learn to create the Nak Machine Learning library. Support for training models on the MASC corpus.
  • 1.0: The first release, which is essentially the same as OpenNLP 1.5.3, but with opennlp.* renamed to chalk.* and using SBT as the build system.
Clone this wiki locally