Skip to content

Commit

Permalink
Merge branch 'master' of github.com:LanguageMachines/PICCL
Browse files Browse the repository at this point in the history
  • Loading branch information
proycon committed Apr 5, 2018
2 parents 7f2c41e + 21001b5 commit 89d29f1
Show file tree
Hide file tree
Showing 5 changed files with 92 additions and 142 deletions.
82 changes: 29 additions & 53 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,27 +34,31 @@ annotation.

## Installation

PICCL is already shipped as a part of [LaMachine](https://proycon.github.io/LaMachine). Inside LaMachine, the command line interface is invoked as follows:
PICCL is already shipped as a part of [LaMachine](https://proycon.github.io/LaMachine), although you may need to explicitly install it using ``lamachine-update --edit``. Once inside LaMachine, the command line interface can be invoked by directly specifying one of the workflows:

$ nextflow run LanguageMachines/PICCL
$ ocr.nf

Alternatively, and for the command line interface only; you can install [Nextflow](https://www.nextflow.io) and [Docker](https://docker.io) manually and then run the
following to obtain PICCL:
Or

$ ticcl.nf

If you are not in LaMachine already; you can install [Nextflow](https://www.nextflow.io) and [Docker](https://docker.io) manually and then run the
following to obtain the latest development release of PICCL:

$ nextflow pull LanguageMachines/PICCL

In this case you need to ensure to always run it with the ``-with-docker proycon/lamachine`` parameter:
In this case you need to ensure to always run it with the ``-with-docker proycon/lamachine:piccl-stable`` parameter:

$ nextflow run LanguageMachines/PICCL -with-docker proycon/lamachine
$ nextflow run LanguageMachines/PICCL -with-docker proycon/lamachine:piccl-stable

We have prepared PICCL for work in many languages, mainly on the basis of available open source lexicons due to [Aspell](http://aspell.net), these data files serve as the input TICCL and have to be downloaded once as follows;
We have prepared PICCL for work in many languages, mainly on the basis of available open source lexicons due to [Aspell](http://aspell.net), these data files serve as the input for TICCL and have to be downloaded once as follows;

$ nextflow run LanguageMachines/PICCL/download-data.nf
$ nextflow run LanguageMachines/PICCL/download-data.nf -with-docker proycon/lamachine:piccl-stable

This will generate a ``data/`` directory in your current directory, and will be referenced in the usage examples in the
next section. In addition, you can also download example corpora (>300MB), which will be placed in a ``corpora/`` directory:

$ nextflow run LanguageMachines/PICCL/download-examples.nf
$ nextflow run LanguageMachines/PICCL/download-examples.nf -with-docker proycon/lamachine:piccl-stable

## Usage

Expand All @@ -70,11 +74,13 @@ PICCL comes with the following workflows, most of them complement one or more ot
* ``foliavalidator.nf`` - A simple validation workflow to validate FoLiA documents.
* ``dbnl.nf`` - A pipeline for linguistic enrichment DBNL corpus data (designed for the Nederlab project, does not use TICCL)

The workflows can be explicitly invoked through NextFlow as follows (add the ``-with-docker proycon/lamachine`` parameter if you
are not already in LaMachine, this applies to all examples in this section), running with the ``--help`` parameter or absence of any parameters will output usage
If you are inside LaMachine, you can invoke these directly. If you let Nextflow manage LaMAchine through docker, then
you have to invoke them like ``nextflow run LanguageMachines/PICCL/ocr.nf -with-docker proycon/lamachine:piccl-stable``. This applies to all examples in this section.

Running with the ``--help`` parameter or absence of any parameters will output usage
information.

$ nextflow run LanguageMachines/PICCL/ocr.nf --help
$ ocr.nf --help
--------------------------
OCR Pipeline
--------------------------
Expand Down Expand Up @@ -102,7 +108,7 @@ information.
(The hyphen delimiter may optionally be changed using --seqdelimiter)


$ nextflow run LanguageMachines/PICCL/ticcl.nf --help
$ ticcl.nf --help
--------------------------
TICCL Pipeline
--------------------------
Expand Down Expand Up @@ -131,69 +137,39 @@ An example of invoking an OCR workflow for English is provided below, it assumes
directory. It OCRs the ``OllevierGeets.pdf`` file, which contains scanned image data, therefore we choose the
``pdfimages`` input type.

$ nextflow run LanguageMachines/PICCL/ocr.nf --inputdir corpora/PDF/ENG/ --inputtype pdfimages --language eng
$ ocr.nf --inputdir corpora/PDF/ENG/ --inputtype pdfimages --language eng

Alternative input types are https://pythonhosted.org/bob/index.htmlimages per page, in which case ``inputtype`` is set to either ``tif``, ``jpg``, ``gif`` or ``png``. These input files should be placed in the designated input directory and follow the naming convention
``$documentname-$sequencenumber.$extension``, for example ``harrypotter-032.png``. An example invocation on dutch
scanned pages in the example collection would be:

$ nextflow run LanguageMachines/PICCL/ocr.nf --inputdir corpora/TIFF/NLD/ --inputtype tif --language nld
$ ocr.nf --inputdir corpora/TIFF/NLD/ --inputtype tif --language nld

In case of the first example the result will be a file ``OllevierGeets.folia.xml`` in the ``ocr_output/`` directory. This in turn can serve as
input for the TICCL workflow, which will attempt to correct OCR errors:

$ nextflow run LanguageMachines/PICCL/ticcl.nf --inputdir ocr_output/ --lexicon data/int/eng/eng.aspell.dict --alphabet data/int/eng/eng.aspell.dict.lc.chars --charconfus data/int/eng/eng.aspell.dict.c0.d2.confusion
$ ticcl.nf --inputdir ocr_output/ --lexicon data/int/eng/eng.aspell.dict --alphabet data/int/eng/eng.aspell.dict.lc.chars --charconfus data/int/eng/eng.aspell.dict.c0.d2.confusion

Note that here we pass a language-specific lexicon file, alphabet file, and character confusion file from the data files obtained by
``download-data.nf``. Result will be a file ``OllevierGeets.folia.ticcl.xml`` in the ``ticcl_output/`` directory,
containing enriched corrections. The second example, on the dutch corpus data, can be run as follows:

$ nextflow run LanguageMachines/PICCL/ticcl.nf --inputdir ocr_output/ --lexicon data/int/nld/nld.aspell.dict --alphabet data/int/nld/nld.aspell.dict.lc.chars --charconfus data/int/eng/nld.aspell.dict.c20.d2.confusion
$ ticcl.nf --inputdir ocr_output/ --lexicon data/int/nld/nld.aspell.dict --alphabet data/int/nld/nld.aspell.dict.lc.chars --charconfus data/int/eng/nld.aspell.dict.c20.d2.confusion


## Webapplication / RESTful webservice

### Installation

PICCL is also available as a webapplication and RESTful webservice, powered by [CLAM](https://proycon.github.io/clam).
If you are in LaMachine, the webservice is already installed, if not you will have to clone this git repository, edit
``picclservice.py`` (the service configuration file) for your system and then run:

$ cd webservice
$ python3 setup.py install

Before the webservice can be used, in any shape or form, it is necessary to download the necessary data into the appropriate directory
(configured as ``PICCLDATAROOT`` in ``picclservice.py``) so the webservice can find it. Follow the instructions
according to your flavour of LaMachine:

In the LaMachine Virtual Machine or within the Docker container:

$ sudo mkdir /var/piccldata
$ cd /var/piccldata
$ sudo nextflow run LanguageMachines/PICCL/download-data.nf
$ sudo nextflow run LanguageMachines/PICCL/download-examples.nf
$ sudo mkdir clamdata && sudo chown vagrant clamdata

In the LaMachine Local Virtual Environment:

(lamachine)$ mkdir $VIRTUAL_ENV/piccldata
(lamachine)$ cd $VIRTUAL_ENV/piccldata
(lamachine)$ nextflow run LanguageMachines/PICCL/download-data.nf
(lamachine)$ nextflow run LanguageMachines/PICCL/download-examples.nf

### Usage

In the LaMachine Local Virtual Environment:

(lamachine)$ clamservice picclservice.picclservice

This will launch a development server on port 8080 and is not suitable for production use!
If you are in LaMachine with PICCL, the webservice is already installed, but you may need to run
``lamachine-start-webserver`` if it is not already running.

In LaMachine VM, just reboot the VM after having downloaded the data and the webservice will be available when
connecting to http://127.0.0.1:8080 .
In LaMachine Docker container, explicitly start the webservices after having downloaded the data for PICCL: ``sudo /usr/src/LaMachine/startwebservices.sh``, and access the aforementioned URL.
For production environments, you will want to adapt the CLAM configuration. To this end,
copy ``$LM_PREFIX/etc/piccl.config.yml`` to ``$LM_PREFIX/etc/piccl.$HOST.yml``, where ``$HOST`` corresponds with your
hostname and edit the file with your host specific settings. Always enable authentication if your server is world-accessible (consult the CLAM
documentation to read how).

For any kind of production use, you will want to enable some form of authentication in ``webservice/picclservice/picclservice.py`` (rerun ``setup.py install`` after editing) and hook it up to an existing webserver.



Expand Down
4 changes: 4 additions & 0 deletions webservice/picclservice/piccl.config.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
port: 8080
root: "{{VIRTUAL_ENV}}/piccl.clam"
piccldir: "{{VIRTUAL_ENV}}/opt/PICCL"
piccldataroot: "{{VIRTUAL_ENV}}/opt/PICCL"
83 changes: 16 additions & 67 deletions webservice/picclservice/picclservice.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@
import os
from base64 import b64decode as D

REQUIRE_VERSION = 2.1
REQUIRE_VERSION = 2.3

CLAMDIR = clam.__path__[0] #directory where CLAM is installed, detected automatically
WEBSERVICEDIR = os.path.dirname(os.path.abspath(__file__)) #directory where this webservice is installed, detected automatically
Expand All @@ -48,63 +48,18 @@
#An informative description for this system (this should be fairly short, about one paragraph, and may not contain HTML)
SYSTEM_DESCRIPTION = "PICCL"

# ======== LOCATION ===========

#Add a section for your host:

host = os.uname()[1]
if 'VIRTUAL_ENV' in os.environ:

HOST = host
if host in ('applejack','mlp01'): #production configuration in Nijmegen
HOST = "webservices-lst.science.ru.nl"
PORT= 443
URLPREFIX = "piccl"
USERS_MYSQL = {
'host': 'mysql-clamopener.science.ru.nl',
'user': 'clamopener',
'password': D(open(os.environ['CLAMOPENER_KEYFILE']).read().strip()),
'database': 'clamopener',
'table': 'clamusers_clamusers'
}
DEBUG = True
REALM = "WEBSERVICES-LST"
DIGESTOPAQUE = open(os.environ['CLAM_DIGESTOPAQUEFILE']).read().strip()
SECRET_KEY = open(os.environ['CLAM_SECRETKEYFILE']).read().strip()
ADMINS = ['proycon','antalb','wstoop']
MAXLOADAVG = 20.0
else:
PORT = 8080

PICCLDATAROOT = os.path.join(os.environ['VIRTUAL_ENV'], 'piccldata') #Path that holds the data/ and corpora/ dirs
if not os.path.exists(PICCLDATAROOT):
raise Exception("Data root dir " + PICCLDATAROOT + " is not initialised yet. Create the directory, enter it and run: nextflow run LanguageMachines/PICCL/download-data.nf and nextflow run LanguageMachines/PICCL/download-examples.nf")

if host == 'mlp01': #production configuration in Nijmegen
ROOT = "/var/www/webservices-lst/live/writable/piccl"
else:
ROOT = PICCLDATAROOT + "/clamdata/"

PICCLDIR = os.path.join(os.environ['VIRTUAL_ENV'], "src/PICCL")


elif os.path.exists('/var/piccldata'):
#assume we are running in LaMachine docker or VM:

HOST = host
PORT = 80 #(for HTTPS set this to 443)
URLPREFIX = '/piccl/'

PICCLDATAROOT = '/var/piccldata' #Path that holds the data/ and corpora/ dirs
if not os.path.exists(PICCLDATAROOT):
raise Exception("Data root dir " + PICCLDATAROOT + " is not initialised yet. Create the directory, enter it and run: nextflow run LanguageMachines/PICCL/download-data.nf and nextflow run LanguageMachines/PICCL/download-examples.nf")

ROOT = PICCLDATAROOT + "/clamdata/"
PICCLDIR = None #let Nextflow handle it
else:
raise Exception("I don't know where I'm running from! Add a section in the configuration corresponding to this host (" + os.uname()[1]+")")
#Amount of free memory required prior to starting a new process (in MB!), Free Memory + Cached (without swap!). Set to 0 to disable this check (not recommended)
REQUIREMEMORY = 1024

#Maximum load average at which processes are still started (first number reported by 'uptime'). Set to 0 to disable this check (not recommended)
#MAXLOADAVG = 4.0

#Minimum amount of free diskspace in MB. Set to 0 to disable this check (not recommended)
DISK = '/dev/sda1' #set this to the disk where ROOT is on
MINDISKSPACE = 0

#The amount of diskspace a user may use (in MB), this is a soft quota which can be exceeded, but creation of new projects is blocked until usage drops below the quota again
#USERQUOTA = 100

# ======== AUTHENTICATION & SECURITY ===========

Expand All @@ -124,21 +79,14 @@

#USERS = { user1': '4f8dh8337e2a5a83734b','user2': pwhash('username', REALM, 'secret') }

#Amount of free memory required prior to starting a new process (in MB!), Free Memory + Cached (without swap!). Set to 0 to disable this check (not recommended)
REQUIREMEMORY = 1024

#Maximum load average at which processes are still started (first number reported by 'uptime'). Set to 0 to disable this check (not recommended)
#MAXLOADAVG = 4.0
#The secret key is used internally for cryptographically signing session data, in production environments, you'll want to set this to a persistent value. If not set it will be randomly generated.
#SECRET_KEY = 'mysecret'

#Minimum amount of free diskspace in MB. Set to 0 to disable this check (not recommended)
DISK = '/dev/sda1' #set this to the disk where ROOT is on
MINDISKSPACE = 0

#The amount of diskspace a user may use (in MB), this is a soft quota which can be exceeded, but creation of new projects is blocked until usage drops below the quota again
#USERQUOTA = 100
#load external configuration file (see piccl.config.yml)
loadconfig(__name__)

#The secret key is used internally for cryptographically signing session data, in production environments, you'll want to set this to a persistent value. If not set it will be randomly generated.
#SECRET_KEY = 'mysecret'

# ======== WEB-APPLICATION STYLING =============

Expand Down Expand Up @@ -268,6 +216,7 @@
InputTemplate('lexicon', PlainTextFormat, "Lexicon (one word per line)",
filename="lexicon.lst",
unique=True,
optional=True,
),
OutputTemplate('ranked', PlainTextFormat, 'Ranked Variant Output',
SetMetaField('encoding','utf-8'),
Expand Down
Loading

0 comments on commit 89d29f1

Please sign in to comment.