diff --git a/README.md b/README.md index 2593abf..0efafeb 100644 --- a/README.md +++ b/README.md @@ -34,27 +34,31 @@ annotation. ## Installation -PICCL is already shipped as a part of [LaMachine](https://proycon.github.io/LaMachine). Inside LaMachine, the command line interface is invoked as follows: +PICCL is already shipped as a part of [LaMachine](https://proycon.github.io/LaMachine), although you may need to explicitly install it using ``lamachine-update --edit``. Once inside LaMachine, the command line interface can be invoked by directly specifying one of the workflows: - $ nextflow run LanguageMachines/PICCL + $ ocr.nf -Alternatively, and for the command line interface only; you can install [Nextflow](https://www.nextflow.io) and [Docker](https://docker.io) manually and then run the -following to obtain PICCL: +Or + + $ ticcl.nf + +If you are not in LaMachine already; you can install [Nextflow](https://www.nextflow.io) and [Docker](https://docker.io) manually and then run the +following to obtain the latest development release of PICCL: $ nextflow pull LanguageMachines/PICCL -In this case you need to ensure to always run it with the ``-with-docker proycon/lamachine`` parameter: +In this case you need to ensure to always run it with the ``-with-docker proycon/lamachine:piccl-stable`` parameter: - $ nextflow run LanguageMachines/PICCL -with-docker proycon/lamachine + $ nextflow run LanguageMachines/PICCL -with-docker proycon/lamachine:piccl-stable -We have prepared PICCL for work in many languages, mainly on the basis of available open source lexicons due to [Aspell](http://aspell.net), these data files serve as the input TICCL and have to be downloaded once as follows; +We have prepared PICCL for work in many languages, mainly on the basis of available open source lexicons due to [Aspell](http://aspell.net), these data files serve as the input for TICCL and have to be downloaded once as follows; - $ nextflow run LanguageMachines/PICCL/download-data.nf + $ nextflow run LanguageMachines/PICCL/download-data.nf -with-docker proycon/lamachine:piccl-stable This will generate a ``data/`` directory in your current directory, and will be referenced in the usage examples in the next section. In addition, you can also download example corpora (>300MB), which will be placed in a ``corpora/`` directory: - $ nextflow run LanguageMachines/PICCL/download-examples.nf + $ nextflow run LanguageMachines/PICCL/download-examples.nf -with-docker proycon/lamachine:piccl-stable ## Usage @@ -70,11 +74,13 @@ PICCL comes with the following workflows, most of them complement one or more ot * ``foliavalidator.nf`` - A simple validation workflow to validate FoLiA documents. * ``dbnl.nf`` - A pipeline for linguistic enrichment DBNL corpus data (designed for the Nederlab project, does not use TICCL) -The workflows can be explicitly invoked through NextFlow as follows (add the ``-with-docker proycon/lamachine`` parameter if you -are not already in LaMachine, this applies to all examples in this section), running with the ``--help`` parameter or absence of any parameters will output usage +If you are inside LaMachine, you can invoke these directly. If you let Nextflow manage LaMAchine through docker, then +you have to invoke them like ``nextflow run LanguageMachines/PICCL/ocr.nf -with-docker proycon/lamachine:piccl-stable``. This applies to all examples in this section. + +Running with the ``--help`` parameter or absence of any parameters will output usage information. - $ nextflow run LanguageMachines/PICCL/ocr.nf --help + $ ocr.nf --help -------------------------- OCR Pipeline -------------------------- @@ -102,7 +108,7 @@ information. (The hyphen delimiter may optionally be changed using --seqdelimiter) - $ nextflow run LanguageMachines/PICCL/ticcl.nf --help + $ ticcl.nf --help -------------------------- TICCL Pipeline -------------------------- @@ -131,24 +137,24 @@ An example of invoking an OCR workflow for English is provided below, it assumes directory. It OCRs the ``OllevierGeets.pdf`` file, which contains scanned image data, therefore we choose the ``pdfimages`` input type. - $ nextflow run LanguageMachines/PICCL/ocr.nf --inputdir corpora/PDF/ENG/ --inputtype pdfimages --language eng + $ ocr.nf --inputdir corpora/PDF/ENG/ --inputtype pdfimages --language eng Alternative input types are https://pythonhosted.org/bob/index.htmlimages per page, in which case ``inputtype`` is set to either ``tif``, ``jpg``, ``gif`` or ``png``. These input files should be placed in the designated input directory and follow the naming convention ``$documentname-$sequencenumber.$extension``, for example ``harrypotter-032.png``. An example invocation on dutch scanned pages in the example collection would be: - $ nextflow run LanguageMachines/PICCL/ocr.nf --inputdir corpora/TIFF/NLD/ --inputtype tif --language nld + $ ocr.nf --inputdir corpora/TIFF/NLD/ --inputtype tif --language nld In case of the first example the result will be a file ``OllevierGeets.folia.xml`` in the ``ocr_output/`` directory. This in turn can serve as input for the TICCL workflow, which will attempt to correct OCR errors: - $ nextflow run LanguageMachines/PICCL/ticcl.nf --inputdir ocr_output/ --lexicon data/int/eng/eng.aspell.dict --alphabet data/int/eng/eng.aspell.dict.lc.chars --charconfus data/int/eng/eng.aspell.dict.c0.d2.confusion + $ ticcl.nf --inputdir ocr_output/ --lexicon data/int/eng/eng.aspell.dict --alphabet data/int/eng/eng.aspell.dict.lc.chars --charconfus data/int/eng/eng.aspell.dict.c0.d2.confusion Note that here we pass a language-specific lexicon file, alphabet file, and character confusion file from the data files obtained by ``download-data.nf``. Result will be a file ``OllevierGeets.folia.ticcl.xml`` in the ``ticcl_output/`` directory, containing enriched corrections. The second example, on the dutch corpus data, can be run as follows: - $ nextflow run LanguageMachines/PICCL/ticcl.nf --inputdir ocr_output/ --lexicon data/int/nld/nld.aspell.dict --alphabet data/int/nld/nld.aspell.dict.lc.chars --charconfus data/int/eng/nld.aspell.dict.c20.d2.confusion + $ ticcl.nf --inputdir ocr_output/ --lexicon data/int/nld/nld.aspell.dict --alphabet data/int/nld/nld.aspell.dict.lc.chars --charconfus data/int/eng/nld.aspell.dict.c20.d2.confusion ## Webapplication / RESTful webservice @@ -156,44 +162,14 @@ containing enriched corrections. The second example, on the dutch corpus data, c ### Installation PICCL is also available as a webapplication and RESTful webservice, powered by [CLAM](https://proycon.github.io/clam). -If you are in LaMachine, the webservice is already installed, if not you will have to clone this git repository, edit -``picclservice.py`` (the service configuration file) for your system and then run: - - $ cd webservice - $ python3 setup.py install - -Before the webservice can be used, in any shape or form, it is necessary to download the necessary data into the appropriate directory -(configured as ``PICCLDATAROOT`` in ``picclservice.py``) so the webservice can find it. Follow the instructions -according to your flavour of LaMachine: - -In the LaMachine Virtual Machine or within the Docker container: - - $ sudo mkdir /var/piccldata - $ cd /var/piccldata - $ sudo nextflow run LanguageMachines/PICCL/download-data.nf - $ sudo nextflow run LanguageMachines/PICCL/download-examples.nf - $ sudo mkdir clamdata && sudo chown vagrant clamdata - -In the LaMachine Local Virtual Environment: - - (lamachine)$ mkdir $VIRTUAL_ENV/piccldata - (lamachine)$ cd $VIRTUAL_ENV/piccldata - (lamachine)$ nextflow run LanguageMachines/PICCL/download-data.nf - (lamachine)$ nextflow run LanguageMachines/PICCL/download-examples.nf - -### Usage - -In the LaMachine Local Virtual Environment: - - (lamachine)$ clamservice picclservice.picclservice - -This will launch a development server on port 8080 and is not suitable for production use! +If you are in LaMachine with PICCL, the webservice is already installed, but you may need to run +``lamachine-start-webserver`` if it is not already running. -In LaMachine VM, just reboot the VM after having downloaded the data and the webservice will be available when -connecting to http://127.0.0.1:8080 . -In LaMachine Docker container, explicitly start the webservices after having downloaded the data for PICCL: ``sudo /usr/src/LaMachine/startwebservices.sh``, and access the aforementioned URL. +For production environments, you will want to adapt the CLAM configuration. To this end, +copy ``$LM_PREFIX/etc/piccl.config.yml`` to ``$LM_PREFIX/etc/piccl.$HOST.yml``, where ``$HOST`` corresponds with your +hostname and edit the file with your host specific settings. Always enable authentication if your server is world-accessible (consult the CLAM +documentation to read how). -For any kind of production use, you will want to enable some form of authentication in ``webservice/picclservice/picclservice.py`` (rerun ``setup.py install`` after editing) and hook it up to an existing webserver. diff --git a/webservice/picclservice/piccl.config.yml b/webservice/picclservice/piccl.config.yml new file mode 100644 index 0000000..a5297c0 --- /dev/null +++ b/webservice/picclservice/piccl.config.yml @@ -0,0 +1,4 @@ +port: 8080 +root: "{{VIRTUAL_ENV}}/piccl.clam" +piccldir: "{{VIRTUAL_ENV}}/opt/PICCL" +piccldataroot: "{{VIRTUAL_ENV}}/opt/PICCL" diff --git a/webservice/picclservice/picclservice.py b/webservice/picclservice/picclservice.py index a69a8d7..979ad47 100644 --- a/webservice/picclservice/picclservice.py +++ b/webservice/picclservice/picclservice.py @@ -30,7 +30,7 @@ import os from base64 import b64decode as D -REQUIRE_VERSION = 2.1 +REQUIRE_VERSION = 2.3 CLAMDIR = clam.__path__[0] #directory where CLAM is installed, detected automatically WEBSERVICEDIR = os.path.dirname(os.path.abspath(__file__)) #directory where this webservice is installed, detected automatically @@ -48,63 +48,18 @@ #An informative description for this system (this should be fairly short, about one paragraph, and may not contain HTML) SYSTEM_DESCRIPTION = "PICCL" -# ======== LOCATION =========== - -#Add a section for your host: - -host = os.uname()[1] -if 'VIRTUAL_ENV' in os.environ: - - HOST = host - if host in ('applejack','mlp01'): #production configuration in Nijmegen - HOST = "webservices-lst.science.ru.nl" - PORT= 443 - URLPREFIX = "piccl" - USERS_MYSQL = { - 'host': 'mysql-clamopener.science.ru.nl', - 'user': 'clamopener', - 'password': D(open(os.environ['CLAMOPENER_KEYFILE']).read().strip()), - 'database': 'clamopener', - 'table': 'clamusers_clamusers' - } - DEBUG = True - REALM = "WEBSERVICES-LST" - DIGESTOPAQUE = open(os.environ['CLAM_DIGESTOPAQUEFILE']).read().strip() - SECRET_KEY = open(os.environ['CLAM_SECRETKEYFILE']).read().strip() - ADMINS = ['proycon','antalb','wstoop'] - MAXLOADAVG = 20.0 - else: - PORT = 8080 - - PICCLDATAROOT = os.path.join(os.environ['VIRTUAL_ENV'], 'piccldata') #Path that holds the data/ and corpora/ dirs - if not os.path.exists(PICCLDATAROOT): - raise Exception("Data root dir " + PICCLDATAROOT + " is not initialised yet. Create the directory, enter it and run: nextflow run LanguageMachines/PICCL/download-data.nf and nextflow run LanguageMachines/PICCL/download-examples.nf") - - if host == 'mlp01': #production configuration in Nijmegen - ROOT = "/var/www/webservices-lst/live/writable/piccl" - else: - ROOT = PICCLDATAROOT + "/clamdata/" - - PICCLDIR = os.path.join(os.environ['VIRTUAL_ENV'], "src/PICCL") - - -elif os.path.exists('/var/piccldata'): - #assume we are running in LaMachine docker or VM: - - HOST = host - PORT = 80 #(for HTTPS set this to 443) - URLPREFIX = '/piccl/' - - PICCLDATAROOT = '/var/piccldata' #Path that holds the data/ and corpora/ dirs - if not os.path.exists(PICCLDATAROOT): - raise Exception("Data root dir " + PICCLDATAROOT + " is not initialised yet. Create the directory, enter it and run: nextflow run LanguageMachines/PICCL/download-data.nf and nextflow run LanguageMachines/PICCL/download-examples.nf") - - ROOT = PICCLDATAROOT + "/clamdata/" - PICCLDIR = None #let Nextflow handle it -else: - raise Exception("I don't know where I'm running from! Add a section in the configuration corresponding to this host (" + os.uname()[1]+")") +#Amount of free memory required prior to starting a new process (in MB!), Free Memory + Cached (without swap!). Set to 0 to disable this check (not recommended) +REQUIREMEMORY = 1024 +#Maximum load average at which processes are still started (first number reported by 'uptime'). Set to 0 to disable this check (not recommended) +#MAXLOADAVG = 4.0 +#Minimum amount of free diskspace in MB. Set to 0 to disable this check (not recommended) +DISK = '/dev/sda1' #set this to the disk where ROOT is on +MINDISKSPACE = 0 + +#The amount of diskspace a user may use (in MB), this is a soft quota which can be exceeded, but creation of new projects is blocked until usage drops below the quota again +#USERQUOTA = 100 # ======== AUTHENTICATION & SECURITY =========== @@ -124,21 +79,14 @@ #USERS = { user1': '4f8dh8337e2a5a83734b','user2': pwhash('username', REALM, 'secret') } -#Amount of free memory required prior to starting a new process (in MB!), Free Memory + Cached (without swap!). Set to 0 to disable this check (not recommended) -REQUIREMEMORY = 1024 -#Maximum load average at which processes are still started (first number reported by 'uptime'). Set to 0 to disable this check (not recommended) -#MAXLOADAVG = 4.0 +#The secret key is used internally for cryptographically signing session data, in production environments, you'll want to set this to a persistent value. If not set it will be randomly generated. +#SECRET_KEY = 'mysecret' -#Minimum amount of free diskspace in MB. Set to 0 to disable this check (not recommended) -DISK = '/dev/sda1' #set this to the disk where ROOT is on -MINDISKSPACE = 0 -#The amount of diskspace a user may use (in MB), this is a soft quota which can be exceeded, but creation of new projects is blocked until usage drops below the quota again -#USERQUOTA = 100 +#load external configuration file (see piccl.config.yml) +loadconfig(__name__) -#The secret key is used internally for cryptographically signing session data, in production environments, you'll want to set this to a persistent value. If not set it will be randomly generated. -#SECRET_KEY = 'mysecret' # ======== WEB-APPLICATION STYLING ============= @@ -268,6 +216,7 @@ InputTemplate('lexicon', PlainTextFormat, "Lexicon (one word per line)", filename="lexicon.lst", unique=True, + optional=True, ), OutputTemplate('ranked', PlainTextFormat, 'Ranked Variant Output', SetMetaField('encoding','utf-8'), diff --git a/webservice/picclservice/picclservice_wrapper.py b/webservice/picclservice/picclservice_wrapper.py index 89ba3d4..f388e0c 100755 --- a/webservice/picclservice/picclservice_wrapper.py +++ b/webservice/picclservice/picclservice_wrapper.py @@ -40,9 +40,11 @@ #use scripts from src/ directly run_piccl = sys.argv[6] if run_piccl[-1] != '/': run_piccl += "/" + print("Running PICCL from " + run_piccl,file=sys.stderr) else: - #use the piccl nextflow downloads + #use the piccl nextflow downloads (this is not very well supported/tested currently!) run_piccl = "nextflow run LanguageMachines/PICCL/" + print("Running PICCL mediated by Nextflow",file=sys.stderr) #If you make use of CUSTOM_FORMATS, you need to import your service configuration file here and set clam.common.data.CUSTOM_FORMATS @@ -58,11 +60,29 @@ clam.common.status.write(statusfile, "Starting...") -def fail(): +def fail(prefix=None): + if prefix: + nextflowout(prefix) if os.path.exists('work'): shutil.rmtree('work') sys.exit(1) +def nextflowout(prefix): + print("[" + prefix + "] Nextflow standard error output",file=sys.stderr) + print("-------------------------------------------------",file=sys.stderr) + print(open(prefix+'.nextflow.err.log','r',encoding='utf-8').read(), file=sys.stderr) + os.unlink(prefix+'.nextflow.err.log') + + print("[" + prefix + "] Nextflow standard output",file=sys.stderr) + print("-------------------------------------------------",file=sys.stderr) + print(open(prefix+'.nextflow.out.log','r',encoding='utf-8').read(), file=sys.stderr) + os.unlink(prefix+'.nextflow.out.log') + + if os.path.exists('trace.txt'): + print("[" + prefix + "] Nextflow trace summary",file=sys.stderr) + print("-------------------------------------------------",file=sys.stderr) + print(open('trace.txt','r',encoding='utf-8').read(), file=sys.stderr) + os.unlink('trace.txt') #========================================================================================================================= @@ -127,8 +147,8 @@ def fail(): #Derive input type from used inputtemplate inputtype = '' for inputfile in clamdata.input: - inputtemplate = inputfile.metadata.inputtemplate - if inputtemplate in ('pdfimages', 'pdftext', 'tif','jpg','png','gif','foliaocr','textocr'): + inputtemplate = inputfile.metadata.inputtemplate + if inputtemplate in ('pdfimages', 'pdftext', 'tif','jpg','png','gif','foliaocr','textocr'): inputtype = inputtemplate if not inputtype: @@ -148,14 +168,13 @@ def fail(): ticcl_inputtype = "pdf" else: clam.common.status.write(statusfile, "Running OCR Pipeline",1) # status update - if os.system(run_piccl + "ocr.nf --inputdir " + shellsafe(inputdir,'"') + " --outputdir ocr_output --inputtype " + shellsafe(inputtype,'"') + " --language " + shellsafe(clamdata['lang'],'"') +" -with-trace >&2" ) != 0: #use original clamdata['lang'] (may be deu_frak) - fail() + if os.system(run_piccl + "ocr.nf --inputdir " + shellsafe(inputdir,'"') + " --outputdir ocr_output --inputtype " + shellsafe(inputtype,'"') + " --language " + shellsafe(clamdata['lang'],'"') +" -with-trace >ocr.nextflow.out.log 2>ocr.nextflow.err.log" ) != 0: #use original clamdata['lang'] (may be deu_frak) + fail('ocr') - #Print Nextflow trace information to stderr so it ends up in the CLAM error.log and is available for inspection - print("OCR pipeline trace summary",file=sys.stderr) - print("-------------------------------",file=sys.stderr) - print(open('trace.txt','r',encoding='utf-8').read(), file=sys.stderr) + #Print Nextflow information to stderr so it ends up in the CLAM error.log and is available for inspection + nextflowout('ocr') + ticclinputdir = "ocr_output" ticcl_inputtype = "folia" @@ -172,13 +191,13 @@ def fail(): ticcl_outputdir = 'ticcl_out' else: ticcl_outputdir = outputdir - if os.system(run_piccl + "ticcl.nf --inputdir " + ticclinputdir + " --inputtype " + ticcl_inputtype + " --outputdir " + shellsafe(ticcl_outputdir,'"') + " --lexicon lexicon.lst --alphabet alphabet.lst --charconfus confusion.lst --clip " + shellsafe(clamdata['rank']) + " --distance " + shellsafe(clamdata['distance']) + " --clip " + shellsafe(clamdata['rank']) + " --pdfhandling " + pdfhandling + " -with-trace >&2" ) != 0: - fail() + if os.system(run_piccl + "ticcl.nf --inputdir " + ticclinputdir + " --inputtype " + ticcl_inputtype + " --outputdir " + shellsafe(ticcl_outputdir,'"') + " --lexicon lexicon.lst --alphabet alphabet.lst --charconfus confusion.lst --clip " + shellsafe(clamdata['rank']) + " --distance " + shellsafe(clamdata['distance']) + " --clip " + shellsafe(clamdata['rank']) + " --pdfhandling " + pdfhandling + " -with-trace >ticcl.nextflow.out.log 2>ticcl.nextflow.err.log" ) != 0: + fail('ticcl') + + #Print Nextflow information to stderr so it ends up in the CLAM error.log and is available for inspection + nextflowout('ticcl') + - #Print Nextflow trace information to stderr so it ends up in the CLAM error.log and is available for inspection - print("TICCL pipeline trace summary",file=sys.stderr) - print("-------------------------------",file=sys.stderr) - print(open('trace.txt','r',encoding='utf-8').read(), file=sys.stderr) frog_inputdir = ticcl_outputdir textclass_opts = "" else: @@ -190,12 +209,14 @@ def fail(): if 'frog' in clamdata and clamdata['frog']: print("Running Frog...",file=sys.stderr) clam.common.status.write(statusfile, "Running Frog Pipeline (linguistic enrichment)",75) # status update - if os.system(run_piccl + "frog.nf " + textclass_opts + " --inputdir " + shellsafe(frog_inputdir,'"') + " --inputformat folia --extension folia.xml --outputdir " + shellsafe(outputdir,'"') + " -with-trace >&2" ) != 0: - fail() + if os.system(run_piccl + "frog.nf " + textclass_opts + " --inputdir " + shellsafe(frog_inputdir,'"') + " --inputformat folia --extension folia.xml --outputdir " + shellsafe(outputdir,'"') + " -with-trace >frog.nextflow.out.log 2>frog.nextflow.err.log" ) != 0: + fail('frog') + nextflowout('frog') elif 'tok' in clamdata and clamdata['tok']: clam.common.status.write(statusfile, "Running Tokeniser (ucto)",75) # status update - if os.system(run_piccl + "tokenize.nf " + textclass_opts + " -L " + shellsafe(lang,'"') + " --inputformat folia --inputdir " + shellsafe(frog_inputdir,'"') + " --extension folia.xml --outputdir " + shellsafe(outputdir,'"') + " -with-trace >&2" ) != 0: - fail() + if os.system(run_piccl + "tokenize.nf " + textclass_opts + " --language " + shellsafe(lang,'"') + " --inputformat folia --inputdir " + shellsafe(frog_inputdir,'"') + " --extension folia.xml --outputdir " + shellsafe(outputdir,'"') + " -with-trace >ucto.nextflow.out.log 2>ucto.nextflow.err.log" ) != 0: + fail('ucto') + nextflowout('ucto') #cleanup shutil.rmtree('work') diff --git a/webservice/setup.py b/webservice/setup.py index 3560af2..790ab89 100755 --- a/webservice/setup.py +++ b/webservice/setup.py @@ -11,7 +11,7 @@ setup( name = "PICCL", version = "0.5", - author = "Martin Reynaert", + author = "Martin Reynaert, Maarten van Gompel", author_email = "reynaert@uvt.nl", description = ("Webservice for PICCL"), license = "GPL", @@ -32,7 +32,7 @@ "Intended Audience :: Science/Research", "License :: OSI Approved :: GNU General Public License v3 (GPLv3)", ], - package_data = {'picclservice':['picclservice/*.wsgi'] }, + package_data = {'picclservice':['picclservice/*.wsgi','picclservice/*.yml'] }, include_package_data=True, install_requires=['CLAM >= 2.3'] )