- Python 3, with PIP
- Conda (or other environment management tool)
If you plan on using ValetRules, but NOT to actively develop it, then follow these steps to install it:
pip install .
python3 -m spacy download en_core_web_sm
pip install stanza
python3 -c 'import stanza; stanza.download("en")'
We recommend the use of a (development) environment management tool like Conda. Please note that
pip install
uses the-e
flag inpip install -e
, which will set your pip-installation to point to the local version of ValetRules.
conda create --name valet python=3.8
conda activate valet
Then install ValetRules with pip's -e
option:
pip install -e .
python3 -m spacy download en_core_web_sm
pip install stanza
python3 -c 'import stanza; stanza.download("en")'
To run the unit and coverage tests, follow these instructions:
pip install -r requirements.txt
pip install -r requirements-test.txt
python setup.py test
coverage run --source="src" -m unittest discover -s tests/
coverage report
coverage html
Alternatively, you may also run just the unit tests with:
pip install -r requirements.txt
pip install -r requirements-test.txt
python -m unittest discover -s tests
There is extensive documentation in Markdown format in the docs directory. The entry point of the documentation tree is Valet Rules.
The nlpcore
package of ValetRules supports two NLP engine libraries, Stanza and spaCy. The scripts default to Stanza but support setting the choice through a parameter. As a practical matter, the ValetRules team has used Spacy more, and the tests that involve NLP are written to use rules that assume Spacy's style of providing NLP information, and the documentation tends to use Spacy style examples.
Both tools provide dependency tree parsing, part-of-speech identification, lemma identification, and named entity recognition. Certain ValetRules capabilities rely on the presence of one of these tools and the information they provide, but these tools are not required if your patterns do not require that information.
Note that Stanza and Spacy have somewhat different behavior, particularly in regard to the dependency tree parses they generate. These differences can require your patterns to be written differently to conform to whichever tool you choose. In some cases, patterns can be written to work with either tool, but this requires more effort. The rule types that may be affected by NLP engine differences are token tests and parse expressions. Other rule types are generally not affected.
For more information and details on the annotations and dependencies provided by the tools:
The GUI tool provides a way to develop rules and examine the annotations and dependencies in source text interactively. It can be helpful to work within the GUI until you are getting the desired results from a smaller set of source files before using the developed rules across a much broader set of source files with scripts such as the command line tool.
The nlpcore
package of ValetRules provides text tokenization and sentence segmentation. While ValetRules is primarily intended to be used on
this type of tokenization and segmentation, if you want to modify how the tokenizer works, such as to maintain spaces or newlines, you would need to create a new tokenizer class by subclassing from the existing examples found in the NLP Core source tokenizer.py
file.
The original author and lead developer of the Valet package is Dayne Freitag ([email protected], [email protected]).