Warning: This software is only compatible with up to Python 3.7 for the moment.
Extension for pie
to include taggers with their models and pre/postprocessors.
Pie is a wonderful tool to train models. And most of the time, it will be enough. What pie_extended
is proposing here
is to provide you with the necessary tools to share your models with customized pre- and post-processing.
The current system provide an easier access to adding customized:
- normalization of your text,
- sentence tokenization,
- word tokenization,
- disambiguation,
- output formatting
@software{thibault_clerice_2020_3883590,
author = {Clérice, Thibault},
title = {Pie Extended, an extension for Pie with pre-processing and post-processing},
month = jun,
year = 2020,
publisher = {Zenodo},
doi = {10.5281/zenodo.3883589},
url = {https://doi.org/10.5281/zenodo.3883589}
}
- Classical Latin (Model:
lasla
) - Ancient Greek (Model:
grc
) - Old French (Model:
fro
) - Early Modern French (Model:
freem
) - Classical French (Model:
fr
) - Old Dutch (Model:
dum
)
If you trained models and want some help sharing them with Pie Extended, open an issue :)
To install, simply do pip install pie-extended
. Then, look at all available models.
Please, in case of doubt, run pip install pie-extended --extra-index-url https://download.pytorch.org/whl/cpu
But on top of that, it provides a quick and easy way to use others models ! For example, in a shell :
pie-extended download lasla
pie-extended install-addons lasla
pie-extended tag lasla your_file.txt
will give you access to all you need !
You can run the lemmatizer in your own scripts and retrieve token annotations as dictionaries:
from typing import List
from pie_extended.cli.utils import get_tagger, get_model, download
# In case you need to download
do_download = False
if do_download:
for dl in download("lasla"):
x = 1
# model_path allows you to override the model loaded by another .tar
model_name = "lasla"
tagger = get_tagger(model_name, batch_size=256, device="cpu", model_path=None)
sentences: List[str] = ["Lorem ipsum dolor sit amet, consectetur adipiscing elit. "]
# Get the main object from the model (: data iterator + postprocesor
from pie_extended.models.lasla.imports import get_iterator_and_processor
for sentence_group in sentences:
iterator, processor = get_iterator_and_processor()
print(tagger.tag_str(sentence_group, iterator=iterator, processor=processor) )
will result in
[{'form': 'lorem', 'lemma': 'lor', 'POS': 'NOMcom', 'morph': 'Case=Acc|Numb=Sing', 'treated': 'lorem'},
{'form': 'ipsum', 'lemma': 'ipse', 'POS': 'PROdem', 'morph': 'Case=Acc|Numb=Sing', 'treated': 'ipsum'},
{'form': 'dolor', 'lemma': 'dolor', 'POS': 'NOMcom', 'morph': 'Case=Nom|Numb=Sing', 'treated': 'dolor'},
{'form': 'sit', 'lemma': 'sum1', 'POS': 'VER', 'morph': 'Numb=Sing|Mood=Sub|Tense=Pres|Voice=Act|Person=3',
'treated': 'sit'},
{'form': 'amet', 'lemma': 'amo', 'POS': 'VER', 'morph': 'Numb=Sing|Mood=Sub|Tense=Pres|Voice=Act|Person=3',
'treated': 'amet'}, {'form': ',', 'lemma': ',', 'pos': 'PUNC', 'morph': 'MORPH=empty', 'treated': ','},
{'form': 'consectetur', 'lemma': 'consector2', 'POS': 'VER',
'morph': 'Numb=Sing|Mood=Sub|Tense=Pres|Voice=Dep|Person=3', 'treated': 'consectetur'},
{'form': 'adipiscing', 'lemma': 'adipiscor', 'POS': 'VER', 'morph': 'Tense=Pres|Voice=Dep', 'treated': 'adipiscing'},
{'form': 'elit', 'lemma': 'elio', 'POS': 'VER', 'morph': 'Numb=Sing|Mood=Ind|Tense=Pres|Voice=Act|Person=3',
'treated': 'elit'}, {'form': '.', 'lemma': '.', 'pos': 'PUNC', 'morph': 'MORPH=empty', 'treated': '.'}]
- Create a package in
./pie_extended/models/
. Exemple:foo
. - Add the name of the package in
./pie_extended/models/__init__.py
in the variablemodules
. - In the module
pie_extended.models.foo
, we should find the following variable:Models
: a string with filenames and tasks for Pie.DESC
: a METADATA object that bears information about the modelDOWNLOADS
: A list of file to download.
from pie_extended.utils import Metadata, File, get_path
DESC = Metadata(
"Foo"
"language",
["Author 1", "Author 2"],
"A readable description",
"A link to more information"
)
DOWNLOADS = [
File("/a/link/to/a/file", "local_name_of_the_file.tar")
]
Models = "<{},task1,task2><{},lemma,pos>".format(
get_path("foo", "local_name_of_the_file.tar")
)
- In the module
pie_extended.models.foo.imports
, we should find the following content:get_iterator_and_processor
: a function that returns aDataIterator
and aProcessor
- (optionally)
addons
: a function that installs add-ons - (optionally)
Disambiguator
: a disambiguator instance (or an object creator that returns one)
Check for a simple example in pie_extended.models.fro.imports
and a more complex one
in pie_extended.models.lasla.imports
Clone the repository, create an environment, and then
python setup.py develop
This is an extremely early build, subject to change here and there. But it is functional !