Spannish Opennlp models generation

Opennlp documentation

Instructions for training models with Opennlp

Data for Lemmatizer Training and Testing

The Universal Dependencies Treebank (https://universaldependencies.org/) and the CoNLL 2009 datasets distribute training data for many languages. Data repositories for training and testing models:

ANCORA: https://github.com/UniversalDependencies/UD_Spanish-AnCora/tree/master
GSD: https://github.com/UniversalDependencies/UD_Spanish-GSD/tree/master
PUD: https://github.com/UniversalDependencies/UD_Spanish-PUD/tree/master Command to train:
opennlp LemmatizerTrainerME.conllu -params PerceptronTrainerParams.txt -lang es -model es-lemmatizer.bin -data es_ancora-ud-train.conllu -encoding UTF-8 Command to evaluate:
opennlp LemmatizerEvaluator.conllu -model es-lemmatizer.bin -data es_ancora-ud-test.conllu -encoding utf-8

Data for Sentence Training and Testing

Data repositories for training and testing models:

CORPUS WIKIPEDIA: https://corpora.uni-leipzig.de/es?corpusId=spa_wikipedia_2021

Command to train:

opennlp SentenceDetectorTrainer -model es-sent.bin -lang es -data spa_wikipedia_2021_1M-sentences-train.txt -encoding UTF-8

Command to evaluate:

opennlp SentenceDetectorEvaluator -model en-sent.bin -data spa-wikipedia_2021_10K-sentences-test.txt -encoding UTF-8

Data for Tokenizer Training

Data repositories for training and testing models:

CORPUS WIKIPEDIA: https://corpora.uni-leipzig.de/es?corpusId=spa_wikipedia_202

Command to train:

opennlp TokenizerTrainer -model es-token.bin -lang es -data spa_wikipedia_2021_300K-sentences-train.txt -encoding UTF-8 -params .\PerceptronTrainerParams.txt

Data for Part Of Speech Training

Data repositories for training and testing models:

Command to train:

opennlp POSTaggerTrainer.conllu -lang es -model es-pos-maxent.bin -data es_ancora-ud-train.conllu params PerceptronTrainerParams.txt -encoding UTF-8

Command to evaluate:

opennlp POSTaggerEvaluator.conllu -model es-pos.bin -data es_ancora-ud-test.conllu -encoding UTF-8

Sentence generator

https://app.inferkit.com/demo

Acknowledgements

Taulé, M., M.A. Martí, M. Recasens (2008) 'Ancora: Multilevel Annotated Corpora for Catalan and Spanish', Proceedings of 6th International Conference on Language Resources and Evaluation. Marrakesh (Morocco).

In addition, the following paper must be cited if coreference information (attributes entity, coreftype, corefsubtype, homophoricDD or entityref) is used:

Recasens, Marta, M. Antònia Martí (2010) ‘AnCora-CO: Coreferentially annotated corpora for Spanish and Catalan’. Language Resources and Evaluation, Springer Science.

Additionally, the following paper must be cited when argumental attributes in "sn" or "grup.nom" (attributes func, arg, tem or lexicalid) are used:

Peris, Aina, Mariona Taulé, Horacio Rodríguez (2010) ‘Semantic Annotation of Deverbal Nominalizations in the Spanish AnCora corpus’. Treebanks and Linguistic Theories (TLT-2010), Estonia.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
DATA		DATA
OPENNLP_FILES		OPENNLP_FILES
TRAINED MODELS		TRAINED MODELS
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spannish Opennlp models generation

Opennlp documentation

Instructions for training models with Opennlp

Data for Lemmatizer Training and Testing

Data for Sentence Training and Testing

Data for Tokenizer Training

Data for Part Of Speech Training

Sentence generator

Acknowledgements

About

Releases

Packages

pantxisto/Spannish_Opennlp_Models

Folders and files

Latest commit

History

Repository files navigation

Spannish Opennlp models generation

Opennlp documentation

Instructions for training models with Opennlp

Data for Lemmatizer Training and Testing

Data for Sentence Training and Testing

Data for Tokenizer Training

Data for Part Of Speech Training

Sentence generator

Acknowledgements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages