Skip to content

Latest commit

 

History

History
68 lines (50 loc) · 3.46 KB

README.md

File metadata and controls

68 lines (50 loc) · 3.46 KB

Spannish Opennlp models generation

Opennlp documentation

Instructions for training models with Opennlp

Data for Lemmatizer Training and Testing

The Universal Dependencies Treebank (https://universaldependencies.org/) and the CoNLL 2009 datasets distribute training data for many languages. Data repositories for training and testing models:

Data for Sentence Training and Testing

Data repositories for training and testing models:

Command to train:

  • opennlp SentenceDetectorTrainer -model es-sent.bin -lang es -data spa_wikipedia_2021_1M-sentences-train.txt -encoding UTF-8

Command to evaluate:

  • opennlp SentenceDetectorEvaluator -model en-sent.bin -data spa-wikipedia_2021_10K-sentences-test.txt -encoding UTF-8

Data for Tokenizer Training

Data repositories for training and testing models:

Command to train:

  • opennlp TokenizerTrainer -model es-token.bin -lang es -data spa_wikipedia_2021_300K-sentences-train.txt -encoding UTF-8 -params .\PerceptronTrainerParams.txt

Data for Part Of Speech Training

Data repositories for training and testing models:

Command to train:

  • opennlp POSTaggerTrainer.conllu -lang es -model es-pos-maxent.bin -data es_ancora-ud-train.conllu params PerceptronTrainerParams.txt -encoding UTF-8

Command to evaluate:

  • opennlp POSTaggerEvaluator.conllu -model es-pos.bin -data es_ancora-ud-test.conllu -encoding UTF-8

Sentence generator

https://app.inferkit.com/demo

Acknowledgements

  • Taulé, M., M.A. Martí, M. Recasens (2008) 'Ancora: Multilevel Annotated Corpora for Catalan and Spanish', Proceedings of 6th International Conference on Language Resources and Evaluation. Marrakesh (Morocco).

In addition, the following paper must be cited if coreference information (attributes entity, coreftype, corefsubtype, homophoricDD or entityref) is used:

  • Recasens, Marta, M. Antònia Martí (2010) ‘AnCora-CO: Coreferentially annotated corpora for Spanish and Catalan’. Language Resources and Evaluation, Springer Science.

Additionally, the following paper must be cited when argumental attributes in "sn" or "grup.nom" (attributes func, arg, tem or lexicalid) are used:

  • Peris, Aina, Mariona Taulé, Horacio Rodríguez (2010) ‘Semantic Annotation of Deverbal Nominalizations in the Spanish AnCora corpus’. Treebanks and Linguistic Theories (TLT-2010), Estonia.