Stanza v1.4.1 #1121

AngledLuffa · 2022-09-14T16:41:37Z

AngledLuffa
Sep 14, 2022
Maintainer

Stanza v1.4.1: Improvements to pos, conparse, and sentiment, jupyter visualization, and wider language coverage

Overview

We improve the quality of the POS, constituency, and sentiment models, add an integration to displaCy, and add new models for a variety of languages.

New NER models

New Polish NER model based on NKJP from Karol Saputa and ryszardtuora
NER for Polish #1070
NER Polish #1110
Make GermEval2014 the default German NER model, including an optional Bert version
[QUESTION] Dependencies between UD/tokenizer model and NER model? #1018
De ner #1022
Japanese conversion of GSD by Megagon
Ja ner #1038
Marathi NER dataset from L3Cube. Includes a Sentiment model as well
Marathi #1043
Thai conversion of LST20
555fc03
Kazakh conversion of KazNERD
de6cd25

Other new models

Sentiment conversion of Tass2020 for Spanish
Spanish sent #1104
VIT constituency dataset for Italian
149f144
... and many subsequent updates
Combined UD models for Hebrew
More combined models? #1109
e4fcf00
For UD models with small train dataset & larger test dataset, flip the datasets
UD_Buryat-BDT UD_Kazakh-KTB UD_Kurmanji-MG UD_Ligurian-GLT UD_Upper_Sorbian-UFAL
9618d60
Spanish conparse model from multiple sources - AnCora, LDC-NW, LDC-DF
47740c6

Model improvements

Pretrained charlm integrated into POS. Gives a small to decent gain for most languages without much additional cost
Load the pretrained charlm, adds it as inputs to the POS model #1086
Pretrained charlm integrated into Sentiment. Improves English, others not so much
Sentiment charlm #1025
LSTM, 2d maxpool as optional items in the Sentiment
from the paper Text Classification Improved by Integrating Bidirectional LSTM with Two-dimensional Max Pooling
Sentiment lstm #1098
First learn with AdaDelta, then with another optimizer in conparse training. Very helpful
b1d10d3
Grad clipping in conparse training
365066a

Pipeline interface improvements

GPU memory savings: charlm reused between different processors in the same pipeline
Charlm cache #1028
Word vectors not saved in the NER models. Saves bandwidth & disk space
Ner wv #1033
Functions to return tagsets for NER and conparse models
[QUESTION] Getting a tagset for a stanza model #1066
Add getting all possible values for each feat #1073
36b84db
2db43c8
displaCy integration with NER and dependency trees
2071413

Bugfixes

Fix that it takes forever to tokenize a single long token (catastrophic backtracking in regex)
TY to Sk Adnan Hassan (VT) and Zainab Aamir (Stony Brook)
fix that it takes forever to tokenize a really long non-numeric token #1056
Starting a new corenlp client w/o server shouldn't wait for the server to be available
TY to Mariano Crosetti
ensure_alive must not affect CoreNLPClient when init with StartServer.DONT_START #1059
Makes CoreNLPClient not checks ensure_alive when start_server=StartSe… #1061
Read raw glove word vectors (they have no header information)
Read in a pretrain even if it doesn't have a row/col header. This ap… #1074
Ensure that illegal languages are not chosen by the LangID model
Langid model gives languages not in langid_lang_subset on difficult strings #1076
Mask illegal langauges by setting them to -ninf. 0 means that illega… #1077
Fix cache in Multilingual pipeline
MultilingualPipeline can not remove languages from the cache #1115
cdf18d8
Fix loading of previously unseen languages in Multilingual pipeline
Pipeline is incorrect with specific lang in MultilingualPipeline if lang_config is set #1101
e551ebe
Fix that conparse would occasionally train to NaN early in the training
c4d7857

Improved training tools

W&B integration for all models: can be activated with --wandb flag in the training scripts
Very simple wandb integration for NER. Other models to follow #1040
New webpages for building charlm, NER, and Sentiment
https://stanfordnlp.github.io/stanza/new_language_charlm.html
https://stanfordnlp.github.io/stanza/new_language_ner.html
https://stanfordnlp.github.io/stanza/new_language_sentiment.html
Script to download Oscar 2019 data for charlm from HF (requires datasets module)
Dump oscar #1014
Unify sentiment training into a Python script, replacing the old shell script
Sentiment #1021
Sentiment #1023
Convert sentiment to use .json inputs. In particular, this helps with languages with spaces in words such as Vietnamese
Sentiment #1024
Slightly faster charlm training
Charlm refactor #1026
Data conversion of WikiNER generalized for retraining / add new WikiNER models
Try to generalize wikiner reading - currently the download format is a #1039
XPOS factory now determined at start of POS training. Makes addition of new languages easier
Xpos #1082
Checkpointing and continued training for charlm, conparse, sentiment
Add a trainer for the charlm - useful for saving and loading everythi… #1090
0e6de80
e5793c9
Option to write the results of a NER model to a file
Ner results #1108
Add fake dependencies to a conllu formatted dataset for better integration with evaluation tools
6544ef3
Convert an AMT NER result to Stanza .json
cfa7e49
Add a ton of language codes, including 3 letter codes for languages we generally treat as 2 letters
5a5e918
b32a98e and others

This discussion was created from the release Stanza v1.4.1.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stanza v1.4.1 #1121

{{title}}

Replies: 0 comments

Select a reply

Stanza v1.4.1 #1121

AngledLuffa Sep 14, 2022 Maintainer

Stanza v1.4.1: Improvements to pos, conparse, and sentiment, jupyter visualization, and wider language coverage

Overview

New NER models

Other new models

Model improvements

Pipeline interface improvements

Bugfixes

Improved training tools

Replies: 0 comments

AngledLuffa
Sep 14, 2022
Maintainer