Models

In general, spaCy expects all model packages to follow the naming convention of [lang]_[name]. For spaCy's models, we also chose to divide the name into three components:

type: Model capabilities (e.g. core for general-purpose model with vocabulary, syntax, entities and word vectors, or depent for only vocab, syntax and entities)

genre: Type of text the model is trained on (e.g. web for web text, news for news text)

size: Model size indicator (sm, md or lg)

For example, en_core_web_sm is a small English model trained on written web text (blogs, news, comments), that includes vocabulary, vectors, syntax and entities.

The Greek models were trained on data from here for POS/DEP Tagger and data that we procuded for ourselves and can be found here for NER (check Prodigy Wiki for more info).

Greek language models

Following the name conventions mentioned above, two models have been trained for Greek language:

el_core_web_sm: Vocabulary, syntax, entities.
el_core_web_lg: Vocabulary, syntax, entities, word-vectors.

Models usage

Import model:

import spacy
nlp = spacy.load('el_core_web_sm')
# For the model with vectors, run the following command:
# nlp = spacy.load('el_core_web_lg')

Get doc object:

# replace with your own text
text = "Η Ελλάδα είναι από τις ομορφότερες χώρες του κόσμου"
doc = nlp(text)

Tokenize and lemmatize your sentence:
```
for token in doc:
	print("Token:{}, Lemma:{}".format(token, token.lemma_))
```
Output:

Token:Η, Lemma:η Token:Ελλάδα, Lemma:ελλάδα Token:είναι, Lemma:είναι Token:από, Lemma:από Token:τις, Lemma:τις Token:ομορφότερες, Lemma:ομορφός Token:χώρες, Lemma:χώρα Token:του, Lemma:του Token:κόσμου, Lemma:κόσμου
Get POS tags for each of the tokens
```
for token in doc:
	print("Token:{}, Tag:{}".format(token, token.tag_))
```
Output:

Token:Η, Tag:DET Token:Ελλάδα, Tag:PROPN Token:είναι, Tag:AUX Token:από, Tag:ADP Token:τις, Tag:DET Token:ομορφότερες, Tag:ADJ Token:χώρες, Tag:NOUN Token:του, Tag:DET Token:κόσμου, Tag:NOUN

Visualize POS tags and Dependencies

from spacy import displacy
displacy.serve(doc)

Get Named Entities out of your sentence

for ent in doc.ents:
	print("Entity:{}, Label:{}".format(ent.text, ent.label_))

Output:

Entity:Ελλάδα, Label:GPE

Visualize Named Entities

from spacy import displacy
displacy.serve(doc, style="ent")

Detect similarity between texts

# for this, we will need the model with the word-vectors
nlp = spacy.load('el_core_web_lg')
doc1 = nlp('Οι πυροσβέστες ψάχνουν αγωνιωδώς για επιζώντες. Οι φωτιές διέλυσαν τα πάντα. Τα πάντα είναι απανθρακωμένα.')
doc2 = nlp('Το Πυροσβεστικό Σώμα συνεχίζει να αναζητά τους αγνωούμενους. Η πυρκαγιά κατέλυσε όλη την περιοχή. Όλα έγιναν στάχτη και κάρβουνο.')
doc3 = nlp('Χθες αγόρασα ένα σκύλο! Και μια γάτα! Και ένα κουνέλι!')
print(doc1.similarity(doc2))
print(doc1.similarity(doc3))

Output:

0.7155315553393391 0.46625177182352695

As we expected, the first two sentences that are semantically close have high similarity score. Contrary to this, the first and the last sentence that talk about different topics have lower similarity score.

For a lot more submodules that are derived from the models usage check here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Models

Models

Greek language models

Models usage

Clone this wiki locally