Skip to content
Giannis Daras edited this page Aug 12, 2018 · 4 revisions

Models

From the spaCy documentation:

In general, spaCy expects all model packages to follow the naming convention of [lang]_[name]. For spaCy's models, we also chose to divide the name into three components:

  • type: Model capabilities (e.g. core for general-purpose model with vocabulary, syntax, entities and word vectors, or depent for only vocab, syntax and entities)
  • genre: Type of text the model is trained on (e.g. web for web text, news for news text)
  • size: Model size indicator (sm, md or lg)

For example, en_core_web_sm is a small English model trained on written web text (blogs, news, comments), that includes vocabulary, vectors, syntax and entities.

The Greek models were trained on data from here for POS/DEP Tagger and data that we procuded for ourselves and can be found here for NER (check Prodigy Wiki for more info).

Greek language models

Following the name conventions mentioned above, two models have been trained for Greek language:

  1. el_core_web_sm: Vocabulary, syntax, entities.
  2. el_core_web_lg: Vocabulary, syntax, entities, word-vectors.

Models usage

  • Import model:

    import spacy
    nlp = spacy.load('el_core_web_sm')
    # For the model with vectors, run the following command:
    # nlp = spacy.load('el_core_web_lg')
  • Get doc object:

    # replace with your own text
    text = "Η Ελλάδα είναι από τις ομορφότερες χώρες του κόσμου"
    doc = nlp(text)
  • Tokenize and lemmatize your sentence:

    for token in doc:
    	print("Token:{}, Lemma:{}".format(token, token.lemma_))

    Output:

    ​ Token:Η, Lemma:η Token:Ελλάδα, Lemma:ελλάδα Token:είναι, Lemma:είναι Token:από, Lemma:από Token:τις, Lemma:τις Token:ομορφότερες, Lemma:ομορφός Token:χώρες, Lemma:χώρα Token:του, Lemma:του Token:κόσμου, Lemma:κόσμου

  • Get POS tags for each of the tokens

    for token in doc:
    	print("Token:{}, Tag:{}".format(token, token.tag_))

    Output:

    ​ Token:Η, Tag:DET Token:Ελλάδα, Tag:PROPN Token:είναι, Tag:AUX Token:από, Tag:ADP Token:τις, Tag:DET Token:ομορφότερες, Tag:ADJ Token:χώρες, Tag:NOUN Token:του, Tag:DET Token:κόσμου, Tag:NOUN

  • Visualize POS tags and Dependencies

    from spacy import displacy
    displacy.serve(doc)
    

  • Get Named Entities out of your sentence

    for ent in doc.ents:
    	print("Entity:{}, Label:{}".format(ent.text, ent.label_))

    Output:

    ​ Entity:Ελλάδα, Label:GPE

  • Visualize Named Entities

    from spacy import displacy
    displacy.serve(doc, style="ent")

  • Detect similarity between texts

    # for this, we will need the model with the word-vectors
    nlp = spacy.load('el_core_web_lg')
    doc1 = nlp('Οι πυροσβέστες ψάχνουν αγωνιωδώς για επιζώντες. Οι φωτιές διέλυσαν τα πάντα. Τα πάντα είναι απανθρακωμένα.')
    doc2 = nlp('Το Πυροσβεστικό Σώμα συνεχίζει να αναζητά τους αγνωούμενους. Η πυρκαγιά κατέλυσε όλη την περιοχή. Όλα έγιναν στάχτη και κάρβουνο.')
    doc3 = nlp('Χθες αγόρασα ένα σκύλο! Και μια γάτα! Και ένα κουνέλι!')
    print(doc1.similarity(doc2))
    print(doc1.similarity(doc3))

    Output:

    0.7155315553393391 0.46625177182352695

    As we expected, the first two sentences that are semantically close have high similarity score. Contrary to this, the first and the last sentence that talk about different topics have lower similarity score.

For a lot more submodules that are derived from the models usage check here.

Clone this wiki locally