word2vec-ruscorpora-300
menshikh-iv
released this
18 Dec 08:56
·
12 commits
to master
since this release
Word2vec Continuous Skipgram vectors trained on the full Russian National Corpus (about 250M words).
Related issue #3.
attribute | value |
---|---|
File size | 199MB |
Number of vectors | 184973 |
Preprocessing | The corpus (used for training) was lemmatized and tagged with Universal PoS |
Window size | 10 |
Dimension | 300 |
License | https://creativecommons.org/licenses/by/4.0/deed.en |
Read more:
- https://www.academia.edu/24306935/WebVectors_a_Toolkit_for_Building_Web_Interfaces_for_Vector_Semantic_Models
- http://rusvectores.org/en/
Example
import gensim.downloader as api
model = api.load("word2vec-ruscorpora-300")
for word, distance in model.most_similar(u"кот_NOUN"):
print(u"{}: {:.3f}".format(word, distance))
"""
output:
кошка_NOUN: 0.757
котенок_NOUN: 0.668
пес_NOUN: 0.563
мяукать_VERB: 0.562
тобик_NOUN: 0.559
фоксик_NOUN: 0.557
собака_NOUN: 0.557
мяучать_VERB: 0.554
харлашка_NOUN: 0.552
котяра_NOUN: 0.551
"""