Example of how to learn vector presentation of words in python using Gensim
on english wikipedia articles.
- Python 3.5 + pip
- Gensim 0.12.4
Run following commands (estimated 10 hours)
./setup.sh
The shell script setup.sh
will do the following
- Install required python libraries using
pip
- Download the compressed english wikipedia articles dump
and put them into
data/enwiki-latest-pages-articles.xml.bz2
- Train the word2vec model using the
train.py
script.
This can ofc. also just be done manually.
Running the test.py
script shows a few examples of the results of the
obtained word representation.
King - man + woman:
"queen" - similarity: 0.678644
"princess" - similarity: 0.587378
"monarch" - similarity: 0.528285
"prince" - similarity: 0.520583
"throne" - similarity: 0.488901
"empress" - similarity: 0.482006
"emperor" - similarity: 0.461451
"regnant" - similarity: 0.45579
"isabeau" - similarity: 0.455715
"berengaria" - similarity: 0.455293
Similarity between man and woman:
0.707675308594