Skip to content

Commit

Permalink
more doc fixes
Browse files Browse the repository at this point in the history
  • Loading branch information
piskvorky committed Sep 20, 2020
1 parent e95ac0a commit e5057c1
Showing 1 changed file with 70 additions and 22 deletions.
92 changes: 70 additions & 22 deletions gensim/models/word2vec.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
"""
Introduction
============
This module implements the word2vec family of algorithms, using highly optimized C routines,
data streaming and Pythonic interfaces.
Expand All @@ -21,17 +22,15 @@
There are more ways to train word vectors in Gensim than just Word2Vec.
See also :class:`~gensim.models.doc2vec.Doc2Vec`, :class:`~gensim.models.fasttext.FastText` and
wrappers for :class:`~gensim.models.wrappers.VarEmbed` and :class:`~gensim.models.wrappers.WordRank`.
wrappers for :class:`~gensim.models.wrappers.varembed.VarEmbed` and :class:`~gensim.models.wrappers.wordrank.WordRank`.
The training algorithms were originally ported from the C package https://code.google.com/p/word2vec/
and extended with additional functionality and optimizations over the years.
and extended with additional functionality and
`optimizations <https://rare-technologies.com/parallelizing-word2vec-in-python/>`_ over the years.
For a tutorial on Gensim word2vec, with an interactive web app trained on GoogleNews,
visit https://rare-technologies.com/word2vec-tutorial/.
**Make sure you have a C compiler before installing Gensim, to use the optimized word2vec routines**
(70x speedup compared to plain NumPy implementation, https://rare-technologies.com/parallelizing-word2vec-in-python/).
Usage examples
==============
Expand All @@ -42,17 +41,17 @@
>>> from gensim.test.utils import common_texts
>>> from gensim.models import Word2Vec
>>>
>>> model = Word2Vec(common_texts, vector_size=100, window=5, min_count=1, workers=4)
>>> model = Word2Vec(sentences=common_texts, vector_size=100, window=5, min_count=1, workers=4)
>>> model.save("word2vec.model")
The training is streamed, so ``sentences`` can be an iterable, reading input data
from disk on-the-fly. This lets you avoid loading the entire corpus into RAM.
However, note that because the iterable must be re-startable, `sentences` must
not be a generator. For an example of an appropriate iterator see
:class:`~gensim.models.word2vec.BrownCorpus`,
:class:`~gensim.models.word2vec.Text8Corpus` or
:class:`~gensim.models.word2vec.LineSentence`.
**The training is streamed, so ``sentences`` can be an iterable**, reading input data
from the disk or network on-the-fly, without loading your entire corpus into RAM.
Note the ``sentences`` iterable must be *restartable* (not just a generator), to allow the algorithm
to stream over your dataset multiple times. For some examples of streamed iterables,
see :class:`~gensim.models.word2vec.BrownCorpus`,
:class:`~gensim.models.word2vec.Text8Corpus` or :class:`~gensim.models.word2vec.LineSentence`.
If you save the model you can continue training it later:
Expand All @@ -62,14 +61,16 @@
>>> model.train([["hello", "world"]], total_examples=1, epochs=1)
(0, 2)
The trained word vectors are stored in a :class:`~gensim.models.keyedvectors.KeyedVectors` instance in `model.wv`:
The trained word vectors are stored in a :class:`~gensim.models.keyedvectors.KeyedVectors` instance, as `model.wv`:
.. sourcecode:: pycon
>>> vector = model.wv['computer'] # get numpy vector of a word
The reason for separating the trained vectors into `KeyedVectors` is that if you don't
need the full model state any more (don't need to continue training), the state can discarded.
need the full model state any more (don't need to continue training), its state can discarded,
keeping just the vectors and their keys proper.
This results in a much smaller and faster object that can be mmapped for lightning
fast loading and sharing the vectors in RAM between processes:
Expand Down Expand Up @@ -103,8 +104,8 @@
full :class:`~gensim.models.word2vec.Word2Vec` object state, as stored by :meth:`~gensim.models.word2vec.Word2Vec.save`,
not just the :class:`~gensim.models.keyedvectors.KeyedVectors`.
You can perform various NLP word tasks with a trained model. Some of them
are already built-in - you can see it in :mod:`gensim.models.keyedvectors`.
You can perform various NLP tasks with a trained model. Some of the operations
are already built-in - see :mod:`gensim.models.keyedvectors`.
If you're finished training a model (i.e. no more updates, only querying),
you can switch to the :class:`~gensim.models.keyedvectors.KeyedVectors` instance:
Expand All @@ -116,18 +117,65 @@
to trim unneeded model state = use much less RAM and allow fast loading and memory sharing (mmap).
Note that there is a :mod:`gensim.models.phrases` module which lets you automatically
detect phrases longer than one word. Using phrases, you can learn a word2vec model
where "words" are actually multiword expressions, such as `new_york_times` or `financial_crisis`:
Embeddings with multiword ngrams
================================
There is a :mod:`gensim.models.phrases` module which lets you automatically
detect phrases longer than one word, using collocation statistics.
Using phrases, you can learn a word2vec model where "words" are actually multiword expressions,
such as `new_york_times` or `financial_crisis`:
.. sourcecode:: pycon
>>> from gensim.test.utils import common_texts
>>> from gensim.models import Phrases
>>>
>>> # Train a bigram detector.
>>> bigram_transformer = Phrases(common_texts)
>>>
>>> # Apply the trained MWE detector to a corpus, using the result to train a Word2vec model.
>>> model = Word2Vec(bigram_transformer[common_texts], min_count=1)
Pretrained models
=================
Gensim comes with several already pre-trained models, in the
`Gensim-data repository <https://github.com/RaRe-Technologies/gensim-data>`_:
.. sourceode:: pycon
>>> import gensim.downloader
>>> # Show all available models in gensim-data
>>> print(list(gensim.downloader.info()['models'].keys()))
['fasttext-wiki-news-subwords-300',
'conceptnet-numberbatch-17-06-300',
'word2vec-ruscorpora-300',
'word2vec-google-news-300',
'glove-wiki-gigaword-50',
'glove-wiki-gigaword-100',
'glove-wiki-gigaword-200',
'glove-wiki-gigaword-300',
'glove-twitter-25',
'glove-twitter-50',
'glove-twitter-100',
'glove-twitter-200',
'__testing_word2vec-matrix-synopsis']
>>>
>>> # Download the "glove-twitter-25" embeddings
>>> glove_vectors = gensim.downloader.load('glove-twitter-25')
>>>
>>> # Use the downloaded vectors as usual:
>>> glove_vectors.most_similar('twitter')
[('facebook', 0.948005199432373),
('tweet', 0.9403423070907593),
('fb', 0.9342358708381653),
('instagram', 0.9104824066162109),
('chat', 0.8964964747428894),
('hashtag', 0.8885937333106995),
('tweets', 0.8878158330917358),
('tl', 0.8778461217880249),
('link', 0.8778210878372192),
('internet', 0.8753897547721863)]
"""

from __future__ import division # py3 "true division"
Expand Down Expand Up @@ -155,7 +203,7 @@
logger = logging.getLogger(__name__)

try:
from gensim.models.word2vec_inner import (
from gensim.models.word2vec_inner import ( # noqa: F401
train_batch_sg,
train_batch_cbow,
score_sentence_sg,
Expand Down

0 comments on commit e5057c1

Please sign in to comment.