24 Aug 08:44

mpenkov

30179c8

4.3.2 Latest

Latest

Changes

4.3.2, 2023-08-23

🔴 Bug fixes

Fix incorrect conversion of cosine distance to cosine similarity (monash849, #3441)

📚 Tutorial and doc improvements

Fix inconsistent documentation for LdaSeqModel #3474 (rsokolewicz, #3475)
Update the licence link to LGPLv2.1 (ERijck, #3471)
Replace HTTP with HTTPS in enwiki URLs (Holmes5, #3459)
Update broken/redirecting/unencrypted links (pabs3, #3456)
Update Python version in docs (gliptak, #3446)

👍 Improvements

Remove unused dependency, handle ImportError (mpenkov, #3447)
Sanity check for hs and negative in Word2Vec (gau-nernst, #3443)

🔮 Testing, CI, housekeeping

Fix CI test and wheel building workflow (mpenkov, #3488)
Build wheels with oldest supported numpy (PrimozGodec, #3467)
Bump pypa/cibuildwheel from 2.12.1 to 2.13.1 (dependabot[bot], #3483)
Doc fixes and separate workflow for building docs via CI (pabs3, #3462)
Move wheels upload into its own job (nikaro, #3454)
Enable arm64/aarch64 wheel builds (nikaro, #3448)

Assets 2

21 Dec 00:42

mpenkov

4.3.0

adf393c

4.3.0

What's Changed

Allow overriding the Cython version requirement by @pabs3 in #3323
Update Python module MANIFEST by @pabs3 in #3343
Clean up references to Morfessor, tox and gensim.models.wrappers by @pabs3 in #3345
Disable the Gensim 3=>4 warning in docs by @piskvorky in #3346
pin sphinx versions, add explicit gallery_top label by @mpenkov in #3383
Declare variables prior to for loop in fastss.pyx for ANSI C compatibility by @hstk30 in #3378
Fix typo in word2vec and KeyedVectors docstrings by @dymil in #3365
Replace np.multiply with np.square and copyedit in translation_matrix.py by @dymil in #3374
Copyedit and fix outdated statements in translation matrix tutorial by @dymil in #3375
Implement Okapi BM25 variants in Gensim by @Witiko in #3304
Giving missing credit in EnsembleLDA to Alex in docs by @sezanzeb in #3393
PERF: pyemd to POT for EMD computation in wmdistance by @TLouf in #3327
Fixed bug in loss computation for Word2Vec with hierarchical softmax by @TalIfargan in #3397
fix deprecation warning from pytest by @martino-vic in #3354
Switch to Cython language level 3 by @pabs3 in #3344
Implement numpy hack in setup.py to enable install under Poetry by @jaymegordo in #3363
Fixed the broken link in readme.md by @aswin2108 in #3409
Path Coherence Model to correctly handle empty documents by @PrimozGodec in #3406
Add support for Python 3.11 and drop support for Python 3.7 by @acul3 in #3402
clarify runtime expectations by @gojomo in #3381
Fix bug that prevents loading old models by @funasshi in #3359
refactor wheel building and testing workflow by @mpenkov in #3410
Fixed FastTextKeyedVectors handling in add_vector by @globba in #3389
Flsamodel by @ERijck in #3398
Fix backwards compatibility bug in Word2Vec by @mpenkov in #3415
fix numpy hack in setup.py by @mpenkov in #3416
updated changelog for next release by @mpenkov in #3412

New Contributors

@hstk30 made their first contribution in #3378
@TLouf made their first contribution in #3327
@TalIfargan made their first contribution in #3397
@martino-vic made their first contribution in #3354
@jaymegordo made their first contribution in #3363
@aswin2108 made their first contribution in #3409
@acul3 made their first contribution in #3402
@funasshi made their first contribution in #3359
@globba made their first contribution in #3389
@ERijck made their first contribution in #3398

Full Changelog: 4.2.0...4.3.0

Contributors

gojomo, pabs3, and 16 other contributors

Assets 2

01 May 08:38

piskvorky

4.2.0

acbba2f

4.2.0

A number of incremental improvements, optimizations and bugfixes: CHANGELOG

Assets 2

18 Sep 14:21

mpenkov

4.1.2

b76108e

4.1.2

4.1.2, 2021-09-17

This is a bugfix release that addresses left over compatibility issues with older versions of numpy and MacOS.

4.1.1, 2021-09-14

This is a bugfix release that addresses compatibility issues with older versions of numpy.

4.1.0, 2021-08-15

Gensim 4.1 brings two major new functionalities:

Ensemble LDA for robust training, selection and comparison of LDA models.
FastSS module for super fast Levenshtein "fuzzy search" queries. Used e.g. for "soft term similarity" calculations.

There are several minor changes that are not backwards compatible with previous versions of Gensim.
The affected functionality is relatively less used, so it is unlikely to affect most users, so we have opted to not require a major version bump.
Nevertheless, we describe them below.

Improved parameter edge-case handling in KeyedVectors most_similar and most_similar_cosmul methods

We now handle both positive and negative keyword parameters consistently.
They may now be either:

A string, in which case the value is reinterpreted as a list of one element (the string value)
A vector, in which case the value is reinterpreted as a list of one element (the vector)
A list of strings
A list of vectors

So you can now simply do:

    model.most_similar(positive='war', negative='peace')

instead of the slightly more involved

model.most_similar(positive=['war'], negative=['peace'])

Both invocations remain correct, so you can use whichever is most convenient.
If you were somehow expecting gensim to interpret the strings as a list of characters, e.g.

model.most_similar(positive=['w', 'a', 'r'], negative=['p', 'e', 'a', 'c', 'e'])

then you will need to specify the lists explicitly in gensim 4.1.

Deprecated obsolete `step` parameter from doc2vec

With the newer version, do this:

model.infer_vector(..., epochs=123)

instead of this:

model.infer_vector(..., steps=123)

Plus a large number of smaller improvements and fixes, as usual.

⚠️ If migrating from old Gensim 3.x, read the Migration guide first.

👍 New features

#3169: Implement shrink_windows argument for Word2Vec, by @M-Demay
#3163: Optimize word mover distance (WMD) computation, by @flowlight0
#3157: New KeyedVectors.vectors_for_all method for vectorizing all words in a dictionary, by @Witiko
#3153: Vectorize word2vec.predict_output_word for speed, by @M-Demay
#3146: Use FastSS for fast kNN over Levenshtein distance, by @Witiko
#3128: Materialize and copy the corpus passed to SoftCosineSimilarity, by @Witiko
#3115: Make LSI dispatcher CLI param for number of jobs optional, by @robguinness
#3091: LsiModel: Only log top words that actually exist in the dictionary, by @kmurphy4
#2980: Added EnsembleLda for stable LDA topics, by @sezanzeb
#2978: Optimize performance of Author-Topic model, by @horpto
#3000: Tidy up KeyedVectors.most_similar() API, by @simonwiles

📚 Tutorials and docs

#3155: Correct parameter name in documentation of fasttext.py, by @bizzyvinci
#3148: Fix broken link to mycorpus.txt in documentation, by @rohit901
#3142: Use more permanent pdf link and update code link, by @dymil
#3141: Update link for online LDA paper, by @dymil
#3133: Update link to Hoffman paper (online VB LDA), by @jonaschn
#3129: [MRG] Add bronze sponsor: TechTarget, by @piskvorky
#3126: Fix typos in make_wiki_online.py and make_wikicorpus.py, by @nicolasassi
#3125: Improve & unify docs for dirichlet priors, by @jonaschn
#3123: Fix hyperlink for doc2vec tutorial, by @AdityaSoni19031997
#3121: [MRG] Add bronze sponsor: eaccidents.com, by @piskvorky
#3120: Fix URL for ldamodel.py, by @jonaschn
#3118: Fix URL in doc string, by @jonaschn
#3107: Draw attention to sponsoring in README, by @piskvorky
#3105: Fix documentation links: Travis to Github Actions, by @piskvorky
#3057: Clarify doc comment in LdaModel.inference(), by @yocen
#2964: Document that preprocessing.strip_punctuation is limited to ASCII, by @sciatro

🔴 Bug fixes

#3178: Fix Unicode string incompatibility in gensim.similarities.fastss.editdist, by @Witiko
#3174: Fix loading Phraser models stored in Gensim 3.x into Gensim 4.0, by @emgucv
#3136: Fix indexing error in word2vec_inner.pyx, by @bluekura
#3131: Add missing import to NMF docs and models/init.py, by @properGrammar
#3116: Fix bug where saved Phrases model did not load its connector_words, by @aloknayak29
#2830: Fixed KeyError in coherence model, by @pietrotrope

⚠️ Removed functionality & deprecations

#3176: Eliminate obsolete step parameter from doc2vec infer_vector and similarity_unseen_docs, by @rock420
#2965: Remove strip_punctuation2 alias of strip_punctuation, by @sciatro
#3180: Move preprocessing functions from gensim.corpora.textcorpus and gensim.corpora.lowcorpus to gensim.parsing.preprocessing, by @rock420

🔮 Testing, CI, housekeeping

#3156: Update Numpy minimum version to 1.17.0, by @PrimozGodec
#3143: replace _mul function with explicit casts, by @mpenkov
#2952: Allow newer versions of the Morfessor module for the tests, by @pabs3
#2965: Remove strip_punctuation2 alias of strip_punctuation, by @sciatro

Assets 2

14 Sep 13:49

mpenkov

4.1.1

3d72896

4.1.1

4.1.1, 2021-09-14

This is a bugfix release that addresses compatibility issues with older versions of numpy.

4.1.0, 2021-08-15

Gensim 4.1 brings two major new functionalities:

Ensemble LDA for robust training, selection and comparison of LDA models.
FastSS module for super fast Levenshtein "fuzzy search" queries. Used e.g. for "soft term similarity" calculations.

Improved parameter edge-case handling in KeyedVectors most_similar and most_similar_cosmul methods

We now handle both positive and negative keyword parameters consistently.
They may now be either:

A string, in which case the value is reinterpreted as a list of one element (the string value)
A vector, in which case the value is reinterpreted as a list of one element (the vector)
A list of strings
A list of vectors

So you can now simply do:

    model.most_similar(positive='war', negative='peace')

instead of the slightly more involved

model.most_similar(positive=['war'], negative=['peace'])

Both invocations remain correct, so you can use whichever is most convenient.
If you were somehow expecting gensim to interpret the strings as a list of characters, e.g.

model.most_similar(positive=['w', 'a', 'r'], negative=['p', 'e', 'a', 'c', 'e'])

then you will need to specify the lists explicitly in gensim 4.1.

Deprecated obsolete `step` parameter from doc2vec

With the newer version, do this:

model.infer_vector(..., epochs=123)

instead of this:

model.infer_vector(..., steps=123)

Plus a large number of smaller improvements and fixes, as usual.

⚠️ If migrating from old Gensim 3.x, read the Migration guide first.

👍 New features

#3169: Implement shrink_windows argument for Word2Vec, by @M-Demay
#3163: Optimize word mover distance (WMD) computation, by @flowlight0
#3157: New KeyedVectors.vectors_for_all method for vectorizing all words in a dictionary, by @Witiko
#3153: Vectorize word2vec.predict_output_word for speed, by @M-Demay
#3146: Use FastSS for fast kNN over Levenshtein distance, by @Witiko
#3128: Materialize and copy the corpus passed to SoftCosineSimilarity, by @Witiko
#3115: Make LSI dispatcher CLI param for number of jobs optional, by @robguinness
#3091: LsiModel: Only log top words that actually exist in the dictionary, by @kmurphy4
#2980: Added EnsembleLda for stable LDA topics, by @sezanzeb
#2978: Optimize performance of Author-Topic model, by @horpto
#3000: Tidy up KeyedVectors.most_similar() API, by @simonwiles

📚 Tutorials and docs

#3155: Correct parameter name in documentation of fasttext.py, by @bizzyvinci
#3148: Fix broken link to mycorpus.txt in documentation, by @rohit901
#3142: Use more permanent pdf link and update code link, by @dymil
#3141: Update link for online LDA paper, by @dymil
#3133: Update link to Hoffman paper (online VB LDA), by @jonaschn
#3129: [MRG] Add bronze sponsor: TechTarget, by @piskvorky
#3126: Fix typos in make_wiki_online.py and make_wikicorpus.py, by @nicolasassi
#3125: Improve & unify docs for dirichlet priors, by @jonaschn
#3123: Fix hyperlink for doc2vec tutorial, by @AdityaSoni19031997
#3121: [MRG] Add bronze sponsor: eaccidents.com, by @piskvorky
#3120: Fix URL for ldamodel.py, by @jonaschn
#3118: Fix URL in doc string, by @jonaschn
#3107: Draw attention to sponsoring in README, by @piskvorky
#3105: Fix documentation links: Travis to Github Actions, by @piskvorky
#3057: Clarify doc comment in LdaModel.inference(), by @yocen
#2964: Document that preprocessing.strip_punctuation is limited to ASCII, by @sciatro

🔴 Bug fixes

#3178: Fix Unicode string incompatibility in gensim.similarities.fastss.editdist, by @Witiko
#3174: Fix loading Phraser models stored in Gensim 3.x into Gensim 4.0, by @emgucv
#3136: Fix indexing error in word2vec_inner.pyx, by @bluekura
#3131: Add missing import to NMF docs and models/init.py, by @properGrammar
#3116: Fix bug where saved Phrases model did not load its connector_words, by @aloknayak29
#2830: Fixed KeyError in coherence model, by @pietrotrope

⚠️ Removed functionality & deprecations

#3176: Eliminate obsolete step parameter from doc2vec infer_vector and similarity_unseen_docs, by @rock420
#2965: Remove strip_punctuation2 alias of strip_punctuation, by @sciatro
#3180: Move preprocessing functions from gensim.corpora.textcorpus and gensim.corpora.lowcorpus to gensim.parsing.preprocessing, by @rock420

🔮 Testing, CI, housekeeping

#3156: Update Numpy minimum version to 1.17.0, by @PrimozGodec
#3143: replace _mul function with explicit casts, by @mpenkov
#2952: Allow newer versions of the Morfessor module for the tests, by @pabs3
#2965: Remove strip_punctuation2 alias of strip_punctuation, by @sciatro

Assets 2

29 Aug 22:28

piskvorky

4.1.0

109c88e

4.1.0

4.1.0, 2021-08-15

Gensim 4.1 brings two major new functionalities:

Ensemble LDA for robust training, selection and comparison of LDA models.
FastSS module for super fast Levenshtein "fuzzy search" queries. Used e.g. for "soft term similarity" calculations.

Improved parameter edge-case handling in KeyedVectors most_similar and most_similar_cosmul methods

We now handle both positive and negative keyword parameters consistently.
They may now be either:

A string, in which case the value is reinterpreted as a list of one element (the string value)
A vector, in which case the value is reinterpreted as a list of one element (the vector)
A list of strings
A list of vectors

So you can now simply do:

    model.most_similar(positive='war', negative='peace')

instead of the slightly more involved

model.most_similar(positive=['war'], negative=['peace'])

Both invocations remain correct, so you can use whichever is most convenient.
If you were somehow expecting gensim to interpret the strings as a list of characters, e.g.

model.most_similar(positive=['w', 'a', 'r'], negative=['p', 'e', 'a', 'c', 'e'])

then you will need to specify the lists explicitly in gensim 4.1.

Deprecated obsolete `step` parameter from doc2vec

With the newer version, do this:

model.infer_vector(..., epochs=123)

instead of this:

model.infer_vector(..., steps=123)

Plus a large number of smaller improvements and fixes, as usual.

⚠️ If migrating from old Gensim 3.x, read the Migration guide first.

👍 New features

#3169: Implement shrink_windows argument for Word2Vec, by @M-Demay
#3163: Optimize word mover distance (WMD) computation, by @flowlight0
#3157: New KeyedVectors.vectors_for_all method for vectorizing all words in a dictionary, by @Witiko
#3153: Vectorize word2vec.predict_output_word for speed, by @M-Demay
#3146: Use FastSS for fast kNN over Levenshtein distance, by @Witiko
#3128: Materialize and copy the corpus passed to SoftCosineSimilarity, by @Witiko
#3115: Make LSI dispatcher CLI param for number of jobs optional, by @robguinness
#3091: LsiModel: Only log top words that actually exist in the dictionary, by @kmurphy4
#2980: Added EnsembleLda for stable LDA topics, by @sezanzeb
#2978: Optimize performance of Author-Topic model, by @horpto
#3000: Tidy up KeyedVectors.most_similar() API, by @simonwiles

📚 Tutorials and docs

#3155: Correct parameter name in documentation of fasttext.py, by @bizzyvinci
#3148: Fix broken link to mycorpus.txt in documentation, by @rohit901
#3142: Use more permanent pdf link and update code link, by @dymil
#3141: Update link for online LDA paper, by @dymil
#3133: Update link to Hoffman paper (online VB LDA), by @jonaschn
#3129: [MRG] Add bronze sponsor: TechTarget, by @piskvorky
#3126: Fix typos in make_wiki_online.py and make_wikicorpus.py, by @nicolasassi
#3125: Improve & unify docs for dirichlet priors, by @jonaschn
#3123: Fix hyperlink for doc2vec tutorial, by @AdityaSoni19031997
#3121: [MRG] Add bronze sponsor: eaccidents.com, by @piskvorky
#3120: Fix URL for ldamodel.py, by @jonaschn
#3118: Fix URL in doc string, by @jonaschn
#3107: Draw attention to sponsoring in README, by @piskvorky
#3105: Fix documentation links: Travis to Github Actions, by @piskvorky
#3057: Clarify doc comment in LdaModel.inference(), by @yocen
#2964: Document that preprocessing.strip_punctuation is limited to ASCII, by @sciatro

🔴 Bug fixes

#3178: Fix Unicode string incompatibility in gensim.similarities.fastss.editdist, by @Witiko
#3174: Fix loading Phraser models stored in Gensim 3.x into Gensim 4.0, by @emgucv
#3136: Fix indexing error in word2vec_inner.pyx, by @bluekura
#3131: Add missing import to NMF docs and models/init.py, by @properGrammar
#3116: Fix bug where saved Phrases model did not load its connector_words, by @aloknayak29
#2830: Fixed KeyError in coherence model, by @pietrotrope

⚠️ Removed functionality & deprecations

#3176: Eliminate obsolete step parameter from doc2vec infer_vector and similarity_unseen_docs, by @rock420
#2965: Remove strip_punctuation2 alias of strip_punctuation, by @sciatro
#3180: Move preprocessing functions from gensim.corpora.textcorpus and gensim.corpora.lowcorpus to gensim.parsing.preprocessing, by @rock420

🔮 Testing, CI, housekeeping

#3156: Update Numpy minimum version to 1.17.0, by @PrimozGodec
#3143: replace _mul function with explicit casts, by @mpenkov
#2952: Allow newer versions of the Morfessor module for the tests, by @pabs3
#2965: Remove strip_punctuation2 alias of strip_punctuation, by @sciatro

4.0.1, 2021-04-01

Bugfix release to address issues with Wheels on Windows:

#3095
#3097

4.0.0, 2021-03-24

⚠️ Gensim 4.0 contains breaking API changes! See the Migration guide to update your existing Gensim 3.x code and models.

Gensim 4.0 is a major release with lots of performance & robustness improvements, and a new website.

Main highlights

Massively optimized popular algorithms the community has grown to love: fastText, word2vec, doc2vec, phrases:

a. Efficiency

| model | 3.8.3: wall time / peak RAM / throughput ...

Assets 2

01 Apr 14:19

mpenkov

4.0.1

b4f64a9

4.0.1

4.0.1, 2021-04-01

Bugfix release to address issues with wheels on Windows due to Numpy binary incompatibility:

#3095
#3097

4.0.0, 2021-03-24

⚠️ Gensim 4.0 contains breaking API changes! See the Migration guide to update your existing Gensim 3.x code and models.

Gensim 4.0 is a major release with lots of performance & robustness improvements, and a new website.

Main highlights

Massively optimized popular algorithms the community has grown to love: fastText, word2vec, doc2vec, phrases:

a. Efficiency

model	3.8.3: wall time / peak RAM / throughput	4.0.0: wall time / peak RAM / throughput
fastText	2.9h / 4.11 GB / 822k words/s	2.3h / 1.26 GB / 914k words/s
word2vec	1.7h / 0.36 GB / 1685k words/s	1.2h / 0.33 GB / 1762k words/s

In other words, fastText now needs 3x less RAM (and is faster); word2vec has 2x faster init (and needs less RAM, and is faster); detecting collocation phrases is 2x faster. (4.0 benchmarks)

b. Robustness. We fixed a bunch of long-standing bugs by refactoring the internal code structure (see 🔴 Bug fixes below)

c. Simplified OOP model for easier model exports and integration with TensorFlow, PyTorch &co.

These improvements come to you transparently aka "for free", but see Migration guide for some changes that break the old Gensim 3.x API. Update your code accordingly.

Dropped a bunch of externally contributed modules and wrappers: summarization, pivoted TFIDF, Mallet…
- Code quality was not up to our standards. Also there was no one to maintain these modules, answer user questions, support them.
  
  So rather than let them rot, we took the hard decision of removing these contributed modules from Gensim. If anyone's interested in maintaining them, please fork & publish into your own repo. They can live happily outside of Gensim.
Dropped Python 2. Gensim 4.0 is Py3.6+. Read our Python version support policy.
- If you still need Python 2 for some reason, stay at Gensim 3.8.3.
A new Gensim website – finally! 🙃

So, a major clean-up release overall. We're happy with this tighter, leaner and faster Gensim.

This is the direction we'll keep going forward: less kitchen-sink of "latest academic algorithms", more focus on robust engineering, targetting concrete NLP & document similarity use-cases.

👍 New features

#2947: Bump minimum Python version to 3.6, by @gojomo
#2300: Use less RAM in LdaMulticore, by @horpto
#2698: Streamline KeyedVectors & X2Vec API, by @gojomo
#2864: Speed up random number generation in word2vec, by @zygm0nt
#2976: Speed up phrase (collocation) detection, by @piskvorky
#2979: Allow skipping common English words in multi-word phrases, by @piskvorky
#2867: Expose max_final_vocab parameter in fastText constructor, by @mpenkov
#2931: Clear up job queue parameters in word2vec, by @lunastera
#2939: X2Vec SaveLoad improvements, by @piskvorky
#3060: Record lifecycle events in Gensim models, by @piskvorky
#3073: Make WMD normalization optional, by @piskvorky
#3065: Default to pickle protocol 4 when saving models, by @piskvorky
#3069: Add Github sponsor + donation nags, by @piskvorky

📚 Tutorials and docs

#3082: Make LDA tutorial read NIPS data on the fly, by @jonaschn
#2954: New theme for the Gensin website, by @dvorakvaclav
#2960: Added Gensim and Compatibility Wiki page, by @piskvorky
#2960: Reworked & simplified the Developer Wiki page, by @piskvorky
#2968: Migrate tutorials & how-tos to 4.0.0, by @piskvorky
#2899: Clean up of language and formatting of docstrings, by @piskvorky
#2899: Added documentation for NMSLIB indexer, by @piskvorky
#2832: Clear up LdaModel documentation, by @FyzHsn
#2871: Clarify that license is LGPL-2.1, by @pombredanne
#2896: Make docs clearer on alpha parameter in LDA model, by @xh2
#2897: Update Hoffman paper link for Online LDA, by @xh2
#2910: Refresh docs for run_annoy tutorial, by @piskvorky
#2935: Fix "generator" language in word2vec docs, by @polm
#3077: Fix various documentation warnings, by @mpenkov
#2991: Fix broken link in run_doc How-To, by @sezanzeb
#3003: Point WordEmbeddingSimilarityIndex documentation to gensim.similarities, by @Witiko
#2996: Make the website link to the old Gensim 3.8.3 documentation dynamic, by @Witiko
#3063: Update link to papers in LSI model, by @jonaschn
#3080: Fix some of the warnings/deprecated functions, by @FredHappyface)

🔴 Bug fixes

#2891: Fix fastText word-vectors with ngrams off, by @gojomo
#2907: Fix doc2vec crash for large sets of doc-vectors, by @gojomo
#2899: Fix similarity bug in NMSLIB indexer, by @piskvorky
#2899: Fix deprecation warnings in Annoy integration, by @piskvorky
#2901: Fix inheritance of WikiCorpus from TextCorpus, by @jenishah
#2940: Fix deprecations in SoftCosineSimilarity, by @Witiko
#2944: Fix save_facebook_model failure after update-vocab & other initialization streamlining, by @gojomo
#2846: Fix for Python 3.9/3.10: remove xml.etree.cElementTree, by @hugovk
#2973: phrases.export_phrases() doesn't yield all bigrams, by @piskvorky
#2942: Segfault when training doc2vec, by @gojomo
[#3041](https://github.com/RaRe-Techn...

Assets 2

25 Mar 13:42

mpenkov

4.0.0

f46d72a

4.0.0

Changes

4.0.0, 2021-03-24

⚠️ Gensim 4.0 contains breaking API changes! See the Migration guide to update your existing Gensim 3.x code and models.

Gensim 4.0 is a major release with lots of performance & robustness improvements, and a new website.

Main highlights

Massively optimized popular algorithms the community has grown to love: fastText, word2vec, doc2vec, phrases:

a. Efficiency

model	3.8.3: wall time / peak RAM / throughput	4.0.0: wall time / peak RAM / throughput
fastText	2.9h / 4.11 GB / 822k words/s	2.3h / 1.26 GB / 914k words/s
word2vec	1.7h / 0.36 GB / 1685k words/s	1.2h / 0.33 GB / 1762k words/s

In other words, fastText now needs 3x less RAM (and is faster); word2vec has 2x faster init (and needs less RAM, and is faster); detecting collocation phrases is 2x faster. (4.0 benchmarks)

b. Robustness. We fixed a bunch of long-standing bugs by refactoring the internal code structure (see 🔴 Bug fixes below)

c. Simplified OOP model for easier model exports and integration with TensorFlow, PyTorch &co.

These improvements come to you transparently aka "for free", but see Migration guide for some changes that break the old Gensim 3.x API. Update your code accordingly.

Dropped a bunch of externally contributed modules and wrappers: summarization, pivoted TFIDF, Mallet…
- Code quality was not up to our standards. Also there was no one to maintain these modules, answer user questions, support them.
  
  So rather than let them rot, we took the hard decision of removing these contributed modules from Gensim. If anyone's interested in maintaining them, please fork & publish into your own repo. They can live happily outside of Gensim.
Dropped Python 2. Gensim 4.0 is Py3.6+. Read our Python version support policy.
- If you still need Python 2 for some reason, stay at Gensim 3.8.3.
A new Gensim website – finally! 🙃

So, a major clean-up release overall. We're happy with this tighter, leaner and faster Gensim.

This is the direction we'll keep going forward: less kitchen-sink of "latest academic algorithms", more focus on robust engineering, targetting concrete NLP & document similarity use-cases.

👍 New features

#2947: Bump minimum Python version to 3.6, by @gojomo
#2300: Use less RAM in LdaMulticore, by @horpto
#2698: Streamline KeyedVectors & X2Vec API, by @gojomo
#2864: Speed up random number generation in word2vec, by @zygm0nt
#2976: Speed up phrase (collocation) detection, by @piskvorky
#2979: Allow skipping common English words in multi-word phrases, by @piskvorky
#2867: Expose max_final_vocab parameter in fastText constructor, by @mpenkov
#2931: Clear up job queue parameters in word2vec, by @lunastera
#2939: X2Vec SaveLoad improvements, by @piskvorky
#3060: Record lifecycle events in Gensim models, by @piskvorky
#3073: Make WMD normalization optional, by @piskvorky
#3065: Default to pickle protocol 4 when saving models, by @piskvorky
#3069: Add Github sponsor + donation nags, by @piskvorky

📚 Tutorials and docs

#3082: Make LDA tutorial read NIPS data on the fly, by @jonaschn
#2954: New theme for the Gensin website, by @dvorakvaclav
#2960: Added Gensim and Compatibility Wiki page, by @piskvorky
#2960: Reworked & simplified the Developer Wiki page, by @piskvorky
#2968: Migrate tutorials & how-tos to 4.0.0, by @piskvorky
#2899: Clean up of language and formatting of docstrings, by @piskvorky
#2899: Added documentation for NMSLIB indexer, by @piskvorky
#2832: Clear up LdaModel documentation, by @FyzHsn
#2871: Clarify that license is LGPL-2.1, by @pombredanne
#2896: Make docs clearer on alpha parameter in LDA model, by @xh2
#2897: Update Hoffman paper link for Online LDA, by @xh2
#2910: Refresh docs for run_annoy tutorial, by @piskvorky
#2935: Fix "generator" language in word2vec docs, by @polm
#3077: Fix various documentation warnings, by @mpenkov
#2991: Fix broken link in run_doc How-To, by @sezanzeb
#3003: Point WordEmbeddingSimilarityIndex documentation to gensim.similarities, by @Witiko
#2996: Make the website link to the old Gensim 3.8.3 documentation dynamic, by @Witiko
#3063: Update link to papers in LSI model, by @jonaschn
#3080: Fix some of the warnings/deprecated functions, by @FredHappyface)

🔴 Bug fixes

#2891: Fix fastText word-vectors with ngrams off, by @gojomo
#2907: Fix doc2vec crash for large sets of doc-vectors, by @gojomo
#2899: Fix similarity bug in NMSLIB indexer, by @piskvorky
#2899: Fix deprecation warnings in Annoy integration, by @piskvorky
#2901: Fix inheritance of WikiCorpus from TextCorpus, by @jenishah
#2940: Fix deprecations in SoftCosineSimilarity, by @Witiko
#2944: Fix save_facebook_model failure after update-vocab & other initialization streamlining, by @gojomo
#2846: Fix for Python 3.9/3.10: remove xml.etree.cElementTree, by @hugovk
#2973: phrases.export_phrases() doesn't yield all bigrams, by @piskvorky
#2942: Segfault when training doc2vec, by @gojomo
#3041: Fix RuntimeError in export_phrases (change defaultdict to dict), by @thalishsajeed
#3059: Fix rac...

Assets 2

22 Mar 09:16

mpenkov

4.0.0.rc1

4a241f0

4.0.0.rc1 Pre-release

Pre-release

4.0.0.rc1, 2021-03-19

⚠️ Gensim 4.0 contains breaking API changes! See the Migration guide to update your existing Gensim 3.x code and models.

Gensim 4.0 is a major release with lots of performance & robustness improvements and a new website.

Main highlights (see also 👍 Improvements below)

Massively optimized popular algorithms the community has grown to love: fastText, word2vec, doc2vec, phrases:

a. Efficiency

model	3.8.3: wall time / peak RAM / throughput	4.0.0: wall time / peak RAM / throughput
fastText	2.9h / 4.11 GB / 822k words/s	2.3h / 1.26 GB / 914k words/s
word2vec	1.7h / 0.36 GB / 1685k words/s	1.2h / 0.33 GB / 1762k words/s

In other words, fastText now needs 3x less RAM (and is faster); word2vec has 2x faster init (and needs less RAM, and is faster); detecting collocation phrases is 2x faster. (4.0 benchmarks)

b. Robustness. We fixed a bunch of long-standing bugs by refactoring the internal code structure (see 🔴 Bug fixes below)

c. Simplified OOP model for easier model exports and integration with TensorFlow, PyTorch &co.

These improvements come to you transparently aka "for free", but see Migration guide for some changes that break the old Gensim 3.x API. Update your code accordingly.

Dropped a bunch of externally contributed modules: summarization, pivoted TFIDF normalization, FIXME.
- Code quality was not up to our standards. Also there was no one to maintain them, answer user questions, support these modules.
  
  So rather than let them rot, we took the hard decision of removing these contributed modules from Gensim. If anyone's interested in maintaining them please fork into your own repo, they can live happily outside of Gensim.
Dropped Python 2. Gensim 4.0 is Py3.6+. Read our Python version support policy.
- If you still need Python 2 for some reason, stay at Gensim 3.8.3.
A new Gensim website – finally! 🙃

So, a major clean-up release overall. We're happy with this tighter, leaner and faster Gensim.

This is the direction we'll keep going forward: less kitchen-sink of "latest academic algorithms", more focus on robust engineering, targetting common concrete NLP & document similarity use-cases.

🌟 New Features

Default to pickle protocol 4 when saving models (piskvorky, #3065)
Record lifecycle events in Gensim models (piskvorky, #3060)
Make WMD normalization optional (piskvorky, #3073)

🔴 Bug fixes

fix RuntimeError in export_phrases (change defaultdict to dict) (thalishsajeed, #3041)

📚 Tutorial and doc improvements

fix various documentation warnings (mpenkov, #3077)
Fix broken link in run_doc how-to (sezanzeb, #2991)
Point WordEmbeddingSimilarityIndex documentation to gensim.similarities (Witiko, #3003)
Make the link to the Gensim 3.8.3 documentation dynamic (Witiko, #2996)

⚠️ Removed functionality

remove on_batch_begin and on_batch_end callbacks (mpenkov, #3078)
remove pattern dependency (mpenkov, #3012)
rm gensim.viz submodule (mpenkov, #3055)

🔮 Miscellaneous

[MRG] Add Github sponsor + donation nags (piskvorky, #3069)
Update URLs (jonaschn, #3063)
Fix race condition in FastText tests (sleepy-owl, #3059)
Add py39 wheels to travis/azure (FredHappyface, #3058)
Update repos before trying to install gdb (janaknat, #3035)
transformed camelCase to snake_case test names (sezanzeb, #3033)
move x86 tests from Travis to GHA, add aarch64 wheel build to Travis (janaknat, #3026)
Add Github Actions x86 and mac jobs to build python wheels (janaknat, #3024)

Assets 2

01 Nov 13:33

mpenkov

4.0.0beta

8624aa2

4.0.0beta Pre-release

Pre-release

4.0.0beta, 2020-10-31

⚠️ Gensim 4.0 contains breaking API changes! See the Migration guide to update your existing Gensim 3.x code and models.

Main highlights

Massively optimized popular algorithms the community has grown to love: fastText, word2vec, doc2vec, phrases:

a. Efficiency

model	3.8.3 wall time / peak RAM / throughput	4.0.0 wall time / peak RAM / throughput
fastText	2.9h / 4.11 GB / 822k words/s	2.3h / 1.26 GB / 914k words/s
word2vec	1.7h / 0.36 GB / 1685k words/s	1.2h / 0.33 GB / 1762k words/s

In other words, fastText now needs 3x less RAM (and is faster); word2vec has 2x faster init (and needs less RAM, and is faster); detecting collocation phrases is 2x faster. 4.0 benchmarks.

b. Robustness. We fixed a bunch of long-standing bugs by refactoring the internal code structure (see 🔴 Bug fixes below)

c. Simplified OOP model for easier model exports and integration with TensorFlow, PyTorch &co.

These improvements come to you transparently aka "for free", but see Migration guide for some changes that break the old Gensim 3.x API. Update your code accordingly.

Dropped a bunch of externally contributed modules: summarization, pivoted TFIDF normalization, wrappers for 3rd party libraries: Mallet, scikit-learn, DTM model, Vowpal Wabbit, wordrank, varembed.
- Why? Code quality was not up to our standards. Also there was no one to maintain them, answer user questions, support these modules and wrappers.
  
  So rather than let them rot, we took the hard decision of removing these contributed modules from Gensim. If anyone's interested in maintaining them please fork into your own repo, they can live happily outside of Gensim, linked to as "contributed" from Gensim docs.
Dropped Python 2. Gensim 4.0 is Py3.6+. Read our Python version support policy.
- If you still need Python 2 for some reason, stay at Gensim 3.8.3.
A new Gensim website – finally! 🙃

So, a major clean-up release overall. We're happy with this tighter, leaner and faster Gensim.

This is the direction we'll keep going forward: less kitchen-sink of "latest academic fad", more focus on robust engineering, targetting common NLP & document similarity use-cases.

Why a pre-release?

This 4.0.0beta pre-release is for users who want the cutting edge performance and bug fixes. Plus users who want to help out, by testing and providing feedback: code, documentation, workflows… Please let us know on the mailing list!

Install the pre-release with:

pip install --pre --upgrade gensim

What will change between this pre-release and a "full" 4.0 release?

Check progress here.

👍 Improvements

#2947: Bump minimum Python version to 3.6, by @gojomo
#2939 + #2984: Code style & py3 migration clean up, by @piskvorky
#2300: Use less RAM in LdaMulticore, by @horpto
#2698: Streamline KeyedVectors & X2Vec API, by @gojomo
#2864: Speed up random number generation in word2vec, by @zygm0nt
#2976: Speed up phrase (collocation) detection, by @piskvorky
#2979: Allow skipping common English words in multi-word phrases, by @piskvorky
#2867: Expose max_final_vocab parameter in fastText constructor, by @mpenkov
#2931: Clear up job queue parameters in word2vec, by @lunastera
#2939: X2Vec SaveLoad improvements, by @piskvorky

📚 Tutorials and docs

#2954: New theme for the Gensin website, @dvorakvaclav
#2960: Added Gensim and Compatibility Wiki page, by @piskvorky
#2960: Reworked & simplified the Developer Wiki page, by @piskvorky
#2968: Migrate tutorials & how-tos to 4.0.0, by @piskvorky
#2899: Clean up of language and formatting of docstrings, by @piskvorky
#2899: Added documentation for NMSLIB indexer, by @piskvorky
#2832: Clear up LdaModel documentation by @FyzHsn
#2871: Clarify that license is LGPL-2.1, by @pombredanne
#2896: Make docs clearer on alpha parameter in LDA model, by @xh2
#2897: Update Hoffman paper link for Online LDA, by @xh2
#2910: Refresh docs for run_annoy tutorial, by @piskvorky
#2935: Fix "generator" language in word2vec docs, by @polm

🔴 Bug fixes

#2891: Fix fastText word-vectors with ngrams off, by @gojomo
#2907: Fix doc2vec crash for large sets of doc-vectors, by @gojomo
#2899: Fix similarity bug in NMSLIB indexer, by @piskvorky
#2899: Fix deprecation warnings in Annoy integration, by @piskvorky
#2901: Fix inheritance of WikiCorpus from TextCorpus, by @jenishah
#2940; Fix deprecations in SoftCosineSimilarity, by @Witiko
#2944: Fix save_facebook_model failure after update-vocab & other initialization streamlining, by @gojomo
#2846: Fix for Python 3.9/3.10: remove xml.etree.cElementTree, by @hugovk
#2973: phrases.export_phrases() doesn't yield all bigrams
#2942: Segfault when training doc2vec

⚠️ Removed functionality & deprecations

#6: No more binary wheels for x32 platforms, by menshikh-iv
#2899: Renamed overly broad similarities.index to the more appropriate similarities.annoy, by @piskvorky
#2958: Remove gensim.summarization subpackage, docs and test data, by @mpenkov
#2926: Rename num_words to topn in dtm_coherence, by @MeganStodel
#2937: Remove Keras dependency, by @piskvorky
Removed all code, methods, attributes and functions marked as deprecated in Gensim 3.8.3.

Assets 2

Releases: piskvorky/gensim

4.3.2

Changes

4.3.2, 2023-08-23

🔴 Bug fixes

📚 Tutorial and doc improvements

👍 Improvements

🔮 Testing, CI, housekeeping

4.3.0

What's Changed

New Contributors

Contributors

4.2.0

4.1.2

4.1.2, 2021-09-17

4.1.1, 2021-09-14

4.1.0, 2021-08-15

Improved parameter edge-case handling in KeyedVectors most_similar and most_similar_cosmul methods

Deprecated obsolete step parameter from doc2vec

👍 New features

📚 Tutorials and docs

🔴 Bug fixes

⚠️ Removed functionality & deprecations

🔮 Testing, CI, housekeeping

4.1.1

4.1.1, 2021-09-14

4.1.0, 2021-08-15

Improved parameter edge-case handling in KeyedVectors most_similar and most_similar_cosmul methods

Deprecated obsolete step parameter from doc2vec

👍 New features

📚 Tutorials and docs

🔴 Bug fixes

⚠️ Removed functionality & deprecations

🔮 Testing, CI, housekeeping

4.1.0

4.1.0, 2021-08-15

Improved parameter edge-case handling in KeyedVectors most_similar and most_similar_cosmul methods

Deprecated obsolete step parameter from doc2vec

👍 New features

📚 Tutorials and docs

🔴 Bug fixes

⚠️ Removed functionality & deprecations

🔮 Testing, CI, housekeeping

4.0.1, 2021-04-01

4.0.0, 2021-03-24

Main highlights

4.0.1

4.0.1, 2021-04-01

4.0.0, 2021-03-24

Main highlights

👍 New features

📚 Tutorials and docs

🔴 Bug fixes

4.0.0

Changes

4.0.0, 2021-03-24

Main highlights

👍 New features

📚 Tutorials and docs

🔴 Bug fixes

4.0.0.rc1

4.0.0.rc1, 2021-03-19

Main highlights (see also 👍 Improvements below)

🌟 New Features

🔴 Bug fixes

📚 Tutorial and doc improvements

⚠️ Removed functionality

🔮 Miscellaneous

4.0.0beta

4.0.0beta, 2020-10-31

Main highlights

Why a pre-release?

What will change between this pre-release and a "full" 4.0 release?

👍 Improvements

📚 Tutorials and docs

🔴 Bug fixes

⚠️ Removed functionality & deprecations

Deprecated obsolete `step` parameter from doc2vec

Deprecated obsolete `step` parameter from doc2vec

Deprecated obsolete `step` parameter from doc2vec