Add numberbatch word embeddings. Fix #9 (#10)

* numberbatch to json * numberbatch to json * added checksum * update conceptnet info * update name
piskvorky · Dec 18, 2017 · 40ad3e4 · 40ad3e4
1 parent fa71854
commit 40ad3e4
Show file tree

Hide file tree

Showing 2 changed files with 18 additions and 1 deletion.
diff --git a/README.md b/README.md
@@ -98,6 +98,7 @@ To load a model or corpus, use either the Python or command line interface:
 ### Models
 | name | num vectors | file size | base dataset | read_more  | description | parameters | preprocessing | license |
 |------|-------------|-----------|--------------|------------|-------------|------------|---------------|---------|
+| conceptnet-numberbatch-17-06-300 | 1917247 | 1168 MB | ConceptNet, word2vec, GloVe, and OpenSubtitles 2016 | <ul><li>http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14972</li> <li>https://github.com/commonsense/conceptnet-numberbatch</li> <li>http://conceptnet.io/</li></ul> | ConceptNet Numberbatch consists of state-of-the-art semantic vectors (also known as word embeddings) that can be used directly as a representation of word meanings or as a starting point for further machine learning. ConceptNet Numberbatch is part of the ConceptNet open data project. ConceptNet provides lots of ways to compute with word meanings, one of which is word embeddings. ConceptNet Numberbatch is a snapshot of just the word embeddings. It is built using an ensemble that combines data from ConceptNet, word2vec, GloVe, and OpenSubtitles 2016, using a variation on retrofitting. | <ul><li>dimension - 300</li></ul> | - | https://github.com/commonsense/conceptnet-numberbatch/blob/master/LICENSE.txt |
 | glove-twitter-100 | 1193514 | 387 MB | Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased) | <ul><li>https://nlp.stanford.edu/projects/glove/</li> <li>https://nlp.stanford.edu/pubs/glove.pdf</li></ul> | Pre-trained vectors based on  2B tweets, 27B tokens, 1.2M vocab, uncased (https://nlp.stanford.edu/projects/glove/) | <ul><li>dimension - 100</li></ul> | Converted to w2v format with `python -m gensim.scripts.glove2word2vec -i <fname> -o glove-twitter-100.txt`. | http://opendatacommons.org/licenses/pddl/ |
 | glove-twitter-200 | 1193514 | 758 MB | Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased) | <ul><li>https://nlp.stanford.edu/projects/glove/</li> <li>https://nlp.stanford.edu/pubs/glove.pdf</li></ul> | Pre-trained vectors based on 2B tweets, 27B tokens, 1.2M vocab, uncased (https://nlp.stanford.edu/projects/glove/). | <ul><li>dimension - 200</li></ul> | Converted to w2v format with `python -m gensim.scripts.glove2word2vec -i <fname> -o glove-twitter-200.txt`. | http://opendatacommons.org/licenses/pddl/ |
 | glove-twitter-25 | 1193514 | 104 MB | Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased) | <ul><li>https://nlp.stanford.edu/projects/glove/</li> <li>https://nlp.stanford.edu/pubs/glove.pdf</li></ul> | Pre-trained vectors based on 2B tweets, 27B tokens, 1.2M vocab, uncased (https://nlp.stanford.edu/projects/glove/). | <ul><li>dimension - 25</li></ul> | Converted to w2v format with `python -m gensim.scripts.glove2word2vec -i <fname> -o glove-twitter-25.txt`. | http://opendatacommons.org/licenses/pddl/ |
@@ -107,7 +108,8 @@ To load a model or corpus, use either the Python or command line interface:
 | glove-wiki-gigaword-300 | 400000 | 376 MB | Wikipedia 2014 + Gigaword 5 (6B tokens, uncased) | <ul><li>https://nlp.stanford.edu/projects/glove/</li> <li>https://nlp.stanford.edu/pubs/glove.pdf</li></ul> | Pre-trained vectors based on Wikipedia 2014 + Gigaword, 5.6B tokens, 400K vocab, uncased (https://nlp.stanford.edu/projects/glove/). | <ul><li>dimension - 300</li></ul> | Converted to w2v format with `python -m gensim.scripts.glove2word2vec -i <fname> -o glove-wiki-gigaword-300.txt`. | http://opendatacommons.org/licenses/pddl/ |
 | glove-wiki-gigaword-50 | 400000 | 65 MB | Wikipedia 2014 + Gigaword 5 (6B tokens, uncased) | <ul><li>https://nlp.stanford.edu/projects/glove/</li> <li>https://nlp.stanford.edu/pubs/glove.pdf</li></ul> | Pre-trained vectors based on Wikipedia 2014 + Gigaword, 5.6B tokens, 400K vocab, uncased (https://nlp.stanford.edu/projects/glove/). | <ul><li>dimension - 50</li></ul> | Converted to w2v format with `python -m gensim.scripts.glove2word2vec -i <fname> -o glove-wiki-gigaword-50.txt`. | http://opendatacommons.org/licenses/pddl/ |
 | word2vec-google-news-300 | 3000000 | 1662 MB | Google News (about 100 billion words) | <ul><li>https://code.google.com/archive/p/word2vec/</li> <li>https://arxiv.org/abs/1301.3781</li> <li>https://arxiv.org/abs/1310.4546</li> <li>https://www.microsoft.com/en-us/research/publication/linguistic-regularities-in-continuous-space-word-representations/?from=http%3A%2F%2Fresearch.microsoft.com%2Fpubs%2F189726%2Frvecs.pdf</li></ul> | Pre-trained vectors trained on a part of the Google News dataset (about 100 billion words). The model contains 300-dimensional vectors for 3 million words and phrases. The phrases were obtained using a simple data-driven approach described in 'Distributed Representations of Words and Phrases and their Compositionality' (https://code.google.com/archive/p/word2vec/). | <ul><li>dimension - 300</li></ul> | - | not found |
-| word2vec-ruscorpora-300 | 184973 | 198 MB | Russian National Corpus (about 250M words) | <ul><li>https://www.academia.edu/24306935/WebVectors_a_Toolkit_for_Building_Web_Interfaces_for_Vector_Semantic_Models</li> <li>http://rusvectores.org/en/</li> <li>https://github.com/RaRe-Technologies/gensim-data/issues/3</li></ul> | Word2vec Continuous Skipgram vectors trained on full Russian National Corpus (about 250M words). The model contains 185K words. | <ul><li>window_size - 10</li> <li>dimension - 300</li></ul> | The corpus was lemmatized and tagged with Universal PoS | not found |
+| word2vec-ruscorpora-300 | 184973 | 198 MB | Russian National Corpus (about 250M words) | <ul><li>https://www.academia.edu/24306935/WebVectors_a_Toolkit_for_Building_Web_Interfaces_for_Vector_Semantic_Models</li> <li>http://rusvectores.org/en/</li> <li>https://github.com/RaRe-Technologies/gensim-data/issues/3</li></ul> | Word2vec Continuous Skipgram vectors trained on full Russian National Corpus (about 250M words). The model contains 185K words. | <ul><li>window_size - 10</li> <li>dimension - 300</li></ul> | The corpus was lemmatized and tagged with Universal PoS | https://creativecommons.org/licenses/by/4.0/deed.en |
+
 
 (table generated automatically by [generate_table.py](https://github.com/RaRe-Technologies/gensim-data/blob/master/generate_table.py) based on [list.json](https://github.com/RaRe-Technologies/gensim-data/blob/master/list.json))
 

diff --git a/list.json b/list.json
@@ -122,6 +122,21 @@
 		}
 	},
 	"models": {
+		"conceptnet-numberbatch-17-06-300": {
+			"num_records": 1917247,
+			"file_size": 1225497562,
+			"base_dataset": "ConceptNet, word2vec, GloVe, and OpenSubtitles 2016",
+			"reader_code": "https://github.com/RaRe-Technologies/gensim-data/releases/download/conceptnet-numberbatch-17-06-300/__init__.py",
+			"license": "https://github.com/commonsense/conceptnet-numberbatch/blob/master/LICENSE.txt",
+			"parameters": {
+				"dimension": 300
+			},
+			"description": "ConceptNet Numberbatch consists of state-of-the-art semantic vectors (also known as word embeddings) that can be used directly as a representation of word meanings or as a starting point for further machine learning. ConceptNet Numberbatch is part of the ConceptNet open data project. ConceptNet provides lots of ways to compute with word meanings, one of which is word embeddings. ConceptNet Numberbatch is a snapshot of just the word embeddings. It is built using an ensemble that combines data from ConceptNet, word2vec, GloVe, and OpenSubtitles 2016, using a variation on retrofitting.",
+			"read_more": ["http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14972", "https://github.com/commonsense/conceptnet-numberbatch", "http://conceptnet.io/"],
+			"checksum": "fd642d457adcd0ea94da0cd21b150847",
+			"file_name": "conceptnet-numberbatch-17-06-300.gz",
+			"parts": 1
+		},
 		"word2vec-ruscorpora-300": {
 			"num_records": 184973,
 			"file_size": 208427381,