Skip to content

Releases: piskvorky/gensim-data

glove-wiki-gigaword-50

25 Oct 03:10
Compare
Choose a tag to compare

Pre-trained glove vectors based on Wikipedia 2014 + Gigaword, 5.6B tokens, uncased

attribute value
File size 66MB
Number of vectors 400000
Dimension 50
License http://opendatacommons.org/licenses/pddl/

Read more:

Example

import gensim.downloader as api

model = api.load("glove-wiki-gigaword-50")
print(model.similarity('bag', 'purse'))

"""
Output:

0.623833699175
"""

fake-news

24 Oct 13:56
Compare
Choose a tag to compare

Fake news dataset contains text and metadata from 244 websites and represents posts in total from a specific window of 30 days. The data was pulled using the webhose.io API, and because it's coming from their crawler, not all websites identified by their BS Detector are present in this dataset. Data sources that were missing a label were simply assigned a label of 'bs'. There are (ostensibly) no genuine, reliable, or trustworthy news sources represented in this dataset (so far), so don't trust anything you read.

attribute value
File size 19MB
Number of posts 12999
Licence https://creativecommons.org/publicdomain/zero/1.0/

Read more:

Example

import gensim.downloader as api
import json

fake_news = api.load("fake-news")
for doc in fake_news: 
    print(json.dumps(doc, indent=4))
    break

"""
Output:

{
    "comments": "0",
    "title": "Muslims BUSTED: They Stole Millions In Gov\u2019t Benefits",
    "published": "2016-10-26T21:41:00.000+03:00",
    "site_url": "100percentfedup.com",
    "language": "english",
    "text": "Print They should pay all the back all the money plus interest. The entire family and everyone who came in with them need to be deported asap. Why did it take two years to bust them? \nHere we go again \u2026another group stealing from the government and taxpayers! A group of Somalis stole over four million in government benefits over just 10 months! \nWe\u2019ve reported on numerous cases like this one where the Muslim refugees/immigrants commit fraud by scamming our system\u2026It\u2019s way out of control! More Related",
    "domain_rank": "25689",
    "crawled": "2016-10-27T01:49:27.168+03:00",
    "type": "bias",
    "likes": "0",
    "shares": "0",
    "spam_score": "0",
    "country": "US",
    "author": "Barracuda Brigade",
    "participants_count": "1",
    "ord_in_thread": "0",
    "thread_title": "Muslims BUSTED: They Stole Millions In Gov\u2019t Benefits",
    "uuid": "6a175f46bcd24d39b3e962ad0f29936721db70db",
    "main_img_url": "http://bb4sp.com/wp-content/uploads/2016/10/Fullscreen-capture-10262016-83501-AM.bmp.jpg",
    "replies_count": "0"
}
"""

text8

14 Oct 12:04
Compare
Choose a tag to compare

First 100,000,000 bytes of plain text from Wikipedia. Used for testing purposes, see wiki-english-* for proper full Wikipedia datasets.

attribute value
File size 32MB
Number of rows 1701

Read more:

Example

import gensim.downloader as api
from gensim.models.word2vec import Word2Vec

data = api.load("text8")
model = Word2Vec(data)
model.most_similar("human", topn=3)

"""
Output:

[('humans', 0.6429149508476257), ('animal', 0.6419760584831238), ('biological', 0.6034130454063416)]
"""