Open Data is the idea that some data should be freely available to everyone to use and republish as they wish, without restrictions from copyright, patents or other mechanisms of control. Wikipedia
Name | Description | URL |
---|---|---|
CDC Open Data | The Centers for Disease Control and Prevention (CDC) is the national public health agency of the United States. | data.cdc.gov |
Data.world | Data.world is the enterprice data catalog for modern data stack. | data.world |
Five Thirty Eight | FiveThirtyEight is a website using data and evidence to advance public knowledge. This is their open data portal sharing the data and code behind some of their articles and graphics. | data.fivethirtyeight.com |
GENESIS-ONLINE | The German Federal Statistical Office is the institution to contact first for official data on the society, the economy, the environment and the state. | www-genesis.destatis.de |
Kaggle | Kaggle, a subsidiary of Google LLC, is an online community of data scientists and machine learning practitioners. Kaggle allows users to find and publish data sets, explore and build models in a web-based data-science environment, work with other data scientists and machine learning engineers, and enter competitions to solve data science challenges. | www.kaggle.com |
Project Gutenberg | Project Gutenberg is an online library of free eBooks. Books can be found in almost all extensions. | www.gutenberg.org |
Registry of Open Data on AWS | This registry exists to help people discover and share datasets that are available via AWS resources. | registry.opendata.aws |
Science On a Sphere | Science On a Sphere is a room-sized, global display system that projects visualization of planetary data onto a six foot diameter sphere to help illustrate Earth System science to people of all ages. | sos.noaa.gov |
Stanford Large Network Dataset collection | The SNAP library is being actively developed since 2004 and is organically growing as a result of their research pursuits in analysis of large social and information networks. | snap.stanford.edu |
Stanford Open Data | Portal for Stanford Open Data | stanfordopendata.org |
The World Bank | The World Bank is an internatinoal financial institution that provides loans and grants to the governments of low-and middle-income countries for the purpose of pursuing capitalprojects. | datacatalog.worldbank.org |
U.S Census Bureau | The United States census is a census that is legally mandated by the US constitution. | data.census.gov |
U.S Department of Commerce | Open Data by U.S Department of Commerce | data.commerce.gov |
U.S Education Open Data | Data Profiles from U.S. Department of Education | data.ed.gov |
U.S Transportation Open Data | Department of Transportation of United States of America | data.transportation.gov |
UCI ML Repository | The UCI ML repository is a collectino of databases, domain theories and data generators that are used by the machine learning comunity for the empirical analysis of machine learning algorithms. | archive.ics.uci.edu |
UNICEF | UNICEF, also known as the United Nations Children's Fund, is a United Nations agency responsible for providing humanitarian and developmental aid to children worldwide. | data.unicef.org |
World Health Organization | The World Health Organization (WHO) is a specialized agency of the United Nations responsible for international public health. | www.who.int |
Yelp | The Yelp Open Dataset is a subset of their businesses, reviews, and user data for use in personal, educational and academic purposes. | www.yelp.com |
Name | Description | URL |
---|---|---|
20 Newsgroups | A collection featuring 20,000 documents that covers 20 newsgroups and subjects | qwone.com |
Amazon question/answer data | This dataset contains question and answer data from Amazon, totaling around 1.4 million answered questions. | jmcauley.ucsd.edu |
ArXiv | This massive 270 GB dataset features all arXiv research papers in fulltext. | arxiv.org |
Enron Email Dataset | This dataset contains 500,000+ messages of Enron officials' emails and is especially of use for anyone looking to expand their understanding of the inner-workings of email tools. | www.cs.cmu.edu |
Google Books Ngrams | A data set containing Google Books n-gram corpora. This data set is freely available on Amazon S3 in a Hadoop friendly file format and is licensed under a Creative Commons Attribution 3.0 Unported License. | aws.amazon.com |
IMDB Reviews | This is a database for binary sentiment classification substantially more data than previous benchmark datasets. | ai.stanford.edu |
Machine Translation of Various Languages | This dataset consists of training data for four European languages | statmt.org |
Multi-Domain Sentiment Dataset | A massive variety of Amazon products along with their corresponding reviews | www.cs.jhu.edu |
Rueters News Dataset | Originally appearing in 1987, this dataset has been labeled, indexed, and compiled for use in machine learning. | archive.ics.uci.edu |
Sentiment140 | Sentiment140 allows you to discover the sentiment of a brand, product or topic on Twitter | help.sentiment140.com |
Stanford Sentiment Treebank | Dataset for training a model to identify sentiment with the use of longer phases with its 10,000+ Rotten Tomatoes reviews | nlp.stanford.edu |
The WikiQA Corpus | This publicly-available Q&A dataset was initially compiled to aid in all open-domain question answering research. | www.microsoft.com |
Twenty Newsgroups Dataset | This data set consists of 20000 messages taken from 20 newsgroups | archive.ics.uci.edu |
Twitter US Airline Sentiment | Analyze how travelers in February 2015 expressed their feelings on Twitter | www.kaggle.com |
UCI's Spambase Data set | This dataset was created by a team at HP (Hewlett-Packard) to help create a spam filter. It contains a litanie of emails previously labeled as spam by users. | archive.ics.uci.edu |
Wikipedia Links Data | This Google dataset contains approximately 13 million documents with each containing a hyperlink (one minimum each) that goes to an English wikipedia page | code.google.com |
WordNet | WordNet is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. | wordnet.princeton.edu |
Yelp Open Dataset | This Yelp dataset features 8.5M+ reviews of over 160,000 businesses. It also has 200,000+ pictures and spans across 8 major metropolitan areas. | www.yelp.com |
YouTubers-saying-things | Dataset containing popular YouTuber's video subtitles | www.kaggle.com |
Name | Description | URL |
---|---|---|
CIFAR-10 | The CIFAR-10 dataset consists of 60000 32x32 color imagse in 10 classes, with 6000 imagse per class. There are 50000 training images and 10000 test images. | www.cs.toronto.edu |
COCO (Common Objects in Context) | COCO is a large-scale object detection, segmentation, and captioning dataset. | cocodataset.org |
Fashion-MNIST | Fashion-MNIST is a dataset consisting of training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes. | github.com |
ImageNet | ImageNet is an image database organized according to the WordNet hierarchy (currently only the nouns), in which each node of the hierarchy is depicted by hundreds and thousands of images. | www.image-net.org |
Open Images Dataset | Open Images is a dataset of ~9M images annotated with image-level labels, object bounding boxes, object segmentation masks, visual relationships, and localized narratives. | storage.googleapis.com |
SVHN (Street View House Number) | SVHN is a real-world image dataset for developing machine learning and object recognition algorithms with minimal requirement on data preprocessing and formatting. | ufldl.stanford.edu |
VisualQA | VQA is a new dataset containing open-ended questions about images. The questions require an understanding of vision, language and commonsense of knowledge to answer. | visualqa.org |
Name | Description | URL |
---|---|---|
Ballroom | This dataset gives many informatinos on ballroom dancing. Some characteristic excerpts of many dance style are provided in real audio format. Their tempi are also available. | mtg.upf.edu |
FMA (Free Music Archive) | A Dataset for Music Analysis | github.com |
Free Spoken Digit Dataset | A free audio dataset of spoken digits. Think MNIST for audio. | github.com |
LibriSpeech | LibriSpeech is a corpus of approximately 1000 hours of 16KHz read English speech, prepared by Vassil Panayotov with the assistance of Daniel Povey. | www.openslr.org |
Urban Sound 8K Dataset | This dataset contains 8732 labeled sound excerpts (<=4s) of urban sounds from 10 classes. | urbansounddataset.weebly.com |
VoxCeleb | This is an audio visual dataset consisting of short clips of human speech, extracted from interview videoes uploaded to YouTube | www.robots.ox.ac.uk |