Open Data ❤️

Open Data is the idea that some data should be freely available to everyone to use and republish as they wish, without restrictions from copyright, patents or other mechanisms of control. Wikipedia

Index

Index
📊 OpenData Websites
🖼️ Image Datasets
📚 NLP Datasets
🎵 Audio Datasets
Open Government Sites

📊 OpenData Websites

Name	Description	URL
CDC Open Data	The Centers for Disease Control and Prevention (CDC) is the national public health agency of the United States.	data.cdc.gov
Data.world	Data.world is the enterprice data catalog for modern data stack.	data.world
Five Thirty Eight	FiveThirtyEight is a website using data and evidence to advance public knowledge. This is their open data portal sharing the data and code behind some of their articles and graphics.	data.fivethirtyeight.com
GENESIS-ONLINE	The German Federal Statistical Office is the institution to contact first for official data on the society, the economy, the environment and the state.	www-genesis.destatis.de
Kaggle	Kaggle, a subsidiary of Google LLC, is an online community of data scientists and machine learning practitioners. Kaggle allows users to find and publish data sets, explore and build models in a web-based data-science environment, work with other data scientists and machine learning engineers, and enter competitions to solve data science challenges.	www.kaggle.com
Project Gutenberg	Project Gutenberg is an online library of free eBooks. Books can be found in almost all extensions.	www.gutenberg.org
Registry of Open Data on AWS	This registry exists to help people discover and share datasets that are available via AWS resources.	registry.opendata.aws
Science On a Sphere	Science On a Sphere is a room-sized, global display system that projects visualization of planetary data onto a six foot diameter sphere to help illustrate Earth System science to people of all ages.	sos.noaa.gov
Stanford Large Network Dataset collection	The SNAP library is being actively developed since 2004 and is organically growing as a result of their research pursuits in analysis of large social and information networks.	snap.stanford.edu
Stanford Open Data	Portal for Stanford Open Data	stanfordopendata.org
The World Bank	The World Bank is an internatinoal financial institution that provides loans and grants to the governments of low-and middle-income countries for the purpose of pursuing capitalprojects.	datacatalog.worldbank.org
U.S Census Bureau	The United States census is a census that is legally mandated by the US constitution.	data.census.gov
U.S Department of Commerce	Open Data by U.S Department of Commerce	data.commerce.gov
U.S Education Open Data	Data Profiles from U.S. Department of Education	data.ed.gov
U.S Transportation Open Data	Department of Transportation of United States of America	data.transportation.gov
UCI ML Repository	The UCI ML repository is a collectino of databases, domain theories and data generators that are used by the machine learning comunity for the empirical analysis of machine learning algorithms.	archive.ics.uci.edu
UNICEF	UNICEF, also known as the United Nations Children's Fund, is a United Nations agency responsible for providing humanitarian and developmental aid to children worldwide.	data.unicef.org
World Health Organization	The World Health Organization (WHO) is a specialized agency of the United Nations responsible for international public health.	www.who.int
Yelp	The Yelp Open Dataset is a subset of their businesses, reviews, and user data for use in personal, educational and academic purposes.	www.yelp.com

⬆️ Go back to index

📚 NLP Datasets

Name	Description	URL
20 Newsgroups	A collection featuring 20,000 documents that covers 20 newsgroups and subjects	qwone.com
Amazon question/answer data	This dataset contains question and answer data from Amazon, totaling around 1.4 million answered questions.	jmcauley.ucsd.edu
ArXiv	This massive 270 GB dataset features all arXiv research papers in fulltext.	arxiv.org
Enron Email Dataset	This dataset contains 500,000+ messages of Enron officials' emails and is especially of use for anyone looking to expand their understanding of the inner-workings of email tools.	www.cs.cmu.edu
Google Books Ngrams	A data set containing Google Books n-gram corpora. This data set is freely available on Amazon S3 in a Hadoop friendly file format and is licensed under a Creative Commons Attribution 3.0 Unported License.	aws.amazon.com
IMDB Reviews	This is a database for binary sentiment classification substantially more data than previous benchmark datasets.	ai.stanford.edu
Machine Translation of Various Languages	This dataset consists of training data for four European languages	statmt.org
Multi-Domain Sentiment Dataset	A massive variety of Amazon products along with their corresponding reviews	www.cs.jhu.edu
Rueters News Dataset	Originally appearing in 1987, this dataset has been labeled, indexed, and compiled for use in machine learning.	archive.ics.uci.edu
Sentiment140	Sentiment140 allows you to discover the sentiment of a brand, product or topic on Twitter	help.sentiment140.com
Stanford Sentiment Treebank	Dataset for training a model to identify sentiment with the use of longer phases with its 10,000+ Rotten Tomatoes reviews	nlp.stanford.edu
The WikiQA Corpus	This publicly-available Q&A dataset was initially compiled to aid in all open-domain question answering research.	www.microsoft.com
Twenty Newsgroups Dataset	This data set consists of 20000 messages taken from 20 newsgroups	archive.ics.uci.edu
Twitter US Airline Sentiment	Analyze how travelers in February 2015 expressed their feelings on Twitter	www.kaggle.com
UCI's Spambase Data set	This dataset was created by a team at HP (Hewlett-Packard) to help create a spam filter. It contains a litanie of emails previously labeled as spam by users.	archive.ics.uci.edu
Wikipedia Links Data	This Google dataset contains approximately 13 million documents with each containing a hyperlink (one minimum each) that goes to an English wikipedia page	code.google.com
WordNet	WordNet is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept.	wordnet.princeton.edu
Yelp Open Dataset	This Yelp dataset features 8.5M+ reviews of over 160,000 businesses. It also has 200,000+ pictures and spans across 8 major metropolitan areas.	www.yelp.com
YouTubers-saying-things	Dataset containing popular YouTuber's video subtitles	www.kaggle.com

⬆️ Go back to index

🖼️ Image Datasets

Name	Description	URL
CIFAR-10	The CIFAR-10 dataset consists of 60000 32x32 color imagse in 10 classes, with 6000 imagse per class. There are 50000 training images and 10000 test images.	www.cs.toronto.edu
COCO (Common Objects in Context)	COCO is a large-scale object detection, segmentation, and captioning dataset.	cocodataset.org
Fashion-MNIST	Fashion-MNIST is a dataset consisting of training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes.	github.com
ImageNet	ImageNet is an image database organized according to the WordNet hierarchy (currently only the nouns), in which each node of the hierarchy is depicted by hundreds and thousands of images.	www.image-net.org
Open Images Dataset	Open Images is a dataset of ~9M images annotated with image-level labels, object bounding boxes, object segmentation masks, visual relationships, and localized narratives.	storage.googleapis.com
SVHN (Street View House Number)	SVHN is a real-world image dataset for developing machine learning and object recognition algorithms with minimal requirement on data preprocessing and formatting.	ufldl.stanford.edu
VisualQA	VQA is a new dataset containing open-ended questions about images. The questions require an understanding of vision, language and commonsense of knowledge to answer.	visualqa.org

⬆️ Go back to index

🎵 Audio Datasets

Name	Description	URL
Ballroom	This dataset gives many informatinos on ballroom dancing. Some characteristic excerpts of many dance style are provided in real audio format. Their tempi are also available.	mtg.upf.edu
FMA (Free Music Archive)	A Dataset for Music Analysis	github.com
Free Spoken Digit Dataset	A free audio dataset of spoken digits. Think MNIST for audio.	github.com
LibriSpeech	LibriSpeech is a corpus of approximately 1000 hours of 16KHz read English speech, prepared by Vassil Panayotov with the assistance of Daniel Povey.	www.openslr.org
Urban Sound 8K Dataset	This dataset contains 8732 labeled sound excerpts (<=4s) of urban sounds from 10 classes.	urbansounddataset.weebly.com
VoxCeleb	This is an audio visual dataset consisting of short clips of human speech, extracted from interview videoes uploaded to YouTube	www.robots.ox.ac.uk

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Open Data ❤️

Index

📊 OpenData Websites

📚 NLP Datasets

🖼️ Image Datasets

🎵 Audio Datasets

Files

README.md

Latest commit

History

README.md

File metadata and controls

Open Data ❤️

Index

📊 OpenData Websites

📚 NLP Datasets

🖼️ Image Datasets

🎵 Audio Datasets