Data Sources

Dataset Finders
Machine Learning Datasets
Computer Vision Datasets
Sentiment Analysis Datasets
Natural Language Processing Datasets
Self-driving (Autonomous Vehicles) Datasets
Clinical Datasets
Recommendation System Datasets

Dataset Finders

Dataset	Description
Kaggle	Kaggle provides a vast container of datasets, sufficient for the enthusiast to the expert.
Google’s Dataset Search Engine	Similar to how Google Scholar works, Dataset Search lets you find datasets wherever they are hosted, whether it’s a publisher’s site, a digital library, or an author’s web page. It’s a phenomenal dataset finder, and it contains over 25 million datasets.
UCI Machine Learning Repository	The Machine Learning Repository at UCI provides an up to date resource for open-source datasets.
VisualData	Discover computer vision datasets by category; it allows searchable queries.
The Big Bad NLP Database	datasets for various natural language processing tasks, created and curated by Quantum Stat
Awesome Public Datasets Collection	This collection is a topic-centric public data sources in high quality. They are collected and tidied from blogs, answers, and user responses. Contains tons of data sources and categories
CMU Libraries	Discover high-quality datasets thanks to the collection of Huajin Wang, at CMU.
Reddit	Datasets provided by the Reddit Community

Machine Learning Datasets

Dataset	Description
Mall Customers Dataset	The Mall customers dataset contains information about people visiting the mall in a particular city. The dataset consists of various columns like gender, customer id, age, annual income, and spending score. It’s generally used to segment customers based on their age, income, and interest.
IRIS Dataset	The iris dataset is a simple and beginner-friendly dataset that contains information about the flower petal and sepal width. The data is divided into three classes, with 50 rows in each class. It’s generally used for classification and regression modeling.
MNIST Dataset	This is a database of handwritten digits. It contains 60,000 training images and 10,000 testing images. This is a perfect dataset to start implementing image classification where you can classify a digit from 0 to 9.
Boston Housing Dataset	Contains information collected by the US Census Service concerning housing in the area of Boston Mass. It was obtained from the StatLib archive and has been used extensively throughout the literature to benchmark algorithms.
Fake News Detection Dataset	It is a CSV file that has 7796 rows with four columns. There are four columns: news, title, news text, result.
Wine quality dataset	The dataset contains different chemical information about the wine. The dataset is suitable for classification and regression tasks.
SOCR data — Heights and Weights Dataset	This is a basic dataset for beginners. It contains only the height and weights of 25,000 different humans of 18 years of age. This dataset can be used to build a model that can predict the height or weight of a human.
Credit Card Fraud Detection Dataset	The dataset contains transactions made by credit cards; they are labeled as fraudulent or genuine. This is important for companies that have transaction systems to build a model for detecting fraudulent activities.
Google-Landmarks-v2	An improved dataset for landmark recognition and retrieval. This dataset contains 5M+ images of 200k+ landmarks from across the world, sourced and annotated by the Wiki Commons community.
Stock Prices	Stock Prices stores historical data about day stock prices, dividends, and splits for US stocks.
Restaurants Health Score	Restaurants Health Score in San Francisco developed by the local Health Department provides interesting material for researchers interested in public health and restaurant business.

Computer Vision Datasets

Dataset	Description
Google’s Open Images	A vast dataset from Google AI containing over 10 million images.
xView	xView is one of the most massive publicly available datasets of overhead imagery. It contains images from complex scenes around the world, annotated using bounding boxes.
ImageNet	The largest image dataset for computer vision. It provides an accessible image database that is organized hierarchically, according to WordNet.
Kinetics-700	A large-scale dataset of video URLs from Youtube. Including human-centered actions. It contains over 700,000 videos.
Cityscapes Dataset	This is an open-source dataset for Computer Vision projects. It contains high-quality pixel-level annotations of video sequences taken in 50 different city streets. The dataset is useful in semantic segmentation and training deep neural networks to understand the urban scene.
IMDB-Wiki dataset	The IMDB-Wiki dataset is one of the most extensive open-source datasets for face images with labeled gender and age. The images are collected from IMDB and Wikipedia. It has five million-plus labeled images.
Color Detection Dataset	The dataset contains a CSV file that has 865 color names with their corresponding RGB(red, green, and blue) values of the color. It also has the hexadecimal value of the color.
Stanford Dogs Dataset	It contains 20,580 images and 120 different dog breed categories.
CIFAR-10	CIFAR-10 is a collection of images for training deep learning computer vision algorithms. The data bank consists of 60000 32x32 color images in 10 classes, 6000 images in each class. If this is not enough, try the CIFAR-100 dataset.
COCO	COCO is a regularly updated DB for object segmentation and recognition in context, sponsored by Microsoft, Facebook, and Mighty AI.

Sentiment Analysis Datasets

Dataset	Description
Lexicoder Sentiment Dictionary	This dataset is specific for sentiment analysis. The dataset contains over 3000 negative words and over 2000 positive sentiment words.
IMDB reviews	An interesting dataset with over 50,000 movie reviews from Kaggle.
Stanford Sentiment Treebank	Standard sentiment dataset with sentiment annotations.
Twitter US Airline Sentiment	Twitter data on US airlines from February 2015, classified as positive, negative, and neutral tweets

Natural Language Processing Datasets

Dataset	Description
IMDB reviews	The large movie review dataset consists of movie reviews from IMDB website with over 25,000 reviews for training and 25,000 for the testing set.
UCI Spambase Dataset	Classifying emails as spam or non-spam is a prevalent and useful task. The dataset contains 4601 emails and 57 meta-information about the emails. You can build models to filter out the spam.
Recommender Systems Dataset	It contains various datasets from popular websites like Goodreads book reviews, Amazon product reviews, bartending data, data from social media, and others that are used in building a recommender system.
Enron Email Dataset	It contains around 0.5 million emails of over 150 users.
SMS Spam Collection in English	A dataset that consists of 5,574 English SMS spam messages.
Rotten Tomatoes Reviews	Archive of more than 480,000 critic reviews (fresh or rotten).
Amazon Reviews	A vast dataset from Amazon, containing over 45 million Amazon reviews.
HotspotQA Dataset	Question answering dataset featuring natural, multi-hop questions, with intense supervision for supporting facts to enable more explainable question answering systems.
VoxCeleb	VoxCeleb is an audio collection that you can use for deep learning tasks such as real-time natural language processing, voice recognition, and speech generation.
LibriSpeech	On LibriSpeech, you will find about 1000 hours of 16kHz oral English speech derived from audiobooks.
Free Spoken Digit Dataset	Free Spoken Digit Dataset can be used for. It consists of spoken digit recordings at 8kHz that are precisely trimmed. They have near minimal silence at the beginnings and ends. The dataset is open-source.
Common Voice	Common Voice is an initiative by Mozilla that contains hundreds of thousands of records of human voice. Every visitor of the Common Voice website can contribute to their open human speech database recording their own voice.
WordNet	WordNet is a lexical database that contains all parts of speech grouped into sets of synonyms. Such a structure makes it a fantastic tool for natural language processing and linguistic research.
Yelp Reviews	Yelp Reviews contains user reviews, business information, and images that you can use for personal and academic purposes.
20 Newsgroups	20 Newsgroups is a dataset that consists of 18,000+ text documents from 20 different newsgroups including sports, technology, art, entertainment, etc.

Self Driving (Autonomous Vehicles) Datasets

Dataset	Description
Waymo Open Dataset	This is a fantastic dataset resource from the folks at Waymo. Includes a vast dataset of autonomous driving, enough to train deep nets from zero.
Berkeley DeepDrive BDD100k	One of the largest datasets for self-driving cars, containing over 2000 hours of driving experiences across New York and California.
Bosch Small Traffic Light Dataset	Dataset for small traffic lights for deep learning.
WPI datasets	Datasets for traffic lights, pedestrian, and lane detection.
Comma.ai	It contains details such as a car’s speed, acceleration, steering angle, and GPS coordinates.
LISA: Laboratory for Intelligent & Safe Automobiles, UC San Diego Datasets	This dataset includes traffic signs, vehicle detection, traffic lights, and trajectory patterns.
Cityscape Dataset	This is an extensive dataset that has street scenes in 50 different cities.

Clinical Datasets

Dataset	Description
MaskedFace-Net	MaskedFace-Net is a real dataset containing human faces with correct and incorrectly worn masks. It contains over 137k images which are based on the Flick-Faces-HQ dataset
COVID-19 Dataset	The Allen Institute of AI research has released a vast research dataset of over 45,000 scholarly articles about COVID-19.
MIMIC-III	Openly available dataset developed by the MIT Lab for Computational Physiology, comprising de-identified health data associated with ~40,000 critical care patients. It includes demographics, vital signs, laboratory tests, medications, and more.

Recommendation Systems Datasets

Dataset	Description
MovieLens	It contains rating data sets from the MovieLens web site.
Jester	It contains 4.1 Million continuous ratings (-10.00 to +10.00) of 100 jokes from 73,421 users. It’s mostly used for the collaborative filter.
Million Song Dataset	It can be used for both collaborative and content-based filtering.
Amazon Product Data	Amazon Product Data contains metadata and reviews on millions of items sold on Amazon. This is an incredible resource for anyone interested in recommender systems.