Skip to content

Data Sources


Dataset Finders

Dataset Description
Kaggle Kaggle provides a vast container of datasets, sufficient for the enthusiast to the expert.
Google’s Dataset Search Engine Similar to how Google Scholar works, Dataset Search lets you find datasets wherever they are hosted, whether it’s a publisher’s site, a digital library, or an author’s web page. It’s a phenomenal dataset finder, and it contains over 25 million datasets.
UCI Machine Learning Repository The Machine Learning Repository at UCI provides an up to date resource for open-source datasets.
VisualData Discover computer vision datasets by category; it allows searchable queries.
The Big Bad NLP Database datasets for various natural language processing tasks, created and curated by Quantum Stat
Awesome Public Datasets Collection This collection is a topic-centric public data sources in high quality. They are collected and tidied from blogs, answers, and user responses. Contains tons of data sources and categories
CMU Libraries Discover high-quality datasets thanks to the collection of Huajin Wang, at CMU.
Reddit Datasets provided by the Reddit Community


Machine Learning Datasets

Dataset Description
Mall Customers Dataset The Mall customers dataset contains information about people visiting the mall in a particular city. The dataset consists of various columns like gender, customer id, age, annual income, and spending score. It’s generally used to segment customers based on their age, income, and interest.
IRIS Dataset The iris dataset is a simple and beginner-friendly dataset that contains information about the flower petal and sepal width. The data is divided into three classes, with 50 rows in each class. It’s generally used for classification and regression modeling.
MNIST Dataset This is a database of handwritten digits. It contains 60,000 training images and 10,000 testing images. This is a perfect dataset to start implementing image classification where you can classify a digit from 0 to 9.
Boston Housing Dataset Contains information collected by the US Census Service concerning housing in the area of Boston Mass. It was obtained from the StatLib archive and has been used extensively throughout the literature to benchmark algorithms.
Fake News Detection Dataset It is a CSV file that has 7796 rows with four columns. There are four columns: news, title, news text, result.
Wine quality dataset The dataset contains different chemical information about the wine. The dataset is suitable for classification and regression tasks.
SOCR data — Heights and Weights Dataset This is a basic dataset for beginners. It contains only the height and weights of 25,000 different humans of 18 years of age. This dataset can be used to build a model that can predict the height or weight of a human.
Credit Card Fraud Detection Dataset The dataset contains transactions made by credit cards; they are labeled as fraudulent or genuine. This is important for companies that have transaction systems to build a model for detecting fraudulent activities.
Google-Landmarks-v2 An improved dataset for landmark recognition and retrieval. This dataset contains 5M+ images of 200k+ landmarks from across the world, sourced and annotated by the Wiki Commons community.
Stock Prices Stock Prices stores historical data about day stock prices, dividends, and splits for US stocks.
Restaurants Health Score Restaurants Health Score in San Francisco developed by the local Health Department provides interesting material for researchers interested in public health and restaurant business.


Computer Vision Datasets

Dataset Description
Google’s Open Images A vast dataset from Google AI containing over 10 million images.
xView xView is one of the most massive publicly available datasets of overhead imagery. It contains images from complex scenes around the world, annotated using bounding boxes.
ImageNet The largest image dataset for computer vision. It provides an accessible image database that is organized hierarchically, according to WordNet.
Kinetics-700 A large-scale dataset of video URLs from Youtube. Including human-centered actions. It contains over 700,000 videos.
Cityscapes Dataset This is an open-source dataset for Computer Vision projects. It contains high-quality pixel-level annotations of video sequences taken in 50 different city streets. The dataset is useful in semantic segmentation and training deep neural networks to understand the urban scene.
IMDB-Wiki dataset The IMDB-Wiki dataset is one of the most extensive open-source datasets for face images with labeled gender and age. The images are collected from IMDB and Wikipedia. It has five million-plus labeled images.
Color Detection Dataset The dataset contains a CSV file that has 865 color names with their corresponding RGB(red, green, and blue) values of the color. It also has the hexadecimal value of the color.
Stanford Dogs Dataset It contains 20,580 images and 120 different dog breed categories.
CIFAR-10 CIFAR-10 is a collection of images for training deep learning computer vision algorithms. The data bank consists of 60000 32x32 color images in 10 classes, 6000 images in each class. If this is not enough, try the CIFAR-100 dataset.
COCO COCO is a regularly updated DB for object segmentation and recognition in context, sponsored by Microsoft, Facebook, and Mighty AI.


Sentiment Analysis Datasets

Dataset Description
Lexicoder Sentiment Dictionary This dataset is specific for sentiment analysis. The dataset contains over 3000 negative words and over 2000 positive sentiment words.
IMDB reviews An interesting dataset with over 50,000 movie reviews from Kaggle.
Stanford Sentiment Treebank Standard sentiment dataset with sentiment annotations.
Twitter US Airline Sentiment Twitter data on US airlines from February 2015, classified as positive, negative, and neutral tweets


Natural Language Processing Datasets

Dataset Description
IMDB reviews The large movie review dataset consists of movie reviews from IMDB website with over 25,000 reviews for training and 25,000 for the testing set.
UCI Spambase Dataset Classifying emails as spam or non-spam is a prevalent and useful task. The dataset contains 4601 emails and 57 meta-information about the emails. You can build models to filter out the spam.
Recommender Systems Dataset It contains various datasets from popular websites like Goodreads book reviews, Amazon product reviews, bartending data, data from social media, and others that are used in building a recommender system.
Enron Email Dataset It contains around 0.5 million emails of over 150 users.
SMS Spam Collection in English A dataset that consists of 5,574 English SMS spam messages.
Rotten Tomatoes Reviews Archive of more than 480,000 critic reviews (fresh or rotten).
Amazon Reviews A vast dataset from Amazon, containing over 45 million Amazon reviews.
HotspotQA Dataset Question answering dataset featuring natural, multi-hop questions, with intense supervision for supporting facts to enable more explainable question answering systems.
VoxCeleb VoxCeleb is an audio collection that you can use for deep learning tasks such as real-time natural language processing, voice recognition, and speech generation.
LibriSpeech On LibriSpeech, you will find about 1000 hours of 16kHz oral English speech derived from audiobooks.
Free Spoken Digit Dataset Free Spoken Digit Dataset can be used for. It consists of spoken digit recordings at 8kHz that are precisely trimmed. They have near minimal silence at the beginnings and ends. The dataset is open-source.
Common Voice Common Voice is an initiative by Mozilla that contains hundreds of thousands of records of human voice. Every visitor of the Common Voice website can contribute to their open human speech database recording their own voice.
WordNet WordNet is a lexical database that contains all parts of speech grouped into sets of synonyms. Such a structure makes it a fantastic tool for natural language processing and linguistic research.
Yelp Reviews Yelp Reviews contains user reviews, business information, and images that you can use for personal and academic purposes.
20 Newsgroups 20 Newsgroups is a dataset that consists of 18,000+ text documents from 20 different newsgroups including sports, technology, art, entertainment, etc.


Self Driving (Autonomous Vehicles) Datasets

Dataset Description
Waymo Open Dataset This is a fantastic dataset resource from the folks at Waymo. Includes a vast dataset of autonomous driving, enough to train deep nets from zero.
Berkeley DeepDrive BDD100k One of the largest datasets for self-driving cars, containing over 2000 hours of driving experiences across New York and California.
Bosch Small Traffic Light Dataset Dataset for small traffic lights for deep learning.
WPI datasets Datasets for traffic lights, pedestrian, and lane detection.
Comma.ai It contains details such as a car’s speed, acceleration, steering angle, and GPS coordinates.
LISA: Laboratory for Intelligent & Safe Automobiles, UC San Diego Datasets This dataset includes traffic signs, vehicle detection, traffic lights, and trajectory patterns.
Cityscape Dataset This is an extensive dataset that has street scenes in 50 different cities.


Clinical Datasets

Dataset Description
MaskedFace-Net MaskedFace-Net is a real dataset containing human faces with correct and incorrectly worn masks. It contains over 137k images which are based on the Flick-Faces-HQ dataset
COVID-19 Dataset The Allen Institute of AI research has released a vast research dataset of over 45,000 scholarly articles about COVID-19.
MIMIC-III Openly available dataset developed by the MIT Lab for Computational Physiology, comprising de-identified health data associated with ~40,000 critical care patients. It includes demographics, vital signs, laboratory tests, medications, and more.


Recommendation Systems Datasets

Dataset Description
MovieLens It contains rating data sets from the MovieLens web site.
Jester It contains 4.1 Million continuous ratings (-10.00 to +10.00) of 100 jokes from 73,421 users. It’s mostly used for the collaborative filter.
Million Song Dataset It can be used for both collaborative and content-based filtering.
Amazon Product Data Amazon Product Data contains metadata and reviews on millions of items sold on Amazon. This is an incredible resource for anyone interested in recommender systems.