Top Public Dataset Sources for Data Analysis and Machine Learning
In this article
Need data? We’ve got you covered. Explore a great collection of free datasets on everything from finance to space.
Collecting high-quality data is a fundamental prerequisite for starting any data analysis or machine learning project.
However, you may notice that looking for a really thought-provoking dataset can be a burdensome process and usually entails spending a lot of time. So as to save your precious time for deriving insights from the data, WebDataRocks team prepared a carefully selected list of free repositories with real-world data, ready to boost your project.
Let’s start exploring them!
Contents:
- Socrata OpenData
- Kaggle
- FiveThirtyEight
- UCI Machine Learning Repository
- ProPublica
- Yelp
- InsideAirbnb
- data.world
- Data Hub: Collections
- Quandl
- NASA datasets
- Wikipedia
- The World Bank
- Data.gov
- Pew Research Center
- Google Dataset Search
- Google Public Datasets
- AWS Public Datasets
- Academic Torrents
Socrata OpenData
One of the largest and most powerful search engines, which hosts thousands of datasets on the topic of finance, infrastructure, transportation, environment, economy, and public safety. What is more, all the datasets are categorized by use of machine learning algorithms, which makes this platform even more intriguing.
Try digging deeper to find here the most challenging datasets for your work.
Developers may find useful the fact that Socrata OpenData exposes the Discovery API which presents a mighty way for getting access to all the public data from the platform. Another great feature for developers is the fact that API calls return nested JSON objects which are easy to understand and parse.
On top of that, there are a lot of examples of data visualization and short tutorials which allow you to explore data interactively with charts. Here you can also find wrappers for accessing features of Socrata OpenData from various server-side languages.
If you want to become a contributor, read the Publisher guide to know how to upload your data.
Kaggle
Literally, Kaggle is the greatest data science platform and community which impresses with a diversity of datasets, competitions, examples of data science projects. Apart from educational purposes, it gives a chance to win financial rewards in competitions, hosted by the leading companies which yearn for understanding their data better.
But competitions are more about journeys to the data science realm rather than winning the first places. You should definitely bring all the available opportunities into play to master the skills required for your career as a data scientist.
It should be noted that this resource contains mainly cleaned data, especially if it’s a part of a competition’s kernel. Datasets can be searched by different tags.
To experience a competitive and challenging environment and test your strengths, you can try participating in the following open competitions:
Or build visualizations and ML models around these datasets:
FiveThirtyEight
Keen on data-driven articles and graphics created by writers of FiveThirtyEight blog? Have a peek into the data that is at the heart of visualizations. You can download the data from this online collection of data or the GitHub repository. Also, you can navigate right up to the journalistic article where it was used.
Most of the visualizations that you can find here are interactive. And we encourage you to create your own variant of the analysis and visualization.
UCI Machine Learning Repository
A comprehensive platform that hosts datasets for machine learning tasks for many years. This is a classic place to start your machine learning path which is supported by the National Science Foundation. Every dataset is well-described – you can check its default task, attribute types, data types, and other features. Many of the datasets are quite small but still great for educational projects.
ProPublica
This American nonprofit organization is recognized for noteworthy investigative journalism. But it’s also known for offering a versatile data repository that covers health, criminal justice, education, politics, business, transportation, and finance topics. Besides, it’s frequently updated.
The collection contains both paid and free datasets. Paid datasets, in turn, are available under academic, commercial, students and journalist licenses.
ProPublica also cares for the ways to access data by exposing five APIs which simplify retrieving data.
Yelp
Have been waiting for the opportunity to create your project but didn’t know how to start?
Then you can’t miss a perfect chance to improve your research and analysis skills on Yelp – one more platform that provides ready-to-use data and encourages both newcomers and skilled data scientists to solve problems.
Not only can you participate in the challenges but also win cash prizes.
After downloading and playing with the data, as the next step, you can submit your project by filling the application form. It can be presented in any format – a paper, video presentation, website, blog, etc – anything that confirms your using of the data.
Do not pass by this place – it’s not only for students. Feel free to participate in challenges and discover your hidden talents.
InsideAirbnb
A data service created and maintained by Airbnb company. It hosts a unique collection of the Airbnb’s data which is categorized by regions and countries. You can browse data for your particular city and explore insightful reports with creative visualizations. But we recommend getting the data and exploring it deeper with your favorite tools.
data.world
Being an open community for developers, data.world is a real treasure for everyone who is passionate about data analysis. More than 450 datasets for all tastes and purposes are freely available in the collection. Most of them are close to the modern world and, henceforth, require cleaning. Since cleaning data is an important stage of any data science project, here you are given an opportunity to practice these skills.
Datasets cover finance, crime, economy, education, census, education, environment, energy, sports, NASA, and a lot more topics.
Besides, you can even contribute your own data.
Signing up is a piece of cake – just use your GitHub account to register and get access to all the datasets.
Working with data is easy as well – you can write SQL queries through the site interface, use SDKs for Python or R or simply download the data file.
Data Hub: Collections
A rich data catalog containing datasets on various topics: economic, climate, education, logistics, healthcare, and more. On the dataset’s page, there are embedded visualizations built with Plotly, which give you a quick overview of data trends.
If you can’t find data you are looking for, you can even make a free request for it.
You will be impressed by a variety of means to integrate the dataset into the tool you are using. There are code snippets that show how to use data with R, Pandas, Python, JavaScript, cURL, and data-cli. Also, you can simply download CSV or JSON datasets.
Quandl
It positions itself as a not to be missed platform with financial and economic data that help power data-driven strategies. Here you can find free and pre-paid datasets. For data retrieval, Quandl provides a free-to-use API which acts as a single interface. Also, you can access data from Python, R, Ruby with the help of modules and packages. The Add-In for Excel is available as well.
NASA datasets
Enthusiastic about space-related projects?
Then this repository is a real find for you. It contains Astrophysics, Heliophysics, Solar System Exploration data, and Image Resources.
Wikipedia
Surprised to see Wikipedia on the list? Yes, you can use it not only for educational purposes. Wikipedia also offers ways of downloading and querying data. You can read more about them in this guide.
The World Bank
A huge repository which provides free access to global development data. You can search datasets by countries, regions and economic or demographic indicators.
With the help of online data visualization tools, you can explore data interactively using charts, tables, maps, build reports in no time, style them, share and embed. Datasets are available as CSV, XML and Excel files.
Data.gov
A repository of public datasets from US government agencies. The datasets related to climate, consumers, education, ecosystems, energy, finance, manufacturing, science are at your fingertips.
Datasets are available for public use but sometimes you have to agree to license agreements before downloading and using data.
Another great thing is that you can submit data stories to share with the world your ways of using data. There are also a lot of challenges you can participate in.
Pew Research Center
Pew Research Center is known for publishing survey reports and various kinds of analysis. Its researchers make datasets that lie at the core of reports available to the public. Many of datasets are provided as .sav files, therefore, you should know how to use SPSS or R. With them you can discover religious, political, social, journalistic and media trends.
Google Dataset Search
Dataset Search is a powerful search engine that exposes a convenient interface through which you can access millions of datasets from around the world. This relatively new product launched by Google has been already favored by scientists, data journalists, and students who need to find scientific, social, environmental, or government data. A huge plus is that the volumes of the data are growing fast.
After querying data, you’ll see the list of repositories, including academic ones, from which you can download it.
If you want to publish your data, follow these quality and technical guidelines which help understand how to describe uploaded datasets.
In general, Google Dataset Search copes well with the goal of making data more accessible to everyone.
But what if you want to practice analyzing big data?
Google Public Datasets
Visit the Cloud Public Datasets Program catalog to find large and amazing datasets. All of them are stored in BigQuery and can be accessed through the Cloud Public Datasets Program. Though you need to pay for the queries that you perform on the data, you can make use of the first 1 TB of free queries.
AWS Public Datasets
You can search datasets from the Amazon Web Services platform through the Registry of Open Data. Datasets are available in the public domain. Here you can also find a lot of fascinating usage cases which may inspire you for starting new scientific or enterprise projects. They cover details on using data by organizations, implementing recommended systems, predicting stock prices, etc.
Besides, you can make your personal contribution by sharing data on AWS.
To start working with data, simply download it or get access from the cloud with the help of EC2 or Hadoop.
Academic Torrents
A distributed system that contains more than 45 TB of data for research. Pay attention to license terms – most datasets are allowed to be used for non-commercial and educational purposes.
Here is the list of some popular datasets:
- ImageNet Large Scale Visual Recognition Challenge (V2017)
- VA: A Large-Scale Database for Aesthetic Visual Analysis
- Google Open Images
Final words
You are more than welcome to explore the above-mentioned collections of data. To get an even more complete list of amazing datasets, we recommend referring to this GitHub page.
We do hope you’ll find your perfect dataset for conducting data-driven researches and satisfying your intellectual curiosity about trends in certain areas of our diverse life.
Good luck with your data analysis and machine learning projects!
Tools for data visualization
To extract the value from your data, you can try visualizing it with WebDataRocks Pivot Table and a charting library of your choice. Here are the tutorials which will help you get started:
- WebDataRocks Pivot Table with Google Charts: How to use
- How to visualize data with Pivot Table and FusionCharts
- WebDataRocks integration with Highcharts
To advance your programming and analytical skills, we recommend searching for courses and tutorials on GitConnected.