That’s where this article comes in. I am going to show you how there are data sets that are already included in a Python package. Not only that, but I am also going to show you how to find them, what they entail, and what kind of project they would be great for.
There is a nice variety in the datasets which allows for a variety of data science techniques. But first, let’s figure out how to find them and what they are.
Importing Seaborn is the first step, inside the package are all of our data sets. Next, we call the names of those data sets and it returns a list — that looks like this.
The image above is the list of names. There are 18 in total and each allows you to work using multiple techniques in an array of projects. However, some of them are easier to interpret than others. That is where we are going to focus on now. Let’s start with learning how to load one of these in and then we move on to an explanation and recommended projects for each.
Anscombe’s quartet is comprised of four data sets that have nearly identical simple descriptive statistics — meaning that they all have eleven points and their averages and variance are all very similar. But they all have very different distributions and appear very different when graphed. They were constructed in 1973 by the statistician Francis Anscombe to demonstrate both the importance of graphing data before analyzing it and the effect of outliers and other influential observations on statistical properties. Both of these are important ideas to a good data scientist.
When graphed, they look like this:
The trend lines show a striking contrast to their points. Pay close attention to how their observations form unique patterns but the trend lines are all similar.
We can do a few things here:
- Clustering — build an algorithm that can detect patterns in the data and see if there is any similarity between points in datasets I, II, III, and IV.
- Logistic Regression — try and predict which point belongs to which data set
- Linear Regression — try and predict the missing value of x or y
This data set is built upon test takers, whether they were focused, and what score they received. The data set looks like this:
The data set is pretty simple to understand. Let’s talk about projects.
- Logistic Regression — we can try and predict whether the target variable is ‘focused’ or ‘distracted.’
This data set is divided into different states in the U.S. Each state has the following features:
As you can see, the data set is flush with continuous variables. Ignoring the ‘abbrev’ feature which just tells us the state, we can use this data set to predict almost any feature.
- Multivariate Linear Regression — using the combination of features you can try to build a model that will predict any of the features.
- Clustering — grouping states together based on the features. An interesting analysis idea would be to look at the clusters to try and determine group models using outside research (ie. traffic laws).
Let’s take a look at the data set below:
In this data set, each observation is a unique diamond. Features include carat, color, cut, clarity, and dimensions (x,y,z). This is a well-known beginner data set that can be used in a variety of projects.
- Multivariate Linear Regression — there are a lot of continuous variables that could be predicted
- Multivariate Logistic Regression — try to predict either color, cut, or clarity using the other features
- Clustering — analyze clusters that come up and determine shared characteristics (ie. overpriced, fairly priced, etc)
This data set is the first we are going to talk about that is time-related. It looks like this:
It is the monthly passenger amount from January 1949 to December 1960. With some data wrangling, you can get year and month features to combine leaving you with a nice time step.
- Time Series — using the data, you can set out to predict the passenger at the later end of the time frame.
Be careful, there is a cyclical nature to this data set.
The Iris data set is another fundamental data set that most of us in data science have worked with in one way or another. It contains sepal and petal dimensions for four types of iris flowers.
- Multivariate Logistic Regression — using the dimensions above, you can try and predict the species of flower.
- Clustering — using the features, aside from species, you can try and create clusters and analyze them for patterns
The MPG data set is a data frame full of cars with features and names attached. As you can see below, there is a mixture of variables that can be used for regression style models.
- Multivariate Linear Regression — trying to predict weight or displacement could be a worthwhile goal
- Multivariate Logistic Regression — the non-continuous variables include origin and model year
- Clustering — ignoring the name column, you can cluster them to determine which observations are the most similar. This would mean outside research.
The ‘planets’ data set is interesting if you are a fan of astronomy. It has a series of observations which are all exoplanets. Take a quick look at the data set itself:
As you can see there a number of features, not all of which are easy to understand. So let’s talk about the most important features for a second.
- method — this reflects how the planet was detected. Be careful, there is a large class imbalance here.
- orbital period — how long it takes to complete one cycle of rotation around its center. Note: some are so far away from a star that it is unclear where the orbit begins, thus the NaN
- mass — relative weight of the exoplanet
- distance — light-years from Earth. If you are unfamiliar, a light-year is a distance that it takes light a year to travel. Since light is the fastest thing we know, this is a large distance.
Now that you have some sort of understanding of the data set, let’s take a look at what we can do with it.
- Logistic Regression — you can try and predict the method of detection using the other variables. Note: large class imbalance here
- Clustering — you can do a deep dive and find patterns among the exoplanets. Use different techniques like PCA to help account for variance.
- Linear Regression — you can determine whether there is a linear relationship between any of the variables and the mass of the exoplanet.
This data set is the most well-known and used in my honest opinion — even the subject of an ongoing Kaggle competition. In this data set, you get a manifest of the passenger vessel that sank due to the infamous iceberg on its way from Ireland to New York City.
There are many features here — whether they traveled alone, what passenger class they were in, gender, and even if they had siblings — but I will only recommend one as the target variable for tradition’s sake. The survived class records whether they survived or not. It is interesting to think about the final product telling you which coefficients were most important for predicting if a person lived or died.
- Logistic Regression — looking at the ‘survived’ column, you can build a model to predict which passenger will survive and which will not.
This will be a fun project and if you can master it, submit it to the Kaggle competition.