In my previous blog, we went through the fundamentals in Data Analysis. Python has plenty of libraries which aid in analyzing the data in depth. Features are critical in a data set as in Machine Learning when we are trying to find a pattern between the features. Now comes the question, what is a dependent variable and independent variable? Just like it sounds dependent variable is the output of the process and independent variable is the input to the process. For example in the below data set, “Species” is the dependent variable and the remaining variables are considered to be independent variables.
Independent variables are also known as “predictors”. Dependent variables are also known as “response or target variable”.
Now for this article we will be using the data from public Kaggle dataset. It is called Iris and it is very common data set used for practice. You can find the data set from the link below:
Let us take a look at the data set and the information. This iris data consists of five columns, which are:
- ID: Identification Number
- SepalLengthCm: Length of Sepal in cm
- SepalWidthCm: Width of Sepal in cm
- PetalLengthCm: Length of Petal in cm
- PetalWidthCm: Width of Petal in cm
- Species: Type of Species
For our analysis we removed the ID column and renamed the column. We can drop the column using “
data.drop('name', axis=1)”. This is one way to do it. If we know the column names in prior, we can then also choose which columns are to be in our data frame using
df = pd.read_csv("sampleData.csv", usecols = ['Col1','Col2']) . Let us see how the data frame looks now.
Install necessary packages
These are the packages that we need to install first
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
from pandas.plotting import scatter_matrix
import seaborn as sns
import plotly.io as pio
First, I am going to get the total count of data for each Species:
sns.catplot('Species',data = data , kind= 'count', aspect = 1.2)
If we are given a data set, it is important that we understand the relation between the variables. How can one variable affect the other? That relationship is known as the Correlation. Correlation can either be:
- Positive : If the values increase together
- Negative : If the values decrease together
mpl.rcParams['figure.figsize'] = (10,7)
corre = data.corr()
ax = fig.add_subplot()
cat = ax.matshow(corre, vmin=-1, vmax =1)
ticks = np.arange(0,4)
Bar Graph/Bar Chart: Represent categorical data with rectangular bars, whose height or length is proportional to the value that they represent. We can use the bar graph to get the count of the values in the column “Petal.Width”.
sns.catplot('Petal.Width',data = data ,
kind= 'count',aspect = 1.5)
hue will draw a separate histogram for each of its unique values and distinguish them by color. Now if we want to get the count of values corresponding to the Width of the petal but also want to categorize them based on the Species, use the below.
sns.catplot('Petal.Width',data = data , kind= 'count', hue = 'Species' ,aspect = 1.5)
The graph speaks for itself. For example, petal width 0.1 to 0.6 mainly belong to the Species “Setosa”. Do you see any relation between petal width and Species?
Scatter Plot : To put it in a simple way, scatter plots are points on horizontal and vertical axis, which show how much a variable is affected by other. This relation is called correlation. We can call two variables highly correlated if the data points makes a straight line.
The below gives a simple scatter plot from which we can understand the relationship between Width & the Length of the petal
chart = sns.catplot('Petal.Width','Petal.Length',data = data ,
aspect = 1.5)
chart.set_xlabels('Petal Width',weight='bold', fontsize=13)
chart.set_ylabels('Petal Length', weight='bold', fontsize=13)
plt.title('Relation between petal width & length',
Let us gather a few more details from the plot like color code them separately based on the Species. This will help us understand the relation between petal’s width and length in different Iris Species.
chart = sns.catplot('Petal.Width','Petal.Length',data = data,
hue ='Species' ,aspect = 1.5)#Customize chart
chart.set_ylabels('Petal Length', weight='bold',fontsize=13)
plt.title('Relation between petal width & length',
According to seaborn 0.11.0 documentation , seaborn.jointplot
Draw a plot of two variables with bivariate and univariate graphs.
The graph gives both scatter plot and separate histogram for each variable. It is clear from this that there is a positive correlation between petal width and length
Kind is a parameter within the seaborn.jointplot which can be customized to determine the kind of plot we want to draw. It can be scatter, kde, hist, hex, reg, resid.
Setting kind = ‘kde’, will will draw both bivariate and univariate KDEs
data= data, kind = 'kde')
Kernel Density Estimation (KDE) is a way to estimate the probability density function of a continuous random variable.
According to seaborn 0.11.0 documentation
A histogram aims to approximate the underlying probability density function that generated the data by binning and counting observations. Kernel density estimation (KDE) presents a different solution to the same problem. Rather than using discrete bins, a KDE plot smooths the observations with a Gaussian kernel, producing a continuous density estimate. While kernel density estimation produces a probability distribution, the height of the curve at each point gives a density, not a probability. A probability can be obtained only by integrating the density across a range.
Similar to a heatmap, bivariate KDE plot smoothens x and y variable with 2D Gaussian
If we set kind= ‘reg’, it will plot data and a linear regression model to fit using regplot() along with univariate KDE curves
data= data, kind = 'reg')
There are two ways to obtain a bin based jointplot, one is using kind= ‘hist’.
histplot()on all of the axes
data= data, kind=”hist”)
The other way is to using kind = ‘hex’.
This will use
matplotlib.axes.Axes.hexbin()to compute a bivariate histogram using hexagonal bins
data= data, kind = 'hex')
In this article we discussed about correlation plot, scatter plot, seaborn.catplot and seaborn.jointplot. This is a few among many visualization techniques. Data Visualization is very important to analyze datasets. Data Visualization is like story telling. In this current age of Big data, visualization is a key tool to tell the stories by making data much easier to understand, highlight the trends and outliers. I hope this helps you to build foundation in data analytics.
Please let me know if you have any questions.