In my previous blog, we went through the fundamentals in Data Analysis. Python has plenty of libraries which aid in analyzing the data in depth. Features are critical in a data set as in Machine Learning when we are trying to find a pattern between the features. Now comes the question, what is a dependent variable and independent variable? Just like it sounds dependent variable is the output of the process and independent variable is the input to the process. For example in the below data set, “Species” is the dependent variable and the remaining variables are considered to be independent variables.

Independent variables are also known as “predictors”. Dependent variables are also known as “response or target variable”.

Now for this article we will be using the data from public Kaggle dataset. It is called *Iris* and it is very common data set used for practice. You can find the data set from the link below:

Let us take a look at the data set and the information. This iris data consists of five columns, which are:

- ID: Identification Number
- SepalLengthCm: Length of Sepal in cm
- SepalWidthCm: Width of Sepal in cm
- PetalLengthCm: Length of Petal in cm
- PetalWidthCm: Width of Petal in cm
- Species: Type of Species

For our analysis we removed the ID column and renamed the column. We can drop the column using “`data.drop('name', axis=1)`

”. This is one way to do it. If we know the column names in prior, we can then also choose which columns are to be in our data frame using `df = pd.read_csv("sampleData.csv", usecols = ['Col1','Col2'])`

. Let us see how the data frame looks now.

`data.describe()`

`data.info()`

**Install necessary packages**

These are the packages that we need to install first

`import matplotlib.pyplot as plt`

import numpy as np

import pandas as pd

from pandas import Series, DataFrame

%matplotlib inline

from pandas.plotting import scatter_matrix

import seaborn as sns

import plotly.io as pio

**Visualization**

First, I am going to get the total count of data for each Species:

`sns.catplot('Species',data = data , kind= 'count', aspect = 1.2)`

If we are given a data set, it is important that we understand the relation between the variables. How can one variable affect the other? That relationship is known as the Correlation. **Correlation **can either be:

**Positive**: If the values increase together**Negative**: If the values decrease together

`mpl.rcParams['figure.figsize'] = (10,7)`

corre = data.corr()

print(corre)

fig =plt.figure()

ax = fig.add_subplot()

cat = ax.matshow(corre, vmin=-1, vmax =1)

fig.colorbar(cat)

ticks = np.arange(0,4)

ax.set_xticks(ticks)

ax.set_yticks(ticks)

ax.set_xticklabels(data.columns[0:4])

ax.set_yticklabels(data.columns[0:4])

**Bar Graph/Bar Chart**: Represent categorical data with rectangular bars, whose height or length is proportional to the value that they represent. We can use the bar graph to get the count of the values in the column “Petal.Width”.

`sns.catplot('Petal.Width',data = data , `

kind= 'count',aspect = 1.5)

`hue`

will draw a separate histogram for each of its unique values and distinguish them by color. Now if we want to get the count of values corresponding to the Width of the petal but also want to categorize them based on the Species, use the below.

`sns.catplot('Petal.Width',data = data , kind= 'count', hue = 'Species' ,aspect = 1.5)`

The graph speaks for itself. For example, petal width 0.1 to 0.6 mainly belong to the Species “Setosa”. Do you see any relation between petal width and Species?

**Scatter Plot** : To put it in a simple way, scatter plots are points on horizontal and vertical axis, which show how much a variable is affected by other. This relation is called correlation. We can call two variables highly correlated if the data points makes a straight line.

The below gives a simple scatter plot from which we can understand the relationship between Width & the Length of the petal

`chart = sns.catplot('Petal.Width','Petal.Length',data = data ,`

aspect = 1.5)

chart.set_xlabels('Petal Width',weight='bold', fontsize=13)

chart.set_ylabels('Petal Length', weight='bold', fontsize=13)

plt.title('Relation between petal width & length',

weight='bold',fontsize=16)

Let us gather a few more details from the plot like color code them separately based on the Species. This will help us understand the relation between petal’s width and length in different Iris Species.

chart = sns.catplot('Petal.Width','Petal.Length',data = data,

hue ='Species' ,aspect = 1.5)#Customize chart

chart.set_xlabels('Petal Width',weight='bold',fontsize=13)

chart.set_ylabels('Petal Length', weight='bold',fontsize=13)

plt.title('Relation between petal width & length',

weight='bold',fontsize=16)

**seaborn.jointplot**

According to seaborn 0.11.0 documentation , seaborn.jointplot

Draw a plot of two variables with bivariate and univariate graphs.

To get a simple jointplot, assign x and y to create a scatterplot (using

) with marginal histograms (using **scatterplot()**

)**histplot()**

`sns.jointplot('Petal.Width','Petal.Length',`

data= data)

The graph gives both scatter plot and separate histogram for each variable. It is clear from this that there is a positive correlation between petal width and length

**Kind **is a parameter within the seaborn.jointplot which can be customized to determine the kind of plot we want to draw. It can be *scatter, kde, hist, hex, reg, resid.*

Setting **kind = ‘kde’**, will will draw both bivariate and univariate KDEs

`sns.jointplot('Petal.Width','Petal.Length',`

data= data, kind = 'kde')

Kernel Density Estimation (**KDE**) is a way to estimate the probability density function of a continuous random variable.

According to seaborn 0.11.0 documentation

A histogram aims to approximate the underlying probability density function that generated the data by binning and counting observations. Kernel density estimation (KDE) presents a different solution to the same problem. Rather than using discrete bins, a KDE plot smooths the observations with a Gaussian kernel, producing a continuous density estimate. While kernel density estimation produces a probability distribution, the height of the curve at each point gives a density, not a probability. A probability can be obtained only by integrating the density across a range.

Similar to a heatmap, bivariate KDE plot smoothens x and y variable with 2D Gaussian

If we set **kind= ‘reg’**, it will plot data and a linear regression model to fit using regplot() along with univariate KDE curves

`sns.jointplot('Petal.Width','Petal.Length',`

data= data, kind = 'reg')

There are two ways to obtain a bin based jointplot, one is using **kind= ‘hist’.**

This uses

`on all of the axes`

histplot()

`sns.jointplot(‘Petal.Width’,’Petal.Length’,`

data= data, kind=”hist”)

The other way is to using **kind = ‘hex’.**

This will use

`to compute a bivariate histogram using hexagonal bins`

matplotlib.axes.Axes.hexbin()

`sns.jointplot('Petal.Width','Petal.Length',`

data= data, kind = 'hex')

## Summary

In this article we discussed about correlation plot, scatter plot, seaborn.catplot and seaborn.jointplot. This is a few among many visualization techniques. Data Visualization is very important to analyze datasets. Data Visualization is like story telling. In this current age of Big data, visualization is a key tool to tell the stories by making data much easier to understand, highlight the trends and outliers. I hope this helps you to build foundation in data analytics.

Please let me know if you have any questions.

*If you have any thoughts, comments or questions, please leave a comment below or contact me on **LinkedIn**. You could also find more similar projects in my **github**.*