Analyzing Data Distributions with Seaborn

A practical guide with many example plots

(image by author)

Data visualizations are key players in data science. They are powerful at exploring variables and the relations among them. Data visualizations are also much more preferred than plain numbers to deliver results and findings.

In this article, we will see how data visualizations can be used to explore the distribution of variables. The examples will be done using a famous Python data visualization library called Seaborn.

It is essential to interpret the distribution of variables. For instance, some machine learning models perform best when the variables have normal distribution. Thus, the distribution of variables directs our strategy to approach problems.

Distributions are also integral parts of exploratory data analysis. We can detect outliers, skewness, or get an overview about the measures of central tendency (mean, median, and mode).

I think we have highlighted the importance of data distributions clearly. We can now start on the examples.

We will be using an insurance dataset that you can obtain from Kaggle. The first step is to import the libraries and read the dataset into a Pandas dataframe.

import numpy as np
import pandas as pd
import seaborn as sns
sns.set(style='darkgrid')insurance = pd.read_csv("/content/insurance.csv")insurance.head()

(image by author)

The dataset contains some measures (i.e. features) about the customers of an insurance company and the amount that is charged for the insurance.

The first type of visualization we will see is histogram. It divides the value range of continuous variables into discrete bins and shows how many values exists in each bin.

The following is a basic histogram of the bmi variable.

sns.displot(insurance, x='bmi', kind='hist', aspect=1.2)

Histogram of bmi (image by author)

We can use the displot function of seaborn and specify the type of distribution using the kind parameter. The aspect variable adjusts the height-width ratio of the figure.

The bmi variable has a normal distribution except for few outliers above 50.

The displot function allows for adding a kde plot on top of histograms. The kde (kernel density estimation) plot is a non-parametric way to estimate the probability density function of a random variable.

sns.displot(insurance, x='bmi', kind='hist', kde=True, aspect=1.2)

(image by author)

We have the option to create only the kde plot by setting the kind parameter as of the displot function as ‘kde’. In that case, we do not need to use the kde parameter.

We can plot the distribution of a variable separately based on the categories of another variable. One way is to use the hue parameter. The figure below shows the histogram of bmi variable for smoker and non-smoker people separately.

sns.displot(insurance, x='bmi', kind='hist', hue='smoker', aspect=1.2)

(image by author)

We can also show the bars side-by-side by using the multiple parameter.

sns.displot(insurance, x='bmi', kind='hist', hue='smoker', multiple='dodge', aspect=1.2)

(image by author)

It is possible to create a grid of plots with the displot function which is a highly useful feature. We can create more informative visualizations by using the hue and col parameters together.

sns.displot(insurance, x='charges', kind='hist', hue='smoker', col='sex', height=6, aspect=1)

(image by author)

The figure above shows the distribution of the charges variables in different settings. We clearly see that the charge is likely to be more for people who smoke. The ratio of smokers is more for males and it is for females.

We can also create two-dimensional histograms that give us an overview of the cross distribution of two variables. The x and y parameters of the displot function is used to create a two-dimensional histogram.

sns.displot(insurance, x='charges', y='bmi', kind='hist',
height=6, aspect=1.2)

(image by author)

This figure shows the distribution of the bmi and charges variables. The darker parts of grid more dense in terms of the number of data points (i.e. rows) they contain.

Another feature we can use about the distributions is the rug plot. It draws ticks along x and y axes to represent marginal distributions. Let’s add rug plot to the two-dimensional histogram created in the previous step.

sns.displot(insurance, x='charges', y='bmi', kind='hist', rug=True,
height=6, aspect=1.2)

(image by author)

The plot is more informative now. In addition to the two-dimensional histogram, the rug plot on the axes provides an overview of the distribution of the individual variables.

The hue parameter can also be used with two-dimensional histograms.

sns.displot(insurance, x='charges', y='bmi', kind='hist',rug=True, hue='smoker',height=6, aspect=1.2)

(image by author)

We can also create bivariate kde plots. For instance, the plot below is the kde version of the previous two-dimensional histogram.

sns.displot(insurance, x='charges', y='bmi', kind='kde',rug=True, hue='smoker',height=6, aspect=1.2)

(image by author)

The density of lines gives us an idea about the distribution. We can use the fill parameter to make it look more like a histogram.

sns.displot(insurance, x='charges', y='bmi', kind='kde',rug=True, hue='smoker',fill=True, height=6, aspect=1.2)

(image by author)

Scatter plots are mainly used to check the correlations between two numerical variables. However, they also give us an idea about the distributions.

Seaborn is quite flexible in terms of combining different kinds of plots to create a more informative visualization. For instance, the jointplot combines scatter plots and histograms.

sns.jointplot(data=insurance, x='charges', y='bmi', hue='smoker', height=7, ratio=4)

(image by author)

A practical guide with many example plots

Footer