data:image/s3,"s3://crabby-images/8526c/8526cf0ba33350d88bd3cc23b6aa0d02a481af65" alt=""
data:image/s3,"s3://crabby-images/e3704/e3704bdc8b9184103ffebcb3ed4bf4e917dabbb7" alt="Soner Yıldırım"
A highly convenient tool for exploratory data analysis
Data visualization techniques are highly useful for exploring a dataset. There is wide variety of visualization types used in data science ecosystem. What best fits a given task depends on the characteristics of the data and variables.
In this article, we will cover an interactive visualization tool created by Facebook. It is essentially a parallel coordinates plot. Thus, each row (i.e. data point) is represented with a line. The coordinates on the line are the variables (i.e. columns).
Parallel coordinates plot provides a graphical representation of possible groups (or clusters) within a dataset. They also reveal certain patterns that can help distinguish data points.
Parallel coordinates plot is also a convenient way of exploring high dimensional data for which traditional visualization techniques might fail to provide a decent solution.
Facebook created HiPlot for hyperparameter tuning of neural networks. However, we can implement it on pretty much any dataset. The ultimate goal is the same: Explore the data well. We will use the famous iris dataset to demonstrate how HiPlot is used.
The first step is to install HiPlot. The documentation provides a detailed explanation for how to install it in various environments. I’m using pip to install it.
pip install -U hiplot
We can now import all the dependencies and read the dataset into a pandas dataframe.
import hiplot as hip
import pandas as pd
from sklearn.datasets import load_irisiris = load_iris(as_frame=True)['frame']iris.head()
It is extremely simple to create an interactive visualization with Hiplot. The following one line of code will create what we will be experimenting throughout the article.
hip.Experiment.from_dataframe(iris).display()
Hiplot also accepts an iterable (e.g. dictionary) as input data. In such cases, we use the from_iterable function instead of the from_dataframe function.
Here is a screenshot of the generated plot. We notice some patterns just by looking at it.
The iris dataset contains 4 independent variables and a target variable. The target takes one of three values depending on the independent variable values.
What makes Hiplot exceptional is the interactive interface. For instance, we can select a value range on any of the variables on the graph.
We select a value range on the petal width column. Only the data points that have a petal width in the selected range are displayed. We immediately notice that the selected range is distinctive for target value 0.
It is possible to select a value range on multiple columns so we can create more specific patterns.
We can also select a value from the target variable and see the pattern of data points that belong to that class.
Hiplot allows for rearranging the columns on the graph. For instance, we can move the target variable and place it on far left. This feature comes in handy if you want to put categorical variable on one side and the numerical variables on the other.
Hiplot also generates a table as part of the interactive interface. We can use this table to select data points and view them on the graph.