A Machine Learning Engineer who needs to figure out distributions of features to create better models, or a Platform Engineer who needs to monitor the platform for metrics like requests per minute, needs to draw and understand graphs. Knowing what graph works in which situation can make it easier to depict stories through graphs. Today there are so many graphs out there, selecting one can become an overwhelming task. The goal of this article is to understand how based on specific type of data we can choose a specific type of graph and what information we can infer from that graph. This enables the reader to quickly infer vital information from a graph and to know which graph to use just based on the type of variables. This article is written keeping in mind what information a machine learning engineer or data scientist will try to infer from some give data. Even though it is useful for anyone who want to know what graphs to draw in which scenario or how to understand the basic graphs.
- Types of Variable (Categorical, Quantitative)
- Scale of Measurement (Nominal, Ordinal, Interval, Ratio)
- Examining Distributions (Pie Chart, Bar Graph, Histogram, Box Plot)
- Examining Relationships (Side-By-Side Box Plot, Scatter Plot, Two Ways Table, HeatMap)
- Categorical Variables: According to Wikipedia A categorical (qualitative) variable is a variable that can take on one of a limited, and usually fixed, number of possible values, assigning each individual or other unit of observation to a particular group. For example Smoking is a categorical variable which can categorize a person into two groups one that smokes and other one that doesn’t smoke. Gender and Race are also categorical as a person can belong to one of a given set of values. Zip code is a categorical variable as it categorizes geographic location.
- Quantitative Variables: Are the variables that represent some kind of measurement and take numerical value. Example can be Age, Weight and Height of a person.
An easier way to recognize if a variable is quantitative is to see if it represents some form of measurement while having numeric values, which is not the case with categorical variables. Otherwise there is no restriction on categorical variables to not to have numerical values or have a small set of possible values as is the case with Zip code. Even though it’s numerical and hence can be added, subtracted or sorted it doesn’t represents any ordinal behavior but just the locality to which a person belongs.
- Nominal: is a qualitative (categorical) measure that uses discrete categories to describe a characteristic. For example: citizenship, religious affiliation, and marital status. Even though these can be represented by using numbers these don’t have and way to be ranked or ordered.
- Ordinal: ranks or orders participants on some scale or attribute, but the difference doesn’t convey fixed or equal differences. For example condition of a car. It can be Very Good, Good, Okay, Bad.
- Interval: takes numerical form and the distance between pairs of consecutive number is equal. For example temperature.
- Ratio: is similar to interval scale, the major difference is how we interpret a value of zero. For ratio measures the zero is meaningful and tell us that the attribute is absent in the participant. For example number of people having polio in India.
Examining distribution has two components
- What values the variable takes
- How often the variable takes those values
In case the data we have contains categorical variable, for example the data in the below image, which show a snippet of data of car and brand to which car belongs to, In some vintage car store.
In such a dataset what we look for is the frequency distributions of the categorical variable. As that answers the category imbalance that we might have in our data.
In this case the graph we go for are the following:
A Pie chart is useful in describing the frequency distribution. The area covered by one color shows the dominance of that category in the dataset. Additional information like Name and percentage of the category can be useful to show. Looking at the above graph we see that The parking lot has 53% cars with brand Mercedes.
In a bar graph x-axis usually represents the categorical labels and y axis will represent the numerical term associated with it. Which is the frequency in this case. The bars that are higher will show the dominant category in the dataset.
In case the variable is quantitative, we usually have values over a large range and it’s not possible to create frequency distribution for each individual value. So we create bins for it and then those bins represent categorical variables for which the histogram can be drawn. An example of quantitative variable is shown in the following graph.
We can define a set of intervals to represent a grade for those marks which will look like:
Histogram even though is just a bar chart, is different as we didn’t represent the values we were given on the X-axis but created bins and then represented the frequency distribution of those bins. In some places it might be clear on how many bins make sense or the bins can be predefined. But in some places this can be experimented with or let alone for the plotting libraries to decide based on some mathematical formulation. To checkout variations in histogram/distplots checkout histogram and distplot.
The information we can infer from a histogram is the following:
Shape of a histogram has two things to look for:
- Skewness: If the distribution is left skewed, right skewed or symmetric.
Symmetric distribution in real word can be seen when measuring heights of students in a class. Where majority of students will exist within some specific range with some exceptions on either side.
Left skewed distribution will have most of the data towards the right end. A real world example of this kind of distribution is the age of death from natural causes. Most such deaths happen at older ages.
In skewed right distribution the most of the data is at the left end. A real world example of this is salaries of people. Most people earn in lower ranges while a few have very high salaries.
By knowing the skew we can decide from which side we want to remove the outliers for a given set of data. If we remove data from both sides of a distribution when the distribution is skewed we might end up removing useful data and as a consequence the models trained might not generalize well.
- Modality: the number of peaks the distribution has.
Any graph with more than 2 modes is known as multimodal
The graph above has two peaks and in real world can come up while looking at distribution of money spent by people at an e-commerce website. The use of this information is to create various segments of users which can be targeted using different ranged recommendations in term of prices.
Outliers are the observations that fall outside the overall pattern.
Center is the midpoint of the distribution. Center is expected to divide the distribution into approximately two equal parts. Mode, Mean and Median are the three measures of center. The point to remember is that Mean is highly sensitive to outliers but median is almost unaffected.
Let’s see what Inter-quartile range is, as its the measure we will need for understanding the upcoming graph. The IQR is the first quartile subtracted from the third quartile. And what first quartile represents is the point on the x-axis which has 25% of data on the left side and 75% of the data on the right side. The third quartile represents the point on x-axis which has 75% of the data on the left side and 25% of the data on the right side of it. So IQR represents the range in which 50% of the data around the median lies.
IQR can help us detect outliers. A general rule of thumb know as 1.5(IQR) Criterion is that:
An observation is considered a suspected outlier if it is:
- Below Q1–1.5(IQR) or
- Above Q3 + 1.5(IQR)
In the above image (Min, Q1, Median, Q3, Max) gives us a quick numerical description of both center and spread of the distribution which brings us to the next graph which can show that.
The information in the above graph is based on the data which looks like:
We will revisit box plots later explaining how it is even more useful while examining relationships.
Examining distribution is based on a single variable, whereas relationship is between two variables. We will explore three cases which are
- One variable is Categorical and other is Quantitative
- Both variables are Quantitative
- Both variables are Categorical
The following image shows a snippet of Iris dataset which represents three categories of Iris flower and also their sepal and petal dimensions.
Let’s try to find the relationship between sepal_width and species.
Once we know what a box plot represents we can use the box plots side by side to let us see how distribution of sepal_width varies in the three varieties of flower. Based on the above graph one can see some patterns, like setosa on average have much larger sepal widths as compared to others.
When both variables are quantitative for example the sepal_width and sepal_height in the above Iris dataset, we can use scatter plot.
Scatter plot shows a relationship between sepal_width and petal_width. Based on this one can create regression line to see a potential trend in the two variables which shows that as sepal_width increases the petal_width also shows increase.
Scatter plot in the most basic form has become a thing of past. Usually it is accompanied with regression lines showing possible trends, with boxplots showing the distributions of the variables plotted along axis. With labeled points(Usually colors are used to label different points on the graph) The graph below shows an advance version of scatter plot.
In case both variables are categorical. The hypothetical example below shows an example of that:
In a two way table each cell contains the value for the intersection of two categorical attributes. for example there are 21 Male Versicolor type Iris flowers in our dataset. If both variables are categorical, their counts/percentage can shown in a two way table to clearly show the relationship between those.
If the number of categories is large it might take a lot of time to read through the numbers, in those cases a heat map can be used which displays this information using colors of cells making it easy to find ranges of interest or unusual patterns. Like in this figure Male-Setosa flowers are the least represented category in our dataset. The color map on the right shows the number associated with a specific shade of color.
Above I’ve shown a way to infer based on types of variables how one can decide the type of graph one might want and what insights one can find in those graphs. This set represents only a basic set of graphs that are available out there. Those graphs are mostly some form of modification of the above described basic graphs which are used to show some extra information. One should be careful while using those graphs as having too many insights within the same graph instead of summarizing the point might just confuse people.
It is better to use multiple graphs which visualize some insights at a time and then try to conclude the entire set of insights using some hybrid version of graphs. Using basic graphs also has a benefit that most people will know how to read those and hence it’s easier to propagate information without having to attach some long documentation that just represents the same information that we wanted the graph to depict in words, completely deferring the point of having a graph. If the graph isn’t self explanatory, then it’s pointless to have it.