All you need to know about Categorical Data and Encoding it!
Machine Learning is a highly glorified field. It is said that you can teach your machine to learn without “explicitly” programming it. While that is true, a lot of programming does go into it (rather implicitly). Categorical Encoding is a Feature Engineering step that is performed before training a Machine Learning model.
What is Categorical Data?
Before we talk about categorical encoding, we first need to understand Categorical Data. It is the data that is divided into finite groups. Whenever we are dealing with classification problems, we usually have categorical data. So in that case, the target variables are categorical data. (Input variables can also be categorical variables.) For instance, gender, country name, and movie genres are all examples of categorical data. In contrast to categorical data, there is numerical data which is continuous data; e.g. the weight of a person.
What is Categorical Encoding and Why Do We Need It?
Now that we have an understanding of categorical data, let’s talk about categorical encoding. Why do we need to encode the data? The problem is that we as humans can understand data as is but to make it understandable by the machines we need to encode it in a way that they can understand it.
The machines (mostly our computers, we are not coding a robot here :D), need the data to be in a numeric format to understand it. Therefore, we encode categorical data into a numeric/number format using categorical encoding.
Two Most Commonly Used Types of Encoding Categorical Data
Label Encoding
Consider a data set consisting of names of countries; Japan, Pakistan, Canada, and China. Now, if we want to encode it to numbers what can we do? We can have something as follows:
Japan: 0
Pakistan: 1
Canada: 2
China: 3
This type of encoding where we convert each unique variable to a number is called Label Encoding or Ordinal Encoding
Problem: Encoding in such a way can make your algorithm think that these variables are somehow ranked. So it may take it as Japan < Pakistan < Canada < China (because 0 < 1 < 2 < 3). Therefore, label encoding should only be used when the ordinal order also makes sense. For instance, it can be used for a job that ranks Junior, Middle, Senior, Lead. In such a case, there is a known relationship between these values.
Let’s quickly see how we can use label encoding using Pandas;
First, we convert our column that we want to encode to type ‘category’ (index corresponding to each unique variable alphabetically) and then use the categories codes
method to ordinally encode the variable.
Here is how you can use label encoding using Sklearn;
In Sklearn, there is an OrdinalEncoder
that we can initialize and call fit_transform on it to ordinally encode a list of variables or a DataFrame column.
One-hot Encoding
One-hot encoding comes to our rescue when we need to encode variables that have no relationship with each other. We can take the same example from the previous section (Japan, Pakistan, Canada, and China). In one-hot encoding, we create a binary column for each unique variable, and only one of those columns is true/hot (one-hot). So, for our countries example, we will have something like this;
So, in each of these cases, where one variable is true, we leave the value on its index as 1 and make the rest 0. Therefore, Japan will be encoded as 1000, Pakistan as 0100, and so on and so forth.
Please see examples of one-hot encoding using Pandas and Sklearn below.
Pandas makes use of a function named get_dummies
which creates an indicator/dummy column for each of the unique variable values thereby one-hot encoding it.
Just like label encoding, Sklearn has built-in one-hot encoding. We initialize the OneHotEncoder
, call fit_transform over it and it returns a matrix which is one-hot encoded against the provided values.
Here is a colab, that you can play around with to explore these two types of Categorical Encoding even more.
Conclusion
We often dive into ML knowing the basics about Supervised, Unsupervised Learning, Classification, and Regression. But in order to actually work in the field, it is good to know what preprocessing steps go into play before training a model. Categorical encoding methods are not limited to the only two mentioned above but they are the basis of most of the others. Data scientists need to look at the individual problem that they are trying to solve and pick the most relevant encoding scheme on the basis of that.
If you have any questions, leave them in the comments below. Happy learning!