Any dataset may contains 2 types of data — numerical and categorical data. Machines have no problem in understanding numerical data.
But some of the machine learning algorithms have a problem in understanding categorical data and they have to be converted into numerical data before passing onto the algorithm.
The process of converting categorical variables into numerical is called encoding. In this blog post, I will be discussing 2 of the most commonly used feature encoding techniques — Label Encoder and One Hot Encoder and how to implement them in Python.
The categorical variables in the dataset are of type `object`. You can use the `dataset.info() ` function to check for the datatypes of different variables in your dataset.
I will be going through an example along with the explanation to explain it better.
Label Encoding is one of the most commonly used feature encoding technique. In this case, each label in a categorical data variable will be assigned a unique value.
For eg. If there are 3 categories — Red, Blue, and Green under the variable “Color”, then they will be numbered as 2,0,1 respectively.
You might be wondering why is it not 0,1,2? Nothing special. The categories are just numbered lexicographically(alphabetically).
Python contains a separate function called LabelEncoder in the `sklearn` library to feature encode the values.
I will be using the infamous titanic dataset as an example.
See that the dataset contains many categorical variables like `Sex`, `Cabin`, and ` Embarked`.
If you are not sure about which are the categorical variables, then use `df.info() ` and check for the columns which are of ` object ` datatype. I will be trying to encode the ` Sex ` variable. See that it contains 2 categories — Male and Female.
Now see that the categories — Male and Female are replaced with numbers 1 and 0 respectively based on the alphabetic order.
Now as you have read about label encoding, you might be thinking why do I need to read about one hot encoding? Is there any drawback of label encoding? Before moving onto the next part, try to figure out the answer by yourself.
LIMITATIONS OF LABEL ENCODER
Yeah, you guessed it right, in the previous example you saw that male is encoded as 1, and female is encoded as 0.
The machine only understands numbers, they don’t know that they are just 2 types of sex. The algorithm may think that male>female, just because the male is encoded with a number higher than female.
This is known as the priority issue. The algorithm thinks Z>W>M>F>D>A.
One Hot Encoding
In one hot encoding, additional columns will be created based on the number of unique features in the categorical feature. For example, in the color example that we have seen, there are 3 unique categories — Red, Blue, and Green, so 3 columns will be created.
The row with the color value of red will have a value of 1 in the red column, 0 in the green column and 0 in the blue column, similarly, for the color value of blue, the value will be 1 for the blue column and 0 for the red and green column.
Even though this method has eliminated the problem of priority, it has added more columns to the dataset. This can drastically affect the model’s compilation time or the model’s performance.
It is not advised to use one-hot encoding if you have too many categories in a column — generally, one hot encoding is used if we have less than 15 unique values in a column.
Use the `OneHotEncoder ` function in the sci-kit learn library to one hot encode a certain column of the dataset. Let’s use the same titanic dataset.
See that 2 new columns are added to the dataset at the end. The “0” column checks whether the sex of the row is female or not and the “1” column checks whether the sex of the row is male or not.
The “0” column will output 0.0 if the column is female and 1.0 if the column is female. This representation is referred to as a sparse representation.
Note: If you use One Hot Encoding, then make sure you delete the original categorical column and also delete one of the newly created columns.
If you don’t do it, it will lead to a Dummy Variable Trap. This is the case when some of the variables are highly correlated with each other and you are able to predict some of the columns with the help of other columns.
Like in this case of One Hot Encoding color example, if you know the value in the column of Male, then you can predict the value of Female. That is the reason why we need to delete one of the columns.
Thanks for reading through the blog post.
Follow us to get notified of our future content like this.