How to do feature engineering beyond scaling and one-hot encoding
Being a data scientist is like being a craftsman. You are equipped with a set of tools and required to create something beautiful yet functional out of simple material. Some of your work might be done by automatic machinery, but you know that your true power is creativity and intricate skill to work by hand.
In this series, we will hone your skillset by exploring several approaches to represent data as a feature. These approaches could improve the learnability of your model especially if you have tons of data in hand.
Imagine that you have data with the following pattern, where the horizontal axis represents a feature X₁, and the vertical axis represents another feature X₂ and each instance (point in the plot) can only belong to either -1 or 1 group (represented by red and green).
Now let me challenge you to draw a linear boundary that can separate the different classes on the data. I bet you can’t, and indeed this is an example of non linearly separable data.
There are several ways to handle this kind of data like using inherently non-linear classification models such as decision tree or a complex neural network. However, there is a simple technique we can use to make a simple linear classifier work very well on this kind of data.
Here is the trick, first, let’s discretize our continuous features into two buckets according to the colour shown on the plot. For X₁, let A denote its positive values and B denote its negative counterpart. Similarly, let C denote positive values of X₂, and D denote negative values of X₂.
Then, we can create a new categorical feature by combining all possible combinations of our newly created buckets.
- AD = {X₁ > 0 and X₂ < 0}
- AC = {X₁ > 0 and X₂ > 0}
- BC ={X₁ < 0 and X₂ > 0}
- BD ={X₁ < 0 and X₂ < 0}
With this brand new feature, we can now easily classify the class of an instance by only using simple binary classification.
If you are careful enough, you can immediately see the appropriate set of weights are the labels themselves (w₀ = -1, w₁ = 1, w₂ = 1, w₃ = -1).
The transformation we just did is an example of feature crosses where we concatenate multiple features into a new one to help our model learning better.