Feature Scaling in Machine Learning

Normalization vs Standardization

In this article we will discover answers to the following questions:

What is feature scaling and why it is required in Machine Learning (ML)?
Normalization — pros and cons.
Standardization — pros and cons.
Normalization or Standardization. Which one is better.

First things first, let’s hit up an analogy and try to understand why we need feature scaling. Consider building a ML model similar to making a smoothie. And this time you are making a strawberry-banana smoothie. Now, you have to carefully mix strawberries and bananas to make the smoothie taste good. If you just mix one strawberry and one banana, chances are you would end up tasting only the banana flavour. Therefore, they need to be mixed in equal portions and not numbers. This is exactly what happens with models when there are a lot of input features and some features completely dominate others if unscaled. Thus, we normalize/standardize all the features to bring them on a common scale.

Every feature in a dataset consists of two parts:

Magnitude and Unit

Most of the time, the dataset contains features highly varying in magnitudes, units, and range. When using algorithms like K-Nearest Neighbour (KNN) or K-Means clustering etc., which measure the euclidian distance between two data points in their computations, this becomes a problem. To understand this problem better, consider 3 features from a housing dataset as shown in the figure:

“Age in years” indicates the age of the house, “Amount in Dollars” indicates the listed price of the house, and “Garage” is a flag feature that indicates if the house has a garage or not.

If unscaled, ML algorithms only take into consideration the magnitude of the feature. This means it would consider 1 dollar equivalent to 1 year. Which makes no sense. Also, as the values become larger and larger, the data points will be plotted further and further away. This not only increases the euclidian distance between the points but also because of the large values, the computational time of the algorithm is more.

Another problem is that the features with high magnitudes and range weigh in a lot more in the distance calculations than the features with low
magnitudes and range. For example, the feature that ranges between 0 and 10M will completely dominate the feature that ranges between 0 and 60. This means, the feature with high magnitude and range will gain more priority. This makes no sense either. Therefore, to suppress all these effects, we would want to scale the features.

For this article, I will use some features from sklearn’s Boston housing dataset to demonstrate the effects of scaling. You don’t need to scale features for this dataset since this is a simple Linear Regression problem. I am just utilizing the data for illustration.

The two most common ways of scaling features are:

Normalization
Standardization

Note — Neither Normalization nor Standardization changes the distribution of the data.

Let’s look at them individually and understand the pros and cons of each.

Normalization — Normalization(scaling) transforms features with different scales to a fixed scale of 0 to 1. This ensures that no specific feature dominates the other.

Normalization can be achieved by Min-Max Scaler. By default, Min-Max Scaler scales features between 0 and 1. We can also choose to specify the min and the max values using the “feature_range” argument in python. The formula for Min-Max Scaler is:

It is important to note that, normalization is sensitive to outliers. So, if the data has outliers, the max value of the feature would be high, and most of the data would get squeezed towards the smaller part of the scale.

Also, the min and max values are only learned from the training data, so an issue arises when a new data has a value of x that is outside the bounds of the min and the max values, the resulting X’ value will not be in the range of 0 and 1.

For e.g. Observe the effects of normalization on the ‘CRIM’ feature from the Boston housing dataset. ‘CRIM’ is the per capita crime rate by town.

Normalization did not change the distribution of the feature.

2. Standardization — Standardization transforms features such that their mean (μ) equals 0 and standard deviation (σ ) equals 1. The range of the new min and max values is determined by the standard deviation of the initial un-normalized feature.

Standardization can be achieved by Z-score Normalization. Z-score is given by:

Unlike normalization, the mean and standard deviation of a feature is more robust to new data than the min and max values. Standardization is more effective if the feature has a Gaussian distribution, to begin with. Observe what happens when you standardize the ‘CRIM’ feature which has a right-skewed distribution.

Whereas, if you do the same on the ‘MEDV” feature, which has a Gaussian or Gaussian-like distribution, the z-score transformation is more effective. ‘MEDV’ is the median value of owner-occupied homes (in $1000’s).

So, Normalization or Standardization, which one is better? The answer is: you guessed it right, it depends. It is sometimes good to perform selective scaling of the numerical features, but the better option is to try out different combinations to data scaling and then comparing the performances of the model.

The different combinations could be:

a. Simply normalizing all the features

b. Simply standardizing all the features

c. Selectively normalizing features with Non-Gaussian distribution and standardizing features with Gaussian distribution.

Conclusion:

Normalization allows us to transform all the features with varying scales to a common scale but does not handle outliers well. On the contrary, standardization is more robust to outliers, new data and facilitates faster convergence of loss function for some algorithms. Therefore, standardization is typically preferred over Normalization.

As a final note, here is a quick cheat sheet for reference.

Thank you for reading. Get in touch if you have further questions via LinkedIn.

Footer