Data Leakage is one of the Biggest issues in machine learning and can lead to deceptive and poor model performance hence it needs to be properly dealt with before deploying your model in production.
Data leakage is one of the most difficult problems when developing a machine learning model. It happens when you train your algorithm on a dataset that includes information that would not be available at the time of prediction when you apply that model to data you collect in the future.
In a more simpler term, data leakage happens when we accidentally share information between the test and training data-sets while creating the model.
“Any other feature whose value would not actually be available in practice at the time you’d want to use the model to make a prediction is a feature that can introduce leakage to your model.” — Data Skeptic
Before we start, please note that this tutorial is part of the Python Data Analysis For Data Science & Machine Learning. Feel free to check it out for more detailed explanation of this and other concepts.
- It causes a model to overrepresent its generalization error, which makes it useless for any real-world application. Therefore caution must be taken, else when deployed in production for any application will cause the application to fail miserably.
- It can make investor make bad investment and cause huge financial cost.
- It is practically deadly in the healthcare sector and may make predictions that can cause lives of humans.
- It can make wrong predictions on customer behaviours which will make buiness leaders take wrong decisions that can push the business into debt or at worse eventually collapse their business.
Data Leakage is therefore one of the most important concepts to know as a Data Scientist or Machine Learning Engineer.
Data leakage often results in unrealistically high levels of performance on the test set and this is because the model is being ran on data that it had already seen to some extent in the training set.
It had already memorised the patterns and everything, so why not? It will definitely perform well.
However, this is definitely misleading, and this model will fail to generalise when deployed in production.
There are several Causes of Data Leakage which includes:
A. Duplicates
B. Leaky Predictors
C. Pre-processing activities
Duplicate values are a common problem when dealing with real world datasets. You can’t run away from it. This normally occurs when your dataset contains several points which are identical.
For example if you are working with customer reviews dataset for sentiment analysis, it is very likely to find customers who have written same reviews for a product different times, partly because some product owners ask them to write more reviews so that they will get sales, or maybe the customer just like or dislike the product and keeps writing same positive or negative reviews over and over again.
In this situation, you may experience data leakage due to the fact that your train and test set may contain the same data point even though they may correspond to different observations. Which will fail when you use it in production to test new sets of reviews.
You may not explicitly leak your data, however, you can still experience data leakage especially if there are dependencies between your test and train set. This mostly happens when you are dealing with data where time is important(like time-series data).
Leaky Predictors include data that will not be available at the time you make predictions.
Let’s demonstrate this concept below:
We have created a dummy data which contains:
— ‘Purchase’, whether people purchased the item or not. ‘
— ‘QTY’, quantity of the item purchased
— ‘Product’, the particular item purchased
— ‘Discount’, whether there was a discount on the item purchased or not
People will mostly buy a product when they are given a good discount and the product is what they need. If you look at the data we have above, most of the people who got Discount also Purchased the product.
Let’s check the relationship or correlation between the two features below:
First, we will convert the object type values to numerical values with label encoding.
Now we have all the values to be numerical as shown above.
let’s proceed to check the correlation between these variables.
We can see that there is a very strong relationship between these two features(Purchase and Discount), about 0.7 while the other features have very less correlation -0.3 and 0.47.
Having a high relationship is actually a good thing, for instance if we want to build a model that will predict whether a customer will purchase a product or not, this variable will help us to get a good prediction with our dataset.
However, we should also note that discount is normally given based on certain conditions, like festive seasons, customer type, or it can run for a certain period of time. In short, discounts are not mostly available all the time.
Considering the relationship or correlation between the two features(Purchase and Discount), if we build a model to predict wether a customer will purchase an item or not based on the given data, the model will see that anyone who has a discount is highly likely to purchased an item. Validation data comes from the same source, so the pattern will repeat itself in validation, and the model will have great validation scores. But the model will fail when we deploy in the real world, since data that will come later on might not have discount.
Probably the most common cause of Data Leakage is happens during data pre-processing steps of machine learning.
Approach 1
Most of the time, we
- prepare our data
- split it into training and testing set and
- build and evaluate our model
While this is mostly done in most machine learning problems, it exposes our test or validation set to the model while training and typically leads to Data Leakage.
Taking for instance data normalisation, where we will like to normalise our data in such a way that it has a range of 0 to 1. This means that the largest value for each attribute is 1 and the smallest value is 0.
Xmax= maximum value
Xmin= Minimum value
When the value of X is the minimum value in the column, the numerator will be 0, and hence X’ is 0 On the other hand, when the value of X is the maximum value in the column, the numerator is equal to the denominator and thus the value of X’ is 1 If the value of X is between the minimum and the maximum value, then the value of X’ is between 0 and 1.
Now when we normalize our data, this requires that we first calculate the minimum and maximum values for each variable before using these values to scale the variables. After that, we split our dataset into train and test sets, but the examples in the training set know something about the data in the test set, i.e. they have been scaled by the global minimum and maximum value and they somehow know something about every data point.
Again, standardisation estimates the mean and standard deviation values from the dataset in order to scale the variables.
Each data point will have a taste of another, whether in train set or test set.
Also missing value imputation gives same problem.
This happens with almost all data preparation techniques.
Approach 2
We can therefore reorganise our process in this way:
- Split data into training and testing set .
- Perform Data Preparation on training set.
- Fit the model on the training set.
- Evaluate Models on the test/validation set.
Let’s see an example below:
The wrong way
First, let’s try the approach 1 and evaluate our results.
i.e.
- prepare our data
- split it into training and testing set and
- build and evaluate our model
We will use the MinMaxScaler function to scale our data into the range 0–1
Let’s create some dummy dataset here.
We will use the sklearn’s make_classification() function to create the dataset with 1,000 records and 10 features.
CodeText
Now we have our dataset.
Step 1. Now let’s normalise our dataset
Step 2. Now let’s split our data into training and testing set
Step 3. Let’s build the model and evaluate the model
Build the model:
Evaluate the model:
We are achieving 88.38% on the training data and 91.50 on the testing data, which is quite good as it is.
let’s do it the right way and see what happens.
The Right Way
Step1: We will first split the data into train and test sets.
Step 2: We now scale our data using the MinMaxScaler
NB: We did not scale the y_test since we want it to represent real world dataset and we can use it for testing or validation. Also in most of the cases, you will not need to scale the y_train since it will already be in small range and scaling the X_test and x_train is enough to get going. However, depending on your dataset and problem statement, you can scale the y_trian BUT NOT before splitting.
Step 3: Build and evaluate the model
Build the model:
Evaluate the model:
What happened?
Now we can see that the reality of the model is being revealed. Our model is overfitting but since in the approach 1, our model has tasted both the traing and test set, it has memorise the pattern and it was still doing well, however, the approach 2 has revealed to us that our model wont work in production, it will fail misearble so we need to tune the model.
K-fold cross-validation involves splitting a dataset into K non-overlapping groups of rows. After that, you train your model on all but one group to form a training dataset, then evaluate the model on the hold-out fold. You repeat this process couple of times so that each fold is given a chance to be used as the holdout test set. You finally average performance across all evaluations.
For each estimation, the dataset is divided into 5 folds, 4 for training and the remaining 1 for testing.
Let’s check the two approaches discussed above using cross-validation approach
Approach 1: The Wrong Way
Step 1: Scale the data using the MinMaxScaler
Step 2: Perform cross-validation and check the accuracy
Our model is doing 88.76% by using the approach 1.
Let’s consider the approach 2 as well.
Approach 2 puts everything in a pipline
Running the example normalises the data correctly within the cross-validation folds of the evaluation procedure to avoid data leakage.
With the two accuracies, we can expect the approach 2 with accuracy of 85.43 to perform well in production than approach 1 even with the accuracy of 88.76.
End notes:I highly recommend using Approach 2 in all the above scenarios, i.e. by splitting your data into training and testing set first before performing any data preprocessing activities in order to avoid unnecessary data leakage.
If you like this tutorial, check out the Python Data Analysis For Data Science & Machine Learning for more detailed explanation of this and other concepts.
If you like this tutorial, please give it a Clap and don’t forget to follow me for more tutorial because I will be posting tutorials with an in-depth explanation so that we can learn from each other.
Have a nice day.