The Credit Card Fraud Detection is an online Challenge on Kaggle where we aim to find if a transaction is Fraudulent or not. I’ve divided this article into two parts, where the Part-1 has information about the dataset and has Exploratory Data Analysis and Part-2 deals with data imbalance and comparison of various classification models.
We’re given features V1, V2, … V28, that are the principal components obtained with PCA. The only features which have not been transformed with PCA are ‘Time’ and ‘Amount’. Feature ‘Time’ contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature ‘Amount’ is the transaction Amount, this feature can be used for example-dependent cost-sensitive learning. Feature ‘Class’ is a binary variable and it takes value 1 in case of fraudulent transaction and 0 otherwise. The competition link for the same is given below.
First, we’ll start by importing all the dependencies needed for Exploratory Data Analysis where we’ll analyze the dataset to find some significant patterns, dealing with missing data and duplicate data, heatmaps, distributions etc.
Reading the given dataset using pandas:
Now, we’ll see if there is any missing data in the dataset.
As we can see, there is no null values that we have to deal with in the dataset, we proceed further.
The dataset might consists of some duplicates which we are going to check and remove. For looking at this, we’ll see the shape of the data before and after removing the duplicates.
As we can see the shape of the data after removing duplicates has changed, we infer that 1081 rows have been deleted which had duplicated data.
There are two peaks in the graph. This dataset is for 2 days. We can relate this as two peaks corresponds to the two times in each day where maximum number of transactions that are happening(and depth corresponds to the night time where people are not doing any transactions).
We can see that we have a huge class-imbalance here.
Why is Class Imbalance a Problem?
When a statistical classifier is trained on a highly imbalanced dataset, it has a tendency to pick the patterns in the most popular class and ignore the rest.
For example, in this dataset, 99.9% of the data are labelled as ‘Not Fraud’ and rest are ‘Fraud’. So, even if a model classifies everything that it sees as ‘Not Fraud’, the accuracy is going to be 99.9% which seems excellent.
But is the model good? NO, because it is not classifying any of the transaction as ‘Fraud’. So, even if the model has an accuracy of 99.9%, it is completely useless!
We need some strategies to work with such a dataset or we need to use some other metrics(except for accuracy) in such scenarios. We are going to discuss the solution for the same in Part-2 of this blog.
We don’t observe any significant patterns so we’ll move ahead.
It is better idea to scale the features before using the dataset so that all the values come in similar range. This is important so that features with lesser significance might not end up dominating the more significant features due to its larger range.
Eg. In some dataset, the column Salary might be in Lakhs/Crores but the column age would be under 100. This would lead the salary column to dominate the feature prediction even though it might be less significant. For this reason, different types of Scaling-Log, Standardization and Normalization is used. We’ll decide which of these to choose from depending on our dataset.
Log is a scaling technique which is done when the variables span several orders of magnitude.
Standardization is a scaling technique are the ones where the values are centered around the mean with a unit standard deviation. This means that the mean of the attribute becomes zero and the resultant distribution has a unit standard deviation.
Normalization (Min-Max Scaling) is a scaling technique in which values are shifted and are then rescaled so that they end up ranging between 0 and 1.
We’re going to compare which scaling technique suits our dataset best and thus, we’ll make a box-plot for comparing
The minimum difference could be seen in Log Scaling. Rest have a huge difference in amounts for 0 and 1 class. Thus, we’d go further with Log Scaling.
Part-2 coming soon!