Azdias dataset has total 366 features (columns) and 891211 individuals (rows) where Customers dataset has 369 features ( Extra 3 features ‘CUSTOMER_GROUP’, ‘ONLINE_PURCHASE’, ‘PRODUCT_GROUP’ ) and 191652 individuals data. Almost all the features are categorical feature, so most of them are ordinal and some of them are nominal features.
The Image on the left side is showing number of features X the information levels , So we have PLZ8 related features more than 100 , which is all about the car/s’ detailed information owned by a person. Then we also have about person him/her self , household, building etc.
Then every information levels have related features and as I mentioned before most of the features have ordinal categorical data that is starting from 1 or 0.
Checkout the last table, you can see that it has 2 attributes’ information and its levels. HH_EINKOMMEN_SCORE has 6 levels , from 1–6 ( Highest income to very low income ) and also -1,0 as unknown. About unknowns we are going to discuss in next steps. Also in INNENSTADT we have data in distance orders. Most of the features have ordinal data like in above image.
1.2 Missing Data
In this step, we are going to deal with the missing data as well as unknown data. First we will deal with the unknown values but also before that there is one more problem that we need to take care of that is a Mixed Types at (18,19) as warned in below image.
I converted those columns data into numeric type with the parameter errors=”coerce”, so if for some values the function gets converting error , those will be replaced with NaN. After fixing mixed type problem we have all the data in numeric.
We also need to convert unknown labels as NaN, because there are so many values with -1 or 0 which are actually unknowns. I converted all of them to NaN And just fyi not all the features had the -1 or 0 as unknowns , some of them also have 9 as unknowns so I had to create the dataframe like you are seeing in the above image.
Now after converting all the data to numeric and unknowns to NaN , we need to check the missing data % in all of the features and there are 366 columns. In the image you can see that columns having more than 40% percentage of missing values. 1st subplot is for Azdias and 2nd one is for Customers data.
I observed that there are 20 columns which have more than 20% missing values in their data in Azdias. but when I checked for for more than 15% missing values there are total 80 columns which have that much percentage. It is too much data loss and I decided to drop only columns that have more than 20% missing values.
After dropping columns I also checked for rows. It means calculated missing values in percentage by rows.
As you can see in the image we have around 80–90k rows which have more than 20% missing values (columns). I concluded from the image so decided to go with threshold of 20% and dropped all the rows which have more than 20% missing values.
That is just a coincidence that we have both threshold for 20% 😉
So, After dropping unnecessary columns and rows , we have now (763864, 366) in Azdias and (138691, 369) in Customers dataset.
1.3 Data engineering and Missing Data Imputation
After looking into the dataset in deep. I found there are actually 4 features that don’t have numeric type. one has timestamp EINGEFUEGT_AM that can be converted into timestamp data type, two have so many labels but not ordinal data but completely random labels which we dropped, where last one OST_WEST_KZ has only two unique values {“W” , “O”} which we can convert it in to numeric like {“W” : 1, “O” : 0}.
CAMEO_DEUINTL_2015 has labels that we can split into two and create two new features. If you try to look closely, from 11–15 the households is wealthy and the family type is changing.
After cleaning all the features and some feature engineering I used Missing data simple imputer with strategy=”most_frequent” because all the features are categorical features.Now, we don’t have any missing values. It all fixed!!!!
TL DR;
Data preprocessing :-
- Fixed Mixed type features by converting them to numeric
- Replaced unknown values to NaNs.
- Converted some object data to Numeric
- Drop columns having more than 20% missing values and also same for rows.
- Created more features from one feature and some other data engineering
- Used simpleImputer with strategy=”most_frequent” and filled all the missing values.
1.4 Dimentionality reduction using PCA
So Why Do we need PCA and what does it actually do?
As you know we have around 346 features after removing 20 columns in preprocessing step. All of the features are not important and to reduce the complexity of the machine learning models we need to reduce the dimensionality but by keeping the variability same as before.
PCA stands for Principal Component Analysis, which is a method that uses simple matrix operations from linear algebra and statistics to calculate a projection of the original data into the same number or fewer dimensions. In short , It reduces the columns.
From above image we can conclude that to get more than 90% explained variance we need atleast 170 features and I decided to keep 172 components as final dimensions.
Analysis on components and the original features :-
This image is for component number 0,
- number of 1–2 family houses in the PLZ8 has positive weights where more than 6 family houses it has negative weights
- most common building-type within the PLZ8 has negative weights too.
- When it is all about cars the feature weights are also positive.
1.4 KMeans Clustering
After deciding that we keep 172 components only we will use KMeans clustering algorith. KMeans is a very popular Unsupervised learning technique to get the clusters of not labeled dataset. This is for estimating customers segments/groups.
Here we use KMeans’s Elbow method to decide the optimal number of clusters.In this method we will apply KMeans for different number of clusters and calculate the sum of squared distance of every point to its respective (predicted) cluster. Then we plot those data as number of clusters vs SSD.
In the Image, we can see that the SSD (Distortions) is keep decreases as number of clusters increases. When we zoom into that then at 10 we could find the small elbow there. So in the end we can conclude that the optimal clusters is 10 based on elbow mothod.
After considering 10 clusters as optimal we run the model on Azdias and Clusters data to predict the cluster number.
1.5 Analyze the Cluster distribution
Let’s compare two dataset’s cluster distributions.
From Above Image we can observe that most of the customers are coming from 1,4,8,9 clusters. And also in the customers data that mail order company has showed that most of the customers are coming from 1,4,5,8. Let’s also look at the percentage of population.
so clusters which have < 1 ratio. we can inference that the customers that belong to those particular clusters are more likely be our future customers, the mail order company should target those clusters to attract and add new individuals into their customer service list.
Conclusion for part 1 :- Based on the real life demographic data provided by Arvato Financial service has helped to get the customer segmentation, which can be used for targeting the general population.
Now, It is time to build a prediction model, which we always interested about. 😜
We have MAILOUT-TRAIN and MAILOUT-TEST two datasets for training and testing of the machine learning model.
2.1 Training and Testing Process
MAILOUT-TRAIN :- It has 366 features same number as we have in Azdias dataset and 42962 rows. We are going to use the same analysis and cleansing techniques that we used for Azdias and customers to clean our data.
After getting the cleaned data we will follow the standard ML training process,
1. Standardization of the dataset and Splitting it into the train and test datasets
2. Try different machine learning classification models
3. Do the predictions on test dataset and compare the test score based on one metric called ROC-AUC score.
The table is showing that we got good accuracy using boosting algorithms and GradientBoostingClassifier has highest ROC-AUC score and elapsed time is 37.63 secs. We can directly use the model to do the predictions for the Kaggle competition but I wanted to try hyper-parameter tuning method to improve this model and to generalize the model.
2.2 Hyper-parameter tuning
We have so many hyper parameters in the machine learning models , let take first Gradient boosting classifier it has hyper parameters like min_samples_split, min_samples_leaf, max_depth, max_leaf_nodes , max_features, learning rate etc.
In the last table we observed that 3 models have good accuracy Adaboost, XGBoost and Gradient Boosting. We will use these three models and try to improve its performance by tuning some of the hyper-parameters. For that we use GridSearchCV method.
On Adaboost I got 0.6822 test ROC-AUC score, On GBM I got 0.7051 and on XGBoost I got 0.7046.
I saved above three best models to make the predictions on the test dataset.
Finally, we are at the last step of this project, it’s time to test above best three models in competition through Kaggle. here.
In this step I have to clean MAILOUT-TEST dataset, which has same number of rows and columns 42962 rows and 366 columns. We use the same cleansing function. I used above saved three boosting models to get the predictions and submitted the three different outcomes as CSV format.
The best score I got is 0.70876
and 237
Rank.
Below are the things or steps we can try to improve our predictions,
- Analyze all the features in more depth because some of them I ignored because of multi-level categorical features, try to remove the outliers, more feature engineering on the data.
- Try different approach to fill the missing values.
- Include more Hyper-parameters for tuning.
This real life project provided by Mail-Order service, was to create the customer segmentation using the general population of Germany. This analysis could be helpful for marketing team to target individuals who are more likely to be interested in their Mail-Order service.