After features are generated, I did some post-processing in order to prepare the data for modelling. we need to encode Categorical columns and assemble the numerical columns in a pipeline. pyspark.ml provides these capabilities for us, you do not need to do one hot encoding yourself.
I’ve used string indexer for creating an index for each categorical column(encodes a string column of labels to a column of label indices), VectorAssembler, which is a feature transformer that assembles (merges) multiple columns into a (feature) vector/column. These are added as stages into a pipeline, which we’ll fit the data. Code and details can be found in the Github repository mentioned in the beginning.
Our goal is to predict which users are likely to churn, so this is essentially a binary classification problem.
It is important to evaluate your models with the right metrics.
To achieve this I choose the following metrics when I’m choosing the right model :
# f1 score: explains how robust and precise the model is.
# AUC: tells how much the model is capable of distinguishing between classes and the probability that the classifier will rank a randomly chosen positive example higher than a randomly chosen negative example.
It is important for us to classify no-churn customers correctly otherwise we might take wrong actions, or we might take actions that we shouldn’t take which might confuse the customer.
I chose Random Forest model and applied hyperparameter tuning based on f1 score :
Result: Test F1-score: 0.702075702075702
The accuracy and f1 score look acceptable. However we shouldn’t forget that our dataset may not be representing all customer base, I’ve used a portion of the original dataset. Moreover in real applications above accuracy of 70% is considered to be good and acceptable.