Modeling
Four ML classification models are used to identify the offer based on the features given. These algorithms are:
- Logistic Regression (LR)
- K Nearest Neighbors (KNN)
- Decision Tree Classifier (CART)
- Gaussian Naive Bayes (NB)
The following shows the code and the estimation accuracy for each of these models:
KNN and CART show similar estimation accuracy with 78% and LR and NB show lower accuracy with 66% each.
Let’s see the prediction accuracy on the validation data.
the prediction accuracy is similar to the estimation accuracy for all the models.
Refinement & Model Evaluation
based on the above results, I will be focusing on the two models that scored highest which are KNN and CART, and try to improve their accuracy.
KNN
The value of K in this algorithm is the parameter that determines the number of neighbors. The default value for K is 5. Here, I tried to find the most suitable K value in the range between 1 and 40 that would minimize the error and improve accuracy.
the values of K were plotted against their mean error to find the best K value.
All values of K produce similar accuracy with the mean error ranging between 21.2% and 21.7%. This means that the best accuracy the model can produce is between ~78.3% when K=3 and ~78.8% when K=20.
The accuracy score, confusion matrix, and classification report for KNN with K=20 is shown below.
CART
The value of the criterion, splitter, and max_depth in this algorithm are the parameters that determine the attribute selection measure(entropy, Gini), split strategy (best, random), and a maximum depth of a tree. Here, the values of criterion is entropy and splitter is best and I tried to find the most suitable max_depth value in the range between 1 and 20 that would minimize the error and improve accuracy.
the values of max_depth were plotted against their mean error to find the best value that results in minimal error.
All values of max_depth that are larger than 5 produce similar accuracy with the mean error at ~20%. This means that the best accuracy the model can produce is ~80% when max_depth=5.
The accuracy score, confusion matrix, and classification report for CART with max_depth=5 is shown below.