We will first train a random model so that we can compare our other models and their performance and efficiency.
How to perform log-loss for a random model in a multi-class setting?We will randomly generate numbers equal to our number of classes(10 in our problem) for every point in our Test and Cross Validate data and then normalize them to sum it to one.
test_data_len = test_df.shape[0]
cv_data_len = cv_df.shape[0]# we create a output array that has exactly same size as the CV datacv_predicted_y = np.zeros((cv_data_len,9))
#for every value in our CV data we create a array of all zeros with #size 9for i in range(cv_data_len):#iterating to each value in cv data(row)
rand_probs = np.random.rand(1,9) #generating randoms form 1 to 9
cv_predicted_y[i] = ((rand_probs/sum(sum(rand_probs)))[0]) #normalizing to sum to 1print("Log loss on Cross Validation Data using Random Model",log_loss(y_cv,cv_predicted_y, eps=1e-15))# Test-Set error.
#we create a output array that has exactly same as the test datatest_predicted_y = np.zeros((test_data_len,9))
for i in range(test_data_len):
rand_probs = np.random.rand(1,9)
test_predicted_y[i] = ((rand_probs/sum(sum(rand_probs)))[0])
print("Log loss on Test Data using Random Model",log_loss(y_test,test_predicted_y, eps=1e-15))predicted_y =np.argmax(test_predicted_y, axis=1)
plot_confusion_matrix(y_test, predicted_y+1)
In the above we first created an empty array with size 9 for each class label and then randomly generated probabilities for each class label and plotted the confusion matrix and computed log-loss.
We can see that our random-model has a log-loss of 2.4 across cross-validate and test-data so we need our models to perform better than this, let’s check the precision and recall for this model.
How to interpret the above precision recall matrix?
Precision
1. Taking an example of cell(1×1) it has value of 0.127 ; it says of all the points that are predicted to be class 1 only 12.7% values are actually class 1
2. For original class 4 and predicted class 2 we can say that of the values that our model predicted to class 2, 23.6% values actually belong to class 4
Recall
1. Check cell (1X1) it has a value of 0.079 which means for all the points which actually belongs to class 1 our model predicted only 7% values to be class 1
2. For original class 8 and predicted class 5 values is 0.250 means of all the values which are actually class 8 are model predicted 25% values to be class 5
We will now be training our models after some exploratory data analysis and also feature encoding which you can check on my notebook. We trained multiple models and Logistic Regression and Support Vector Machine stands out from the rest.
Logistic Regression
Support Vector Machine
Comparison of all the models
We can see that Logistic Regression and Support Vector Machine performs better than others in terms of both log-loss and percentage of mis-classified points.