Just yesterday I was watching a YouTube video created by Krish Naik, who is a YouTube influencer and also has a Kaggle account.He made a video on how it is important for people seeking to enter the field of data science to undertake competition programming because it helps them to develop logical organisational skills in programming, which is something that potential employers like to see.
Data science competitions are varied, so something can be learned from each one that is entered. In addition, as skills are acquired, the code for these competitions can always be modified to improve accuracy and move up the leaderboard.
With this in mind, I have selected a competition question from Analytics Vidhya concerning developing an incentive plan for salespersons and determining whether an insurance policy will be renewed. The link for the datasets for this competition question can be found here:- McKinsey Analytics Online Hackathon (analyticsvidhya.com)
The problem statement for this question reads as follows:-
“Your client is an Insurance company and they need your help in building a model to predict the propensity to pay renewal premium and build an incentive plan for its agents to maximise the net revenue (i.e. renewals — incentives given to collect the renewals) collected from the policies post their issuance.
You have information about past transactions from the policy holders along with their demographics. The client has provided aggregated historical transactional data like number of premiums delayed by 3/ 6/ 12 months across all the products, number of premiums paid, customer sourcing channel and customer demographics like age, monthly income and area type.
In addition to the information above, the client has provided the following relationships:
Expected effort in hours put in by an agent for incentives provided; and Expected increase in chances of renewal, given the effort from the agent.
Given the information, the client wants you to predict the propensity of renewal collection and create an incentive plan for agents (at policy level) to maximise the net revenues from these policies.
Equation for the effort-incentives curve: Y = 10*(1-exp(-X/400))
Equation for the % improvement in renewal prob vs effort curve:
Y = 20*(1-exp(-X/5))”
I opened an .ipynb file on Google Colab, which is a very versatile Jupyter Notebook that I can use from virtually any computer that has an internet connection and Google accessibility. Because the files for this competition question were so large, I had to save them to the Google drive that houses all of my files. These files need to be saved to the Google drive so they can be used from any computer that allows Google access.
Because many Python libraries are already installed on Google Colab, I only had to import two of the main Python libraries, being pandas and numpy.
I also read in the datasets that I had saved onto my Google drive. I did this by copying the path of the file and pasting it into the line of code that I had created:-
I checked for null values and in this instance three columns in the train and test datasets contained null values that would need to be imputed:-
Because the null values were in numeric columns, I replaced all of the null values with the median value of each column.
I created a new column called “age_in_years”, where I divided “age_in_days” by 365.25, which is the number of days in a year. I felt it was better to convert the age to years to help out with the computations.
I then created a new column called “incentives”, which I used to calculate the incentives for the insurance sales persons. I did not find the formulas provided by the insurance company to be very helpful because they did not define X and Y, making it difficult to determine what the incentive should be. I therefore researched on the internet what the sales incentive for insurance sales should be and learned that the sales incentive can be anywhere between 2% and 8%. I therefore had to use trial and error to select that best rate, which is 4.5% in this post, but can be any optimum value between 2% and 8%.
It is important to create the “incentives” column at the beginning of the program because the incentive will determine how many hours of work the insurance salesperson is willing to put in to get the sale. A higher incentive will indicate that the salesperson will be willing to put in more work than a lower incentive. In addition, if the salesperson already has a heavy workload, he is likely to prioritise his work and put a higher incentive before a lower incentive. Because of this, the incentive will have an effect on whether or not the customer renews the policy:-
I used seaborn code to produce a graph of the target, being train.renewal, and found that there is a class imbalance in this dataset.
I put the target value on a counter and found that 0 is comprised of only 6.259% of the column:-
I used matplotlib to produce graphs of all of the numeric columns in the dataset, which gives a picture of how the independent variables affect the dependant variable, being “renewal”.
I then ordinal encoded the two categorical columns in the dataset because most models, especially in the sklearn library, will only train and predict on numeric data.
Once the datasets had been fully preprocessed, I defined the X, y and X_test variables.
I created a variable, test_id, which contains the data from test.id and will be used at the end of the program when the dataframe with predictions is created.
The y variable, being the target, is composed of train.renewal and contains binary data of either 0 or 1.
X is a dataframe that is composed of the train dataset with the following columns dropped: “renewal”, “age_in_days”, and “id”.
X_test is a dataframe that is composed of the test dataset with the following columns dropped: “renewal” and “age_in_days.
I also put X and X_test on a scaler, using sklearn’s StandardScaler() function. It is important to scale the data because it improves the accuracy of the predictions:-
Because I like visual representations of the data, I created a two dimensional graph of the target variables and how they appear in the computer’s memory. As can be seen, the 0’s are intermingled with the 1’s, and this is going to affect the accuracy of the predictions:-
I used sklearn’s train_test_split() to break the X dataset up into training and validation sets. I set the validation set to the 10% of the X dataset because i wanted to have as much trainable data as possible and hopefully improve the accuracy of the model:-
I defined the model, being sklearn’s LinearSVC() because I have had good results with it in the past when dealing with class imbalances. I achieved an accuracy of 82%, but when I changed the value of the incentive the accuracy varied. The parameters will need to be retuned whenever the insurance salespersons’ incentives change:-
I predicted on the validation set and attained an accuracy of 82%.
I put the predictions on a confusion matrix and found that 186 0’s had been misidentified and 1,255 1’s had been misidentified. It is therefore very important to put any classification problem on a confusion matrix to determine how many examples are correct:-
In order to visualise the accuracy of the model, I created a two dimensional graph that depicts the correct examples in purple and the incorrect ones in yellow:-
I then predicted on the test set.
I created a variable, incentives, and placed test.incentives in it because this data is going to be necessary when the submission dataframe is created:-
I prepared my submission by created a dataframe that included the test_id, which had been created at the beginning of the program, the model’s predictions, and the incentives, which had recently been created.
I converted the submission dataframe to a .csv file, which was ready to be downloaded:-
I then downloaded the submission .csv file and put it on Analytics Vidhya’s Solution Checker and checked my scores. The score obtained is dependant on the incentives, so if the incentives are changed then the score will change. With this particular competition question I was only allowed to check my predictions eleven times in a day and I had exhausted all of my submissions for this day.
I would highly suggest you try this code out and see for yourself how the incentive affects the predictions and also the score:-
The code for this program can be found in its entirety in my personal GitHub account, the web link being found here:- Misc-Predictions/AV_Hack_Insurance.ipynb at main · TracyRenee61/Misc-Predictions (github.com)