Linear regression Insurance

This is a model that I made to predict the cost of an insurance base on the age, BMI and if the person smokes.

The dataset was find in Kaggle, link

This the form of the original dataset from Kaggle

Original Dataset without preprocessing

The first step that I made was change the strings to numbers and separate de sex feature into Female and Male and remove the column from Region, because for this implementation I was not interested on the feature.

After preprocessing the data look like this:

Dataset after preprocessing

Then we gonna make a a correlation between features to see which are the most important ones.

As we can see for the correlation matrix the most important ones to predict charges are Smoker, Age and BMI in that order so we gonna make our model according to this features.

BMI meaning

Life expectancy is the basis for life insurance rates. So factors that impact your potential “mortality,” or life insurance expectancy, are factored into life insurance quotes.

Cigarette smoking among adults is at an all-time low of 14%, according to a 2020 report from the Surgeon General. But 16 million Americans have a smoking-related disease. And you don’t have to already have health consequences in order to get stuck with higher life insurance rates. Simply being a smoker will usually push you into higher rates when you shop for life insurance.

Take it from Forbes

What Smokers Should Know About Buying Life Insurance — Forbes Advisor

Nowadays if you want to calculate your health insurance you have to calculate based on the salary and how many members you want to add the plan and is a huge process with this recent problem of COVID-19 most of the people try to acquire a health insurance and you can’t calculate and estimate cost for a single cover person.

I made my model based on the one published by Amit Yadav in Coursera call Linear Regression with Python this model is using Mean Square Error and Gradient Descent to optimize towards the local minimum.

The gradient descent algorithm can be simplify in 4 steps:

1.Get predictions y_hat for X with current values of W and b.
2.Compute the loss between y and y_hat
3.Find gradients of the loss with respect to parameters W and b
4.Update the values of W and b by subtracting the gradient values obtained in the previous step

And addition to the code was the function to split the data between Train and Test sets and also the inputs according to the features from our model.

Also I add the implementation with the sklearn framework to compare the results between both models.

Footer