How can the random component in risk assessment be reduced with external data?

written by: Oren Atia & Elad Palensia

Everyone wants to assess risk better. A high degree of confidence in risk assessment allows businesses to offer a competitive price with a high-profit margin and avoid significant losses by passing on high-risk opportunities (Risk Mitigation). It would not be presumptuous to say that one of the essential elements for any risk assessment/mitigation in any insurance company is the persons who hold the insurance policy and the environment they live in.

Human decisions or human behavior, isn’t random nor detached from prior events. Most times, what seems to be a random and spontaneous action is tied to a person’s interest and journey of life. Even when it doesn’t look like it, a person goes to a process that leads him or her into taking action, which from the outside would seem uncorrelated. For example, a person woke up one day and decided to buy a plane ticket to Hawaii. It probably wasn’t the first time he has ever heard of Hawaii. Most likely, he did some reading, got himself excited about the place, and went through an internal process before he “spontaneously” swiped his credit card and bought that ticket.

In addition the environment in which person live also affects. Sometimes global data affect the risk, without the person of the policy being aware of them. For example, In the domain of car policies, weather conditions, Visibility conditions, road conditions, daytime hours, night hours and more. to which the policy is exposed during the calendar year.

True, randomness cannot be ignored.You can travel and suddenly a lightning strike will hit your car. Or a more realistic sample, you park in a public parking and someone pierces your car, leaving no details….

The goal is to reduce the unknown,
by gathering relevant information
and influencing the degree of risk,
wherever it is

The goal is to reduce the unknown, by gathering relevant information and influencing the degree of risk, wherever it is. One can try to minimize it by finding influential data, depending on collecting relevant data, reducing dimensions and inserting the data into a model that will yield good risk assessment results.

In order not to stay at the level of theory, we will move on to a small example that explains the process. In this (one) example we will explain the idea of how to gather data and prepare it for a model. This is a a sample of process that will explain the step to build a predictive model from data. The explanation will include an example of integrating data that helps enrich the dataset.

For the purpose of the example, suppose we have the following basic DataFrame:

Suppose we know how that all vehicles reach the same point on a daily basis, and we can calculate the average distance the vehicle travels in a year. We will use the call_km_Per_Year function, which calculates and feeds a new column.

def Calc_Km_PerYear(nCity):
if nCity == 'Tel AViv':
return 24000
elif nCity == 'Hifa':
return 28000
elif nCity == 'Ber Sheva':
return 35000
else:
return 18000df['Km_Year'] = df['City'].apply(Calc_Km_Per_Year)

The Results:

We are assuming that driving in the dark or in daylight affects the amount of accidents. Of course, the insight to reach such a conclusion, that daylight or darkness affect the amount of accidents, is the result of research in itself. So I go out to gather relevant information on the subject and add the monthly average time of daylight hours from the month in which the policy was purchased until the end of the year:

# Enrich Data 
html_string = 'https://www.climatestotravel.com/climate/israel'
dfs = pd.read_html(html_string)
# The average sunshine hours per day.
dfs[2]

The result is the following DataFrame:

In order to integrate the result of the new information, we will perform the following action:

def getSunByMonth(strDate):
strDate = strDate.split ("/")
return dfs[2].iloc[0, int(strDate[1]):].mean()df['sunInMounth'] = df['Underwriting_date'].apply(getSunByMonth)
df

The result is that a new column is added:

At this point we are looking where the data needs to be change. Step One, we will perform One Hot Encoder. It is important to expand on this topic. Especially about when the decision to move columns to this manipulation. The rule of thumb is to move to One Hot Encoder fields that are categories. This means that categorical data must be converted to a numerical form. If there is meaning to the number, you should switch to Scale. If there is no meaning, consider running in One hot encoder. Fields that are numeric, should go to the next step — Scale later in this article. In our case we will handle the vehicle type column, and city.

# One Hot Encoder
from sklearn.preprocessing import LabelBinarizer
y = pd.get_dummies(df.City, prefix='City')
df =df.join(y, how='outer')y = pd.get_dummies(df.Car, prefix='Car')
df =df.join(y, how='outer')
df

The results:

Footer