Roy Lai
- Jun 9, 2019
- 8 min read

An analysis of Insurance Application Forms

Updated: Jun 19, 2019

The aim of this project was to accurately categorise customer risk using insurance application forms, thereby making it quicker and less labour intensive for customers to get an insurance offer. To do that, we’ll need to understand the application process in a bit more detail.

A graphical representation of the process

To give a bit of context, I would like to start by telling a short story from early in my career. It was a usual Monday and my boss asked me to process a life insurance application form for a customer before he retreated back to his office. Let’s call the customer Bob. Bob just signed the form over the weekend and my boss was now rushing to complete the process before end of the week. I was left sitting there thinking… "Impossible!". Well not unless we are lucky (a clean skin). The customer still needs to get a blood test and if there’s anything unclear, there will be more back and forth between the customer/our office and the insurance underwriter. On average we are looking at a month to complete the application, maybe more.

And… Well, before I finished my thought, I realised my boss was once again standing in front of me. He looked at me and said “Don’t worry about the application for now. Bob just passed away”.

You’ll be glad to know that down the track Bob’s family got the sum insured even though the application had not been submitted at the time although that’s a whole other story. The point I am trying to illustrate is that time is critical. You never know what is going to happen. Yes, this was an extreme example and Bob’s family got the money in the end but is this something you would want to worry about when a loved one had just passed away?

So, here’s the challenge. How can we make this process simpler, quicker, better? There are many benefits in doing so from both the business and the customer’s perspective.

But before delving into the topic deeper, let’s have a quick look at the data I worked with.

Exploratory Data Analysis

The data I worked with was provided by Prudential Life but given the sensitive nature of the information, it has all been de-identified. We have roughly 60,000 rows of data all already normalised.

I note that there is also a lot of null values. Given the information is taken from insurance applications, it stands to reason that the nulls represent a lack of response to the questionnaire. As such, my assumption is to first treat the nulls as 0, indicating the lack of response. I then considered other methods including logistic regression and k-means to impute the values.

The first thing I wanted to point your attention to is our target column. A column called ‘Response’. ‘Response’ has integer values of 1 to 8 which I am treating as 8 different categories.

Of note here is that we have a bit of class imbalance with a significantly lower Response count categorised as a 3 or a 4.

Next, the columns that intuitively means something were Age, Weight, Height and BMI where BMI incorporates both Weight and Height. So, let’s take a closer look at Age and BMI.

Starting with Age, the graph here illustrates the average age in each of our response categories. Intuitively, I assumed a higher response means higher risk, so we should see category 8 with the highest average age and as we move from category 8 to 1, average age should decrease in a relatively linear manner.

However, this graph tells a different story.

Firstly, we see category 1 has the highest mean age while category 8 has the lowest mean age. This is the opposite of my assumption and suggests that in fact, a higher response means lower risk.

Secondly, the mean ages across group did not change in a relatively linear manner as I expected. There could be several reasons such as younger people exhibiting riskier behaviour such as excessive drinking and speeding.

Next we look at BMI. We see response group 8 with the lowest average BMI and group 1 with BMI on the higher side but not the highest. I expected response 8 to have the highest BMI if they were the riskiest group. Again, this confirms lower Response indicates higher risk. I note that there are clear differences between the groups, but no specific pattern was identifiable here.

Modelling

On to the fun part - the modelling! I will first go through my process of selecting my model, then how I approached tuning the hyperparameters of the selected model, and finally looking at interpreting the results.

As I am treating this as a classification problem, I compared several classifier models including KNN, Logistic Regression, SVM, XGBoost and Extra Tree.

The models were compared based on the time it took to train, the accuracy score, and the log loss score. Log loss was chosen as my understanding is that it is generally good in dealing with multiclass classification and unbalanced data. It also handles the unbalanced data in a more neutral way compared to F1 and ROC-AUC. The closer the log loss to 0, the better.

Accuracy score was also used for comparisons and to compare to the base line which we can see here is 0.37. In this case, the closer the accuracy to 1 the better.

We can see here that most of the log loss scores across the different models were quite comparable but as for accuracy score, XGBoost was performing the best despite evidence of overfitting (training score significantly higher than test score). Logistic Regression, SVM and Extra Tree all had more consistent results comparing train and test but also had lower accuracies.

Not shown here but I also attempted creating an ensemble through Voting Classifier and created a meta learner using Gradient Boosting Classifier both manually and through the MLENS package. However, I was not able to achieve better results.

Based on these results, I chose to explore XGBoost further as model of choice and to see if I can improve the predictions.

The process was quite simple, I performed a Gridsearch on the parameters as shown here. Using the best parameters, I tested the model and found that it now gave me an accuracy of 63% on training data but it was over-fitting, so I played around with the regularisation parameters more.

With that we have new accuracy and log loss as shown. The accuracy has dropped as a result but there is less over-fitting. On the positive side, log loss was reduced significantly compared to the values around 1.5 from before.

Interpretation

Moving onto the interpretation, I took advantage of another two modules. SHAP and LIME.

Let’s start with SHAP. Without going into too much details on how shapley values are calculated (it is based on Game Theory for those interested!), this following table looks at the general evaluation of feature relevance on the model as a whole. The higher the average shapley value (the x axis), the more impact it has on the model. The graph also shows the average shapley value of each class, in other words, the importance of the variables to each particular classes specifically. You will note that the shapley values are additive. Taking Medical History 4 as an example, we can clearly see it had the most impact of our features and within that, it was particularly important to the blue and purple classes.

With the overall model interpretation shown here, the business may use this insight to minimise irrelevant question in application forms and focus on those that are more important, making the process more efficient.

The Shap module could also interpret individual rows of data but for that I am going to defer to the Lime module as it is slightly easier to read for our purposes.

The above is an example of the output for Lime. A lot going on here so let’s start from left to right. First graph demonstrates the Prediction Probability of each class, we see the model is predicting this person has an 82% chance of being in class 8, 5% chance of being class 2 and so forth. But how did the model determine that? The next two illustrations show just that. Here we see the top 5 contributing features where the orange value contributed towards a higher probability of class 8 while the blue values reduced the probability of the person being in class 8. On the right, we see the actual values this person scored on the respective features.

This level of interpretation can not only help give the company have more confidence in what the model is doing, it helps to delve one level deeper. For example, a family history of diabetes may push the risk level of one person up but for another person, given their healthy lifestyle, it may be less of a determining factor. These are the types of interpretation we are hoping to build. The aim is not just time savings but to also build a methodical and consistent framework that can assist the underwriting process.

An application of this is that the output could be used by the insurance underwriter as a starting point to review an insurance application. This may potentially be used as a tool for a more guided approach to the underwriting process, improving consistency on underwriting methodology. This could assist with the business comply to a more consistent standard in making decisions and help reduce back and forth with the customers by focusing only on what’s important.

An Experimental Approach - Clustering

A bit small but illustration serves to show that based on colours, the groups are not clearly definable.

As an experiment, I also performed a clustering analysis to see if I can find groupings that reflect my target groups (i.e. the Ground Truth). I used PCA to create 5 principle components and KMeans to perform the clustering specifying 8 groups which is the number of groups in my target.

Doing a pairplot on the principle components and using colour coding based on my target groups, the plots indicated that I do not have clearly defined target groups. And while I was able to get pretty nicely separated groups with a Silhouette Score of 0.59 other measures (V Measure, Homogeneity, Completeness) suggested that this analysis was not producing very good results.

Conclusions

In the end, I do not think I have adequate accuracy to say something extreme such as being able to avoid having a blood test in the application process. That said, the findings in this analysis could be a starting point to build a tool for a more guided approach to the underwriting process, improving consistency on underwriting methodology. We maybe be able to focus more on specific questions that have shown to impact on response rate more and even remove other questions that is generally not shown to contribute to the decision as much.

But for that, we would still need to improve the accuracy of the model which was partially hampered by having to work with de-identified data.

None the less, next steps are to continue and explore ways on improving the accuracy of the model. One method which I have considered is to treat this as a linear regression problem and round the resulting predictions up or down to fit into my 8 categories. I have made a start on this already including trying to fit a polynomial with degree 2 with mixed results and need more work on this before I am able to draw conclusions.

For those interested, the code I used is on my GitHub.

Thank you for reading and if you have any feedback, please do not hesitate to contact me!

Disclaimer: I am not an expert in the insurance underwriting process. The scenario in this project is based on my understanding of the process working from the advice side only. If anyone reading this has any feedback, I would love to hear from you!

An analysis of Insurance Application Forms

Recent Posts