Analysis of life insurance data to support a data product proposal for an insurance brokerage

Author: Robbie Sharma (robbie.sharma@gmail.com)

Mentor: Hobson Lane (hobsonlane@gmail.com)

Prepared for: Springboard - Data Science Intensive Course

January 2016 - February 2016

Introduction

Insurance brokers are challenged to retain their client base while growing business in diverse markets. Personalizing insurance products and providing policies at competitive premiums are primary goals brokers have, although it is key for them to understand what factors can influence claims risk, the future continuance of policies, and the administrative overhead with managing their client policies.

For the Data Science Intensive Capstone project, my goal was to understand data wrangling and predictive modelling by machine learning so I can help my insurance broker client meet or exceed their strategic retention and growth goals. In order to achieve this, I analyzed a life insurance data set offered by Prudential Life Insurance through a competition on Kaggle.com. By using data analytics and machine learning techniques, I studied a structured and fairly "clean" data set in order to make predictions on the ordinal categorical risk rating based on profiles of life insurance customers.

The premise of this study is to show my client how insurance and customer data can be manipulated to identify important features, trends and statistics in order to make predictions and decisions in meeting their strategic goals.

Considering the life insurance data set, methologies and analytical techniques are shown to identify interesting data features, normalize and transform certain features, and how to apply and tune simple machine learning algorithms for predictive purposes.

Multiple interviews were conducted with the CEO and CFO of the insurance brokerage in order to understand their strategic goals, business processes, and data sources.

This report will make useful connections between solving a structured machine learning problem involving the life insurance dataset and how a similar methodology could be applied to the datasets from the insurance brokerage.

The project is seperated into sub-sections of discussing the "Life Insurance" dataset and machine learning analysis and in discussing the "Insurance brokerage" data analytics project in terms of developing a data analytics project.

The conclusion will connect the results and methodologies of the life insurance data set to form the machine learning piece for the insurance brokerages' data analytics project.

Life Insurance: Problem description

The goal of the life insurance Kaggle competition, is to solve a multi-classification problem by assigning a categorical (ordinal) risk rating from 1-9 for a customer profile. The algorithm accuracy is tested by measuring the quadratic weighted kappa (QWK), the inter-rater agreement between a predicted set to the actual set of risk ratings.

The data set contains 127 features (minus the ID feature) consisting of categorical (ordinal), continuous (normalized), and discrete feature types. There are 48 Medical Keyword features acting as dummy variables with binary, discrete values.

The deliverables were a machine learning algorithm for predicting risk response in the Kaggle dataset.

An appropriate process would need to be applied to transform the data with useful features selected, a machine learning algorithm implemented and the QWK calculated to determine error in classification. The top competitors achieved QWK scores of 0.67939.

Methods & Analysis

The following methods and code was written were taken to analyze the data. The primary goal was to compare Response ratings to the other features in the data set.

  1. Data exploration and plotting
    1. The 48 Medical_Keyword_X columns were summed to derive a total medical keyword count feature.
    2. Histogram plots of all the features were created to perform a preliminary exploration.
    3. The following features were explored in this project in more detail: Ins_Age, Ht, Wt, BMI, Product_Info_2, and Medical_Keyword_Sum
    4. Histogram plots and a scatter matrix plot were saved for the features listed above.
  2. Data transformation and normalization
    1. Replaced alpha-numeric labels in Product_Info_2 with an enumerated dictionary of dummy integers.
    2. Replaced all the NaNs in the dataset with -1.
    3. Normalization based on min max difference of data sets.
      1. Risk rating was normalized to values between 0 and 1.
      2. Product_Info_2 normalized between 0 and 1.
      3. Categorical and discrete data sets were normalized to 0 and 1. Elements with a -1 or NaN were normalized to 0.
  3. Machine learning
    1. Training/Test set
      1. 10% of the train.csv data set used for a testing set and 90% used for a training set.
    2. Test and fitting using the following classifiers
      1. Linear Model - LASSO
      2. Random Forest
    3. Evaluation
      1. Quadratic Weighted Kappa function from skll library used to classify the error

Data exploration - Life Insurance

Basic statistics, histograms and scatter plots were created to determine useful features for use in the machine learning algorithms.

Descriptive statistics

There are 59380 rows in the dataset. Detailed descriptive statistics can be seen in the following notebook: capstone-data-story-project.ipynb

Scatter matrix plot of Response, Ht, BMI, Ins_Age, Product_Info_2, Medical_Keyword_Sum

Histograms

Histograms were created for all the features and the following were selected to be the most interesting.

Product_Info_2 Categorical data set

  • Shows an exponental distribution of categories
  • Categories could be normalized and rearranged to make it normally distributed
  • The D3 category is a very common occurance across all risk levels

Product_Info_4 Continuous data set

  • I suspect this may be related to the cost of a life insurance plan.
  • Values between 0.2 and 0.3 are common occurences.

Risk response

  • Risk rating 8 is the most common type
  • Very small occurrences of Risk rating 3 and 5
  • There are an uneven number of datasets for each risk rating which suggests classification issues through the ML algorithms may be encountered.

Further histogram exploration

Histograms of selected features created at each response rating (1-8) are shown as follows

'Ht' feature - Risk response 1-8

'BMI' feature

'Ins_Age' feature

'Product_Info_2' feature

'Medical Keyword Sum' feature

Data cleaning & preparation

NaN replacement

The following features had NaN values in their respective columns of data:

Feature NaN count
Employment_Info_1 19
Employment_Info_4 6779
Employment_Info_6 10854
Insurance_History_5 25396
Family_Hist_2 28656
Family_Hist_3 34241
Family_Hist_4 19184
Family_Hist_5 41811
Medical_History_1 8889
Medical_History_10 58824
Medical_History_15 44596
Medical_History_24 55580
Medical_History_32 58274

As no datum had a -1 value, all NaN values were replaced with a -1.

Data normalization

As Product_Info_2 had string datatypes, the string categories were associated with an enumerated dummy variable set. All columns that had NaN values had been replaced with a -1. The goal was to convert the data set into integer values in order to be normalized and used in the linear regression algorithms. A min-max scaler operation was performed on all the categorical and discrete datasets.

Data split into test-training sets

The dataset was split into risk rating groups because the end goal is to predict risk rating from a test data set.

The first "len(group)/100" portion of each risk rating group was pooled into a test data frame, and the remaning into a training data frame. This was performed in a loop, however the cross-validation library in sklearn would be a better option to perform the data split-test.

Predictive modelling - Life Insurance

After cleaning, NaN conversion, normalization,and splitting, the training and test data set were split into an X data sets and a Y data set. The Y data set was the normalized risk rating feature, and the X data set were the other remaining features. These were converted to numpy arrays for classifier fitting and prediction. After fitting and prediction, the normalized risk rating values were transformed back into their risk rating categories from 1-8.

Lasso linear regression and random forest classifiers were used to fit the training data sets. The QWK was determined for each test data set and compared to the top QWK score from Kaggle.

Evaluation criteria - Quadratic Weighted Kappa

Accuracy_score, mean_squared_error and Quadratic Weighted Kappa were considred to evaluate the classifier. QWK was the method used to evaluate because it was most appropirate for evaluating how well a classifier. The QWK score was then determined comparing the predicted and actual values for the test data set.

Lasso Linear Regression modelling

  • The C value tunes the importance of the L2 regularization parameter.
  • Tuned with alpha value from 0 to 0.0099
  • A very low alpha suggests high variance and the algorithm is being over fitted.
  • Need to reduce the features to make the algorithm more generalized.

Regularization reduces the overfitting problem

  • L1 regularization - Lasso regression - a low alpha reduces the norm size of the input variables.
  • L2 regularization - spreads out the shrinkage so all the interdependent variables are equally influential
  • The alpha variable affects the regularization of the linear regression model.

Random Forest Classifier modelling

  • Tuned with 0 to 1000 estimators

Classifier Results

alpha/time vs. kappa: linear lasso - test#1

alpha   kappa   time

count 50.000000 50.000000 50.00000 mean 0.050000 0.162239 0.47140 std 0.029155 0.086191 0.09196 min 0.001000 0.111002 0.36900 25% 0.025500 0.116683 0.41125 50% 0.050000 0.126014 0.44200 75% 0.074500 0.146227 0.49200 max 0.099000 0.461167 0.79400

alpha/time vs. kappa: linear lasso: test2

alpha   kappa   time

count 50.000000 50.000000 50.000000 mean 0.005000 0.386544 1.049080 std 0.002915 0.049576 2.596456 min 0.000100 0.309687 0.524000 25% 0.002550 0.340325 0.618250 50% 0.005000 0.384313 0.651500 75% 0.007450 0.430685 0.680500 max 0.009900 0.473927 19.014000

Estimators/Time vs. kappa: RandomForest: Test1

est kappa   time

count 10.000000 10.000000 10.000000 mean 46.000000 0.337425 19.335500 std 30.276504 0.035419 13.075482 min 1.000000 0.259207 0.530000 25% 23.500000 0.325625 9.757500 50% 46.000000 0.346819 18.733000 75% 68.500000 0.358949 28.620500 max 91.000000 0.376106 41.517000

Estimators/Time vs. kappa: RandomForest: Test2

est kappa   time

count 9.000000 9.000000 9.000000 mean 500.000000 0.357689 210.043556 std 273.861279 0.006708 114.229644 min 100.000000 0.346036 42.440000 25% 300.000000 0.357782 123.817000 50% 500.000000 0.358111 214.093000 75% 700.000000 0.362640 291.163000 max 900.000000 0.366685 356.299000

Discussion of results

The data shows that the linear lasso algorithm performed prediction faster and resulted in a higher QWK score than random forest. I had predicted that random forest would perform better in the QWK scoring with more estimators; although this was not the case. Linear lasso classifier had a higher QWK score although the alpha value of 0.0099 suggests the algorithm is over-fitting the data.

Insurance brokerage: Problem description

My client is an insurance broker and has been in business for 21 years. They have collected a significant amount of data on their clients, insurance agencies, products and sales representatives over the past 20 years. Due to confidentiality concerns, I cannot release any data although the strategies imployed will be discussed.

By being able to find risk rating correlations between life insurance applicants, I can use the methods learned in this course and project to facilitate a discussion on how my client’s problem of improving retention can be resolved using similar methods.

I explored their Applied Systems TAMS software (a insurance brokerage management tool). The tool produces client proudct summaries, claims revenue, sales summaries, and accounting reports. It can export into CSV.

The deliverables were a memo outlining an approach to solving my client’s retention problem. and a machine learning algorithm for predicting risk response in the Kaggle dataset. A portion of the Email memo is attached in the Appendix.

In the context of this project, the goal is to determine what tasks would be useful in a data science project that could help my insurance broker client meet their company goals.

Methods and Analysis

The following steps were taken to prepare a data analytics project proposal:

  1. Interviews conducted with CEO and CFO
    1. Determined key company goals for the year
      1. Retain 90% of clients on renewal every month.
      2. Grow client base by X% every month (X needs to be determined).
      3. Cut overhead costs by 10%.
  2. Discussed financial and operational health of the company.
    1. Company overhead was 70%
    2. Top 10% of producers/sales people are the CEO and 3 long term employees
      1. Opportunities for training and development for rising stars.
  3. Provided information and education on data analytics
    1. Industry trends in adopting data analytics for the insurance.
    2. Methodologies and procedures to perform data analytics.
    3. Useful outcomes to make decisions in the insurance business.
  4. Preparation of data analytics project proposal
    1. Observational summary
    2. Data analaytics project recommendation

Observational summary

The following was submitted to the company after interviews were conducted.

  • Applied Systems TAM major issues:
    • Under utlization of useful reports ie. Marketing, activity, call reports etc.
    • Duplicate/incorrect/missing data
    • Poor logging of management, producer and CSR activities
  • Strategic retention plan required for your major renewal cycle from March – September is approaching. Plan?
  • Developing the employee handbook and operational policies are key initiatives.
  • The Insurance Brokerage earns about 12.5% commission revenue on Personal lines and 7%-17% commission revenue on Commercial.
    • Higher commission is earned on new business; growth strategy could focus on markets that offer balance between retention risk, claims risk and high commission revenue
  • Limited performance/development incentives for CSRs and producers who are not named CEO, TopRep A or TopRep B (Names replaced for privacy)
    • There are opportunities for personal to commercial cross-sells at renewals
  • Clients are observed to be happy with company service
  • Company culture is A+; Employees are generally happy; they respect your management team and reception is good at taking care of walk-ins clients

Data analytics project - Phase 1

Project proposal details

After observing the client's database I came to the following workflow and processes to assist my client in meeting their business goals.

  1. Data cleanup/transformation.
    1. Observed duplicates, missing data, information not properly filled in etc.
    2. Need to investigate platform on how to perform mass changes and what is required to be changed.
      1. The SQL database can be directly queried if Applied Systems TAMS approves database read/write access.
  2. Data exploration
    1. Perform ETL processes on TAM data using Python
    2. Identification of data types (continuous, discrete, categorical etc.)
    3. Identification of data features related to retention and cross-selling goals
  3. Data analytics
    1. Basic descriptive statistics on
      1. products
      2. representatives
      3. sales activities
      4. claims losses
      5. premium revenues
    2. Basic Tables/charts -> top 20%, histograms, pie charts
    3. Retention rates of different premium brackets
      1. New policies/Total policies, Lost policies/Total Policies
    4. Customer segments (building 1st and 2nd order models) 1, Preimum brackets
      1. Combinations of meta data
        1. Income brackets, postal code, city, province, gender, age, personal? commercial? both?
    5. Discrete, continuous, and categorical time series signatures of “customer features”.
    6. Experimentation with machine learning and predictive models using simple linear regression, random forest and decision trees.
  4. Data visualization
    1. Excel, Qlikview, or Tableau dashboards… TBD after exploration and further needs assessments
  5. Management Consulting
    1. Recommending reporting, decision-making and operating procedures/policies on retention and product cross-selling
    2. Identifying an appropriate reporting and analytics toolchain and workflow for the company

Personal and commercial lines data table

The project would be to create Python scripts that extract and process the data from the reports that are generated from the TAMs platform. The dataset structure would be similar to life insurance project data set. For the personal lines business, an example of the features to generate would be:

  • Customer ID
  • Producer_ID (person who manages the client account)
  • CSR_ID (customer service assistant to Producer_ID)
  • 1st contact date
  • 1st premium date
  • Final premium date
  • Insurance product (1-XXX) date added
  • Insurnace product (1-XXX) date removed
  • Postal code
  • Income bracket
  • Gender

A very similar dataset for commercial would be made.

Actionable intelligence

Data can be normalized and secured for confidentiality for off-site analysis. The following information could be derived below:

  • Descriptive and inferential statistics on all the features.
  • Retention rates between different premium brackets
  • Customer segments (building 1st and 2nd order models)
  • Discrete, continuous, and categorical time series signatures of “customer features”.

Time series data signatures to consider:

  • Day of week
  • Month of year
  • Quarter of the year

Unique value proposition

The information derived can provide “actionable intelligence” for The Insurance Brokerage; the critical and key deliverable to meet retention and progressive growth goals.

A few “actionable intelligence” measures are to use different perspective lenses of when financial decisions are made by The Insurance Brokerage, customers, companies, industry etc.

Customer segments can be compared in different premium brackets, income brackets, product line combinations, or renewal times of the year.

The customer who cancel after 1 year, 2 years, 3 years etc. can be explored. A predictive model can be made to determine which clients are prone to cancelling after a certain time period.

The unique value is that you can determine the order of the high revenue-low claims risk clients that are predicted most likely to cancel during the renewal period. The retention strategy would focus on ensuring these clients are taken care of.

Lessons learned

Data wrangling

  1. Learned the difficulty in processing data for fitting machine learning algorithms
  2. Collected a lot of good code snippets to aid in future work
  3. When installing packages through a Windows-based Anaconda environment, use the 'conda install' command rather than 'pip install' to perform the package install. I installed the 'skll' package using pip and my package environment was compromised, so I had to manually remove packages and reinstall many of them to work out the module error bugs in the code.

Predictive Model Improvements

To further improve the predictive model I would perform the following in the future:

  1. Split data set into 70% train and 30% test using the Cross Validation package.
  2. Select partial features to train the classifiers. Product_Info_2, Product_Info_4, Ins_Agea and Medical_Keyword_Sum look like promising base features to include. Employement_Info and Insure_History would need to be explored in more detail.
  3. Use LassoCV to perform some further cross validation checks for linear regression.
  4. Tune the random forest. Reduce the number of branches for faster processing.
  5. Tune lasso with alpha of magnitudes greater than 1.
  6. Create scatter plots of Response vs. each label then explore R^2 relationships.
  7. Try replacing NaN values with 0 or disregard the NaNs in data set when fitting.

Client Interactions

  1. Present more visual information to the client than textual information.
  2. Present information in a single memo or proposal, than multiple submissions of information.
  3. Reduce technical jargon and present information in simple terms.
  4. Estimate work and present a proposal faster.

Conclusion

Life Insurance

The Prudential life insurance data set offered a good basis to learn and apply data analytics methodologies using Python, data analysis, visualization and predictive modelling libraries. The goal was to determine a predictive model for risk rating prediction and classification.

Histograms and a scatter plot matrix were created to explore the relationships between risk rating and other data features. The data set consisted of categorical, continuous and discrete datasets that were processed for NaN values and normalized. Lasso linear regression and random forest classifiers were used. Quadratic Weighted Kappa (QWK) was used to evaluate classifier performance as the risk ratings were categorical.

The lasso linear regression offered the fastest training time and best QWK value of 0.473 with an alpha value of 0.0099. The model is acceptable because an appropirate QWK score was determined. This model did not perform well because the low alpha value suggests overfitting and the QWK value placed 2300/2619 out of all the scores in the Kaggle competition. The top score was 0.679. By reducing the features used the algorithm can be improved and fine tuned to approach the top score.

As the dataset given was fairly clean and has similar features to that of my insurnace broker's data, the life insurance project forms a good outline on how a predictive modelling project could be applied to the insurance brokers data.

Insurance brokerage

Through managerial interviews and observations of the insurance brokerage, a realistic and feasible data project was determined. The primary goal of the project is to help form the brokers' retention strategy. Data would be extracted from the brokers' TAMS platform (an insurance brokerage customer management program) where various managerial reports could be exported as CSV files. The CSV files could be processed to create a similar data table to that of the Prudential Life Insurance data set. Product lines, sales persons associated with accounts, and premium start and end dates were determined to be useful features for retention goals. Descriptive statistics on various customer segments can be useful in determining the retenton goals for the brokerage. The most useful intelligence would be to determine what features of the datasets contribute the most to a customer cancelling a policy after one, two and three years. Predicting the most valuable customers prone to cancelling would form the basis of the retention strategy.

The client has showed interest in performing this project although they are not convinced it is a feasible venture at this time. They are interested in knowing how a predictive model can help aid in retention. I will approach the client again in the upcoming months in regards to the project.

Overall, the project was a useful learning tool in applying data analysis using Pandas and machine learning using Sci-kit learn.

Thanks

I would like to thank Hobson Lane for his mentorship in completing this project and appreciate the support, understanding and patience he has given. In most of the 30 min mentor sessions, we were able to communicate well and I received a lot of useful information from him on the world of machine learning. His knowledge, expertise and insane level of intelligence was very helpful in boosting my understanding of the concepts.

I also thank Springboard for organizing quality content to learn "intensively" the fundamentals of data science and machine learning. The capstone project allowed me to learn Python, Pandas, Sci-kit learn and other data analysis fundamentals.