Insurance brokers are challenged to retain their client base while growing business in diverse markets. Personalizing insurance products and providing policies at competitive premiums are primary goals brokers have, although it is key for them to understand what factors can influence claims risk, the future continuance of policies, and the administrative overhead with managing their client policies.
For the Data Science Intensive Capstone project, my goal was to understand data wrangling and predictive modelling by machine learning so I can help my insurance broker client meet or exceed their strategic retention and growth goals. In order to achieve this, I analyzed a life insurance data set offered by Prudential Life Insurance through a competition on Kaggle.com. By using data analytics and machine learning techniques, I studied a structured and fairly "clean" data set in order to make predictions on the ordinal categorical risk rating based on profiles of life insurance customers.
The premise of this study is to show my client how insurance and customer data can be manipulated to identify important features, trends and statistics in order to make predictions and decisions in meeting their strategic goals.
Considering the life insurance data set, methologies and analytical techniques are shown to identify interesting data features, normalize and transform certain features, and how to apply and tune simple machine learning algorithms for predictive purposes.
Multiple interviews were conducted with the CEO and CFO of the insurance brokerage in order to understand their strategic goals, business processes, and data sources.
This report will make useful connections between solving a structured machine learning problem involving the life insurance dataset and how a similar methodology could be applied to the datasets from the insurance brokerage.
The project is seperated into sub-sections of discussing the "Life Insurance" dataset and machine learning analysis and in discussing the "Insurance brokerage" data analytics project in terms of developing a data analytics project.
The conclusion will connect the results and methodologies of the life insurance data set to form the machine learning piece for the insurance brokerages' data analytics project.
The goal of the life insurance Kaggle competition, is to solve a multi-classification problem by assigning a categorical (ordinal) risk rating from 1-9 for a customer profile. The algorithm accuracy is tested by measuring the quadratic weighted kappa (QWK), the inter-rater agreement between a predicted set to the actual set of risk ratings.
The data set contains 127 features (minus the ID feature) consisting of categorical (ordinal), continuous (normalized), and discrete feature types. There are 48 Medical Keyword features acting as dummy variables with binary, discrete values.
The deliverables were a machine learning algorithm for predicting risk response in the Kaggle dataset.
An appropriate process would need to be applied to transform the data with useful features selected, a machine learning algorithm implemented and the QWK calculated to determine error in classification. The top competitors achieved QWK scores of 0.67939.
The following methods and code was written were taken to analyze the data. The primary goal was to compare Response ratings to the other features in the data set.
The following features had NaN values in their respective columns of data:
| Feature | NaN count |
|---|---|
| Employment_Info_1 | 19 |
| Employment_Info_4 | 6779 |
| Employment_Info_6 | 10854 |
| Insurance_History_5 | 25396 |
| Family_Hist_2 | 28656 |
| Family_Hist_3 | 34241 |
| Family_Hist_4 | 19184 |
| Family_Hist_5 | 41811 |
| Medical_History_1 | 8889 |
| Medical_History_10 | 58824 |
| Medical_History_15 | 44596 |
| Medical_History_24 | 55580 |
| Medical_History_32 | 58274 |
As no datum had a -1 value, all NaN values were replaced with a -1.
As Product_Info_2 had string datatypes, the string categories were associated with an enumerated dummy variable set. All columns that had NaN values had been replaced with a -1. The goal was to convert the data set into integer values in order to be normalized and used in the linear regression algorithms. A min-max scaler operation was performed on all the categorical and discrete datasets.
The dataset was split into risk rating groups because the end goal is to predict risk rating from a test data set.
The first "len(group)/100" portion of each risk rating group was pooled into a test data frame, and the remaning into a training data frame. This was performed in a loop, however the cross-validation library in sklearn would be a better option to perform the data split-test.
After cleaning, NaN conversion, normalization,and splitting, the training and test data set were split into an X data sets and a Y data set. The Y data set was the normalized risk rating feature, and the X data set were the other remaining features. These were converted to numpy arrays for classifier fitting and prediction. After fitting and prediction, the normalized risk rating values were transformed back into their risk rating categories from 1-8.
Lasso linear regression and random forest classifiers were used to fit the training data sets. The QWK was determined for each test data set and compared to the top QWK score from Kaggle.
Accuracy_score, mean_squared_error and Quadratic Weighted Kappa were considred to evaluate the classifier. QWK was the method used to evaluate because it was most appropirate for evaluating how well a classifier. The QWK score was then determined comparing the predicted and actual values for the test data set.
Regularization reduces the overfitting problem
alpha kappa time
count 50.000000 50.000000 50.00000 mean 0.050000 0.162239 0.47140 std 0.029155 0.086191 0.09196 min 0.001000 0.111002 0.36900 25% 0.025500 0.116683 0.41125 50% 0.050000 0.126014 0.44200 75% 0.074500 0.146227 0.49200 max 0.099000 0.461167 0.79400
alpha kappa time
count 50.000000 50.000000 50.000000 mean 0.005000 0.386544 1.049080 std 0.002915 0.049576 2.596456 min 0.000100 0.309687 0.524000 25% 0.002550 0.340325 0.618250 50% 0.005000 0.384313 0.651500 75% 0.007450 0.430685 0.680500 max 0.009900 0.473927 19.014000
est kappa time
count 10.000000 10.000000 10.000000 mean 46.000000 0.337425 19.335500 std 30.276504 0.035419 13.075482 min 1.000000 0.259207 0.530000 25% 23.500000 0.325625 9.757500 50% 46.000000 0.346819 18.733000 75% 68.500000 0.358949 28.620500 max 91.000000 0.376106 41.517000
est kappa time
count 9.000000 9.000000 9.000000 mean 500.000000 0.357689 210.043556 std 273.861279 0.006708 114.229644 min 100.000000 0.346036 42.440000 25% 300.000000 0.357782 123.817000 50% 500.000000 0.358111 214.093000 75% 700.000000 0.362640 291.163000 max 900.000000 0.366685 356.299000
The data shows that the linear lasso algorithm performed prediction faster and resulted in a higher QWK score than random forest. I had predicted that random forest would perform better in the QWK scoring with more estimators; although this was not the case. Linear lasso classifier had a higher QWK score although the alpha value of 0.0099 suggests the algorithm is over-fitting the data.
My client is an insurance broker and has been in business for 21 years. They have collected a significant amount of data on their clients, insurance agencies, products and sales representatives over the past 20 years. Due to confidentiality concerns, I cannot release any data although the strategies imployed will be discussed.
By being able to find risk rating correlations between life insurance applicants, I can use the methods learned in this course and project to facilitate a discussion on how my client’s problem of improving retention can be resolved using similar methods.
I explored their Applied Systems TAMS software (a insurance brokerage management tool). The tool produces client proudct summaries, claims revenue, sales summaries, and accounting reports. It can export into CSV.
The deliverables were a memo outlining an approach to solving my client’s retention problem. and a machine learning algorithm for predicting risk response in the Kaggle dataset. A portion of the Email memo is attached in the Appendix.
In the context of this project, the goal is to determine what tasks would be useful in a data science project that could help my insurance broker client meet their company goals.
The following steps were taken to prepare a data analytics project proposal:
The following was submitted to the company after interviews were conducted.
After observing the client's database I came to the following workflow and processes to assist my client in meeting their business goals.
The project would be to create Python scripts that extract and process the data from the reports that are generated from the TAMs platform. The dataset structure would be similar to life insurance project data set. For the personal lines business, an example of the features to generate would be:
A very similar dataset for commercial would be made.
Data can be normalized and secured for confidentiality for off-site analysis. The following information could be derived below:
Time series data signatures to consider:
The information derived can provide “actionable intelligence” for The Insurance Brokerage; the critical and key deliverable to meet retention and progressive growth goals.
A few “actionable intelligence” measures are to use different perspective lenses of when financial decisions are made by The Insurance Brokerage, customers, companies, industry etc.
Customer segments can be compared in different premium brackets, income brackets, product line combinations, or renewal times of the year.
The customer who cancel after 1 year, 2 years, 3 years etc. can be explored. A predictive model can be made to determine which clients are prone to cancelling after a certain time period.
The unique value is that you can determine the order of the high revenue-low claims risk clients that are predicted most likely to cancel during the renewal period. The retention strategy would focus on ensuring these clients are taken care of.
To further improve the predictive model I would perform the following in the future:
The Prudential life insurance data set offered a good basis to learn and apply data analytics methodologies using Python, data analysis, visualization and predictive modelling libraries. The goal was to determine a predictive model for risk rating prediction and classification.
Histograms and a scatter plot matrix were created to explore the relationships between risk rating and other data features. The data set consisted of categorical, continuous and discrete datasets that were processed for NaN values and normalized. Lasso linear regression and random forest classifiers were used. Quadratic Weighted Kappa (QWK) was used to evaluate classifier performance as the risk ratings were categorical.
The lasso linear regression offered the fastest training time and best QWK value of 0.473 with an alpha value of 0.0099. The model is acceptable because an appropirate QWK score was determined. This model did not perform well because the low alpha value suggests overfitting and the QWK value placed 2300/2619 out of all the scores in the Kaggle competition. The top score was 0.679. By reducing the features used the algorithm can be improved and fine tuned to approach the top score.
As the dataset given was fairly clean and has similar features to that of my insurnace broker's data, the life insurance project forms a good outline on how a predictive modelling project could be applied to the insurance brokers data.
Through managerial interviews and observations of the insurance brokerage, a realistic and feasible data project was determined. The primary goal of the project is to help form the brokers' retention strategy. Data would be extracted from the brokers' TAMS platform (an insurance brokerage customer management program) where various managerial reports could be exported as CSV files. The CSV files could be processed to create a similar data table to that of the Prudential Life Insurance data set. Product lines, sales persons associated with accounts, and premium start and end dates were determined to be useful features for retention goals. Descriptive statistics on various customer segments can be useful in determining the retenton goals for the brokerage. The most useful intelligence would be to determine what features of the datasets contribute the most to a customer cancelling a policy after one, two and three years. Predicting the most valuable customers prone to cancelling would form the basis of the retention strategy.
The client has showed interest in performing this project although they are not convinced it is a feasible venture at this time. They are interested in knowing how a predictive model can help aid in retention. I will approach the client again in the upcoming months in regards to the project.
Overall, the project was a useful learning tool in applying data analysis using Pandas and machine learning using Sci-kit learn.
I would like to thank Hobson Lane for his mentorship in completing this project and appreciate the support, understanding and patience he has given. In most of the 30 min mentor sessions, we were able to communicate well and I received a lot of useful information from him on the world of machine learning. His knowledge, expertise and insane level of intelligence was very helpful in boosting my understanding of the concepts.
I also thank Springboard for organizing quality content to learn "intensively" the fundamentals of data science and machine learning. The capstone project allowed me to learn Python, Pandas, Sci-kit learn and other data analysis fundamentals.