Data Story Project: How can insurance companies retain $\$$100,000, make $\$$1M, and only spend $\$$10,000 to do it?

  1. Spend $\$$1 on BI for every $\$$10 you expect from it.
  2. 80% of your revenue is generated by 20% of your customers and reps.
  3. Retain the top 20% of the 80% of your first and second year renewal clients.
  4. Transfer the skills of the top 20% of your reps to the next 20%. Scrap the bottom 10%.
  5. Dominate an industry with 80% market share.

These are the rules or a set of hypotheses to find statistical significance in business strategy. My favourite law is Pareto's principle of 80/20.

TODO: Write a story on this and post on website... data story needs further work...

Summary of life insurance data set

Data retrieved from:

https://www.kaggle.com/c/prudential-life-insurance-assessment

File descriptions:

  • train.csv - the training set, contains the Response values
  • test.csv - the test set, you must predict the Response variable for all rows in this file
  • sample_submission.csv - a sample submission file in the correct format

Data fields:

Variable Description
Id A unique identifier associated with an application.
Product_Info_1-7 A set of normalized variables relating to the product applied for
Ins_Age Normalized age of applicant
Ht Normalized height of applicant
Wt Normalized weight of applicant
BMI Normalized BMI of applicant
Employment_Info_1-6 A set of normalized variables relating to the employment history of the applicant.
InsuredInfo_1-6 A set of normalized variables providing information about the applicant.
Insurance_History_1-9 A set of normalized variables relating to the insurance history of the applicant.
Family_Hist_1-5 A set of normalized variables relating to the family history of the applicant.
Medical_History_1-41 A set of normalized variables relating to the medical history of the applicant.
Medical_Keyword_1-48 A set of dummy variables relating to the presence of/absence of a medical keyword being associated with the application.
Response This is the target variable, an ordinal variable relating to the final decision associated with an application

Categorical (nominal) data fields

Product_Info_1, Product_Info_2, Product_Info_3, Product_Info_5, Product_Info_6, Product_Info_7, Employment_Info_2, Employment_Info_3, Employment_Info_5, InsuredInfo_1, InsuredInfo_2, InsuredInfo_3, InsuredInfo_4, InsuredInfo_5, InsuredInfo_6, InsuredInfo_7, Insurance_History_1, Insurance_History_2, Insurance_History_3, Insurance_History_4, Insurance_History_7, Insurance_History_8, Insurance_History_9, Family_Hist_1, Medical_History_2, Medical_History_3, Medical_History_4, Medical_History_5, Medical_History_6, Medical_History_7, Medical_History_8, Medical_History_9, Medical_History_11, Medical_History_12, Medical_History_13, Medical_History_14, Medical_History_16, Medical_History_17, Medical_History_18, Medical_History_19, Medical_History_20, Medical_History_21, Medical_History_22, Medical_History_23, Medical_History_25, Medical_History_26, Medical_History_27, Medical_History_28, Medical_History_29, Medical_History_30, Medical_History_31, Medical_History_33, Medical_History_34, Medical_History_35, Medical_History_36, Medical_History_37, Medical_History_38, Medical_History_39, Medical_History_40, Medical_History_41

Continuous data fields

Product_Info_4, Ins_Age, Ht, Wt, BMI, Employment_Info_1, Employment_Info_4, Employment_Info_6, Insurance_History_5, Family_Hist_2, Family_Hist_3, Family_Hist_4, Family_Hist_5

Discrete data fields

  • Medical_History_1, Medical_History_10, Medical_History_15, Medical_History_24, Medical_History_32
  • Medical_Keyword_1-48 are dummy variables.
  1. Can you count something interesting?

Descriptive statistics and Histogram plots for Response data set.

Name: Response, dtype: float64, CATEGORICAL

count 59381.000000 mean 5.636837 std 2.456833 min 1.000000 25% 4.000000 50% 6.000000 75% 8.000000 max 8.000000

Name: Ins_Age, dtype: float64, CONTINUOUS

count 59381.000000 mean 0.405567 std 0.197190 min 0.000000 25% 0.238806 50% 0.402985 75% 0.567164 max 1.000000

Name: BMI, dtype: float64, CONTINUOUS

count 59381.000000 mean 0.469462 std 0.122213 min 0.000000 25% 0.385517 50% 0.451349 75% 0.532858 max 1.000000

Name: Wt, dtype: float64, CONTINUOUS

count 59381.000000 mean 0.292587 std 0.089037 min 0.000000 25% 0.225941 50% 0.288703 75% 0.345188 max 1.000000

Descriptive statistics and Histogram plots for Product_Info_1-7 data set.

Name: Product_Info_1, dtype: float64, CATEGORICAL


In [3]:
for i in range(1,8):
    print "![Product_Info_"+str(i)+"](images/hist_Product_Info_"+str(i)+".png \"Product_Info_"+str(i)+"\")"


![Product_Info_1](images/hist_Product_Info_1.png "Product_Info_1")
![Product_Info_2](images/hist_Product_Info_2.png "Product_Info_2")
![Product_Info_3](images/hist_Product_Info_3.png "Product_Info_3")
![Product_Info_4](images/hist_Product_Info_4.png "Product_Info_4")
![Product_Info_5](images/hist_Product_Info_5.png "Product_Info_5")
![Product_Info_6](images/hist_Product_Info_6.png "Product_Info_6")
![Product_Info_7](images/hist_Product_Info_7.png "Product_Info_7")
* count 59381.000000 * mean 1.026355 * std 0.160191 * min 1.000000 * 25% 1.000000 * 50% 1.000000 * 75% 1.000000 * max 2.000000

Name: Product_Info_2, dtype: object, CATEGORICAL

count 59381 unique 19 top D3 freq 14321 * Response * 1 count 6207 * unique 19 * top D3 * freq 1440 * 2 count 6552 * unique 15 * top D3 * freq 1675 * 3 count 1013 * unique 19 * top D3 * freq 237 * 4 count 1428 * unique 19 * top D3 * freq 420 * 5 count 5432 * unique 19 * top D3 * freq 1256 * 6 count 11233 * unique 19 top D3 freq 3281 7 count 8027 unique 19 top D3 freq 2080 8 count 19489 unique 19 top D4 freq 5148 dtype: object

Name: Product_Info_3, dtype: float64, CATEGORICAL

count 59381.000000 mean 24.415655 std 5.072885 min 1.000000 25% 26.000000 50% 26.000000 75% 26.000000 max 38.000000

Name: Product_Info_4, dtype: float64, CONTINUOUS


In [ ]:

Name: Product_Info_5, dtype: float64, CATEGORICAL

count 59381.000000 mean 2.006955 std 0.083107 min 2.000000 25% 2.000000 50% 2.000000 75% 2.000000 max 3.000000

Name: Product_Info_6, dtype: float64, CATEGORICAL

count 59381.000000 mean 2.673599 std 0.739103 min 1.000000 25% 3.000000 50% 3.000000 75% 3.000000 max 3.000000

Name: Product_Info_7, dtype: float64, CATEGORICAL

count 59381.000000 mean 1.043583 std 0.291949 min 1.000000 25% 1.000000 50% 1.000000 75% 1.000000 max 3.000000

Categorical types for each column

Product_Info_1 : ['1, ', '2, '] Product_Info_2 : ['C2, ', 'B2, ', 'E1, ', 'A4, ', 'B1, ', 'A1, ', 'A3, ', 'A2, ', 'A5, ', 'C1, ', 'A7, ', 'A6, ', 'C3, ', 'A8, ', 'D4, ', 'C4, ', 'D2, ', 'D3, ', 'D1, '] Product_Info_3 : ['1, ', '2, ', '3, ', '4, ', '5, ', '6, ', '8, ', '9, ', '10, ', '11, ', '12, ', '13, ', '15, ', '16, ', '17, ', '18, ', '19, ', '20, ', '21, ', '22, ', '23, ', '24, ', '26, ', '27, ', '28, ', '29, ', '30, ', '31, ', '32, ', '33, ', '34, ', '36, ', '37, ', '38, '] Product_Info_5 : ['2, ', '3, '] Product_Info_6 : ['1, ', '3, '] Product_Info_7 : ['1, ', '2, ', '3, '] Employment_Info_2 : ['1, ', '2, ', '3, ', '4, ', '5, ', '6, ', '7, ', '9, ', '10, ', '11, ', '12, ', '13, ', '14, ', '15, ', '16, ', '17, ', '18, ', '19, ', '20, ', '21, ', '22, ', '23, ', '25, ', '26, ', '27, ', '28, ', '29, ', '30, ', '31, ', '32, ', '33, ', '34, ', '35, ', '36, ', '37, ', '38, '] Employment_Info_3 : ['1, ', '3, '] Employment_Info_5 : ['2, ', '3, '] InsuredInfo_1 : ['1, ', '2, ', '3, '] InsuredInfo_2 : ['2, ', '3, '] InsuredInfo_3 : ['1, ', '2, ', '3, ', '4, ', '5, ', '6, ', '7, ', '8, ', '9, ', '10, ', '11, '] InsuredInfo_4 : ['2, ', '3, '] InsuredInfo_5 : ['1, ', '3, '] InsuredInfo_6 : ['1, ', '2, '] InsuredInfo_7 : ['1, ', '3, '] Insurance_History_1 : ['1, ', '2, '] Insurance_History_2 : ['1, ', '2, ', '3, '] Insurance_History_3 : ['1, ', '2, ', '3, '] Insurance_History_4 : ['1, ', '2, ', '3, '] Insurance_History_7 : ['1, ', '2, ', '3, '] Insurance_History_8 : ['1, ', '2, ', '3, '] Insurance_History_9 : ['1, ', '2, ', '3, '] Family_Hist_1 : ['1, ', '2, ', '3, '] Medical_History_2 : ['1, ', '2, ', '3, ', '5, ', '6, ', '7, ', '8, ', '9, ', '10, ', '12, ', '13, ', '14, ', '15, ', '16, ', '17, ', '18, ', '19, ', '20, ', '21, ', '22, ', '23, ', '24, ', '25, ', '26, ', '27, ', '28, ', '29, ', '30, ', '32, ', '33, ', '34, ', '35, ', '36, ', '37, ', '38, ', '39, ', '40, ', '41, ', '42, ', '43, ', '44, ', '45, ', '46, ', '47, ', '48, ', '50, ', '51, ', '52, ', '53, ', '54, ', '55, ', '56, ', '57, ', '58, ', '59, ', '60, ', '61, ', '62, ', '63, ', '64, ', '66, ', '67, ', '68, ', '69, ', '70, ', '71, ', '72, ', '73, ', '74, ', '75, ', '76, ', '77, ', '78, ', '79, ', '81, ', '82, ', '84, ', '85, ', '86, ', '87, ', '88, ', '89, ', '90, ', '91, ', '93, ', '94, ', '95, ', '96, ', '97, ', '98, ', '99, ', '100, ', '101, ', '102, ', '104, ', '105, ', '106, ', '107, ', '108, ', '109, ', '110, ', '111, ', '112, ', '113, ', '114, ', '115, ', '116, ', '117, ', '120, ', '121, ', '122, ', '123, ', '124, ', '125, ', '127, ', '128, ', '129, ', '131, ', '132, ', '133, ', '134, ', '135, ', '136, ', '137, ', '138, ', '139, ', '140, ', '141, ', '142, ', '143, ', '144, ', '145, ', '146, ', '147, ', '148, ', '149, ', '150, ', '151, ', '152, ', '153, ', '154, ', '155, ', '156, ', '157, ', '158, ', '159, ', '160, ', '161, ', '162, ', '163, ', '164, ', '165, ', '166, ', '167, ', '169, ', '170, ', '171, ', '172, ', '173, ', '174, ', '175, ', '177, ', '179, ', '180, ', '181, ', '182, ', '183, ', '184, ', '185, ', '186, ', '187, ', '188, ', '189, ', '190, ', '191, ', '192, ', '193, ', '195, ', '196, ', '197, ', '198, ', '199, ', '200, ', '201, ', '202, ', '203, ', '204, ', '205, ', '207, ', '208, ', '209, ', '210, ', '212, ', '213, ', '214, ', '215, ', '216, ', '217, ', '218, ', '219, ', '220, ', '221, ', '222, ', '223, ', '224, ', '225, ', '226, ', '227, ', '228, ', '229, ', '230, ', '231, ', '232, ', '233, ', '234, ', '235, ', '236, ', '238, ', '239, ', '240, ', '241, ', '242, ', '243, ', '245, ', '247, ', '248, ', '249, ', '250, ', '251, ', '252, ', '253, ', '255, ', '256, ', '257, ', '258, ', '259, ', '260, ', '261, ', '262, ', '264, ', '265, ', '266, ', '267, ', '268, ', '270, ', '271, ', '272, ', '273, ', '274, ', '275, ', '276, ', '277, ', '278, ', '279, ', '280, ', '281, ', '282, ', '283, ', '285, ', '286, ', '287, ', '288, ', '289, ', '290, ', '291, ', '293, ', '294, ', '295, ', '296, ', '297, ', '298, ', '299, ', '301, ', '302, ', '303, ', '305, ', '306, ', '307, ', '310, ', '311, ', '313, ', '314, ', '315, ', '316, ', '317, ', '318, ', '319, ', '320, ', '321, ', '322, ', '323, ', '324, ', '326, ', '327, ', '328, ', '329, ', '330, ', '331, ', '332, ', '333, ', '334, ', '335, ', '336, ', '337, ', '338, ', '343, ', '344, ', '345, ', '346, ', '347, ', '348, ', '349, ', '350, ', '351, ', '352, ', '353, ', '354, ', '355, ', '357, ', '358, ', '360, ', '361, ', '362, ', '363, ', '364, ', '366, ', '368, ', '369, ', '370, ', '371, ', '372, ', '373, ', '374, ', '375, ', '376, ', '377, ', '378, ', '379, ', '380, ', '381, ', '382, ', '383, ', '384, ', '385, ', '386, ', '387, ', '388, ', '389, ', '390, ', '391, ', '392, ', '393, ', '394, ', '395, ', '396, ', '397, ', '398, ', '399, ', '400, ', '403, ', '404, ', '405, ', '406, ', '407, ', '408, ', '409, ', '410, ', '411, ', '412, ', '413, ', '414, ', '415, ', '416, ', '417, ', '418, ', '419, ', '420, ', '421, ', '422, ', '426, ', '427, ', '428, ', '430, ', '431, ', '432, ', '433, ', '434, ', '435, ', '436, ', '437, ', '438, ', '439, ', '440, ', '441, ', '443, ', '444, ', '446, ', '447, ', '448, ', '449, ', '451, ', '452, ', '453, ', '455, ', '456, ', '457, ', '458, ', '459, ', '461, ', '462, ', '464, ', '465, ', '466, ', '467, ', '468, ', '469, ', '470, ', '471, ', '472, ', '473, ', '474, ', '475, ', '476, ', '477, ', '478, ', '479, ', '480, ', '481, ', '482, ', '483, ', '484, ', '486, ', '487, ', '488, ', '489, ', '490, ', '491, ', '492, ', '493, ', '494, ', '495, ', '496, ', '497, ', '498, ', '499, ', '501, ', '502, ', '503, ', '504, ', '505, ', '506, ', '507, ', '509, ', '510, ', '511, ', '512, ', '513, ', '514, ', '515, ', '516, ', '517, ', '518, ', '519, ', '520, ', '522, ', '523, ', '524, ', '525, ', '526, ', '527, ', '528, ', '529, ', '530, ', '531, ', '532, ', '533, ', '534, ', '536, ', '537, ', '538, ', '540, ', '541, ', '542, ', '543, ', '544, ', '545, ', '546, ', '548, ', '549, ', '550, ', '551, ', '552, ', '553, ', '554, ', '557, ', '558, ', '559, ', '560, ', '561, ', '562, ', '563, ', '564, ', '565, ', '566, ', '567, ', '568, ', '569, ', '570, ', '571, ', '572, ', '573, ', '575, ', '576, ', '577, ', '578, ', '579, ', '580, ', '581, ', '582, ', '583, ', '584, ', '586, ', '587, ', '588, ', '589, ', '590, ', '591, ', '592, ', '593, ', '595, ', '596, ', '598, ', '599, ', '600, ', '601, ', '602, ', '603, ', '605, ', '606, ', '607, ', '608, ', '609, ', '610, ', '611, ', '613, ', '614, ', '615, ', '616, ', '617, ', '618, ', '619, ', '620, ', '621, ', '622, ', '623, ', '624, ', '626, ', '627, ', '628, ', '629, ', '630, ', '631, ', '632, ', '633, ', '634, ', '635, ', '636, ', '637, ', '638, ', '639, ', '640, ', '641, ', '642, ', '643, ', '644, ', '645, ', '646, ', '647, ', '648, '] Medical_History_3 : ['1, ', '2, ', '3, '] Medical_History_4 : ['1, ', '2, '] Medical_History_5 : ['1, ', '2, ', '3, '] Medical_History_6 : ['1, ', '2, ', '3, '] Medical_History_7 : ['1, ', '2, ', '3, '] Medical_History_8 : ['1, ', '2, ', '3, '] Medical_History_9 : ['1, ', '2, ', '3, '] Medical_History_11 : ['1, ', '2, ', '3, '] Medical_History_12 : ['1, ', '2, ', '3, '] Medical_History_13 : ['1, ', '2, ', '3, '] Medical_History_14 : ['1, ', '2, ', '3, '] Medical_History_16 : ['1, ', '2, ', '3, '] Medical_History_17 : ['1, ', '2, ', '3, '] Medical_History_18 : ['1, ', '2, ', '3, '] Medical_History_19 : ['1, ', '2, ', '3, '] Medical_History_20 : ['1, ', '2, ', '3, '] Medical_History_21 : ['1, ', '2, ', '3, '] Medical_History_22 : ['1, ', '2, '] Medical_History_23 : ['1, ', '2, ', '3, '] Medical_History_25 : ['1, ', '2, ', '3, '] Medical_History_26 : ['1, ', '2, ', '3, '] Medical_History_27 : ['1, ', '2, ', '3, '] Medical_History_28 : ['1, ', '2, ', '3, '] Medical_History_29 : ['1, ', '2, ', '3, '] Medical_History_30 : ['1, ', '2, ', '3, '] Medical_History_31 : ['1, ', '2, ', '3, '] Medical_History_33 : ['1, ', '3, '] Medical_History_34 : ['1, ', '2, ', '3, '] Medical_History_35 : ['1, ', '2, ', '3, '] Medical_History_36 : ['1, ', '2, ', '3, '] Medical_History_37 : ['1, ', '2, ', '3, '] Medical_History_38 : ['1, ', '2, '] Medical_History_39 : ['1, ', '2, ', '3, '] Medical_History_40 : ['1, ', '2, ', '3, '] Medical_History_41 : ['1, ', '2, ', '3, '] Response : ['1, ', '2, ', '3, ', '4, ', '5, ', '6, ', '7, ', '8, ']

Columns with NaN values and their counts.

The following had NaN values and they were converted to -1 and renormalized Minimum and max values were checked.

Employment_Info_1 19 Employment_Info_4 6779 Employment_Info_6 10854 Insurance_History_5 25396 Family_Hist_2 28656 Family_Hist_3 34241 Family_Hist_4 19184 Family_Hist_5 41811 Medical_History_1 8889 Medical_History_10 58824 Medical_History_15 44596 Medical_History_24 55580 Medical_History_32 58274

In [ ]:
nullList = ['Family_Hist_4',
 'Medical_History_1',
 'Medical_History_10',
 'Medical_History_15',
 'Medical_History_24',
 'Medical_History_32']

The following columns had values that were not equal to 1 and NaN values.

Family_Hist_4 Medical_History_1 Medical_History_10 Medical_History_15 Medical_History_24 Medical_History_32
max 0.943662 240 240 240 240 240
  1. Can you find some trends (high, low, increase, decrease, anomalies)?
  2. Can you make a bar plot or a histogram?
  1. Can you compare two related quantities?
  2. Can you make a scatterplot?
  3. Can you make a time-series plot?

In [ ]: