Exploration of Prudential Life Insurance Data

Data retrieved from:

https://www.kaggle.com/c/prudential-life-insurance-assessment

File descriptions:
  • train.csv - the training set, contains the Response values
  • test.csv - the test set, you must predict the Response variable for all rows in this file
  • sample_submission.csv - a sample submission file in the correct format
Data fields:
Variable Description
Id A unique identifier associated with an application.
Product_Info_1-7 A set of normalized variables relating to the product applied for
Ins_Age Normalized age of applicant
Ht Normalized height of applicant
Wt Normalized weight of applicant
BMI Normalized BMI of applicant
Employment_Info_1-6 A set of normalized variables relating to the employment history of the applicant.
InsuredInfo_1-6 A set of normalized variables providing information about the applicant.
Insurance_History_1-9 A set of normalized variables relating to the insurance history of the applicant.
Family_Hist_1-5 A set of normalized variables relating to the family history of the applicant.
Medical_History_1-41 A set of normalized variables relating to the medical history of the applicant.
Medical_Keyword_1-48 A set of dummy variables relating to the presence of/absence of a medical keyword being associated with the application.
Response This is the target variable, an ordinal variable relating to the final decision associated with an application

The following variables are all categorical (nominal):

Product_Info_1, Product_Info_2, Product_Info_3, Product_Info_5, Product_Info_6, Product_Info_7, Employment_Info_2, Employment_Info_3, Employment_Info_5, InsuredInfo_1, InsuredInfo_2, InsuredInfo_3, InsuredInfo_4, InsuredInfo_5, InsuredInfo_6, InsuredInfo_7, Insurance_History_1, Insurance_History_2, Insurance_History_3, Insurance_History_4, Insurance_History_7, Insurance_History_8, Insurance_History_9, Family_Hist_1, Medical_History_2, Medical_History_3, Medical_History_4, Medical_History_5, Medical_History_6, Medical_History_7, Medical_History_8, Medical_History_9, Medical_History_11, Medical_History_12, Medical_History_13, Medical_History_14, Medical_History_16, Medical_History_17, Medical_History_18, Medical_History_19, Medical_History_20, Medical_History_21, Medical_History_22, Medical_History_23, Medical_History_25, Medical_History_26, Medical_History_27, Medical_History_28, Medical_History_29, Medical_History_30, Medical_History_31, Medical_History_33, Medical_History_34, Medical_History_35, Medical_History_36, Medical_History_37, Medical_History_38, Medical_History_39, Medical_History_40, Medical_History_41

The following variables are continuous:

Product_Info_4, Ins_Age, Ht, Wt, BMI, Employment_Info_1, Employment_Info_4, Employment_Info_6, Insurance_History_5, Family_Hist_2, Family_Hist_3, Family_Hist_4, Family_Hist_5

The following variables are discrete:

Medical_History_1, Medical_History_10, Medical_History_15, Medical_History_24, Medical_History_32

Medical_Keyword_1-48 are dummy variables.


In [14]:
# Convert variable data into categorical, continuous, discrete, 
# and dummy variable lists the following into a dictionary list

In [ ]:
# The following variables are all categorical (nominal):

In [33]:
s = ["Product_Info_1, Product_Info_2, Product_Info_3, Product_Info_5, Product_Info_6, Product_Info_7, Employment_Info_2, Employment_Info_3, Employment_Info_5, InsuredInfo_1, InsuredInfo_2, InsuredInfo_3, InsuredInfo_4, InsuredInfo_5, InsuredInfo_6, InsuredInfo_7, Insurance_History_1, Insurance_History_2, Insurance_History_3, Insurance_History_4, Insurance_History_7, Insurance_History_8, Insurance_History_9, Family_Hist_1, Medical_History_2, Medical_History_3, Medical_History_4, Medical_History_5, Medical_History_6, Medical_History_7, Medical_History_8, Medical_History_9, Medical_History_11, Medical_History_12, Medical_History_13, Medical_History_14, Medical_History_16, Medical_History_17, Medical_History_18, Medical_History_19, Medical_History_20, Medical_History_21, Medical_History_22, Medical_History_23, Medical_History_25, Medical_History_26, Medical_History_27, Medical_History_28, Medical_History_29, Medical_History_30, Medical_History_31, Medical_History_33, Medical_History_34, Medical_History_35, Medical_History_36, Medical_History_37, Medical_History_38, Medical_History_39, Medical_History_40, Medical_History_41",
    "Product_Info_4, Ins_Age, Ht, Wt, BMI, Employment_Info_1, Employment_Info_4, Employment_Info_6, Insurance_History_5, Family_Hist_2, Family_Hist_3, Family_Hist_4, Family_Hist_5",
     "Medical_History_1, Medical_History_10, Medical_History_15, Medical_History_24, Medical_History_32"]

varTypes = dict()
varTypes['categorical'] = s[0].split(', ')
varTypes['continuous'] = s[1].split(', ')
varTypes['discrete'] = s[2].split(', ')
l = list()
for i in range(1,49): l.append("Medical_Keyword_"+str(i))
varTypes['dummy'] = l

In [56]:
#Checking over the variable types
#for i in iter(varTypes['dummy']):
    #print i

In [69]:
%pylab inline
%matplotlib inline
import pandas as pd 
import matplotlib.pyplot as plt


Populating the interactive namespace from numpy and matplotlib

In [6]:
df = pd.read_csv('prud_files/train.csv')

In [7]:
df.head(5)


Out[7]:
Id Product_Info_1 Product_Info_2 Product_Info_3 Product_Info_4 Product_Info_5 Product_Info_6 Product_Info_7 Ins_Age Ht ... Medical_Keyword_40 Medical_Keyword_41 Medical_Keyword_42 Medical_Keyword_43 Medical_Keyword_44 Medical_Keyword_45 Medical_Keyword_46 Medical_Keyword_47 Medical_Keyword_48 Response
0 2 1 D3 10 0.076923 2 1 1 0.641791 0.581818 ... 0 0 0 0 0 0 0 0 0 8
1 5 1 A1 26 0.076923 2 3 1 0.059701 0.600000 ... 0 0 0 0 0 0 0 0 0 4
2 6 1 E1 26 0.076923 2 3 1 0.029851 0.745455 ... 0 0 0 0 0 0 0 0 0 8
3 7 1 D4 10 0.487179 2 3 1 0.164179 0.672727 ... 0 0 0 0 0 0 0 0 0 8
4 8 1 D2 26 0.230769 2 3 1 0.417910 0.654545 ... 0 0 0 0 0 0 0 0 0 8

5 rows × 128 columns


In [37]:
df.describe()


Out[37]:
Id Product_Info_1 Product_Info_3 Product_Info_4 Product_Info_5 Product_Info_6 Product_Info_7 Ins_Age Ht Wt ... Medical_Keyword_40 Medical_Keyword_41 Medical_Keyword_42 Medical_Keyword_43 Medical_Keyword_44 Medical_Keyword_45 Medical_Keyword_46 Medical_Keyword_47 Medical_Keyword_48 Response
count 59381.000000 59381.000000 59381.000000 59381.000000 59381.000000 59381.000000 59381.000000 59381.000000 59381.000000 59381.000000 ... 59381.000000 59381.000000 59381.000000 59381.000000 59381.000000 59381.000000 59381.000000 59381.000000 59381.000000 59381.000000
mean 39507.211515 1.026355 24.415655 0.328952 2.006955 2.673599 1.043583 0.405567 0.707283 0.292587 ... 0.056954 0.010054 0.045536 0.010710 0.007528 0.013691 0.008488 0.019905 0.054496 5.636837
std 22815.883089 0.160191 5.072885 0.282562 0.083107 0.739103 0.291949 0.197190 0.074239 0.089037 ... 0.231757 0.099764 0.208479 0.102937 0.086436 0.116207 0.091737 0.139676 0.226995 2.456833
min 2.000000 1.000000 1.000000 0.000000 2.000000 1.000000 1.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000
25% 19780.000000 1.000000 26.000000 0.076923 2.000000 3.000000 1.000000 0.238806 0.654545 0.225941 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 4.000000
50% 39487.000000 1.000000 26.000000 0.230769 2.000000 3.000000 1.000000 0.402985 0.709091 0.288703 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 6.000000
75% 59211.000000 1.000000 26.000000 0.487179 2.000000 3.000000 1.000000 0.567164 0.763636 0.345188 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 8.000000
max 79146.000000 2.000000 38.000000 1.000000 3.000000 3.000000 3.000000 1.000000 1.000000 1.000000 ... 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 8.000000

8 rows × 127 columns


In [70]:
?df.describe()

In [38]:
df.index


Out[38]:
Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, ...], dtype='int64')

In [39]:
df.columns


Out[39]:
Index([u'Id', u'Product_Info_1', u'Product_Info_2', u'Product_Info_3', u'Product_Info_4', u'Product_Info_5', u'Product_Info_6', u'Product_Info_7', u'Ins_Age', u'Ht', u'Wt', u'BMI', u'Employment_Info_1', u'Employment_Info_2', u'Employment_Info_3', u'Employment_Info_4', u'Employment_Info_5', u'Employment_Info_6', u'InsuredInfo_1', u'InsuredInfo_2', u'InsuredInfo_3', u'InsuredInfo_4', u'InsuredInfo_5', u'InsuredInfo_6', u'InsuredInfo_7', u'Insurance_History_1', u'Insurance_History_2', u'Insurance_History_3', u'Insurance_History_4', u'Insurance_History_5', u'Insurance_History_7', u'Insurance_History_8', u'Insurance_History_9', u'Family_Hist_1', u'Family_Hist_2', u'Family_Hist_3', u'Family_Hist_4', u'Family_Hist_5', u'Medical_History_1', u'Medical_History_2', u'Medical_History_3', u'Medical_History_4', u'Medical_History_5', u'Medical_History_6', u'Medical_History_7', u'Medical_History_8', u'Medical_History_9', u'Medical_History_10', u'Medical_History_11', u'Medical_History_12', u'Medical_History_13', u'Medical_History_14', u'Medical_History_15', u'Medical_History_16', u'Medical_History_17', u'Medical_History_18', u'Medical_History_19', u'Medical_History_20', u'Medical_History_21', u'Medical_History_22', u'Medical_History_23', u'Medical_History_24', u'Medical_History_25', u'Medical_History_26', u'Medical_History_27', u'Medical_History_28', u'Medical_History_29', u'Medical_History_30', u'Medical_History_31', u'Medical_History_32', u'Medical_History_33', u'Medical_History_34', u'Medical_History_35', u'Medical_History_36', u'Medical_History_37', u'Medical_History_38', u'Medical_History_39', u'Medical_History_40', u'Medical_History_41', u'Medical_Keyword_1', u'Medical_Keyword_2', u'Medical_Keyword_3', u'Medical_Keyword_4', u'Medical_Keyword_5', u'Medical_Keyword_6', u'Medical_Keyword_7', u'Medical_Keyword_8', u'Medical_Keyword_9', u'Medical_Keyword_10', u'Medical_Keyword_11', u'Medical_Keyword_12', u'Medical_Keyword_13', u'Medical_Keyword_14', u'Medical_Keyword_15', u'Medical_Keyword_16', u'Medical_Keyword_17', u'Medical_Keyword_18', u'Medical_Keyword_19', u'Medical_Keyword_20', u'Medical_Keyword_21', ...], dtype='object')

In [67]:
df.loc??

In [71]:
#Exploration of the categorical data
catD = df.loc[:,varTypes['categorical']]
catD.describe()


Out[71]:
Product_Info_1 Product_Info_3 Product_Info_5 Product_Info_6 Product_Info_7 Employment_Info_2 Employment_Info_3 Employment_Info_5 InsuredInfo_1 InsuredInfo_2 ... Medical_History_31 Medical_History_33 Medical_History_34 Medical_History_35 Medical_History_36 Medical_History_37 Medical_History_38 Medical_History_39 Medical_History_40 Medical_History_41
count 59381.000000 59381.000000 59381.000000 59381.000000 59381.000000 59381.000000 59381.000000 59381.000000 59381.000000 59381.000000 ... 59381.000000 59381.000000 59381.000000 59381.000000 59381.000000 59381.000000 59381.000000 59381.000000 59381.000000 59381.000000
mean 1.026355 24.415655 2.006955 2.673599 1.043583 8.641821 1.300904 2.142958 1.209326 2.007427 ... 2.985265 2.804618 2.689076 1.002055 2.179468 1.938398 1.004850 2.830720 2.967599 1.641064
std 0.160191 5.072885 0.083107 0.739103 0.291949 4.227082 0.715034 0.350033 0.417939 0.085858 ... 0.170989 0.593798 0.724661 0.063806 0.412633 0.240574 0.069474 0.556665 0.252427 0.933361
min 1.000000 1.000000 2.000000 1.000000 1.000000 1.000000 1.000000 2.000000 1.000000 2.000000 ... 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
25% 1.000000 26.000000 2.000000 3.000000 1.000000 9.000000 1.000000 2.000000 1.000000 2.000000 ... 3.000000 3.000000 3.000000 1.000000 2.000000 2.000000 1.000000 3.000000 3.000000 1.000000
50% 1.000000 26.000000 2.000000 3.000000 1.000000 9.000000 1.000000 2.000000 1.000000 2.000000 ... 3.000000 3.000000 3.000000 1.000000 2.000000 2.000000 1.000000 3.000000 3.000000 1.000000
75% 1.000000 26.000000 2.000000 3.000000 1.000000 9.000000 1.000000 2.000000 1.000000 2.000000 ... 3.000000 3.000000 3.000000 1.000000 2.000000 2.000000 1.000000 3.000000 3.000000 3.000000
max 2.000000 38.000000 3.000000 3.000000 3.000000 38.000000 3.000000 3.000000 3.000000 3.000000 ... 3.000000 3.000000 3.000000 3.000000 3.000000 3.000000 2.000000 3.000000 3.000000 3.000000

8 rows × 59 columns


In [100]:
catD.head(5)


Out[100]:
Product_Info_1 Product_Info_2 Product_Info_3 Product_Info_5 Product_Info_6 Product_Info_7 Employment_Info_2 Employment_Info_3 Employment_Info_5 InsuredInfo_1 ... Medical_History_31 Medical_History_33 Medical_History_34 Medical_History_35 Medical_History_36 Medical_History_37 Medical_History_38 Medical_History_39 Medical_History_40 Medical_History_41
0 1 D3 10 2 1 1 12 1 3 1 ... 3 1 3 1 2 2 1 3 3 3
1 1 A1 26 2 3 1 1 3 2 1 ... 3 3 1 1 2 2 1 3 3 1
2 1 E1 26 2 3 1 9 1 2 1 ... 3 3 3 1 3 2 1 3 3 1
3 1 D4 10 2 3 1 9 1 3 2 ... 3 3 3 1 2 2 1 3 3 1
4 1 D2 26 2 3 1 9 1 2 1 ... 3 3 3 1 3 2 1 3 3 1

5 rows × 60 columns


In [ ]:
d = catD.loc[:,key]
.groupby('colour').size().plot(kind='bar')

In [113]:
#Iterate through each categorical column of data
#Perform a 2D histogram later
plt.figure()
fig, axes = plt.subplots(nrows = 1, ncols = 2)
i=0

for key in varTypes['categorical']:
    
    #Select the data in each key iteration 
    d = catD.loc[:,key]
    l = d.value_counts()
    #print "The category is: {0} with value_counts: {1} and detailed tuple: {2} ".format(key, l.count(), l)
    
    plt.figure()
    
    ax = axes[i,0]
    
    plt.title('Histogram: ' + str(key), ax = ax)
    plt.xlabel('Category: '+str(key), ax = ax)
    plt.ylabel('Frequency', ax = ax)
    d.value_counts().hist(alpha=0.5, ax = ax)
    
    ax = axes[i,1]
    plt.title('Cumulative Histogram: ' + str(key), ax = ax)
    plt.xlabel('Category: '+str(key), ax = ax)
    plt.ylabel('Frequency', ax = ax)
    d.value_counts().hist(alpha=0.5, cumulative=True, ax = ax)
    i+=1
    break


---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-113-464223f80026> in <module>()
     15     plt.figure()
     16 
---> 17     ax = axes(i,0)
     18 
     19     plt.title('Histogram: ' + str(key), ax = ax)

TypeError: 'numpy.ndarray' object is not callable
<matplotlib.figure.Figure at 0x2d89f080>
<matplotlib.figure.Figure at 0x2d89f7f0>

In [ ]:
for i in range(catD):
    print catD[i].columns
    catD[i].value_counts()

In [72]:
#fig, axes = plt.subplots(1, 2, figsize(12,4))
#fig.tight_layout()
num_bins = 50

plt.hist(catD['Product_Info_1'], num_bins, facecolor='green',alpha=0.5)


Out[72]:
(array([ 57816.,      0.,      0.,      0.,      0.,      0.,      0.,
             0.,      0.,   1565.]),
 array([ 1. ,  1.1,  1.2,  1.3,  1.4,  1.5,  1.6,  1.7,  1.8,  1.9,  2. ]),
 <a list of 10 Patch objects>)

In [ ]:
disD = df.loc[:,varTypes['discrete']]
contD = df.loc[:,varTypes['continuous']]
respD = df.loc[:,['id','Response']]

In [ ]: