Exploration of Prudential Life Insurance Data

Data retrieved from:

https://www.kaggle.com/c/prudential-life-insurance-assessment

File descriptions:
  • train.csv - the training set, contains the Response values
  • test.csv - the test set, you must predict the Response variable for all rows in this file
  • sample_submission.csv - a sample submission file in the correct format
Data fields:
Variable Description
Id A unique identifier associated with an application.
Product_Info_1-7 A set of normalized variables relating to the product applied for
Ins_Age Normalized age of applicant
Ht Normalized height of applicant
Wt Normalized weight of applicant
BMI Normalized BMI of applicant
Employment_Info_1-6 A set of normalized variables relating to the employment history of the applicant.
InsuredInfo_1-6 A set of normalized variables providing information about the applicant.
Insurance_History_1-9 A set of normalized variables relating to the insurance history of the applicant.
Family_Hist_1-5 A set of normalized variables relating to the family history of the applicant.
Medical_History_1-41 A set of normalized variables relating to the medical history of the applicant.
Medical_Keyword_1-48 A set of dummy variables relating to the presence of/absence of a medical keyword being associated with the application.
Response This is the target variable, an ordinal variable relating to the final decision associated with an application

The following variables are all categorical (nominal):

Product_Info_1, Product_Info_2, Product_Info_3, Product_Info_5, Product_Info_6, Product_Info_7, Employment_Info_2, Employment_Info_3, Employment_Info_5, InsuredInfo_1, InsuredInfo_2, InsuredInfo_3, InsuredInfo_4, InsuredInfo_5, InsuredInfo_6, InsuredInfo_7, Insurance_History_1, Insurance_History_2, Insurance_History_3, Insurance_History_4, Insurance_History_7, Insurance_History_8, Insurance_History_9, Family_Hist_1, Medical_History_2, Medical_History_3, Medical_History_4, Medical_History_5, Medical_History_6, Medical_History_7, Medical_History_8, Medical_History_9, Medical_History_11, Medical_History_12, Medical_History_13, Medical_History_14, Medical_History_16, Medical_History_17, Medical_History_18, Medical_History_19, Medical_History_20, Medical_History_21, Medical_History_22, Medical_History_23, Medical_History_25, Medical_History_26, Medical_History_27, Medical_History_28, Medical_History_29, Medical_History_30, Medical_History_31, Medical_History_33, Medical_History_34, Medical_History_35, Medical_History_36, Medical_History_37, Medical_History_38, Medical_History_39, Medical_History_40, Medical_History_41

The following variables are continuous:

Product_Info_4, Ins_Age, Ht, Wt, BMI, Employment_Info_1, Employment_Info_4, Employment_Info_6, Insurance_History_5, Family_Hist_2, Family_Hist_3, Family_Hist_4, Family_Hist_5

The following variables are discrete:

Medical_History_1, Medical_History_10, Medical_History_15, Medical_History_24, Medical_History_32

Medical_Keyword_1-48 are dummy variables.

My thoughts are as follows:

The main dependent variable is the Risk Response (1-8) What are variables are correlated to the risk response? How do I perform correlation analysis between variables?

Import libraries


In [2]:
# Importing libraries

%pylab inline
%matplotlib inline
import pandas as pd 
import matplotlib.pyplot as plt
from matplotlib.colors import LogNorm
from sklearn import preprocessing
import numpy as np


Populating the interactive namespace from numpy and matplotlib

In [3]:
# Convert variable data into categorical, continuous, discrete, 
# and dummy variable lists the following into a dictionary

Seperation of columns into categorical, continous and discrete


In [4]:
s = ["Product_Info_1, Product_Info_2, Product_Info_3, Product_Info_5, Product_Info_6, Product_Info_7, Employment_Info_2, Employment_Info_3, Employment_Info_5, InsuredInfo_1, InsuredInfo_2, InsuredInfo_3, InsuredInfo_4, InsuredInfo_5, InsuredInfo_6, InsuredInfo_7, Insurance_History_1, Insurance_History_2, Insurance_History_3, Insurance_History_4, Insurance_History_7, Insurance_History_8, Insurance_History_9, Family_Hist_1, Medical_History_2, Medical_History_3, Medical_History_4, Medical_History_5, Medical_History_6, Medical_History_7, Medical_History_8, Medical_History_9, Medical_History_11, Medical_History_12, Medical_History_13, Medical_History_14, Medical_History_16, Medical_History_17, Medical_History_18, Medical_History_19, Medical_History_20, Medical_History_21, Medical_History_22, Medical_History_23, Medical_History_25, Medical_History_26, Medical_History_27, Medical_History_28, Medical_History_29, Medical_History_30, Medical_History_31, Medical_History_33, Medical_History_34, Medical_History_35, Medical_History_36, Medical_History_37, Medical_History_38, Medical_History_39, Medical_History_40, Medical_History_41",
    "Product_Info_4, Ins_Age, Ht, Wt, BMI, Employment_Info_1, Employment_Info_4, Employment_Info_6, Insurance_History_5, Family_Hist_2, Family_Hist_3, Family_Hist_4, Family_Hist_5",
     "Medical_History_1, Medical_History_10, Medical_History_15, Medical_History_24, Medical_History_32"]
 

varTypes = dict()


varTypes['categorical'] = s[0].split(', ')


varTypes['continuous'] = s[1].split(', ')


varTypes['discrete'] = s[2].split(', ')



varTypes['dummy'] = ["Medical_Keyword_"+str(i) for i in range(1,49)]

In [5]:
#Prints out each of the the variable types as a check
#for i in iter(varTypes['dummy']):
    #print i

Importing life insurance data set


In [6]:
#Import training data 
d_raw = pd.read_csv('prud_files/train.csv')
d = d_raw.copy()

In [9]:
len(d.columns)


Out[9]:
128

Pre-processing raw dataset for NaN values


In [181]:
# Get all the columns that have NaNs
d = d_raw.copy()
a = pd.isnull(d).sum()
nullColumns = a[a>0].index.values

#for c in nullColumns:
    #d[c].fillna(-1)

#Determine the min and max values for the NaN columns
a = pd.DataFrame(d, columns=nullColumns).describe()
a_min = a[3:4]
a_max = a[7:8]


Out[181]:
array(['Employment_Info_1', 'Employment_Info_4', 'Employment_Info_6',
       'Insurance_History_5', 'Family_Hist_2', 'Family_Hist_3',
       'Family_Hist_4', 'Family_Hist_5', 'Medical_History_1',
       'Medical_History_10', 'Medical_History_15', 'Medical_History_24',
       'Medical_History_32'], dtype=object)

In [175]:
nullList = ['Family_Hist_4',
 'Medical_History_1',
 'Medical_History_10',
 'Medical_History_15',
 'Medical_History_24',
 'Medical_History_32']

pd.DataFrame(a_max, columns=nullList)


Out[175]:
Family_Hist_4 Medical_History_1 Medical_History_10 Medical_History_15 Medical_History_24 Medical_History_32
max 0.943662 240 240 240 240 240

In [303]:
# Convert all NaNs to -1 and sum up all medical keywords across columns

df = d.fillna(-1)
b = pd.DataFrame(df[varTypes["dummy"]].sum(axis=1), columns=["Medical_Keyword_Sum"])
df= pd.concat([df,b], axis=1, join='outer')

In [334]:



Out[334]:
Id Product_Info_1 Product_Info_2 Product_Info_3 Product_Info_4 Product_Info_5 Product_Info_6 Product_Info_7 Ins_Age Ht ... Medical_Keyword_41 Medical_Keyword_42 Medical_Keyword_43 Medical_Keyword_44 Medical_Keyword_45 Medical_Keyword_46 Medical_Keyword_47 Medical_Keyword_48 Response Medical_Keyword_Sum
0 2 1 D3 10 0.076923 2 1 1 0.641791 0.581818 ... 0 0 0 0 0 0 0 0 8 0
1 5 1 A1 26 0.076923 2 3 1 0.059701 0.600000 ... 0 0 0 0 0 0 0 0 4 0
2 6 1 E1 26 0.076923 2 3 1 0.029851 0.745455 ... 0 0 0 0 0 0 0 0 8 0
3 7 1 D4 10 0.487179 2 3 1 0.164179 0.672727 ... 0 0 0 0 0 0 0 0 8 1
4 8 1 D2 26 0.230769 2 3 1 0.417910 0.654545 ... 0 0 0 0 0 0 0 0 8 0
5 10 1 D2 26 0.230769 3 1 1 0.507463 0.836364 ... 0 0 0 0 0 0 0 0 8 2
6 11 1 A8 10 0.166194 2 3 1 0.373134 0.581818 ... 0 0 0 0 0 0 0 0 8 0
7 14 1 D2 26 0.076923 2 3 1 0.611940 0.781818 ... 0 0 0 0 0 0 0 0 1 0
8 15 1 D3 26 0.230769 2 3 1 0.522388 0.618182 ... 0 0 0 0 0 0 0 0 8 1
9 16 1 E1 21 0.076923 2 3 1 0.552239 0.600000 ... 0 0 0 0 0 0 0 0 1 2
10 17 1 D3 26 0.128205 2 3 1 0.537313 0.690909 ... 0 0 0 1 0 0 1 1 6 4
11 18 1 D4 26 0.230769 2 3 1 0.298507 0.690909 ... 0 0 0 0 0 0 0 0 2 1
12 19 1 A2 26 0.102564 2 3 1 0.567164 0.618182 ... 0 0 0 0 0 0 0 0 7 1
13 20 2 D1 26 0.487179 2 3 1 0.223881 0.781818 ... 0 0 0 0 0 0 0 0 3 1
14 22 1 D4 26 0.487179 2 3 1 0.328358 0.636364 ... 0 0 0 0 0 0 0 0 8 2
15 23 1 A7 26 0.000000 2 3 1 0.626866 0.672727 ... 0 0 0 0 0 0 0 0 5 3
16 24 2 D4 26 0.487179 2 3 1 0.208955 0.745455 ... 0 0 0 0 0 0 0 0 8 1
17 25 1 D3 26 0.384615 2 3 1 0.268657 0.636364 ... 0 0 0 0 0 0 0 0 7 0
18 26 1 D3 26 0.076923 2 3 1 0.388060 0.781818 ... 0 0 0 0 0 0 0 0 2 1
19 27 1 D4 26 0.487179 2 3 1 0.223881 0.600000 ... 0 0 0 0 0 0 0 0 8 0
20 29 1 D2 26 0.435897 2 3 1 0.388060 0.745455 ... 0 0 0 0 0 0 0 0 8 0
21 31 1 A1 26 1.000000 2 1 1 0.537313 0.709091 ... 0 0 0 0 0 0 0 0 5 0
22 32 1 D4 26 0.230769 2 3 1 0.179104 0.800000 ... 0 0 0 0 0 0 0 0 5 0
23 33 1 A2 26 0.179487 2 3 1 0.164179 0.745455 ... 0 0 0 0 0 0 0 0 8 0
24 34 1 D1 26 0.487179 2 1 1 0.164179 0.818182 ... 0 0 0 0 0 0 0 0 6 0
25 35 1 A6 26 0.230769 2 3 1 0.268657 0.781818 ... 0 0 0 0 0 0 0 0 8 0
26 37 1 A1 26 1.000000 2 3 1 0.507463 0.654545 ... 0 0 0 0 0 0 0 0 6 1
27 39 1 D3 26 0.230769 2 3 1 0.134328 0.763636 ... 0 0 0 0 0 0 0 0 8 0
28 40 1 D4 26 0.487179 2 3 1 0.492537 0.618182 ... 0 0 0 0 0 0 0 0 7 2
29 41 1 D3 26 1.000000 2 3 1 0.582090 0.654545 ... 0 0 0 0 0 0 0 0 6 2
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
59351 79115 1 A7 26 0.000000 2 3 1 0.134328 0.781818 ... 0 0 0 0 0 0 0 0 5 0
59352 79116 1 D2 10 0.230769 2 3 1 0.358209 0.618182 ... 0 0 0 0 0 0 0 0 6 1
59353 79117 1 D4 26 0.589744 2 1 1 0.179104 0.781818 ... 0 0 0 0 0 0 0 0 6 3
59354 79118 1 D2 26 0.487179 2 1 1 0.402985 0.763636 ... 0 0 0 0 0 0 0 0 6 1
59355 79119 1 D3 26 0.230769 2 3 1 0.223881 0.745455 ... 0 1 0 0 0 0 0 0 6 1
59356 79120 1 D3 10 0.076923 2 3 1 0.522388 0.600000 ... 0 0 0 0 0 0 0 0 6 2
59357 79121 1 D3 26 1.000000 2 1 3 0.582090 0.781818 ... 0 0 0 0 0 0 0 0 6 0
59358 79122 1 D4 26 0.282051 2 3 1 0.238806 0.727273 ... 0 0 0 0 0 0 0 0 6 0
59359 79123 1 D3 26 0.230769 2 3 1 0.447761 0.781818 ... 0 0 0 1 0 0 0 0 6 1
59360 79124 1 D4 26 1.000000 2 3 1 0.194030 0.654545 ... 0 0 0 0 0 0 0 0 5 1
59361 79126 1 A1 26 0.230769 2 1 1 0.268657 0.727273 ... 0 0 0 0 0 0 0 0 2 0
59362 79127 1 D4 26 0.230769 2 3 1 0.253731 0.781818 ... 0 0 0 0 0 0 0 0 7 0
59363 79128 1 D2 4 0.076923 2 3 1 0.746269 0.563636 ... 0 0 0 0 0 0 0 0 6 0
59364 79130 1 D2 26 0.076923 2 3 1 0.552239 0.727273 ... 0 0 0 0 0 0 0 0 1 1
59365 79131 1 D1 29 0.076923 2 3 1 0.641791 0.709091 ... 0 0 0 0 0 0 0 0 5 0
59366 79132 1 D1 26 0.282051 2 3 1 0.582090 0.781818 ... 0 0 0 0 0 0 0 0 8 2
59367 79133 1 E1 26 0.179487 2 3 3 0.373134 0.600000 ... 0 0 0 0 0 0 0 0 6 1
59368 79134 1 D4 26 0.230769 2 1 1 0.417910 0.727273 ... 0 0 0 0 0 0 0 0 8 0
59369 79135 1 D1 26 0.179487 2 3 1 0.611940 0.745455 ... 0 0 0 0 0 0 0 1 2 6
59370 79136 1 D3 26 0.230769 2 3 1 0.238806 0.763636 ... 0 0 0 0 0 0 0 0 4 0
59371 79137 1 D3 26 0.487179 2 1 1 0.537313 0.709091 ... 0 0 0 0 0 0 0 0 6 1
59372 79138 1 D3 26 0.487179 2 3 1 0.477612 0.763636 ... 0 1 0 0 0 0 0 1 2 4
59373 79139 2 D4 29 0.487179 2 3 1 0.208955 0.800000 ... 0 0 0 0 0 0 0 0 8 0
59374 79140 1 D4 26 0.307692 2 3 1 0.164179 0.690909 ... 0 0 0 0 0 0 0 0 7 0
59375 79141 1 C1 26 0.076923 2 3 1 0.477612 0.654545 ... 0 0 0 0 0 0 0 0 8 1
59376 79142 1 D1 10 0.230769 2 3 1 0.074627 0.709091 ... 0 0 0 0 0 0 0 0 4 0
59377 79143 1 D3 26 0.230769 2 3 1 0.432836 0.800000 ... 0 0 0 0 0 0 0 0 7 0
59378 79144 1 E1 26 0.076923 2 3 1 0.104478 0.745455 ... 0 0 0 0 0 0 0 0 8 1
59379 79145 1 D2 10 0.230769 2 3 1 0.507463 0.690909 ... 0 0 0 0 0 0 0 0 8 2
59380 79146 1 A8 26 0.076923 2 3 1 0.447761 0.781818 ... 0 0 0 0 0 0 0 0 7 0

59381 rows × 129 columns


In [ ]:

Create or import the test data set


In [328]:
#Turn split train to test on or off.  

#If on, 10% of the dataset is used for feature training
#If off, training set is loaded from file

splitTrainToTest = 1

if(splitTrainToTest):
    
    d_gb = df.groupby("Response")
    
    df_test = pd.DataFrame()
    
    for name, group in d_gb:
        df_test = pd.concat([df_test, group[:len(group)/10]], axis=0, join='outer')
    print "test data is 10% training data"
    
else:
    d_test = pd.read_csv('prud_files/test.csv')
    df_test = d_test.fillna(-1)
    b = pd.DataFrame(df[varTypes["dummy"]].sum(axis=1), columns=["Medical_Keyword_Sum"])
    df_test= pd.concat([df_test,b], axis=1, join='outer')
    print "test data is prud_files/test.csv"


test data is 10% training data

Data transformation and extraction

Data groupings


In [275]:
df_cat = df[["Id","Response"]+varTypes["categorical"]]
df_disc = df[["Id","Response"]+varTypes["discrete"]]
df_cont = df[["Id","Response"]+varTypes["continuous"]]
df_dummy = df[["Id","Response"]+varTypes["dummy"]]

df_cat_test = df_test[["Id","Response"]+varTypes["categorical"]]
df_disc_test = df_test[["Id","Response"]+varTypes["discrete"]]
df_cont_test = df_test[["Id","Response"]+varTypes["continuous"]]
df_dummy_test = df_test[["Id","Response"]+varTypes["dummy"]]

In [355]:
## Extract categories of each column

df_n = df[["Response", "Medical_Keyword_Sum"]+varTypes["categorical"]+varTypes["discrete"]+varTypes["continuous"]].copy()
df_test_n = df_test[["Response","Medical_Keyword_Sum"]+varTypes["categorical"]+varTypes["discrete"]+varTypes["continuous"]].copy()

In [356]:
#Get all the Product Info 2 categories
a = pd.get_dummies(df["Product_Info_2"]).columns.tolist()

norm_PI2_dict = dict()

#Create an enumerated dictionary of Product Info 2 categories
i=1
for c in a:
    norm_PI2_dict.update({c:i})
    i+=1 

print norm_PI2_dict

df_n = df_n.replace(to_replace={'Product_Info_2':norm_PI2_dict})
df_test_n = df_test_n.replace(to_replace={'Product_Info_2':norm_PI2_dict})

df_n


{'B2': 10, 'D1': 15, 'E1': 19, 'D3': 17, 'A1': 1, 'D4': 18, 'A3': 3, 'A2': 2, 'A5': 5, 'A4': 4, 'A7': 7, 'A6': 6, 'C3': 13, 'A8': 8, 'C1': 11, 'C4': 14, 'D2': 16, 'C2': 12, 'B1': 9}
Out[356]:
Response Medical_Keyword_Sum Product_Info_1 Product_Info_2 Product_Info_3 Product_Info_5 Product_Info_6 Product_Info_7 Employment_Info_2 Employment_Info_3 ... Wt BMI Employment_Info_1 Employment_Info_4 Employment_Info_6 Insurance_History_5 Family_Hist_2 Family_Hist_3 Family_Hist_4 Family_Hist_5
0 8 0 1 17 10 2 1 1 12 1 ... 0.148536 0.323008 0.0280 0.00000 -1.0000 0.000667 -1.000000 0.598039 -1.000000 0.526786
1 4 0 1 1 26 2 3 1 1 3 ... 0.131799 0.272288 0.0000 0.00000 0.0018 0.000133 0.188406 -1.000000 0.084507 -1.000000
2 8 0 1 19 26 2 3 1 9 1 ... 0.288703 0.428780 0.0300 0.00000 0.0300 -1.000000 0.304348 -1.000000 0.225352 -1.000000
3 8 1 1 18 10 2 3 1 9 1 ... 0.205021 0.352438 0.0420 0.00000 0.2000 -1.000000 0.420290 -1.000000 0.352113 -1.000000
4 8 0 1 16 26 2 3 1 9 1 ... 0.234310 0.424046 0.0270 0.00000 0.0500 -1.000000 0.463768 -1.000000 0.408451 -1.000000
5 8 2 1 16 26 3 1 1 15 1 ... 0.299163 0.364887 0.3250 0.00000 1.0000 0.005000 -1.000000 0.294118 0.507042 -1.000000
6 8 0 1 8 10 2 3 1 1 3 ... 0.173640 0.376587 0.1100 -1.00000 0.8000 0.001667 0.594203 -1.000000 0.549296 -1.000000
7 1 0 1 16 26 2 3 1 12 1 ... 0.403766 0.571612 0.1200 0.00000 1.0000 0.000667 -1.000000 0.490196 -1.000000 0.633929
8 8 1 1 17 26 2 3 1 9 1 ... 0.184100 0.362643 0.1650 0.00000 1.0000 0.007613 -1.000000 0.529412 0.676056 -1.000000
9 1 2 1 19 21 2 3 1 1 3 ... 0.284519 0.587796 0.0250 0.00000 0.0500 0.000667 0.797101 -1.000000 -1.000000 0.553571
10 6 4 1 17 26 2 3 1 9 1 ... 0.309623 0.521668 0.0500 -1.00000 0.1500 0.000587 -1.000000 0.470588 0.647887 -1.000000
11 2 1 1 18 26 2 3 1 3 1 ... 0.271967 0.455050 0.0900 -1.00000 1.0000 -1.000000 0.405797 -1.000000 0.352113 -1.000000
12 7 1 1 2 26 2 3 1 9 1 ... 0.163180 0.320784 0.0750 0.00000 -1.0000 0.000667 -1.000000 0.549020 -1.000000 0.482143
13 3 1 2 15 26 2 3 1 9 1 ... 0.361925 0.507515 0.1000 -1.00000 0.0750 -1.000000 0.420290 -1.000000 0.338028 -1.000000
14 8 2 1 18 26 2 3 1 3 1 ... 0.142259 0.264648 0.1600 0.00000 0.6000 0.004000 -1.000000 0.578431 0.535211 -1.000000
15 5 3 1 7 26 2 3 1 9 1 ... 0.330544 0.581279 0.0750 0.00000 -1.0000 0.000480 -1.000000 0.549020 -1.000000 0.535714
16 8 1 2 18 26 2 3 1 14 1 ... 0.246862 0.360969 0.1000 0.00000 0.2500 -1.000000 0.275362 -1.000000 0.253521 -1.000000
17 7 0 1 17 26 2 3 1 9 1 ... 0.228033 0.430949 0.0378 0.00000 0.0360 -1.000000 -1.000000 0.343137 0.436620 -1.000000
18 2 1 1 17 26 2 3 1 9 1 ... 0.309623 0.427394 0.0800 0.00000 -1.0000 0.000400 -1.000000 0.509804 0.507042 -1.000000
19 8 0 1 18 26 2 3 1 9 1 ... 0.138075 0.285254 0.0550 0.00000 0.0000 -1.000000 0.289855 -1.000000 0.281690 -1.000000
20 8 0 1 16 26 2 3 1 9 1 ... 0.246862 0.360969 0.0830 0.00000 0.5000 0.001107 -1.000000 0.509804 0.478873 -1.000000
21 5 0 1 1 26 2 1 1 3 1 ... 0.370293 0.605334 0.2100 0.00000 1.0000 -1.000000 -1.000000 0.421569 -1.000000 0.544643
22 5 0 1 18 26 2 3 1 9 1 ... 0.539749 0.753765 0.0310 0.00000 0.0000 -1.000000 0.434783 -1.000000 0.394366 -1.000000
23 8 0 1 2 26 2 3 1 9 1 ... 0.288703 0.428780 0.0650 0.00000 0.3500 0.003333 -1.000000 0.313725 0.281690 -1.000000
24 6 0 1 15 26 2 1 1 9 1 ... 0.435146 0.576961 0.0270 0.00000 0.1500 0.000133 0.376812 -1.000000 0.253521 -1.000000
25 8 0 1 6 26 2 3 1 12 1 ... 0.368201 0.517129 0.1000 0.00000 0.3500 -1.000000 -1.000000 0.441176 0.394366 -1.000000
26 6 1 1 1 26 2 3 1 12 1 ... 0.299163 0.545946 0.1500 0.00000 1.0000 0.002333 -1.000000 -1.000000 -1.000000 -1.000000
27 8 0 1 17 26 2 3 1 9 1 ... 0.215481 0.296359 0.0420 0.00000 -1.0000 0.000667 -1.000000 -1.000000 -1.000000 -1.000000
28 7 2 1 18 26 2 3 1 9 1 ... 0.276151 0.546823 0.1200 0.00000 0.1200 0.001000 -1.000000 -1.000000 0.450704 -1.000000
29 6 2 1 17 26 2 3 1 9 1 ... 0.278243 0.506623 0.1150 0.00000 1.0000 0.004720 -1.000000 0.529412 0.661972 -1.000000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
59351 5 0 1 7 26 2 3 1 12 1 ... 0.351464 0.491491 0.0700 0.00000 0.0700 -1.000000 -1.000000 0.254902 -1.000000 0.339286
59352 6 1 1 16 10 2 3 1 12 1 ... 0.246862 0.488220 0.0200 0.00000 0.1000 -1.000000 -1.000000 -1.000000 0.408451 -1.000000
59353 6 3 1 18 26 2 1 1 9 1 ... 0.382845 0.539563 0.0800 0.00000 0.0000 -1.000000 0.275362 -1.000000 0.239437 -1.000000
59354 6 1 1 16 26 2 1 1 12 1 ... 0.341004 0.494104 0.0500 0.02000 0.2500 0.006667 0.608696 -1.000000 0.507042 -1.000000
59355 6 1 1 17 26 2 3 1 9 1 ... 0.361925 0.547451 0.0580 0.00000 0.2250 0.000333 0.405797 -1.000000 0.323944 -1.000000
59356 6 2 1 17 10 2 3 1 9 1 ... 0.299163 0.618050 0.0400 0.00000 -1.0000 0.000067 -1.000000 0.598039 0.591549 -1.000000
59357 6 0 1 17 26 2 1 3 9 1 ... 0.351464 0.491491 0.1250 0.02500 1.0000 -1.000000 -1.000000 0.617647 -1.000000 0.625000
59358 6 0 1 18 26 2 3 1 9 1 ... 0.372385 0.586182 0.0900 -1.00000 0.0050 -1.000000 0.275362 -1.000000 0.281690 -1.000000
59359 6 1 1 17 26 2 3 1 14 1 ... 0.424686 0.603660 0.0600 0.00750 0.2250 0.002067 -1.000000 0.568627 0.619718 -1.000000
59360 5 1 1 18 26 2 3 1 9 1 ... 0.146444 0.258890 0.0800 0.00000 0.1500 -1.000000 0.304348 -1.000000 0.267606 -1.000000
59361 2 0 1 1 26 2 1 1 3 1 ... 0.267782 0.411703 0.2500 0.00000 0.9000 -1.000000 0.478261 -1.000000 0.267606 -1.000000
59362 7 0 1 18 26 2 3 1 9 1 ... 0.351464 0.491491 0.0540 0.00000 0.0250 0.000667 0.449275 -1.000000 0.408451 -1.000000
59363 6 0 1 16 4 2 3 1 1 3 ... 0.205021 0.464570 0.0000 0.00000 0.0000 -1.000000 -1.000000 0.519608 -1.000000 -1.000000
59364 1 1 1 16 26 2 3 1 12 1 ... 0.177824 0.261651 0.0500 0.00000 0.0000 -1.000000 -1.000000 0.598039 -1.000000 0.517857
59365 5 0 1 15 29 2 3 1 9 1 ... 0.284519 0.458023 0.0450 0.00000 0.2000 -1.000000 -1.000000 0.578431 -1.000000 0.535714
59366 8 2 1 15 26 2 3 1 14 1 ... 0.320084 0.443418 0.0580 0.00000 0.3000 0.002000 -1.000000 0.480392 0.704225 -1.000000
59367 6 1 1 19 26 2 3 3 9 1 ... 0.320084 0.661270 0.1600 0.00000 0.0000 0.005667 -1.000000 0.568627 -1.000000 0.491071
59368 8 0 1 18 26 2 1 1 9 1 ... 0.299163 0.464047 0.0900 0.00000 0.2000 0.000733 -1.000000 0.568627 0.436620 -1.000000
59369 2 6 1 15 26 2 3 1 1 3 ... 0.451883 0.693246 0.0920 -1.00000 0.1600 -1.000000 -1.000000 0.627451 -1.000000 0.348214
59370 4 0 1 17 26 2 3 1 9 1 ... 0.330544 0.477625 0.0650 0.00000 0.0500 0.001333 -1.000000 0.078431 -1.000000 0.348214
59371 6 1 1 17 26 2 1 1 12 1 ... 0.343096 0.558626 0.0650 0.00000 -1.0000 -1.000000 -1.000000 0.519608 0.661972 -1.000000
59372 2 4 1 17 26 2 3 1 14 1 ... 0.305439 0.438076 0.2000 0.00000 0.3000 -1.000000 0.681159 -1.000000 0.605634 -1.000000
59373 8 0 2 18 29 2 3 1 9 1 ... 0.257322 0.332885 0.0320 0.00000 -1.0000 -1.000000 0.275362 -1.000000 0.295775 -1.000000
59374 7 0 1 18 26 2 3 1 9 1 ... 0.288703 0.484658 0.0590 0.00000 0.0200 -1.000000 0.405797 -1.000000 0.295775 -1.000000
59375 8 1 1 11 26 2 3 1 9 1 ... 0.271967 0.494827 0.0450 0.00000 0.0450 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000
59376 4 0 1 15 10 2 3 1 1 3 ... 0.320084 0.519103 0.0200 0.00000 0.0250 -1.000000 0.217391 -1.000000 0.197183 -1.000000
59377 7 0 1 17 26 2 3 1 9 1 ... 0.403766 0.551119 0.1000 0.00001 0.3500 0.000267 0.565217 -1.000000 0.478873 -1.000000
59378 8 1 1 19 26 2 3 1 9 1 ... 0.246862 0.360969 0.0350 0.00000 -1.0000 -1.000000 0.173913 -1.000000 0.126761 -1.000000
59379 8 2 1 16 10 2 3 1 9 1 ... 0.276151 0.462452 0.0380 -1.00000 -1.0000 -1.000000 -1.000000 0.372549 0.704225 -1.000000
59380 7 0 1 8 26 2 3 1 9 1 ... 0.382845 0.539563 0.1230 -1.00000 0.3000 -1.000000 -1.000000 0.401961 -1.000000 0.589286

59381 rows × 80 columns

Categorical normalization


In [359]:
# normalizes a single dataframe column and returns the result
def normalize_df(d):
    min_max_scaler = preprocessing.MinMaxScaler()
    x = d.values.astype(np.float)
    #return pd.DataFrame(min_max_scaler.fit_transform(x))
    return pd.DataFrame(min_max_scaler.fit_transform(x))


def normalize_cat(d):
    
    for x in varTypes["discrete"]:
            try:
                a = pd.DataFrame(normalize_df(d_disc[x]))
                a.columns=[str("n"+x)]
                d_disc = pd.concat([d_disc, a], axis=1, join='outer')
            except Exception as e:
                print e.args
                print "Error on "+str(x)+" w error: "+str(e)

    return d_disc



def normalize_disc(d_disc):
    
    for x in varTypes["discrete"]:
            try:
                a = pd.DataFrame(normalize_df(d_disc[x]))
                a.columns=[str("n"+x)]
                d_disc = pd.concat([d_disc, a], axis=1, join='outer')
            except Exception as e:
                print e.args
                print "Error on "+str(x)+" w error: "+str(e)

    return d_disc


# t= categorical, discrete, continuous

def normalize_cols(d, t = "categorical"):
    
    for x in varTypes[t]:
            try:
                a = pd.DataFrame(normalize_df(d[x]))
                a.columns=[str("n"+x)]
                a = pd.concat(a, axis=1, join='outer')
            except Exception as e:
                print e.args
                print "Error on "+str(x)+" w error: "+str(e)

    return a

def normalize_response(d):
    
    a = pd.DataFrame(normalize_df(d["Response"]))
    a.columns=["nResponse"]
    #d_cat = pd.concat([d_cat, a], axis=1, join='outer')
    
    return a

In [15]:
df_n_2 = df_n.copy()
df_n_test_2 = df_test_n.copy()

df_n_2 = df_n_2[["Response"]+varTypes["categorical"]+varTypes["discrete"]]


df_n_test_2 = df_n_test_2[["Response"]+varTypes["categorical"]+varTypes["discrete"]]

df_n_2 = df_n_2.apply(normalize_df, axis=1)
df_n_test_2 = df_n_test_2.apply(normalize_df, axis=1)


---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-15-88ac19475656> in <module>()
      7 df_n_test_2 = df_n_test_2[["Response"]+varTypes["categorical"]+varTypes["discrete"]]
      8 
----> 9 df_n_2 = df_n_2.apply(normalize_df, axis=1)
     10 df_n_test_2 = df_n_test_2.apply(normalize_df, axis=1)

C:\Users\Robbie\Anaconda\lib\site-packages\pandas\core\frame.pyc in apply(self, func, axis, broadcast, raw, reduce, args, **kwds)
   3594                     if reduce is None:
   3595                         reduce = True
-> 3596                     return self._apply_standard(f, axis, reduce=reduce)
   3597             else:
   3598                 return self._apply_broadcast(f, axis)

C:\Users\Robbie\Anaconda\lib\site-packages\pandas\core\frame.pyc in _apply_standard(self, func, axis, ignore_failures, reduce)
   3701                 index = None
   3702 
-> 3703             result = self._constructor(data=results, index=index)
   3704             result.columns = res_index
   3705 

C:\Users\Robbie\Anaconda\lib\site-packages\pandas\core\frame.pyc in __init__(self, data, index, columns, dtype, copy)
    206                                  dtype=dtype, copy=copy)
    207         elif isinstance(data, dict):
--> 208             mgr = self._init_dict(data, index, columns, dtype=dtype)
    209         elif isinstance(data, ma.MaskedArray):
    210             import numpy.ma.mrecords as mrecords

C:\Users\Robbie\Anaconda\lib\site-packages\pandas\core\frame.pyc in _init_dict(self, data, index, columns, dtype)
    334 
    335         return _arrays_to_mgr(arrays, data_names, index, columns,
--> 336                               dtype=dtype)
    337 
    338     def _init_ndarray(self, values, index, columns, dtype=None,

C:\Users\Robbie\Anaconda\lib\site-packages\pandas\core\frame.pyc in _arrays_to_mgr(arrays, arr_names, index, columns, dtype)
   4620 
   4621     # don't force copy because getting jammed in an ndarray anyway
-> 4622     arrays = _homogenize(arrays, index, dtype)
   4623 
   4624     # from BlockManager perspective

C:\Users\Robbie\Anaconda\lib\site-packages\pandas\core\frame.pyc in _homogenize(data, index, dtype)
   4934 
   4935             v = _sanitize_array(v, index, dtype=dtype, copy=False,
-> 4936                                 raise_cast_failure=False)
   4937 
   4938         homogenized.append(v)

C:\Users\Robbie\Anaconda\lib\site-packages\pandas\core\series.pyc in _sanitize_array(data, index, dtype, copy, raise_cast_failure)
   2685             raise Exception('Data must be 1-dimensional')
   2686         else:
-> 2687             subarr = _asarray_tuplesafe(data, dtype=dtype)
   2688 
   2689     # This is to prevent mixed-type Series getting all casted to

C:\Users\Robbie\Anaconda\lib\site-packages\pandas\core\common.pyc in _asarray_tuplesafe(values, dtype)
   2305             except ValueError:
   2306                 # we have a list-of-list
-> 2307                 result[:] = [tuple(x) for x in values]
   2308 
   2309     return result

TypeError: 'numpy.int64' object is not iterable

In [382]:
df_n_3 = pd.concat([df["Id"],df_n["Medical_Keyword_Sum"],df_n_2, df_n[varTypes["continuous"]]],axis=1,join='outer')
df_n_test_3 = pd.concat([df_test["Id"],df_test_n["Medical_Keyword_Sum"],df_n_test_2, df_test_n[varTypes["continuous"]]],axis=1,join='outer')

In [ ]:
train_data = df_n_3.values
test_data = df_n_test_3.values

In [ ]:
from sklearn import linear_model

clf = linear_model.Lasso(alpha = 0.1)
clf.fit(X_train, Y_train)
pred = clf.predict(X_test)

print accuracy_score(pred, Y_test)

In [ ]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators = 1)
#model = model.fit(train_data[0:,2:],train_data[0:,0])

In [409]:
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score


clf = GaussianNB()

clf.fit(train_data[0:,2:],train_data[0:,0])

pred = clf.predict(X_test)

print accuracy_score(pred, Y_test)


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-409-2c278d58102b> in <module>()
      5 clf = GaussianNB()
      6 
----> 7 clf.fit(X_train, Y_train)
      8 
      9 pred = clf.predict(X_test)

C:\Users\Robbie\Anaconda\lib\site-packages\sklearn\naive_bayes.pyc in fit(self, X, y)
    150         y = column_or_1d(y, warn=True)
    151 
--> 152         n_samples, n_features = X.shape
    153 
    154         self.classes_ = unique_y = np.unique(y)

ValueError: need more than 1 value to unpack

In [410]:
from sklearn.metrics import accuracy_score


  File "<ipython-input-410-b1fabbd013ba>", line 1
    from sklearn import
                        ^
SyntaxError: invalid syntax

In [ ]:


In [407]:



---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-407-83dc42ad503c> in <module>()
      4 clf = linear_model.Lasso(alpha = 0.1)
      5 clf.fit(X_train, Y_train)
----> 6 pred = clf.predict(X_test)
      7 
      8 print accuracy_score(pred, Y_test)

C:\Users\Robbie\Anaconda\lib\site-packages\sklearn\linear_model\base.pyc in predict(self, X)
    149             Returns predicted values.
    150         """
--> 151         return self.decision_function(X)
    152 
    153     _center_data = staticmethod(center_data)

C:\Users\Robbie\Anaconda\lib\site-packages\sklearn\linear_model\coordinate_descent.pyc in decision_function(self, X)
    734                             + self.intercept_)
    735         else:
--> 736             return super(ElasticNet, self).decision_function(X)
    737 
    738 

C:\Users\Robbie\Anaconda\lib\site-packages\sklearn\linear_model\base.pyc in decision_function(self, X)
    134         X = safe_asarray(X)
    135         return safe_sparse_dot(X, self.coef_.T,
--> 136                                dense_output=True) + self.intercept_
    137 
    138     def predict(self, X):

C:\Users\Robbie\Anaconda\lib\site-packages\sklearn\utils\extmath.pyc in safe_sparse_dot(a, b, dense_output)
    178         return ret
    179     else:
--> 180         return fast_dot(a, b)
    181 
    182 

ValueError: shapes (5934,) and (59381,) not aligned: 5934 (dim 0) != 59381 (dim 0)

In [340]:
df_n.columns.tolist()


Out[340]:
['nMedical_History_41']

In [32]:
d_cat = df_cat.copy()
d_cat_test = df_cat_test.copy()

d_cont = df_cont.copy()
d_cont_test = df_cont_test.copy()

d_disc = df_disc.copy()
d_disc_test = df_disc_test.copy()

In [ ]:
#df_cont_n = normalize_cols(d_cont, "continuous")
#df_cont_test_n = normalize_cols(d_cont_test, "continuous")

In [31]:
df_cat_n = normalize_cols(d_cat, "categorical")
df_cat_test_n = normalize_cols(d_cat_test, "categorical")


('could not convert string to float: A8',)
Error on Product_Info_2 w error: could not convert string to float: A8
('could not convert string to float: A8',)
Error on Product_Info_2 w error: could not convert string to float: A8

In [33]:
df_disc_n = normalize_cols(d_disc, "discrete")
df_disc_test_n = normalize_cols(d_disc, "discrete")


("Input contains NaN, infinity or a value too large for dtype('float64').",)
Error on Medical_History_1 w error: Input contains NaN, infinity or a value too large for dtype('float64').
("Input contains NaN, infinity or a value too large for dtype('float64').",)
Error on Medical_History_10 w error: Input contains NaN, infinity or a value too large for dtype('float64').
("Input contains NaN, infinity or a value too large for dtype('float64').",)
Error on Medical_History_15 w error: Input contains NaN, infinity or a value too large for dtype('float64').
("Input contains NaN, infinity or a value too large for dtype('float64').",)
Error on Medical_History_24 w error: Input contains NaN, infinity or a value too large for dtype('float64').
("Input contains NaN, infinity or a value too large for dtype('float64').",)
Error on Medical_History_32 w error: Input contains NaN, infinity or a value too large for dtype('float64').
("Input contains NaN, infinity or a value too large for dtype('float64').",)
Error on Medical_History_1 w error: Input contains NaN, infinity or a value too large for dtype('float64').
("Input contains NaN, infinity or a value too large for dtype('float64').",)
Error on Medical_History_10 w error: Input contains NaN, infinity or a value too large for dtype('float64').
("Input contains NaN, infinity or a value too large for dtype('float64').",)
Error on Medical_History_15 w error: Input contains NaN, infinity or a value too large for dtype('float64').
("Input contains NaN, infinity or a value too large for dtype('float64').",)
Error on Medical_History_24 w error: Input contains NaN, infinity or a value too large for dtype('float64').
("Input contains NaN, infinity or a value too large for dtype('float64').",)
Error on Medical_History_32 w error: Input contains NaN, infinity or a value too large for dtype('float64').

In [21]:
a = df_cat_n.iloc[:,62:]

# TODO: Clump into function
#rows are normalized into binary columns of groupings


Out[21]:
<bound method Index.tolist of Index([u'nResponse', u'nProduct_Info_1', u'nProduct_Info_3', u'nProduct_Info_5', u'nProduct_Info_6', u'nProduct_Info_7', u'nEmployment_Info_2', u'nEmployment_Info_3', u'nEmployment_Info_5', u'nInsuredInfo_1', u'nInsuredInfo_2', u'nInsuredInfo_3', u'nInsuredInfo_4', u'nInsuredInfo_5', u'nInsuredInfo_6', u'nInsuredInfo_7', u'nInsurance_History_1', u'nInsurance_History_2', u'nInsurance_History_3', u'nInsurance_History_4', u'nInsurance_History_7', u'nInsurance_History_8', u'nInsurance_History_9', u'nFamily_Hist_1', u'nMedical_History_2', u'nMedical_History_3', u'nMedical_History_4', u'nMedical_History_5', u'nMedical_History_6', u'nMedical_History_7', u'nMedical_History_8', u'nMedical_History_9', u'nMedical_History_11', u'nMedical_History_12', u'nMedical_History_13', u'nMedical_History_14', u'nMedical_History_16', u'nMedical_History_17', u'nMedical_History_18', u'nMedical_History_19', u'nMedical_History_20', u'nMedical_History_21', u'nMedical_History_22', u'nMedical_History_23', u'nMedical_History_25', u'nMedical_History_26', u'nMedical_History_27', u'nMedical_History_28', u'nMedical_History_29', u'nMedical_History_30', u'nMedical_History_31', u'nMedical_History_33', u'nMedical_History_34', u'nMedical_History_35', u'nMedical_History_36', u'nMedical_History_37', u'nMedical_History_38', u'nMedical_History_39', u'nMedical_History_40', u'nMedical_History_41'], dtype='object')>

In [14]:
# Define various group by data streams

df = d
    
gb_PI2 = df.groupby('Product_Info_1')
gb_PI2 = df.groupby('Product_Info_2')

gb_Ins_Age = df.groupby('Ins_Age')
gb_Ht = df.groupby('Ht')
gb_Wt = df.groupby('Wt')

gb_response = df.groupby('Response')

In [ ]:
#Outputs rows the differnet categorical groups

for c in df.columns:
    if (c in varTypes['categorical']):
        if(c != 'Id'):
            a = [ str(x)+", " for x in df.groupby(c).groups ]
            print c + " : " + str(a)

In [ ]:
df_prod_info = pd.DataFrame(d, columns=["Response"]+ [ "Product_Info_"+str(x) for x in range(1,8)]) 

df_emp_info = pd.DataFrame(d, columns=["Response"]+ [ "Employment_Info_"+str(x) for x in range(1,6)]) 

# continous
df_bio = pd.DataFrame(d, columns=["Response", "Ins_Age", "Ht", "Wt","BMI"])

# all the values are discrete (0 or 1)
df_med_kw = pd.DataFrame(d, columns=["Response"]+ [ "Medical_Keyword_"+str(x) for x in range(1,48)])

In [ ]:

Grouping of various categorical data sets

Histograms and descriptive statistics for Risk Response, Ins_Age, BMI, Wt


In [ ]:
plt.figure(0)
plt.subplot(121)
plt.title("Categorical - Histogram for Risk Response")
plt.xlabel("Risk Response (1-7)")
plt.ylabel("Frequency")
plt.hist(df.Response)
plt.savefig('images/hist_Response.png')
print df.Response.describe()
print ""

plt.subplot(122)
plt.title("Normalized - Histogram for Risk Response")
plt.xlabel("Normalized Risk Response (1-7)")
plt.ylabel("Frequency")
plt.hist(df_cat_n.nResponse)
plt.savefig('images/hist_norm_Response.png')
print df_cat_n.nResponse.describe()
print ""

In [ ]:
def plotContinuous(d, t):
    plt.title("Continuous - Histogram for "+ str(t))
    plt.xlabel("Normalized "+str(t)+"[0,1]")
    plt.ylabel("Frequency")
    plt.hist(d)
    plt.savefig("images/hist_"+str(t)+".png")
    #print df.iloc[:,:1].describe()
    print ""
    

for i in range(i,len(df_cat.columns:
    
    plt.figure(1)
    plotContinuous(df.Ins_Age, "Ins_Age")
plt.show()

In [26]:
df_disc.describe()[7:8]


Out[26]:
Id Response Medical_History_1 Medical_History_10 Medical_History_15 Medical_History_24 Medical_History_32
max 79146 8 240 240 240 240 240

In [ ]:
plt.figure(1)
plt.title("Continuous - Histogram for Ins_Age")
plt.xlabel("Normalized Ins_Age [0,1]")
plt.ylabel("Frequency")
plt.hist(df.Ins_Age)
plt.savefig('images/hist_Ins_Age.png')
print df.Ins_Age.describe()
print ""

plt.figure(2)
plt.title("Continuous - Histogram for BMI")
plt.xlabel("Normalized BMI [0,1]")
plt.ylabel("Frequency")
plt.hist(df.BMI)
plt.savefig('images/hist_BMI.png')
print df.BMI.describe()
print ""

plt.figure(3)
plt.title("Continuous - Histogram for Wt")
plt.xlabel("Normalized Wt [0,1]")
plt.ylabel("Frequency")
plt.hist(df.Wt)
plt.savefig('images/hist_Wt.png')
print df.Wt.describe()
print ""

plt.show()

In [ ]:


In [ ]:
plt.show()

Histograms and descriptive statistics for Product_Info_1-7


In [ ]:
k=1
for i in range(1,8):
    '''
    print "The iteration is: "+str(i)
    print df['Product_Info_'+str(i)].describe()
    print ""
    '''
    
    plt.figure(i)
        
    if(i == 4):
       
        plt.title("Continuous - Histogram for Product_Info_"+str(i))
        plt.xlabel("Normalized value: [0,1]")
        plt.ylabel("Frequency")
        plt.hist(df['Product_Info_'+str(i)])
        plt.savefig('images/hist_Product_Info_'+str(i)+'.png')
            
        
    else:
        
        if(i != 2):
            
            plt.subplot(1,2,1)
            plt.title("Cat-Hist- Product_Info_"+str(i))
            plt.xlabel("Categories")
            plt.ylabel("Frequency")
            plt.hist(df['Product_Info_'+str(i)])
            plt.savefig('images/hist_Product_Info_'+str(i)+'.png')
            
            
            plt.subplot(1,2,2)
            plt.title("Normalized - Histogram of Product_Info_"+str(i))
            plt.xlabel("Categories")
            plt.ylabel("Frequency")
            plt.hist(df_cat_n['nProduct_Info_'+str(i)])
            plt.savefig('images/hist_norm_Product_Info_'+str(i)+'.png')
            
        elif(i == 2):
            plt.title("Cat-Hist Product_Info_"+str(i))
            plt.xlabel("Categories")
            plt.ylabel("Frequency")
            df.Product_Info_2.value_counts().plot(kind='bar') 
            plt.savefig('images/hist_Product_Info_'+str(i)+'.png')
         
plt.show()

Split dataframes into categorical, continuous, discrete, dummy, and response


In [ ]:
catD = df.loc[:,varTypes['categorical']]
contD = df.loc[:,varTypes['continuous']]
disD = df.loc[:,varTypes['discrete']]
dummyD = df.loc[:,varTypes['dummy']]
respD = df.loc[:,['id','Response']]

Descriptive statistics and scatter plot relating Product_Info_2 and Response


In [ ]:
prod_info = [ "Product_Info_"+str(i) for i in range(1,8)]

a = catD.loc[:, prod_info[1]]

stats = catD.groupby(prod_info[1]).describe()

In [ ]:
c = gb_PI2.Response.count()
plt.figure(0)

plt.scatter(c[0],c[1])

In [ ]:
plt.figure(0)
plt.title("Histogram of "+"Product_Info_"+str(i))
plt.xlabel("Categories " + str((a.describe())['count']))
plt.ylabel("Frequency")

In [ ]:
for i in range(1,8):
    a = catD.loc[:, "Product_Info_"+str(i)]
    if(i is not 4):
        print a.describe()
    print ""
    
    plt.figure(i)
    plt.title("Histogram of "+"Product_Info_"+str(i))
    plt.xlabel("Categories " + str((catD.groupby(key).describe())['count']))
    plt.ylabel("Frequency")
    
    #fig, axes = plt.subplots(nrows = 1, ncols = 2)
    #catD[key].value_counts(normalize=True).hist(ax=axes[0]); axes[0].set_title("Histogram: "+str(key))
    #catD[key].value_counts(normalize=True).hist(cumulative=True,ax=axes[1]); axes[1].set_title("Cumulative HG: "+str(key))
    
    if a.dtype in (np.int64, np.float, float, int):
        a.hist()
        
# Random functions
#catD.Product_Info_1.describe()
#catD.loc[:, prod_info].groupby('Product_Info_2').describe()
#df[varTypes['categorical']].hist()

In [ ]:
catD.head(5)

In [ ]:
#Exploration of the discrete data
disD.describe()

In [ ]:
disD.head(5)

In [ ]:
#Iterate through each categorical column of data
#Perform a 2D histogram later

i=0    
for key in varTypes['categorical']:
    
    #print "The category is: {0} with value_counts: {1} and detailed tuple: {2} ".format(key, l.count(), l)
    plt.figure(i)
    plt.title("Histogram of "+str(key))
    plt.xlabel("Categories " + str((df.groupby(key).describe())['count']))
    #fig, axes = plt.subplots(nrows = 1, ncols = 2)
    #catD[key].value_counts(normalize=True).hist(ax=axes[0]); axes[0].set_title("Histogram: "+str(key))
    #catD[key].value_counts(normalize=True).hist(cumulative=True,ax=axes[1]); axes[1].set_title("Cumulative HG: "+str(key))
    if df[key].dtype in (np.int64, np.float, float, int):
        df[key].hist()
    
    i+=1

In [ ]:
#Iterate through each 'discrete' column of data
#Perform a 2D histogram later

i=0    
for key in varTypes['discrete']:
    
    #print "The category is: {0} with value_counts: {1} and detailed tuple: {2} ".format(key, l.count(), l)
    plt.figure(i)
    fig, axes = plt.subplots(nrows = 1, ncols = 2)
    
    #Histogram based on normalized value counts of the data set
    disD[key].value_counts().hist(ax=axes[0]); axes[0].set_title("Histogram: "+str(key))
    
    #Cumulative histogram based on normalized value counts of the data set
    disD[key].value_counts().hist(cumulative=True,ax=axes[1]); axes[1].set_title("Cumulative HG: "+str(key))
    i+=1

In [ ]:
#2D Histogram

i=0    
for key in varTypes['categorical']:
    
    #print "The category is: {0} with value_counts: {1} and detailed tuple: {2} ".format(key, l.count(), l)
    plt.figure(i)
    #fig, axes = plt.subplots(nrows = 1, ncols = 2)
    
    x = catD[key].value_counts(normalize=True)
    y = df['Response']
    
    plt.hist2d(x[1], y, bins=40, norm=LogNorm())
    plt.colorbar()
    
    #catD[key].value_counts(normalize=True).hist(ax=axes[0]); axes[0].set_title("Histogram: "+str(key))
    #catD[key].value_counts(normalize=True).hist(cumulative=True,ax=axes[1]); axes[1].set_title("Cumulative HG: "+str(key))
    i+=1

In [ ]:
#Iterate through each categorical column of data
#Perform a 2D histogram later

i=0    
for key in varTypes['categorical']:
    
    #print "The category is: {0} with value_counts: {1} and detailed tuple: {2} ".format(key, l.count(), l)
    plt.figure(i)
    #fig, axes = plt.subplots(nrows = 1, ncols = 2)
    #catD[key].value_counts(normalize=True).hist(ax=axes[0]); axes[0].set_title("Histogram: "+str(key))
    #catD[key].value_counts(normalize=True).hist(cumulative=True,ax=axes[1]); axes[1].set_title("Cumulative HG: "+str(key))
    if df[key].dtype in (np.int64, np.float, float, int):
        #(1.*df[key].value_counts()/len(df[key])).hist()
        df[key].value_counts(normalize=True).plot(kind='bar')
    
    i+=1

In [1]:
df.loc('Product_Info_1')


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-a71792ee057d> in <module>()
----> 1 df.loc('Product_Info_1')

NameError: name 'df' is not defined

In [6]:



hist_BMI.png
hist_Ins_Age.png
hist_norm_Product_Info_1.png
hist_norm_Product_Info_3.png
hist_norm_Product_Info_5.png
hist_norm_Product_Info_6.png
hist_norm_Product_Info_7.png
hist_norm_Response.png
hist_product_info_1.png
hist_Product_Info_2.png
hist_Product_Info_3.png
hist_product_info_4.png
hist_Product_Info_5.png
hist_Product_Info_6.png
hist_Product_Info_7.png
hist_response.png
hist_Wt.png
RFC_scatter_alpha_kappa_test1.png
RFC_scatter_alpha_kappa_test2.png
scatterLasso_alpha_kappa_test1.png
scatterLasso_alpha_kappa_test2.png
scatter_alpha_kappa.png
scatter_alpha_kappa_test1.png
scatter_alpha_kappa_test2.png
subplot_demo.py