https://www.kaggle.com/c/prudential-life-insurance-assessment
Variable | Description |
---|---|
Id | A unique identifier associated with an application. |
Product_Info_1-7 | A set of normalized variables relating to the product applied for |
Ins_Age | Normalized age of applicant |
Ht | Normalized height of applicant |
Wt | Normalized weight of applicant |
BMI | Normalized BMI of applicant |
Employment_Info_1-6 | A set of normalized variables relating to the employment history of the applicant. |
InsuredInfo_1-6 | A set of normalized variables providing information about the applicant. |
Insurance_History_1-9 | A set of normalized variables relating to the insurance history of the applicant. |
Family_Hist_1-5 | A set of normalized variables relating to the family history of the applicant. |
Medical_History_1-41 | A set of normalized variables relating to the medical history of the applicant. |
Medical_Keyword_1-48 | A set of dummy variables relating to the presence of/absence of a medical keyword being associated with the application. |
Response | This is the target variable, an ordinal variable relating to the final decision associated with an application |
The following variables are all categorical (nominal):
Product_Info_1, Product_Info_2, Product_Info_3, Product_Info_5, Product_Info_6, Product_Info_7, Employment_Info_2, Employment_Info_3, Employment_Info_5, InsuredInfo_1, InsuredInfo_2, InsuredInfo_3, InsuredInfo_4, InsuredInfo_5, InsuredInfo_6, InsuredInfo_7, Insurance_History_1, Insurance_History_2, Insurance_History_3, Insurance_History_4, Insurance_History_7, Insurance_History_8, Insurance_History_9, Family_Hist_1, Medical_History_2, Medical_History_3, Medical_History_4, Medical_History_5, Medical_History_6, Medical_History_7, Medical_History_8, Medical_History_9, Medical_History_11, Medical_History_12, Medical_History_13, Medical_History_14, Medical_History_16, Medical_History_17, Medical_History_18, Medical_History_19, Medical_History_20, Medical_History_21, Medical_History_22, Medical_History_23, Medical_History_25, Medical_History_26, Medical_History_27, Medical_History_28, Medical_History_29, Medical_History_30, Medical_History_31, Medical_History_33, Medical_History_34, Medical_History_35, Medical_History_36, Medical_History_37, Medical_History_38, Medical_History_39, Medical_History_40, Medical_History_41
The following variables are continuous:
Product_Info_4, Ins_Age, Ht, Wt, BMI, Employment_Info_1, Employment_Info_4, Employment_Info_6, Insurance_History_5, Family_Hist_2, Family_Hist_3, Family_Hist_4, Family_Hist_5
The following variables are discrete:
Medical_History_1, Medical_History_10, Medical_History_15, Medical_History_24, Medical_History_32
Medical_Keyword_1-48 are dummy variables.
My thoughts are as follows:
The main dependent variable is the Risk Response (1-8) What are variables are correlated to the risk response? How do I perform correlation analysis between variables?
In [26]:
# Importing libraries
%pylab inline
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.colors import LogNorm
from sklearn import preprocessing
import numpy as np
In [3]:
# Convert variable data into categorical, continuous, discrete,
# and dummy variable lists the following into a dictionary
In [71]:
s = ["Product_Info_1, Product_Info_2, Product_Info_3, Product_Info_5, Product_Info_6, Product_Info_7, Employment_Info_2, Employment_Info_3, Employment_Info_5, InsuredInfo_1, InsuredInfo_2, InsuredInfo_3, InsuredInfo_4, InsuredInfo_5, InsuredInfo_6, InsuredInfo_7, Insurance_History_1, Insurance_History_2, Insurance_History_3, Insurance_History_4, Insurance_History_7, Insurance_History_8, Insurance_History_9, Family_Hist_1, Medical_History_2, Medical_History_3, Medical_History_4, Medical_History_5, Medical_History_6, Medical_History_7, Medical_History_8, Medical_History_9, Medical_History_11, Medical_History_12, Medical_History_13, Medical_History_14, Medical_History_16, Medical_History_17, Medical_History_18, Medical_History_19, Medical_History_20, Medical_History_21, Medical_History_22, Medical_History_23, Medical_History_25, Medical_History_26, Medical_History_27, Medical_History_28, Medical_History_29, Medical_History_30, Medical_History_31, Medical_History_33, Medical_History_34, Medical_History_35, Medical_History_36, Medical_History_37, Medical_History_38, Medical_History_39, Medical_History_40, Medical_History_41",
"Product_Info_4, Ins_Age, Ht, Wt, BMI, Employment_Info_1, Employment_Info_4, Employment_Info_6, Insurance_History_5, Family_Hist_2, Family_Hist_3, Family_Hist_4, Family_Hist_5",
"Medical_History_1, Medical_History_10, Medical_History_15, Medical_History_24, Medical_History_32"]
varTypes = dict()
#Very hacky way of inserting and appending ID and Response columns to the required dataframes
#Make this better
varTypes['categorical'] = s[0].split(', ')
#varTypes['categorical'].insert(0, 'Id')
#varTypes['categorical'].append('Response')
varTypes['continuous'] = s[1].split(', ')
#varTypes['continuous'].insert(0, 'Id')
#varTypes['continuous'].append('Response')
varTypes['discrete'] = s[2].split(', ')
#varTypes['discrete'].insert(0, 'Id')
#varTypes['discrete'].append('Response')
varTypes['dummy'] = ["Medical_Keyword_"+str(i) for i in range(1,49)]
varTypes['dummy'].insert(0, 'Id')
varTypes['dummy'].append('Response')
In [5]:
#Prints out each of the the variable types as a check
#for i in iter(varTypes['dummy']):
#print i
The following variables are all categorical (nominal):
Product_Info_1, Product_Info_2, Product_Info_3, Product_Info_5, Product_Info_6, Product_Info_7, Employment_Info_2, Employment_Info_3, Employment_Info_5, InsuredInfo_1, InsuredInfo_2, InsuredInfo_3, InsuredInfo_4, InsuredInfo_5, InsuredInfo_6, InsuredInfo_7, Insurance_History_1, Insurance_History_2, Insurance_History_3, Insurance_History_4, Insurance_History_7, Insurance_History_8, Insurance_History_9, Family_Hist_1, Medical_History_2, Medical_History_3, Medical_History_4, Medical_History_5, Medical_History_6, Medical_History_7, Medical_History_8, Medical_History_9, Medical_History_11, Medical_History_12, Medical_History_13, Medical_History_14, Medical_History_16, Medical_History_17, Medical_History_18, Medical_History_19, Medical_History_20, Medical_History_21, Medical_History_22, Medical_History_23, Medical_History_25, Medical_History_26, Medical_History_27, Medical_History_28, Medical_History_29, Medical_History_30, Medical_History_31, Medical_History_33, Medical_History_34, Medical_History_35, Medical_History_36, Medical_History_37, Medical_History_38, Medical_History_39, Medical_History_40, Medical_History_41
The following variables are continuous: Product_Info_4, Ins_Age, Ht, Wt, BMI, Employment_Info_1, Employment_Info_4, Employment_Info_6, Insurance_History_5, Family_Hist_2, Family_Hist_3, Family_Hist_4, Family_Hist_5
The following variables are discrete: Medical_History_1, Medical_History_10, Medical_History_15, Medical_History_24, Medical_History_32
Medical_Keyword_1-48 are dummy variables.
In [11]:
#Import training data
d = pd.read_csv('prud_files/train.csv')
In [93]:
def normalize_df(d):
min_max_scaler = preprocessing.MinMaxScaler()
x = d.values.astype(np.float)
return pd.DataFrame(min_max_scaler.fit_transform(x))
In [130]:
# Import training data
d = pd.read_csv('prud_files/train.csv')
#Separation into groups
df_cat = pd.DataFrame(d, columns=["Id","Response"]+varTypes["categorical"])
df_disc = pd.DataFrame(d, columns=["Id","Response"]+varTypes["categorical"])
df_cont = pd.DataFrame(d, columns=["Id","Response"]+varTypes["categorical"])
In [159]:
d_cat = df_cat.copy()
#normalizes the columns for binary classification
norm_product_info_2 = [pd.get_dummies(d_cat["Product_Info_2"])]
a = pd.DataFrame(normalize_df(d_cat["Response"]))
a.columns=["nResponse"]
d_cat = pd.concat([d_cat, a], axis=1, join='outer')
for x in varTypes["categorical"]:
try:
a = pd.DataFrame(normalize_df(d_cat[x]))
a.columns=[str("n"+x)]
d_cat = pd.concat([d_cat, a], axis=1, join='outer')
except Exception as e:
print e.args
print "Error on "+str(x)+" w error: "+str(e)
In [145]:
In [163]:
d_cat.iloc[:,62:66].head(5)
# Normalization of columns
# Create a minimum and maximum processor object
Out[163]:
In [107]:
# Define various group by data streams
df = d
gb_PI2 = df.groupby('Product_Info_1')
gb_PI2 = df.groupby('Product_Info_2')
gb_Ins_Age = df.groupby('Ins_Age')
gb_Ht = df.groupby('Ht')
gb_Wt = df.groupby('Wt')
gb_response = df.groupby('Response')
In [66]:
#Outputs rows the differnet categorical groups
for c in df.columns:
if (c in varTypes['categorical']):
if(c != 'Id'):
a = [ str(x)+", " for x in df.groupby(c).groups ]
print c + " : " + str(a)
In [28]:
df_prod_info = pd.DataFrame(d, columns=(["Response"]+ [ "Product_Info_"+str(x) for x in range(1,8)]))
df_emp_info = pd.DataFrame(d, columns=(["Response"]+ [ "Employment_Info_"+str(x) for x in range(1,6)]))
df_bio = pd.DataFrame(d, columns=["Response", "Ins_Age", "Ht", "Wt","BMI"])
df_med_kw = pd.DataFrame(d, columns=(["Response"]+ [ "Medical_Keyword_"+str(x) for x in range(1,48)])).add(axis=[ "Medical_Keyword_"+str(x) for x in range(1,48)])
df_med_kw.describe()
In [6]:
df.head(5)
Out[6]:
In [7]:
df.describe()
Out[7]:
In [9]:
plt.figure(0)
plt.title("Categorical - Histogram for Risk Response")
plt.xlabel("Risk Response (1-7)")
plt.ylabel("Frequency")
plt.hist(df.Response)
plt.savefig('images/hist_Response.png')
print df.Response.describe()
print ""
plt.figure(1)
plt.title("Continuous - Histogram for Ins_Age")
plt.xlabel("Normalized Ins_Age [0,1]")
plt.ylabel("Frequency")
plt.hist(df.Ins_Age)
plt.savefig('images/hist_Ins_Age.png')
print df.Ins_Age.describe()
print ""
plt.figure(2)
plt.title("Continuous - Histogram for BMI")
plt.xlabel("Normalized BMI [0,1]")
plt.ylabel("Frequency")
plt.hist(df.BMI)
plt.savefig('images/hist_BMI.png')
print df.BMI.describe()
print ""
plt.figure(3)
plt.title("Continuous - Histogram for Wt")
plt.xlabel("Normalized Wt [0,1]")
plt.ylabel("Frequency")
plt.hist(df.Wt)
plt.savefig('images/hist_Wt.png')
print df.Wt.describe()
print ""
plt.show()
In [10]:
for i in range(1,8):
print "The iteration is: "+str(i)
print df['Product_Info_'+str(i)].describe()
print ""
plt.figure(i)
if(i == 4):
plt.title("Continuous - Histogram for Product_Info_"+str(i))
plt.xlabel("Normalized value: [0,1]")
plt.ylabel("Frequency")
else:
plt.title("Categorical - Histogram of Product_Info_"+str(i))
plt.xlabel("Categories")
plt.ylabel("Frequency")
if(i == 2):
df.Product_Info_2.value_counts().plot(kind='bar')
else:
plt.hist(df['Product_Info_'+str(i)])
plt.savefig('images/hist_Product_Info_'+str(i)+'.png')
plt.show()
In [11]:
catD = df.loc[:,varTypes['categorical']]
contD = df.loc[:,varTypes['continuous']]
disD = df.loc[:,varTypes['discrete']]
dummyD = df.loc[:,varTypes['dummy']]
respD = df.loc[:,['id','Response']]
In [12]:
prod_info = [ "Product_Info_"+str(i) for i in range(1,8)]
a = catD.loc[:, prod_info[1]]
stats = catD.groupby(prod_info[1]).describe()
In [192]:
c = gb_PI2.Response.count()
plt.figure(0)
plt.scatter(c[0],c[1])
Out[192]:
In [64]:
plt.figure(0)
plt.title("Histogram of "+"Product_Info_"+str(i))
plt.xlabel("Categories " + str((a.describe())['count']))
plt.ylabel("Frequency")
Out[64]:
In [61]:
for i in range(1,8):
a = catD.loc[:, "Product_Info_"+str(i)]
if(i is not 4):
print a.describe()
print ""
plt.figure(i)
plt.title("Histogram of "+"Product_Info_"+str(i))
plt.xlabel("Categories " + str((catD.groupby(key).describe())['count']))
plt.ylabel("Frequency")
#fig, axes = plt.subplots(nrows = 1, ncols = 2)
#catD[key].value_counts(normalize=True).hist(ax=axes[0]); axes[0].set_title("Histogram: "+str(key))
#catD[key].value_counts(normalize=True).hist(cumulative=True,ax=axes[1]); axes[1].set_title("Cumulative HG: "+str(key))
if a.dtype in (np.int64, np.float, float, int):
a.hist()
# Random functions
#catD.Product_Info_1.describe()
#catD.loc[:, prod_info].groupby('Product_Info_2').describe()
#df[varTypes['categorical']].hist()
In [15]:
catD.head(5)
Out[15]:
In [16]:
#Exploration of the discrete data
disD.describe()
Out[16]:
In [17]:
disD.head(5)
Out[17]:
In [60]:
#Iterate through each categorical column of data
#Perform a 2D histogram later
i=0
for key in varTypes['categorical']:
#print "The category is: {0} with value_counts: {1} and detailed tuple: {2} ".format(key, l.count(), l)
plt.figure(i)
plt.title("Histogram of "+str(key))
plt.xlabel("Categories " + str((df.groupby(key).describe())['count']))
#fig, axes = plt.subplots(nrows = 1, ncols = 2)
#catD[key].value_counts(normalize=True).hist(ax=axes[0]); axes[0].set_title("Histogram: "+str(key))
#catD[key].value_counts(normalize=True).hist(cumulative=True,ax=axes[1]); axes[1].set_title("Cumulative HG: "+str(key))
if df[key].dtype in (np.int64, np.float, float, int):
df[key].hist()
i+=1
In [ ]:
#Iterate through each 'discrete' column of data
#Perform a 2D histogram later
i=0
for key in varTypes['discrete']:
#print "The category is: {0} with value_counts: {1} and detailed tuple: {2} ".format(key, l.count(), l)
plt.figure(i)
fig, axes = plt.subplots(nrows = 1, ncols = 2)
#Histogram based on normalized value counts of the data set
disD[key].value_counts().hist(ax=axes[0]); axes[0].set_title("Histogram: "+str(key))
#Cumulative histogram based on normalized value counts of the data set
disD[key].value_counts().hist(cumulative=True,ax=axes[1]); axes[1].set_title("Cumulative HG: "+str(key))
i+=1
In [ ]:
#2D Histogram
i=0
for key in varTypes['categorical']:
#print "The category is: {0} with value_counts: {1} and detailed tuple: {2} ".format(key, l.count(), l)
plt.figure(i)
#fig, axes = plt.subplots(nrows = 1, ncols = 2)
x = catD[key].value_counts(normalize=True)
y = df['Response']
plt.hist2d(x[1], y, bins=40, norm=LogNorm())
plt.colorbar()
#catD[key].value_counts(normalize=True).hist(ax=axes[0]); axes[0].set_title("Histogram: "+str(key))
#catD[key].value_counts(normalize=True).hist(cumulative=True,ax=axes[1]); axes[1].set_title("Cumulative HG: "+str(key))
i+=1
In [ ]:
#Iterate through each categorical column of data
#Perform a 2D histogram later
i=0
for key in varTypes['categorical']:
#print "The category is: {0} with value_counts: {1} and detailed tuple: {2} ".format(key, l.count(), l)
plt.figure(i)
#fig, axes = plt.subplots(nrows = 1, ncols = 2)
#catD[key].value_counts(normalize=True).hist(ax=axes[0]); axes[0].set_title("Histogram: "+str(key))
#catD[key].value_counts(normalize=True).hist(cumulative=True,ax=axes[1]); axes[1].set_title("Cumulative HG: "+str(key))
if df[key].dtype in (np.int64, np.float, float, int):
#(1.*df[key].value_counts()/len(df[key])).hist()
df[key].value_counts(normalize=True).plot(kind='bar')
i+=1
In [12]:
df.loc('Product_Info_1')