https://www.kaggle.com/c/prudential-life-insurance-assessment
Variable | Description |
---|---|
Id | A unique identifier associated with an application. |
Product_Info_1-7 | A set of normalized variables relating to the product applied for |
Ins_Age | Normalized age of applicant |
Ht | Normalized height of applicant |
Wt | Normalized weight of applicant |
BMI | Normalized BMI of applicant |
Employment_Info_1-6 | A set of normalized variables relating to the employment history of the applicant. |
InsuredInfo_1-6 | A set of normalized variables providing information about the applicant. |
Insurance_History_1-9 | A set of normalized variables relating to the insurance history of the applicant. |
Family_Hist_1-5 | A set of normalized variables relating to the family history of the applicant. |
Medical_History_1-41 | A set of normalized variables relating to the medical history of the applicant. |
Medical_Keyword_1-48 | A set of dummy variables relating to the presence of/absence of a medical keyword being associated with the application. |
Response | This is the target variable, an ordinal variable relating to the final decision associated with an application |
The following variables are all categorical (nominal):
Product_Info_1, Product_Info_2, Product_Info_3, Product_Info_5, Product_Info_6, Product_Info_7, Employment_Info_2, Employment_Info_3, Employment_Info_5, InsuredInfo_1, InsuredInfo_2, InsuredInfo_3, InsuredInfo_4, InsuredInfo_5, InsuredInfo_6, InsuredInfo_7, Insurance_History_1, Insurance_History_2, Insurance_History_3, Insurance_History_4, Insurance_History_7, Insurance_History_8, Insurance_History_9, Family_Hist_1, Medical_History_2, Medical_History_3, Medical_History_4, Medical_History_5, Medical_History_6, Medical_History_7, Medical_History_8, Medical_History_9, Medical_History_11, Medical_History_12, Medical_History_13, Medical_History_14, Medical_History_16, Medical_History_17, Medical_History_18, Medical_History_19, Medical_History_20, Medical_History_21, Medical_History_22, Medical_History_23, Medical_History_25, Medical_History_26, Medical_History_27, Medical_History_28, Medical_History_29, Medical_History_30, Medical_History_31, Medical_History_33, Medical_History_34, Medical_History_35, Medical_History_36, Medical_History_37, Medical_History_38, Medical_History_39, Medical_History_40, Medical_History_41
The following variables are continuous:
Product_Info_4, Ins_Age, Ht, Wt, BMI, Employment_Info_1, Employment_Info_4, Employment_Info_6, Insurance_History_5, Family_Hist_2, Family_Hist_3, Family_Hist_4, Family_Hist_5
The following variables are discrete:
Medical_History_1, Medical_History_10, Medical_History_15, Medical_History_24, Medical_History_32
Medical_Keyword_1-48 are dummy variables.
My thoughts are as follows:
The main dependent variable is the Risk Response (1-8) What are variables are correlated to the risk response? How do I perform correlation analysis between variables?
In [2]:
# Importing libraries
%pylab inline
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.colors import LogNorm
from sklearn import preprocessing
import numpy as np
In [3]:
# Convert variable data into categorical, continuous, discrete,
# and dummy variable lists the following into a dictionary
In [4]:
s = ["Product_Info_1, Product_Info_2, Product_Info_3, Product_Info_5, Product_Info_6, Product_Info_7, Employment_Info_2, Employment_Info_3, Employment_Info_5, InsuredInfo_1, InsuredInfo_2, InsuredInfo_3, InsuredInfo_4, InsuredInfo_5, InsuredInfo_6, InsuredInfo_7, Insurance_History_1, Insurance_History_2, Insurance_History_3, Insurance_History_4, Insurance_History_7, Insurance_History_8, Insurance_History_9, Family_Hist_1, Medical_History_2, Medical_History_3, Medical_History_4, Medical_History_5, Medical_History_6, Medical_History_7, Medical_History_8, Medical_History_9, Medical_History_11, Medical_History_12, Medical_History_13, Medical_History_14, Medical_History_16, Medical_History_17, Medical_History_18, Medical_History_19, Medical_History_20, Medical_History_21, Medical_History_22, Medical_History_23, Medical_History_25, Medical_History_26, Medical_History_27, Medical_History_28, Medical_History_29, Medical_History_30, Medical_History_31, Medical_History_33, Medical_History_34, Medical_History_35, Medical_History_36, Medical_History_37, Medical_History_38, Medical_History_39, Medical_History_40, Medical_History_41",
"Product_Info_4, Ins_Age, Ht, Wt, BMI, Employment_Info_1, Employment_Info_4, Employment_Info_6, Insurance_History_5, Family_Hist_2, Family_Hist_3, Family_Hist_4, Family_Hist_5",
"Medical_History_1, Medical_History_10, Medical_History_15, Medical_History_24, Medical_History_32"]
varTypes = dict()
varTypes['categorical'] = s[0].split(', ')
varTypes['continuous'] = s[1].split(', ')
varTypes['discrete'] = s[2].split(', ')
varTypes['dummy'] = ["Medical_Keyword_"+str(i) for i in range(1,49)]
In [5]:
#Prints out each of the the variable types as a check
#for i in iter(varTypes['dummy']):
#print i
In [6]:
#Import training data
d_raw = pd.read_csv('prud_files/train.csv')
d = d_raw.copy()
In [9]:
len(d.columns)
Out[9]:
In [181]:
# Get all the columns that have NaNs
d = d_raw.copy()
a = pd.isnull(d).sum()
nullColumns = a[a>0].index.values
#for c in nullColumns:
#d[c].fillna(-1)
#Determine the min and max values for the NaN columns
a = pd.DataFrame(d, columns=nullColumns).describe()
a_min = a[3:4]
a_max = a[7:8]
Out[181]:
In [175]:
nullList = ['Family_Hist_4',
'Medical_History_1',
'Medical_History_10',
'Medical_History_15',
'Medical_History_24',
'Medical_History_32']
pd.DataFrame(a_max, columns=nullList)
Out[175]:
In [303]:
# Convert all NaNs to -1 and sum up all medical keywords across columns
df = d.fillna(-1)
b = pd.DataFrame(df[varTypes["dummy"]].sum(axis=1), columns=["Medical_Keyword_Sum"])
df= pd.concat([df,b], axis=1, join='outer')
In [334]:
Out[334]:
In [ ]:
In [328]:
#Turn split train to test on or off.
#If on, 10% of the dataset is used for feature training
#If off, training set is loaded from file
splitTrainToTest = 1
if(splitTrainToTest):
d_gb = df.groupby("Response")
df_test = pd.DataFrame()
for name, group in d_gb:
df_test = pd.concat([df_test, group[:len(group)/10]], axis=0, join='outer')
print "test data is 10% training data"
else:
d_test = pd.read_csv('prud_files/test.csv')
df_test = d_test.fillna(-1)
b = pd.DataFrame(df[varTypes["dummy"]].sum(axis=1), columns=["Medical_Keyword_Sum"])
df_test= pd.concat([df_test,b], axis=1, join='outer')
print "test data is prud_files/test.csv"
In [275]:
df_cat = df[["Id","Response"]+varTypes["categorical"]]
df_disc = df[["Id","Response"]+varTypes["discrete"]]
df_cont = df[["Id","Response"]+varTypes["continuous"]]
df_dummy = df[["Id","Response"]+varTypes["dummy"]]
df_cat_test = df_test[["Id","Response"]+varTypes["categorical"]]
df_disc_test = df_test[["Id","Response"]+varTypes["discrete"]]
df_cont_test = df_test[["Id","Response"]+varTypes["continuous"]]
df_dummy_test = df_test[["Id","Response"]+varTypes["dummy"]]
In [355]:
## Extract categories of each column
df_n = df[["Response", "Medical_Keyword_Sum"]+varTypes["categorical"]+varTypes["discrete"]+varTypes["continuous"]].copy()
df_test_n = df_test[["Response","Medical_Keyword_Sum"]+varTypes["categorical"]+varTypes["discrete"]+varTypes["continuous"]].copy()
In [356]:
#Get all the Product Info 2 categories
a = pd.get_dummies(df["Product_Info_2"]).columns.tolist()
norm_PI2_dict = dict()
#Create an enumerated dictionary of Product Info 2 categories
i=1
for c in a:
norm_PI2_dict.update({c:i})
i+=1
print norm_PI2_dict
df_n = df_n.replace(to_replace={'Product_Info_2':norm_PI2_dict})
df_test_n = df_test_n.replace(to_replace={'Product_Info_2':norm_PI2_dict})
df_n
Out[356]:
In [359]:
# normalizes a single dataframe column and returns the result
def normalize_df(d):
min_max_scaler = preprocessing.MinMaxScaler()
x = d.values.astype(np.float)
#return pd.DataFrame(min_max_scaler.fit_transform(x))
return pd.DataFrame(min_max_scaler.fit_transform(x))
def normalize_cat(d):
for x in varTypes["discrete"]:
try:
a = pd.DataFrame(normalize_df(d_disc[x]))
a.columns=[str("n"+x)]
d_disc = pd.concat([d_disc, a], axis=1, join='outer')
except Exception as e:
print e.args
print "Error on "+str(x)+" w error: "+str(e)
return d_disc
def normalize_disc(d_disc):
for x in varTypes["discrete"]:
try:
a = pd.DataFrame(normalize_df(d_disc[x]))
a.columns=[str("n"+x)]
d_disc = pd.concat([d_disc, a], axis=1, join='outer')
except Exception as e:
print e.args
print "Error on "+str(x)+" w error: "+str(e)
return d_disc
# t= categorical, discrete, continuous
def normalize_cols(d, t = "categorical"):
for x in varTypes[t]:
try:
a = pd.DataFrame(normalize_df(d[x]))
a.columns=[str("n"+x)]
a = pd.concat(a, axis=1, join='outer')
except Exception as e:
print e.args
print "Error on "+str(x)+" w error: "+str(e)
return a
def normalize_response(d):
a = pd.DataFrame(normalize_df(d["Response"]))
a.columns=["nResponse"]
#d_cat = pd.concat([d_cat, a], axis=1, join='outer')
return a
In [15]:
df_n_2 = df_n.copy()
df_n_test_2 = df_test_n.copy()
df_n_2 = df_n_2[["Response"]+varTypes["categorical"]+varTypes["discrete"]]
df_n_test_2 = df_n_test_2[["Response"]+varTypes["categorical"]+varTypes["discrete"]]
df_n_2 = df_n_2.apply(normalize_df, axis=1)
df_n_test_2 = df_n_test_2.apply(normalize_df, axis=1)
In [382]:
df_n_3 = pd.concat([df["Id"],df_n["Medical_Keyword_Sum"],df_n_2, df_n[varTypes["continuous"]]],axis=1,join='outer')
df_n_test_3 = pd.concat([df_test["Id"],df_test_n["Medical_Keyword_Sum"],df_n_test_2, df_test_n[varTypes["continuous"]]],axis=1,join='outer')
In [ ]:
train_data = df_n_3.values
test_data = df_n_test_3.values
In [ ]:
from sklearn import linear_model
clf = linear_model.Lasso(alpha = 0.1)
clf.fit(X_train, Y_train)
pred = clf.predict(X_test)
print accuracy_score(pred, Y_test)
In [ ]:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators = 1)
#model = model.fit(train_data[0:,2:],train_data[0:,0])
In [409]:
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
clf = GaussianNB()
clf.fit(train_data[0:,2:],train_data[0:,0])
pred = clf.predict(X_test)
print accuracy_score(pred, Y_test)
In [410]:
from sklearn.metrics import accuracy_score
In [ ]:
In [407]:
In [340]:
df_n.columns.tolist()
Out[340]:
In [32]:
d_cat = df_cat.copy()
d_cat_test = df_cat_test.copy()
d_cont = df_cont.copy()
d_cont_test = df_cont_test.copy()
d_disc = df_disc.copy()
d_disc_test = df_disc_test.copy()
In [ ]:
#df_cont_n = normalize_cols(d_cont, "continuous")
#df_cont_test_n = normalize_cols(d_cont_test, "continuous")
In [31]:
df_cat_n = normalize_cols(d_cat, "categorical")
df_cat_test_n = normalize_cols(d_cat_test, "categorical")
In [33]:
df_disc_n = normalize_cols(d_disc, "discrete")
df_disc_test_n = normalize_cols(d_disc, "discrete")
In [21]:
a = df_cat_n.iloc[:,62:]
# TODO: Clump into function
#rows are normalized into binary columns of groupings
Out[21]:
In [14]:
# Define various group by data streams
df = d
gb_PI2 = df.groupby('Product_Info_1')
gb_PI2 = df.groupby('Product_Info_2')
gb_Ins_Age = df.groupby('Ins_Age')
gb_Ht = df.groupby('Ht')
gb_Wt = df.groupby('Wt')
gb_response = df.groupby('Response')
In [ ]:
#Outputs rows the differnet categorical groups
for c in df.columns:
if (c in varTypes['categorical']):
if(c != 'Id'):
a = [ str(x)+", " for x in df.groupby(c).groups ]
print c + " : " + str(a)
In [ ]:
df_prod_info = pd.DataFrame(d, columns=["Response"]+ [ "Product_Info_"+str(x) for x in range(1,8)])
df_emp_info = pd.DataFrame(d, columns=["Response"]+ [ "Employment_Info_"+str(x) for x in range(1,6)])
# continous
df_bio = pd.DataFrame(d, columns=["Response", "Ins_Age", "Ht", "Wt","BMI"])
# all the values are discrete (0 or 1)
df_med_kw = pd.DataFrame(d, columns=["Response"]+ [ "Medical_Keyword_"+str(x) for x in range(1,48)])
In [ ]:
In [ ]:
plt.figure(0)
plt.subplot(121)
plt.title("Categorical - Histogram for Risk Response")
plt.xlabel("Risk Response (1-7)")
plt.ylabel("Frequency")
plt.hist(df.Response)
plt.savefig('images/hist_Response.png')
print df.Response.describe()
print ""
plt.subplot(122)
plt.title("Normalized - Histogram for Risk Response")
plt.xlabel("Normalized Risk Response (1-7)")
plt.ylabel("Frequency")
plt.hist(df_cat_n.nResponse)
plt.savefig('images/hist_norm_Response.png')
print df_cat_n.nResponse.describe()
print ""
In [ ]:
def plotContinuous(d, t):
plt.title("Continuous - Histogram for "+ str(t))
plt.xlabel("Normalized "+str(t)+"[0,1]")
plt.ylabel("Frequency")
plt.hist(d)
plt.savefig("images/hist_"+str(t)+".png")
#print df.iloc[:,:1].describe()
print ""
for i in range(i,len(df_cat.columns:
plt.figure(1)
plotContinuous(df.Ins_Age, "Ins_Age")
plt.show()
In [26]:
df_disc.describe()[7:8]
Out[26]:
In [ ]:
plt.figure(1)
plt.title("Continuous - Histogram for Ins_Age")
plt.xlabel("Normalized Ins_Age [0,1]")
plt.ylabel("Frequency")
plt.hist(df.Ins_Age)
plt.savefig('images/hist_Ins_Age.png')
print df.Ins_Age.describe()
print ""
plt.figure(2)
plt.title("Continuous - Histogram for BMI")
plt.xlabel("Normalized BMI [0,1]")
plt.ylabel("Frequency")
plt.hist(df.BMI)
plt.savefig('images/hist_BMI.png')
print df.BMI.describe()
print ""
plt.figure(3)
plt.title("Continuous - Histogram for Wt")
plt.xlabel("Normalized Wt [0,1]")
plt.ylabel("Frequency")
plt.hist(df.Wt)
plt.savefig('images/hist_Wt.png')
print df.Wt.describe()
print ""
plt.show()
In [ ]:
In [ ]:
plt.show()
In [ ]:
k=1
for i in range(1,8):
'''
print "The iteration is: "+str(i)
print df['Product_Info_'+str(i)].describe()
print ""
'''
plt.figure(i)
if(i == 4):
plt.title("Continuous - Histogram for Product_Info_"+str(i))
plt.xlabel("Normalized value: [0,1]")
plt.ylabel("Frequency")
plt.hist(df['Product_Info_'+str(i)])
plt.savefig('images/hist_Product_Info_'+str(i)+'.png')
else:
if(i != 2):
plt.subplot(1,2,1)
plt.title("Cat-Hist- Product_Info_"+str(i))
plt.xlabel("Categories")
plt.ylabel("Frequency")
plt.hist(df['Product_Info_'+str(i)])
plt.savefig('images/hist_Product_Info_'+str(i)+'.png')
plt.subplot(1,2,2)
plt.title("Normalized - Histogram of Product_Info_"+str(i))
plt.xlabel("Categories")
plt.ylabel("Frequency")
plt.hist(df_cat_n['nProduct_Info_'+str(i)])
plt.savefig('images/hist_norm_Product_Info_'+str(i)+'.png')
elif(i == 2):
plt.title("Cat-Hist Product_Info_"+str(i))
plt.xlabel("Categories")
plt.ylabel("Frequency")
df.Product_Info_2.value_counts().plot(kind='bar')
plt.savefig('images/hist_Product_Info_'+str(i)+'.png')
plt.show()
In [ ]:
catD = df.loc[:,varTypes['categorical']]
contD = df.loc[:,varTypes['continuous']]
disD = df.loc[:,varTypes['discrete']]
dummyD = df.loc[:,varTypes['dummy']]
respD = df.loc[:,['id','Response']]
In [ ]:
prod_info = [ "Product_Info_"+str(i) for i in range(1,8)]
a = catD.loc[:, prod_info[1]]
stats = catD.groupby(prod_info[1]).describe()
In [ ]:
c = gb_PI2.Response.count()
plt.figure(0)
plt.scatter(c[0],c[1])
In [ ]:
plt.figure(0)
plt.title("Histogram of "+"Product_Info_"+str(i))
plt.xlabel("Categories " + str((a.describe())['count']))
plt.ylabel("Frequency")
In [ ]:
for i in range(1,8):
a = catD.loc[:, "Product_Info_"+str(i)]
if(i is not 4):
print a.describe()
print ""
plt.figure(i)
plt.title("Histogram of "+"Product_Info_"+str(i))
plt.xlabel("Categories " + str((catD.groupby(key).describe())['count']))
plt.ylabel("Frequency")
#fig, axes = plt.subplots(nrows = 1, ncols = 2)
#catD[key].value_counts(normalize=True).hist(ax=axes[0]); axes[0].set_title("Histogram: "+str(key))
#catD[key].value_counts(normalize=True).hist(cumulative=True,ax=axes[1]); axes[1].set_title("Cumulative HG: "+str(key))
if a.dtype in (np.int64, np.float, float, int):
a.hist()
# Random functions
#catD.Product_Info_1.describe()
#catD.loc[:, prod_info].groupby('Product_Info_2').describe()
#df[varTypes['categorical']].hist()
In [ ]:
catD.head(5)
In [ ]:
#Exploration of the discrete data
disD.describe()
In [ ]:
disD.head(5)
In [ ]:
#Iterate through each categorical column of data
#Perform a 2D histogram later
i=0
for key in varTypes['categorical']:
#print "The category is: {0} with value_counts: {1} and detailed tuple: {2} ".format(key, l.count(), l)
plt.figure(i)
plt.title("Histogram of "+str(key))
plt.xlabel("Categories " + str((df.groupby(key).describe())['count']))
#fig, axes = plt.subplots(nrows = 1, ncols = 2)
#catD[key].value_counts(normalize=True).hist(ax=axes[0]); axes[0].set_title("Histogram: "+str(key))
#catD[key].value_counts(normalize=True).hist(cumulative=True,ax=axes[1]); axes[1].set_title("Cumulative HG: "+str(key))
if df[key].dtype in (np.int64, np.float, float, int):
df[key].hist()
i+=1
In [ ]:
#Iterate through each 'discrete' column of data
#Perform a 2D histogram later
i=0
for key in varTypes['discrete']:
#print "The category is: {0} with value_counts: {1} and detailed tuple: {2} ".format(key, l.count(), l)
plt.figure(i)
fig, axes = plt.subplots(nrows = 1, ncols = 2)
#Histogram based on normalized value counts of the data set
disD[key].value_counts().hist(ax=axes[0]); axes[0].set_title("Histogram: "+str(key))
#Cumulative histogram based on normalized value counts of the data set
disD[key].value_counts().hist(cumulative=True,ax=axes[1]); axes[1].set_title("Cumulative HG: "+str(key))
i+=1
In [ ]:
#2D Histogram
i=0
for key in varTypes['categorical']:
#print "The category is: {0} with value_counts: {1} and detailed tuple: {2} ".format(key, l.count(), l)
plt.figure(i)
#fig, axes = plt.subplots(nrows = 1, ncols = 2)
x = catD[key].value_counts(normalize=True)
y = df['Response']
plt.hist2d(x[1], y, bins=40, norm=LogNorm())
plt.colorbar()
#catD[key].value_counts(normalize=True).hist(ax=axes[0]); axes[0].set_title("Histogram: "+str(key))
#catD[key].value_counts(normalize=True).hist(cumulative=True,ax=axes[1]); axes[1].set_title("Cumulative HG: "+str(key))
i+=1
In [ ]:
#Iterate through each categorical column of data
#Perform a 2D histogram later
i=0
for key in varTypes['categorical']:
#print "The category is: {0} with value_counts: {1} and detailed tuple: {2} ".format(key, l.count(), l)
plt.figure(i)
#fig, axes = plt.subplots(nrows = 1, ncols = 2)
#catD[key].value_counts(normalize=True).hist(ax=axes[0]); axes[0].set_title("Histogram: "+str(key))
#catD[key].value_counts(normalize=True).hist(cumulative=True,ax=axes[1]); axes[1].set_title("Cumulative HG: "+str(key))
if df[key].dtype in (np.int64, np.float, float, int):
#(1.*df[key].value_counts()/len(df[key])).hist()
df[key].value_counts(normalize=True).plot(kind='bar')
i+=1
In [1]:
df.loc('Product_Info_1')
In [6]: