https://www.kaggle.com/c/prudential-life-insurance-assessment
Variable | Description |
---|---|
Id | A unique identifier associated with an application. |
Product_Info_1-7 | A set of normalized variables relating to the product applied for |
Ins_Age | Normalized age of applicant |
Ht | Normalized height of applicant |
Wt | Normalized weight of applicant |
BMI | Normalized BMI of applicant |
Employment_Info_1-6 | A set of normalized variables relating to the employment history of the applicant. |
InsuredInfo_1-6 | A set of normalized variables providing information about the applicant. |
Insurance_History_1-9 | A set of normalized variables relating to the insurance history of the applicant. |
Family_Hist_1-5 | A set of normalized variables relating to the family history of the applicant. |
Medical_History_1-41 | A set of normalized variables relating to the medical history of the applicant. |
Medical_Keyword_1-48 | A set of dummy variables relating to the presence of/absence of a medical keyword being associated with the application. |
Response | This is the target variable, an ordinal variable relating to the final decision associated with an application |
The following variables are all categorical (nominal):
Product_Info_1, Product_Info_2, Product_Info_3, Product_Info_5, Product_Info_6, Product_Info_7, Employment_Info_2, Employment_Info_3, Employment_Info_5, InsuredInfo_1, InsuredInfo_2, InsuredInfo_3, InsuredInfo_4, InsuredInfo_5, InsuredInfo_6, InsuredInfo_7, Insurance_History_1, Insurance_History_2, Insurance_History_3, Insurance_History_4, Insurance_History_7, Insurance_History_8, Insurance_History_9, Family_Hist_1, Medical_History_2, Medical_History_3, Medical_History_4, Medical_History_5, Medical_History_6, Medical_History_7, Medical_History_8, Medical_History_9, Medical_History_11, Medical_History_12, Medical_History_13, Medical_History_14, Medical_History_16, Medical_History_17, Medical_History_18, Medical_History_19, Medical_History_20, Medical_History_21, Medical_History_22, Medical_History_23, Medical_History_25, Medical_History_26, Medical_History_27, Medical_History_28, Medical_History_29, Medical_History_30, Medical_History_31, Medical_History_33, Medical_History_34, Medical_History_35, Medical_History_36, Medical_History_37, Medical_History_38, Medical_History_39, Medical_History_40, Medical_History_41
The following variables are continuous:
Product_Info_4, Ins_Age, Ht, Wt, BMI, Employment_Info_1, Employment_Info_4, Employment_Info_6, Insurance_History_5, Family_Hist_2, Family_Hist_3, Family_Hist_4, Family_Hist_5
The following variables are discrete:
Medical_History_1, Medical_History_10, Medical_History_15, Medical_History_24, Medical_History_32
Medical_Keyword_1-48 are dummy variables.
In [14]:
# Convert variable data into categorical, continuous, discrete,
# and dummy variable lists the following into a dictionary list
In [ ]:
# The following variables are all categorical (nominal):
In [33]:
s = ["Product_Info_1, Product_Info_2, Product_Info_3, Product_Info_5, Product_Info_6, Product_Info_7, Employment_Info_2, Employment_Info_3, Employment_Info_5, InsuredInfo_1, InsuredInfo_2, InsuredInfo_3, InsuredInfo_4, InsuredInfo_5, InsuredInfo_6, InsuredInfo_7, Insurance_History_1, Insurance_History_2, Insurance_History_3, Insurance_History_4, Insurance_History_7, Insurance_History_8, Insurance_History_9, Family_Hist_1, Medical_History_2, Medical_History_3, Medical_History_4, Medical_History_5, Medical_History_6, Medical_History_7, Medical_History_8, Medical_History_9, Medical_History_11, Medical_History_12, Medical_History_13, Medical_History_14, Medical_History_16, Medical_History_17, Medical_History_18, Medical_History_19, Medical_History_20, Medical_History_21, Medical_History_22, Medical_History_23, Medical_History_25, Medical_History_26, Medical_History_27, Medical_History_28, Medical_History_29, Medical_History_30, Medical_History_31, Medical_History_33, Medical_History_34, Medical_History_35, Medical_History_36, Medical_History_37, Medical_History_38, Medical_History_39, Medical_History_40, Medical_History_41",
"Product_Info_4, Ins_Age, Ht, Wt, BMI, Employment_Info_1, Employment_Info_4, Employment_Info_6, Insurance_History_5, Family_Hist_2, Family_Hist_3, Family_Hist_4, Family_Hist_5",
"Medical_History_1, Medical_History_10, Medical_History_15, Medical_History_24, Medical_History_32"]
varTypes = dict()
varTypes['categorical'] = s[0].split(', ')
varTypes['continuous'] = s[1].split(', ')
varTypes['discrete'] = s[2].split(', ')
l = list()
for i in range(1,49): l.append("Medical_Keyword_"+str(i))
varTypes['dummy'] = l
In [56]:
#Checking over the variable types
#for i in iter(varTypes['dummy']):
#print i
In [69]:
%pylab inline
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
In [6]:
df = pd.read_csv('prud_files/train.csv')
In [7]:
df.head(5)
Out[7]:
In [37]:
df.describe()
Out[37]:
In [70]:
?df.describe()
In [38]:
df.index
Out[38]:
In [39]:
df.columns
Out[39]:
In [67]:
df.loc??
In [71]:
#Exploration of the categorical data
catD = df.loc[:,varTypes['categorical']]
catD.describe()
Out[71]:
In [100]:
catD.head(5)
Out[100]:
In [ ]:
d = catD.loc[:,key]
.groupby('colour').size().plot(kind='bar')
In [113]:
#Iterate through each categorical column of data
#Perform a 2D histogram later
plt.figure()
fig, axes = plt.subplots(nrows = 1, ncols = 2)
i=0
for key in varTypes['categorical']:
#Select the data in each key iteration
d = catD.loc[:,key]
l = d.value_counts()
#print "The category is: {0} with value_counts: {1} and detailed tuple: {2} ".format(key, l.count(), l)
plt.figure()
ax = axes[i,0]
plt.title('Histogram: ' + str(key), ax = ax)
plt.xlabel('Category: '+str(key), ax = ax)
plt.ylabel('Frequency', ax = ax)
d.value_counts().hist(alpha=0.5, ax = ax)
ax = axes[i,1]
plt.title('Cumulative Histogram: ' + str(key), ax = ax)
plt.xlabel('Category: '+str(key), ax = ax)
plt.ylabel('Frequency', ax = ax)
d.value_counts().hist(alpha=0.5, cumulative=True, ax = ax)
i+=1
break
In [ ]:
for i in range(catD):
print catD[i].columns
catD[i].value_counts()
In [72]:
#fig, axes = plt.subplots(1, 2, figsize(12,4))
#fig.tight_layout()
num_bins = 50
plt.hist(catD['Product_Info_1'], num_bins, facecolor='green',alpha=0.5)
Out[72]:
In [ ]:
disD = df.loc[:,varTypes['discrete']]
contD = df.loc[:,varTypes['continuous']]
respD = df.loc[:,['id','Response']]
In [ ]: