Exploration of Prudential Life Insurance Data

Data retrieved from:

https://www.kaggle.com/c/prudential-life-insurance-assessment

File descriptions:

train.csv - the training set, contains the Response values
test.csv - the test set, you must predict the Response variable for all rows in this file
sample_submission.csv - a sample submission file in the correct format

Data fields:

Variable	Description
Id	A unique identifier associated with an application.
Product_Info_1-7	A set of normalized variables relating to the product applied for
Ins_Age	Normalized age of applicant
Ht	Normalized height of applicant
Wt	Normalized weight of applicant
BMI	Normalized BMI of applicant
Employment_Info_1-6	A set of normalized variables relating to the employment history of the applicant.
InsuredInfo_1-6	A set of normalized variables providing information about the applicant.
Insurance_History_1-9	A set of normalized variables relating to the insurance history of the applicant.
Family_Hist_1-5	A set of normalized variables relating to the family history of the applicant.
Medical_History_1-41	A set of normalized variables relating to the medical history of the applicant.
Medical_Keyword_1-48	A set of dummy variables relating to the presence of/absence of a medical keyword being associated with the application.
Response	This is the target variable, an ordinal variable relating to the final decision associated with an application

The following variables are all categorical (nominal):

Product_Info_1, Product_Info_2, Product_Info_3, Product_Info_5, Product_Info_6, Product_Info_7, Employment_Info_2, Employment_Info_3, Employment_Info_5, InsuredInfo_1, InsuredInfo_2, InsuredInfo_3, InsuredInfo_4, InsuredInfo_5, InsuredInfo_6, InsuredInfo_7, Insurance_History_1, Insurance_History_2, Insurance_History_3, Insurance_History_4, Insurance_History_7, Insurance_History_8, Insurance_History_9, Family_Hist_1, Medical_History_2, Medical_History_3, Medical_History_4, Medical_History_5, Medical_History_6, Medical_History_7, Medical_History_8, Medical_History_9, Medical_History_11, Medical_History_12, Medical_History_13, Medical_History_14, Medical_History_16, Medical_History_17, Medical_History_18, Medical_History_19, Medical_History_20, Medical_History_21, Medical_History_22, Medical_History_23, Medical_History_25, Medical_History_26, Medical_History_27, Medical_History_28, Medical_History_29, Medical_History_30, Medical_History_31, Medical_History_33, Medical_History_34, Medical_History_35, Medical_History_36, Medical_History_37, Medical_History_38, Medical_History_39, Medical_History_40, Medical_History_41

The following variables are continuous:

Product_Info_4, Ins_Age, Ht, Wt, BMI, Employment_Info_1, Employment_Info_4, Employment_Info_6, Insurance_History_5, Family_Hist_2, Family_Hist_3, Family_Hist_4, Family_Hist_5

The following variables are discrete:

Medical_History_1, Medical_History_10, Medical_History_15, Medical_History_24, Medical_History_32

Medical_Keyword_1-48 are dummy variables.

My thoughts are as follows:

The main dependent variable is the Risk Response (1-8) What are variables are correlated to the risk response? How do I perform correlation analysis between variables?

Import libraries



In [26]:

    
# Importing libraries

%pylab inline
%matplotlib inline
import pandas as pd 
import matplotlib.pyplot as plt
from matplotlib.colors import LogNorm
from sklearn import preprocessing
import numpy as np









    



Populating the interactive namespace from numpy and matplotlib



In [3]:

    
# Convert variable data into categorical, continuous, discrete, 
# and dummy variable lists the following into a dictionary

Define categorical data types



In [71]:

    
s = ["Product_Info_1, Product_Info_2, Product_Info_3, Product_Info_5, Product_Info_6, Product_Info_7, Employment_Info_2, Employment_Info_3, Employment_Info_5, InsuredInfo_1, InsuredInfo_2, InsuredInfo_3, InsuredInfo_4, InsuredInfo_5, InsuredInfo_6, InsuredInfo_7, Insurance_History_1, Insurance_History_2, Insurance_History_3, Insurance_History_4, Insurance_History_7, Insurance_History_8, Insurance_History_9, Family_Hist_1, Medical_History_2, Medical_History_3, Medical_History_4, Medical_History_5, Medical_History_6, Medical_History_7, Medical_History_8, Medical_History_9, Medical_History_11, Medical_History_12, Medical_History_13, Medical_History_14, Medical_History_16, Medical_History_17, Medical_History_18, Medical_History_19, Medical_History_20, Medical_History_21, Medical_History_22, Medical_History_23, Medical_History_25, Medical_History_26, Medical_History_27, Medical_History_28, Medical_History_29, Medical_History_30, Medical_History_31, Medical_History_33, Medical_History_34, Medical_History_35, Medical_History_36, Medical_History_37, Medical_History_38, Medical_History_39, Medical_History_40, Medical_History_41",
    "Product_Info_4, Ins_Age, Ht, Wt, BMI, Employment_Info_1, Employment_Info_4, Employment_Info_6, Insurance_History_5, Family_Hist_2, Family_Hist_3, Family_Hist_4, Family_Hist_5",
     "Medical_History_1, Medical_History_10, Medical_History_15, Medical_History_24, Medical_History_32"]
 

varTypes = dict()


#Very hacky way of inserting and appending ID and Response columns to the required dataframes
#Make this better

varTypes['categorical'] = s[0].split(', ')
#varTypes['categorical'].insert(0, 'Id')
#varTypes['categorical'].append('Response')

varTypes['continuous'] = s[1].split(', ')
#varTypes['continuous'].insert(0, 'Id')
#varTypes['continuous'].append('Response')

varTypes['discrete'] = s[2].split(', ')
#varTypes['discrete'].insert(0, 'Id') 
#varTypes['discrete'].append('Response')



varTypes['dummy'] = ["Medical_Keyword_"+str(i) for i in range(1,49)]
varTypes['dummy'].insert(0, 'Id')
varTypes['dummy'].append('Response')









    



['Id', 'Medical_Keyword_1', 'Medical_Keyword_2', 'Medical_Keyword_3', 'Medical_Keyword_4', 'Medical_Keyword_5', 'Medical_Keyword_6', 'Medical_Keyword_7', 'Medical_Keyword_8', 'Medical_Keyword_9', 'Medical_Keyword_10', 'Medical_Keyword_11', 'Medical_Keyword_12', 'Medical_Keyword_13', 'Medical_Keyword_14', 'Medical_Keyword_15', 'Medical_Keyword_16', 'Medical_Keyword_17', 'Medical_Keyword_18', 'Medical_Keyword_19', 'Medical_Keyword_20', 'Medical_Keyword_21', 'Medical_Keyword_22', 'Medical_Keyword_23', 'Medical_Keyword_24', 'Medical_Keyword_25', 'Medical_Keyword_26', 'Medical_Keyword_27', 'Medical_Keyword_28', 'Medical_Keyword_29', 'Medical_Keyword_30', 'Medical_Keyword_31', 'Medical_Keyword_32', 'Medical_Keyword_33', 'Medical_Keyword_34', 'Medical_Keyword_35', 'Medical_Keyword_36', 'Medical_Keyword_37', 'Medical_Keyword_38', 'Medical_Keyword_39', 'Medical_Keyword_40', 'Medical_Keyword_41', 'Medical_Keyword_42', 'Medical_Keyword_43', 'Medical_Keyword_44', 'Medical_Keyword_45', 'Medical_Keyword_46', 'Medical_Keyword_47', 'Medical_Keyword_48', 'Response']



In [5]:

    
#Prints out each of the the variable types as a check
#for i in iter(varTypes['dummy']):
    #print i

Importing life insurance data set

The following variables are all categorical (nominal):

The following variables are continuous: Product_Info_4, Ins_Age, Ht, Wt, BMI, Employment_Info_1, Employment_Info_4, Employment_Info_6, Insurance_History_5, Family_Hist_2, Family_Hist_3, Family_Hist_4, Family_Hist_5

The following variables are discrete: Medical_History_1, Medical_History_10, Medical_History_15, Medical_History_24, Medical_History_32

Medical_Keyword_1-48 are dummy variables.



In [11]:

    
#Import training data 
d = pd.read_csv('prud_files/train.csv')



In [93]:

    
def normalize_df(d):
    min_max_scaler = preprocessing.MinMaxScaler()
    x = d.values.astype(np.float)
    return pd.DataFrame(min_max_scaler.fit_transform(x))



In [130]:

    
# Import training data 
d = pd.read_csv('prud_files/train.csv')

#Separation into groups

df_cat = pd.DataFrame(d, columns=["Id","Response"]+varTypes["categorical"])
df_disc = pd.DataFrame(d, columns=["Id","Response"]+varTypes["categorical"])
df_cont = pd.DataFrame(d, columns=["Id","Response"]+varTypes["categorical"])



In [159]:

    
d_cat = df_cat.copy()

#normalizes the columns for binary classification
norm_product_info_2 = [pd.get_dummies(d_cat["Product_Info_2"])]

a = pd.DataFrame(normalize_df(d_cat["Response"]))
a.columns=["nResponse"]
d_cat = pd.concat([d_cat, a], axis=1, join='outer')

for x in varTypes["categorical"]:
        try:

            a = pd.DataFrame(normalize_df(d_cat[x]))
            a.columns=[str("n"+x)]
            d_cat = pd.concat([d_cat, a], axis=1, join='outer')
        except Exception as e:
            print e.args
            print "Error on "+str(x)+" w error: "+str(e)









    



('could not convert string to float: A8',)
Error on Product_Info_2 w error: could not convert string to float: A8



In [145]:



In [163]:

    
d_cat.iloc[:,62:66].head(5)

# Normalization of columns
# Create a minimum and maximum processor object









    Out[163]:






  
    
      
      nResponse
      nProduct_Info_1
      nProduct_Info_3
      nProduct_Info_5
    
  
  
    
      0
       1.000000
       0
       0.243243
       0
    
    
      1
       0.428571
       0
       0.675676
       0
    
    
      2
       1.000000
       0
       0.675676
       0
    
    
      3
       1.000000
       0
       0.243243
       0
    
    
      4
       1.000000
       0
       0.675676
       0



In [107]:

    
# Define various group by data streams

df = d
    
gb_PI2 = df.groupby('Product_Info_1')
gb_PI2 = df.groupby('Product_Info_2')

gb_Ins_Age = df.groupby('Ins_Age')
gb_Ht = df.groupby('Ht')
gb_Wt = df.groupby('Wt')

gb_response = df.groupby('Response')



In [66]:

    
#Outputs rows the differnet categorical groups

for c in df.columns:
    if (c in varTypes['categorical']):
        if(c != 'Id'):
            a = [ str(x)+", " for x in df.groupby(c).groups ]
            print c + " : " + str(a)









    



Product_Info_1 : ['1, ', '2, ']
Product_Info_2 : ['C2, ', 'B2, ', 'E1, ', 'A4, ', 'B1, ', 'A1, ', 'A3, ', 'A2, ', 'A5, ', 'C1, ', 'A7, ', 'A6, ', 'C3, ', 'A8, ', 'D4, ', 'C4, ', 'D2, ', 'D3, ', 'D1, ']
Product_Info_3 : ['1, ', '2, ', '3, ', '4, ', '5, ', '6, ', '8, ', '9, ', '10, ', '11, ', '12, ', '13, ', '15, ', '16, ', '17, ', '18, ', '19, ', '20, ', '21, ', '22, ', '23, ', '24, ', '26, ', '27, ', '28, ', '29, ', '30, ', '31, ', '32, ', '33, ', '34, ', '36, ', '37, ', '38, ']
Product_Info_5 : ['2, ', '3, ']
Product_Info_6 : ['1, ', '3, ']
Product_Info_7 : ['1, ', '2, ', '3, ']
Employment_Info_2 : ['1, ', '2, ', '3, ', '4, ', '5, ', '6, ', '7, ', '9, ', '10, ', '11, ', '12, ', '13, ', '14, ', '15, ', '16, ', '17, ', '18, ', '19, ', '20, ', '21, ', '22, ', '23, ', '25, ', '26, ', '27, ', '28, ', '29, ', '30, ', '31, ', '32, ', '33, ', '34, ', '35, ', '36, ', '37, ', '38, ']
Employment_Info_3 : ['1, ', '3, ']
Employment_Info_5 : ['2, ', '3, ']
InsuredInfo_1 : ['1, ', '2, ', '3, ']
InsuredInfo_2 : ['2, ', '3, ']
InsuredInfo_3 : ['1, ', '2, ', '3, ', '4, ', '5, ', '6, ', '7, ', '8, ', '9, ', '10, ', '11, ']
InsuredInfo_4 : ['2, ', '3, ']
InsuredInfo_5 : ['1, ', '3, ']
InsuredInfo_6 : ['1, ', '2, ']
InsuredInfo_7 : ['1, ', '3, ']
Insurance_History_1 : ['1, ', '2, ']
Insurance_History_2 : ['1, ', '2, ', '3, ']
Insurance_History_3 : ['1, ', '2, ', '3, ']
Insurance_History_4 : ['1, ', '2, ', '3, ']
Insurance_History_7 : ['1, ', '2, ', '3, ']
Insurance_History_8 : ['1, ', '2, ', '3, ']
Insurance_History_9 : ['1, ', '2, ', '3, ']
Family_Hist_1 : ['1, ', '2, ', '3, ']
Medical_History_2 : ['1, ', '2, ', '3, ', '5, ', '6, ', '7, ', '8, ', '9, ', '10, ', '12, ', '13, ', '14, ', '15, ', '16, ', '17, ', '18, ', '19, ', '20, ', '21, ', '22, ', '23, ', '24, ', '25, ', '26, ', '27, ', '28, ', '29, ', '30, ', '32, ', '33, ', '34, ', '35, ', '36, ', '37, ', '38, ', '39, ', '40, ', '41, ', '42, ', '43, ', '44, ', '45, ', '46, ', '47, ', '48, ', '50, ', '51, ', '52, ', '53, ', '54, ', '55, ', '56, ', '57, ', '58, ', '59, ', '60, ', '61, ', '62, ', '63, ', '64, ', '66, ', '67, ', '68, ', '69, ', '70, ', '71, ', '72, ', '73, ', '74, ', '75, ', '76, ', '77, ', '78, ', '79, ', '81, ', '82, ', '84, ', '85, ', '86, ', '87, ', '88, ', '89, ', '90, ', '91, ', '93, ', '94, ', '95, ', '96, ', '97, ', '98, ', '99, ', '100, ', '101, ', '102, ', '104, ', '105, ', '106, ', '107, ', '108, ', '109, ', '110, ', '111, ', '112, ', '113, ', '114, ', '115, ', '116, ', '117, ', '120, ', '121, ', '122, ', '123, ', '124, ', '125, ', '127, ', '128, ', '129, ', '131, ', '132, ', '133, ', '134, ', '135, ', '136, ', '137, ', '138, ', '139, ', '140, ', '141, ', '142, ', '143, ', '144, ', '145, ', '146, ', '147, ', '148, ', '149, ', '150, ', '151, ', '152, ', '153, ', '154, ', '155, ', '156, ', '157, ', '158, ', '159, ', '160, ', '161, ', '162, ', '163, ', '164, ', '165, ', '166, ', '167, ', '169, ', '170, ', '171, ', '172, ', '173, ', '174, ', '175, ', '177, ', '179, ', '180, ', '181, ', '182, ', '183, ', '184, ', '185, ', '186, ', '187, ', '188, ', '189, ', '190, ', '191, ', '192, ', '193, ', '195, ', '196, ', '197, ', '198, ', '199, ', '200, ', '201, ', '202, ', '203, ', '204, ', '205, ', '207, ', '208, ', '209, ', '210, ', '212, ', '213, ', '214, ', '215, ', '216, ', '217, ', '218, ', '219, ', '220, ', '221, ', '222, ', '223, ', '224, ', '225, ', '226, ', '227, ', '228, ', '229, ', '230, ', '231, ', '232, ', '233, ', '234, ', '235, ', '236, ', '238, ', '239, ', '240, ', '241, ', '242, ', '243, ', '245, ', '247, ', '248, ', '249, ', '250, ', '251, ', '252, ', '253, ', '255, ', '256, ', '257, ', '258, ', '259, ', '260, ', '261, ', '262, ', '264, ', '265, ', '266, ', '267, ', '268, ', '270, ', '271, ', '272, ', '273, ', '274, ', '275, ', '276, ', '277, ', '278, ', '279, ', '280, ', '281, ', '282, ', '283, ', '285, ', '286, ', '287, ', '288, ', '289, ', '290, ', '291, ', '293, ', '294, ', '295, ', '296, ', '297, ', '298, ', '299, ', '301, ', '302, ', '303, ', '305, ', '306, ', '307, ', '310, ', '311, ', '313, ', '314, ', '315, ', '316, ', '317, ', '318, ', '319, ', '320, ', '321, ', '322, ', '323, ', '324, ', '326, ', '327, ', '328, ', '329, ', '330, ', '331, ', '332, ', '333, ', '334, ', '335, ', '336, ', '337, ', '338, ', '343, ', '344, ', '345, ', '346, ', '347, ', '348, ', '349, ', '350, ', '351, ', '352, ', '353, ', '354, ', '355, ', '357, ', '358, ', '360, ', '361, ', '362, ', '363, ', '364, ', '366, ', '368, ', '369, ', '370, ', '371, ', '372, ', '373, ', '374, ', '375, ', '376, ', '377, ', '378, ', '379, ', '380, ', '381, ', '382, ', '383, ', '384, ', '385, ', '386, ', '387, ', '388, ', '389, ', '390, ', '391, ', '392, ', '393, ', '394, ', '395, ', '396, ', '397, ', '398, ', '399, ', '400, ', '403, ', '404, ', '405, ', '406, ', '407, ', '408, ', '409, ', '410, ', '411, ', '412, ', '413, ', '414, ', '415, ', '416, ', '417, ', '418, ', '419, ', '420, ', '421, ', '422, ', '426, ', '427, ', '428, ', '430, ', '431, ', '432, ', '433, ', '434, ', '435, ', '436, ', '437, ', '438, ', '439, ', '440, ', '441, ', '443, ', '444, ', '446, ', '447, ', '448, ', '449, ', '451, ', '452, ', '453, ', '455, ', '456, ', '457, ', '458, ', '459, ', '461, ', '462, ', '464, ', '465, ', '466, ', '467, ', '468, ', '469, ', '470, ', '471, ', '472, ', '473, ', '474, ', '475, ', '476, ', '477, ', '478, ', '479, ', '480, ', '481, ', '482, ', '483, ', '484, ', '486, ', '487, ', '488, ', '489, ', '490, ', '491, ', '492, ', '493, ', '494, ', '495, ', '496, ', '497, ', '498, ', '499, ', '501, ', '502, ', '503, ', '504, ', '505, ', '506, ', '507, ', '509, ', '510, ', '511, ', '512, ', '513, ', '514, ', '515, ', '516, ', '517, ', '518, ', '519, ', '520, ', '522, ', '523, ', '524, ', '525, ', '526, ', '527, ', '528, ', '529, ', '530, ', '531, ', '532, ', '533, ', '534, ', '536, ', '537, ', '538, ', '540, ', '541, ', '542, ', '543, ', '544, ', '545, ', '546, ', '548, ', '549, ', '550, ', '551, ', '552, ', '553, ', '554, ', '557, ', '558, ', '559, ', '560, ', '561, ', '562, ', '563, ', '564, ', '565, ', '566, ', '567, ', '568, ', '569, ', '570, ', '571, ', '572, ', '573, ', '575, ', '576, ', '577, ', '578, ', '579, ', '580, ', '581, ', '582, ', '583, ', '584, ', '586, ', '587, ', '588, ', '589, ', '590, ', '591, ', '592, ', '593, ', '595, ', '596, ', '598, ', '599, ', '600, ', '601, ', '602, ', '603, ', '605, ', '606, ', '607, ', '608, ', '609, ', '610, ', '611, ', '613, ', '614, ', '615, ', '616, ', '617, ', '618, ', '619, ', '620, ', '621, ', '622, ', '623, ', '624, ', '626, ', '627, ', '628, ', '629, ', '630, ', '631, ', '632, ', '633, ', '634, ', '635, ', '636, ', '637, ', '638, ', '639, ', '640, ', '641, ', '642, ', '643, ', '644, ', '645, ', '646, ', '647, ', '648, ']
Medical_History_3 : ['1, ', '2, ', '3, ']
Medical_History_4 : ['1, ', '2, ']
Medical_History_5 : ['1, ', '2, ', '3, ']
Medical_History_6 : ['1, ', '2, ', '3, ']
Medical_History_7 : ['1, ', '2, ', '3, ']
Medical_History_8 : ['1, ', '2, ', '3, ']
Medical_History_9 : ['1, ', '2, ', '3, ']
Medical_History_11 : ['1, ', '2, ', '3, ']
Medical_History_12 : ['1, ', '2, ', '3, ']
Medical_History_13 : ['1, ', '2, ', '3, ']
Medical_History_14 : ['1, ', '2, ', '3, ']
Medical_History_16 : ['1, ', '2, ', '3, ']
Medical_History_17 : ['1, ', '2, ', '3, ']
Medical_History_18 : ['1, ', '2, ', '3, ']
Medical_History_19 : ['1, ', '2, ', '3, ']
Medical_History_20 : ['1, ', '2, ', '3, ']
Medical_History_21 : ['1, ', '2, ', '3, ']
Medical_History_22 : ['1, ', '2, ']
Medical_History_23 : ['1, ', '2, ', '3, ']
Medical_History_25 : ['1, ', '2, ', '3, ']
Medical_History_26 : ['1, ', '2, ', '3, ']
Medical_History_27 : ['1, ', '2, ', '3, ']
Medical_History_28 : ['1, ', '2, ', '3, ']
Medical_History_29 : ['1, ', '2, ', '3, ']
Medical_History_30 : ['1, ', '2, ', '3, ']
Medical_History_31 : ['1, ', '2, ', '3, ']
Medical_History_33 : ['1, ', '3, ']
Medical_History_34 : ['1, ', '2, ', '3, ']
Medical_History_35 : ['1, ', '2, ', '3, ']
Medical_History_36 : ['1, ', '2, ', '3, ']
Medical_History_37 : ['1, ', '2, ', '3, ']
Medical_History_38 : ['1, ', '2, ']
Medical_History_39 : ['1, ', '2, ', '3, ']
Medical_History_40 : ['1, ', '2, ', '3, ']
Medical_History_41 : ['1, ', '2, ', '3, ']
Response : ['1, ', '2, ', '3, ', '4, ', '5, ', '6, ', '7, ', '8, ']



In [28]:

    
df_prod_info = pd.DataFrame(d, columns=(["Response"]+ [ "Product_Info_"+str(x) for x in range(1,8)])) 
df_emp_info = pd.DataFrame(d, columns=(["Response"]+ [ "Employment_Info_"+str(x) for x in range(1,6)])) 
df_bio = pd.DataFrame(d, columns=["Response", "Ins_Age", "Ht", "Wt","BMI"])
df_med_kw = pd.DataFrame(d, columns=(["Response"]+ [ "Medical_Keyword_"+str(x) for x in range(1,48)])).add(axis=[ "Medical_Keyword_"+str(x) for x in range(1,48)])
df_med_kw.describe()









    



---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-28-49219c99689d> in <module>()
      2 df_emp_info = pd.DataFrame(d, columns=(["Response"]+ [ "Employment_Info_"+str(x) for x in range(1,6)]))
      3 df_bio = pd.DataFrame(d, columns=["Response", "Ins_Age", "Ht", "Wt","BMI"])
----> 4 df_med_kw = pd.DataFrame(d, columns=(["Response"]+ [ "Medical_Keyword_"+str(x) for x in range(1,48)])).add(axis=[ "Medical_Keyword_"+str(x) for x in range(1,48)])
      5 df_med_kw.describe()

TypeError: f() takes at least 2 arguments (2 given)



In [6]:

    
df.head(5)









    Out[6]:






  
    
      
      Id
      Product_Info_1
      Product_Info_2
      Product_Info_3
      Product_Info_4
      Product_Info_5
      Product_Info_6
      Product_Info_7
      Ins_Age
      Ht
      ...
      Medical_Keyword_40
      Medical_Keyword_41
      Medical_Keyword_42
      Medical_Keyword_43
      Medical_Keyword_44
      Medical_Keyword_45
      Medical_Keyword_46
      Medical_Keyword_47
      Medical_Keyword_48
      Response
    
  
  
    
      0
       2
       1
       D3
       10
       0.076923
       2
       1
       1
       0.641791
       0.581818
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       8
    
    
      1
       5
       1
       A1
       26
       0.076923
       2
       3
       1
       0.059701
       0.600000
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       4
    
    
      2
       6
       1
       E1
       26
       0.076923
       2
       3
       1
       0.029851
       0.745455
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       8
    
    
      3
       7
       1
       D4
       10
       0.487179
       2
       3
       1
       0.164179
       0.672727
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       8
    
    
      4
       8
       1
       D2
       26
       0.230769
       2
       3
       1
       0.417910
       0.654545
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       8
    
  

5 rows × 128 columns



In [7]:

    
df.describe()









    Out[7]:






  
    
      
      Id
      Product_Info_1
      Product_Info_3
      Product_Info_4
      Product_Info_5
      Product_Info_6
      Product_Info_7
      Ins_Age
      Ht
      Wt
      ...
      Medical_Keyword_40
      Medical_Keyword_41
      Medical_Keyword_42
      Medical_Keyword_43
      Medical_Keyword_44
      Medical_Keyword_45
      Medical_Keyword_46
      Medical_Keyword_47
      Medical_Keyword_48
      Response
    
  
  
    
      count
       59381.000000
       59381.000000
       59381.000000
       59381.000000
       59381.000000
       59381.000000
       59381.000000
       59381.000000
       59381.000000
       59381.000000
      ...
       59381.000000
       59381.000000
       59381.000000
       59381.000000
       59381.000000
       59381.000000
       59381.000000
       59381.000000
       59381.000000
       59381.000000
    
    
      mean
       39507.211515
           1.026355
          24.415655
           0.328952
           2.006955
           2.673599
           1.043583
           0.405567
           0.707283
           0.292587
      ...
           0.056954
           0.010054
           0.045536
           0.010710
           0.007528
           0.013691
           0.008488
           0.019905
           0.054496
           5.636837
    
    
      std
       22815.883089
           0.160191
           5.072885
           0.282562
           0.083107
           0.739103
           0.291949
           0.197190
           0.074239
           0.089037
      ...
           0.231757
           0.099764
           0.208479
           0.102937
           0.086436
           0.116207
           0.091737
           0.139676
           0.226995
           2.456833
    
    
      min
           2.000000
           1.000000
           1.000000
           0.000000
           2.000000
           1.000000
           1.000000
           0.000000
           0.000000
           0.000000
      ...
           0.000000
           0.000000
           0.000000
           0.000000
           0.000000
           0.000000
           0.000000
           0.000000
           0.000000
           1.000000
    
    
      25%
       19780.000000
           1.000000
          26.000000
           0.076923
           2.000000
           3.000000
           1.000000
           0.238806
           0.654545
           0.225941
      ...
           0.000000
           0.000000
           0.000000
           0.000000
           0.000000
           0.000000
           0.000000
           0.000000
           0.000000
           4.000000
    
    
      50%
       39487.000000
           1.000000
          26.000000
           0.230769
           2.000000
           3.000000
           1.000000
           0.402985
           0.709091
           0.288703
      ...
           0.000000
           0.000000
           0.000000
           0.000000
           0.000000
           0.000000
           0.000000
           0.000000
           0.000000
           6.000000
    
    
      75%
       59211.000000
           1.000000
          26.000000
           0.487179
           2.000000
           3.000000
           1.000000
           0.567164
           0.763636
           0.345188
      ...
           0.000000
           0.000000
           0.000000
           0.000000
           0.000000
           0.000000
           0.000000
           0.000000
           0.000000
           8.000000
    
    
      max
       79146.000000
           2.000000
          38.000000
           1.000000
           3.000000
           3.000000
           3.000000
           1.000000
           1.000000
           1.000000
      ...
           1.000000
           1.000000
           1.000000
           1.000000
           1.000000
           1.000000
           1.000000
           1.000000
           1.000000
           8.000000
    
  

8 rows × 127 columns

Grouping of various categorical data sets

Histograms and descriptive statistics for Risk Response, Ins_Age, BMI, Wt



In [9]:

    
plt.figure(0)
plt.title("Categorical - Histogram for Risk Response")
plt.xlabel("Risk Response (1-7)")
plt.ylabel("Frequency")
plt.hist(df.Response)
plt.savefig('images/hist_Response.png')
print df.Response.describe()
print ""


plt.figure(1)
plt.title("Continuous - Histogram for Ins_Age")
plt.xlabel("Normalized Ins_Age [0,1]")
plt.ylabel("Frequency")
plt.hist(df.Ins_Age)
plt.savefig('images/hist_Ins_Age.png')
print df.Ins_Age.describe()
print ""

plt.figure(2)
plt.title("Continuous - Histogram for BMI")
plt.xlabel("Normalized BMI [0,1]")
plt.ylabel("Frequency")
plt.hist(df.BMI)
plt.savefig('images/hist_BMI.png')
print df.BMI.describe()
print ""

plt.figure(3)
plt.title("Continuous - Histogram for Wt")
plt.xlabel("Normalized Wt [0,1]")
plt.ylabel("Frequency")
plt.hist(df.Wt)
plt.savefig('images/hist_Wt.png')
print df.Wt.describe()
print ""

plt.show()









    



count    59381.000000
mean         5.636837
std          2.456833
min          1.000000
25%          4.000000
50%          6.000000
75%          8.000000
max          8.000000
Name: Response, dtype: float64

count    59381.000000
mean         0.405567
std          0.197190
min          0.000000
25%          0.238806
50%          0.402985
75%          0.567164
max          1.000000
Name: Ins_Age, dtype: float64

count    59381.000000
mean         0.469462
std          0.122213
min          0.000000
25%          0.385517
50%          0.451349
75%          0.532858
max          1.000000
Name: BMI, dtype: float64

count    59381.000000
mean         0.292587
std          0.089037
min          0.000000
25%          0.225941
50%          0.288703
75%          0.345188
max          1.000000
Name: Wt, dtype: float64

Histograms and descriptive statistics for Product_Info_1-7



In [10]:

    
for i in range(1,8):
    
    print "The iteration is: "+str(i)
    print df['Product_Info_'+str(i)].describe()
    print ""
    
    plt.figure(i)

    if(i == 4):
        plt.title("Continuous - Histogram for Product_Info_"+str(i))
        plt.xlabel("Normalized value: [0,1]")
        plt.ylabel("Frequency")
    else:
        plt.title("Categorical - Histogram of Product_Info_"+str(i))
        plt.xlabel("Categories")
        plt.ylabel("Frequency")
    
    if(i == 2):
        df.Product_Info_2.value_counts().plot(kind='bar')
    else:
        plt.hist(df['Product_Info_'+str(i)])
        
    plt.savefig('images/hist_Product_Info_'+str(i)+'.png')

plt.show()









    



The iteration is: 1
count    59381.000000
mean         1.026355
std          0.160191
min          1.000000
25%          1.000000
50%          1.000000
75%          1.000000
max          2.000000
Name: Product_Info_1, dtype: float64

The iteration is: 2
count     59381
unique       19
top          D3
freq      14321
Name: Product_Info_2, dtype: object

The iteration is: 3
count    59381.000000
mean        24.415655
std          5.072885
min          1.000000
25%         26.000000
50%         26.000000
75%         26.000000
max         38.000000
Name: Product_Info_3, dtype: float64

The iteration is: 4
count    59381.000000
mean         0.328952
std          0.282562
min          0.000000
25%          0.076923
50%          0.230769
75%          0.487179
max          1.000000
Name: Product_Info_4, dtype: float64

The iteration is: 5
count    59381.000000
mean         2.006955
std          0.083107
min          2.000000
25%          2.000000
50%          2.000000
75%          2.000000
max          3.000000
Name: Product_Info_5, dtype: float64

The iteration is: 6
count    59381.000000
mean         2.673599
std          0.739103
min          1.000000
25%          3.000000
50%          3.000000
75%          3.000000
max          3.000000
Name: Product_Info_6, dtype: float64

The iteration is: 7
count    59381.000000
mean         1.043583
std          0.291949
min          1.000000
25%          1.000000
50%          1.000000
75%          1.000000
max          3.000000
Name: Product_Info_7, dtype: float64

Split dataframes into categorical, continuous, discrete, dummy, and response



In [11]:

    
catD = df.loc[:,varTypes['categorical']]
contD = df.loc[:,varTypes['continuous']]
disD = df.loc[:,varTypes['discrete']]
dummyD = df.loc[:,varTypes['dummy']]
respD = df.loc[:,['id','Response']]

Descriptive statistics and scatter plot relating Product_Info_2 and Response



In [12]:

    
prod_info = [ "Product_Info_"+str(i) for i in range(1,8)]

a = catD.loc[:, prod_info[1]]

stats = catD.groupby(prod_info[1]).describe()



In [192]:

    
c = gb_PI2.Response.count()
plt.figure(0)

plt.scatter(c[0],c[1])









    Out[192]:





<matplotlib.collections.PathCollection at 0x419ea828>



In [64]:

    
plt.figure(0)
plt.title("Histogram of "+"Product_Info_"+str(i))
plt.xlabel("Categories " + str((a.describe())['count']))
plt.ylabel("Frequency")









    Out[64]:





'Product_Info_3'



In [61]:

    
for i in range(1,8):
    a = catD.loc[:, "Product_Info_"+str(i)]
    if(i is not 4):
        print a.describe()
    print ""
    
    plt.figure(i)
    plt.title("Histogram of "+"Product_Info_"+str(i))
    plt.xlabel("Categories " + str((catD.groupby(key).describe())['count']))
    plt.ylabel("Frequency")
    
    #fig, axes = plt.subplots(nrows = 1, ncols = 2)
    #catD[key].value_counts(normalize=True).hist(ax=axes[0]); axes[0].set_title("Histogram: "+str(key))
    #catD[key].value_counts(normalize=True).hist(cumulative=True,ax=axes[1]); axes[1].set_title("Cumulative HG: "+str(key))
    
    if a.dtype in (np.int64, np.float, float, int):
        a.hist()
        
# Random functions
#catD.Product_Info_1.describe()
#catD.loc[:, prod_info].groupby('Product_Info_2').describe()
#df[varTypes['categorical']].hist()









    



count    59381.000000
mean         1.026355
std          0.160191
min          1.000000
25%          1.000000
50%          1.000000
75%          1.000000
max          2.000000
Name: Product_Info_1, dtype: float64







    



---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-61-64afcafe78d9> in <module>()
      9     plt.figure(i)
     10     plt.title("Histogram of "+"Product_Info_"+str(i))
---> 11     plt.xlabel("Categories " + str((catD.groupby(key).describe())['count']))
     12     plt.ylabel("Frequency")
     13     #fig, axes = plt.subplots(nrows = 1, ncols = 2)

C:\Users\Robbie\Anaconda\lib\site-packages\pandas\core\groupby.pyc in describe(self, percentile_width, percentiles, include, exclude)

C:\Users\Robbie\Anaconda\lib\site-packages\pandas\core\groupby.pyc in wrapper(*args, **kwargs)
    559             except Exception:
    560                 try:
--> 561                     return self.apply(curried)
    562                 except Exception:
    563 

C:\Users\Robbie\Anaconda\lib\site-packages\pandas\core\groupby.pyc in apply(self, func, *args, **kwargs)
    660         # ignore SettingWithCopy here in case the user mutates
    661         with option_context('mode.chained_assignment',None):
--> 662             return self._python_apply_general(f)
    663 
    664     def _python_apply_general(self, f):

C:\Users\Robbie\Anaconda\lib\site-packages\pandas\core\groupby.pyc in _python_apply_general(self, f)
    664     def _python_apply_general(self, f):
    665         keys, values, mutated = self.grouper.apply(f, self._selected_obj,
--> 666                                                    self.axis)
    667 
    668         return self._wrap_applied_output(keys, values,

C:\Users\Robbie\Anaconda\lib\site-packages\pandas\core\groupby.pyc in apply(self, f, data, axis)
   1272                 hasattr(splitter, 'fast_apply') and axis == 0):
   1273             try:
-> 1274                 values, mutated = splitter.fast_apply(f, group_keys)
   1275                 return group_keys, values, mutated
   1276             except (lib.InvalidApply):

C:\Users\Robbie\Anaconda\lib\site-packages\pandas\core\groupby.pyc in fast_apply(self, f, names)
   3444 
   3445         sdata = self._get_sorted_data()
-> 3446         results, mutated = lib.apply_frame_axis0(sdata, f, names, starts, ends)
   3447 
   3448         return results, mutated

pandas\src\reduce.pyx in pandas.lib.apply_frame_axis0 (pandas\lib.c:38246)()

C:\Users\Robbie\Anaconda\lib\site-packages\pandas\core\groupby.pyc in f(g)
    656         @wraps(func)
    657         def f(g):
--> 658             return func(g, *args, **kwargs)
    659 
    660         # ignore SettingWithCopy here in case the user mutates

C:\Users\Robbie\Anaconda\lib\site-packages\pandas\core\groupby.pyc in curried(x)
    544 
    545             def curried(x):
--> 546                 return f(x, *args, **kwargs)
    547 
    548             # preserve the name so we can detect it when calling plot methods,

C:\Users\Robbie\Anaconda\lib\site-packages\pandas\core\generic.pyc in describe(self, percentile_width, percentiles, include, exclude)
   3841             data = self.select_dtypes(include=include, exclude=exclude)
   3842 
-> 3843         ldesc = [describe_1d(s, percentiles) for _, s in data.iteritems()]
   3844         # set a convenient order for rows
   3845         names = []

C:\Users\Robbie\Anaconda\lib\site-packages\pandas\core\generic.pyc in describe_1d(data, percentiles)
   3819         def describe_1d(data, percentiles):
   3820             if com.is_numeric_dtype(data):
-> 3821                 return describe_numeric_1d(data, percentiles)
   3822             elif com.is_timedelta64_dtype(data):
   3823                 return describe_numeric_1d(data, percentiles)

C:\Users\Robbie\Anaconda\lib\site-packages\pandas\core\generic.pyc in describe_numeric_1d(series, percentiles)
   3793                   [pretty_name(x) for x in percentiles] + ['max'])
   3794             d = ([series.count(), series.mean(), series.std(), series.min()] +
-> 3795                  [series.quantile(x) for x in percentiles] + [series.max()])
   3796             return pd.Series(d, index=stat_index, name=series.name)
   3797 

C:\Users\Robbie\Anaconda\lib\site-packages\pandas\core\series.pyc in quantile(self, q)
   1264                 return _quantile(values, qs*100)
   1265 
-> 1266         return self._maybe_box(lambda values: multi(values, q), dropna=True)
   1267 
   1268     def ptp(self, axis=None, out=None):

C:\Users\Robbie\Anaconda\lib\site-packages\pandas\core\series.pyc in _maybe_box(self, func, dropna)
   2121                 if len(values) == 0:
   2122                     return np.nan
-> 2123             result = func(values)
   2124 
   2125         return result

C:\Users\Robbie\Anaconda\lib\site-packages\pandas\core\series.pyc in <lambda>(values)
   1264                 return _quantile(values, qs*100)
   1265 
-> 1266         return self._maybe_box(lambda values: multi(values, q), dropna=True)
   1267 
   1268     def ptp(self, axis=None, out=None):

C:\Users\Robbie\Anaconda\lib\site-packages\pandas\core\series.pyc in multi(values, qs)
   1258 
   1259         def multi(values, qs):
-> 1260             if com.is_list_like(qs):
   1261                 return Series([_quantile(values, x*100)
   1262                                for x in qs], index=qs)

C:\Users\Robbie\Anaconda\lib\site-packages\pandas\core\common.pyc in is_list_like(arg)
   2501 
   2502 def is_list_like(arg):
-> 2503     return (hasattr(arg, '__iter__') and
   2504             not isinstance(arg, compat.string_and_binary_types))
   2505 

KeyboardInterrupt:



In [15]:

    
catD.head(5)









    Out[15]:






  
    
      
      Id
      Product_Info_1
      Product_Info_2
      Product_Info_3
      Product_Info_5
      Product_Info_6
      Product_Info_7
      Employment_Info_2
      Employment_Info_3
      Employment_Info_5
      ...
      Medical_History_33
      Medical_History_34
      Medical_History_35
      Medical_History_36
      Medical_History_37
      Medical_History_38
      Medical_History_39
      Medical_History_40
      Medical_History_41
      Response
    
  
  
    
      0
       2
       1
       D3
       10
       2
       1
       1
       12
       1
       3
      ...
       1
       3
       1
       2
       2
       1
       3
       3
       3
       8
    
    
      1
       5
       1
       A1
       26
       2
       3
       1
        1
       3
       2
      ...
       3
       1
       1
       2
       2
       1
       3
       3
       1
       4
    
    
      2
       6
       1
       E1
       26
       2
       3
       1
        9
       1
       2
      ...
       3
       3
       1
       3
       2
       1
       3
       3
       1
       8
    
    
      3
       7
       1
       D4
       10
       2
       3
       1
        9
       1
       3
      ...
       3
       3
       1
       2
       2
       1
       3
       3
       1
       8
    
    
      4
       8
       1
       D2
       26
       2
       3
       1
        9
       1
       2
      ...
       3
       3
       1
       3
       2
       1
       3
       3
       1
       8
    
  

5 rows × 62 columns



In [16]:

    
#Exploration of the discrete data
disD.describe()









    Out[16]:






  
    
      
      Id
      Medical_History_1
      Medical_History_10
      Medical_History_15
      Medical_History_24
      Medical_History_32
      Response
    
  
  
    
      count
       59381.000000
       50492.000000
       557.000000
       14785.000000
       3801.000000
       1107.000000
       59381.000000
    
    
      mean
       39507.211515
           7.962172
       141.118492
         123.760974
         50.635622
         11.965673
           5.636837
    
    
      std
       22815.883089
          13.027697
       107.759559
          98.516206
         78.149069
         38.718774
           2.456833
    
    
      min
           2.000000
           0.000000
         0.000000
           0.000000
          0.000000
          0.000000
           1.000000
    
    
      25%
       19780.000000
           2.000000
         8.000000
          17.000000
          1.000000
          0.000000
           4.000000
    
    
      50%
       39487.000000
           4.000000
       229.000000
         117.000000
          8.000000
          0.000000
           6.000000
    
    
      75%
       59211.000000
           9.000000
       240.000000
         240.000000
         64.000000
          2.000000
           8.000000
    
    
      max
       79146.000000
         240.000000
       240.000000
         240.000000
        240.000000
        240.000000
           8.000000



In [17]:

    
disD.head(5)









    Out[17]:






  
    
      
      Id
      Medical_History_1
      Medical_History_10
      Medical_History_15
      Medical_History_24
      Medical_History_32
      Response
    
  
  
    
      0
       2
        4
      NaN
       240
      NaN
      NaN
       8
    
    
      1
       5
        5
      NaN
         0
      NaN
      NaN
       4
    
    
      2
       6
       10
      NaN
       NaN
      NaN
      NaN
       8
    
    
      3
       7
        0
      NaN
       NaN
      NaN
      NaN
       8
    
    
      4
       8
      NaN
      NaN
       NaN
      NaN
      NaN
       8



In [60]:

    
#Iterate through each categorical column of data
#Perform a 2D histogram later

i=0    
for key in varTypes['categorical']:
    
    #print "The category is: {0} with value_counts: {1} and detailed tuple: {2} ".format(key, l.count(), l)
    plt.figure(i)
    plt.title("Histogram of "+str(key))
    plt.xlabel("Categories " + str((df.groupby(key).describe())['count']))
    #fig, axes = plt.subplots(nrows = 1, ncols = 2)
    #catD[key].value_counts(normalize=True).hist(ax=axes[0]); axes[0].set_title("Histogram: "+str(key))
    #catD[key].value_counts(normalize=True).hist(cumulative=True,ax=axes[1]); axes[1].set_title("Cumulative HG: "+str(key))
    if df[key].dtype in (np.int64, np.float, float, int):
        df[key].hist()
    
    i+=1









    



---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-60-515469ef3abc> in <module>()
      8     plt.figure(i)
      9     plt.title("Histogram of "+str(key))
---> 10     plt.xlabel("Categories " + str((df.groupby(key).describe())['count']))
     11     #fig, axes = plt.subplots(nrows = 1, ncols = 2)
     12     #catD[key].value_counts(normalize=True).hist(ax=axes[0]); axes[0].set_title("Histogram: "+str(key))

C:\Users\Robbie\Anaconda\lib\site-packages\pandas\core\groupby.pyc in describe(self, percentile_width, percentiles, include, exclude)

C:\Users\Robbie\Anaconda\lib\site-packages\pandas\core\groupby.pyc in wrapper(*args, **kwargs)
    559             except Exception:
    560                 try:
--> 561                     return self.apply(curried)
    562                 except Exception:
    563 

C:\Users\Robbie\Anaconda\lib\site-packages\pandas\core\groupby.pyc in apply(self, func, *args, **kwargs)
    660         # ignore SettingWithCopy here in case the user mutates
    661         with option_context('mode.chained_assignment',None):
--> 662             return self._python_apply_general(f)
    663 
    664     def _python_apply_general(self, f):

C:\Users\Robbie\Anaconda\lib\site-packages\pandas\core\groupby.pyc in _python_apply_general(self, f)
    664     def _python_apply_general(self, f):
    665         keys, values, mutated = self.grouper.apply(f, self._selected_obj,
--> 666                                                    self.axis)
    667 
    668         return self._wrap_applied_output(keys, values,

C:\Users\Robbie\Anaconda\lib\site-packages\pandas\core\groupby.pyc in apply(self, f, data, axis)
   1272                 hasattr(splitter, 'fast_apply') and axis == 0):
   1273             try:
-> 1274                 values, mutated = splitter.fast_apply(f, group_keys)
   1275                 return group_keys, values, mutated
   1276             except (lib.InvalidApply):

C:\Users\Robbie\Anaconda\lib\site-packages\pandas\core\groupby.pyc in fast_apply(self, f, names)
   3444 
   3445         sdata = self._get_sorted_data()
-> 3446         results, mutated = lib.apply_frame_axis0(sdata, f, names, starts, ends)
   3447 
   3448         return results, mutated

pandas\src\reduce.pyx in pandas.lib.apply_frame_axis0 (pandas\lib.c:38246)()

C:\Users\Robbie\Anaconda\lib\site-packages\pandas\core\groupby.pyc in f(g)
    656         @wraps(func)
    657         def f(g):
--> 658             return func(g, *args, **kwargs)
    659 
    660         # ignore SettingWithCopy here in case the user mutates

C:\Users\Robbie\Anaconda\lib\site-packages\pandas\core\groupby.pyc in curried(x)
    544 
    545             def curried(x):
--> 546                 return f(x, *args, **kwargs)
    547 
    548             # preserve the name so we can detect it when calling plot methods,

C:\Users\Robbie\Anaconda\lib\site-packages\pandas\core\generic.pyc in describe(self, percentile_width, percentiles, include, exclude)
   3841             data = self.select_dtypes(include=include, exclude=exclude)
   3842 
-> 3843         ldesc = [describe_1d(s, percentiles) for _, s in data.iteritems()]
   3844         # set a convenient order for rows
   3845         names = []

C:\Users\Robbie\Anaconda\lib\site-packages\pandas\core\generic.pyc in describe_1d(data, percentiles)
   3819         def describe_1d(data, percentiles):
   3820             if com.is_numeric_dtype(data):
-> 3821                 return describe_numeric_1d(data, percentiles)
   3822             elif com.is_timedelta64_dtype(data):
   3823                 return describe_numeric_1d(data, percentiles)

C:\Users\Robbie\Anaconda\lib\site-packages\pandas\core\generic.pyc in describe_numeric_1d(series, percentiles)
   3794             d = ([series.count(), series.mean(), series.std(), series.min()] +
   3795                  [series.quantile(x) for x in percentiles] + [series.max()])
-> 3796             return pd.Series(d, index=stat_index, name=series.name)
   3797 
   3798 

C:\Users\Robbie\Anaconda\lib\site-packages\pandas\core\series.pyc in __init__(self, data, index, dtype, name, copy, fastpath)
    129 
    130             if index is not None:
--> 131                 index = _ensure_index(index)
    132 
    133             if data is None:

C:\Users\Robbie\Anaconda\lib\site-packages\pandas\core\index.pyc in _ensure_index(index_like, copy)
   4767             index_like = copy(index_like)
   4768 
-> 4769     return Index(index_like)
   4770 
   4771 

C:\Users\Robbie\Anaconda\lib\site-packages\pandas\core\index.pyc in __new__(cls, data, dtype, copy, name, fastpath, tupleize_cols, **kwargs)
    212                     return PeriodIndex(subarr, name=name, **kwargs)
    213 
--> 214         return cls._simple_new(subarr, name)
    215 
    216     @classmethod

C:\Users\Robbie\Anaconda\lib\site-packages\pandas\core\index.pyc in _simple_new(cls, values, name, **kwargs)
    221         for k, v in compat.iteritems(kwargs):
    222             setattr(result,k,v)
--> 223         result._reset_identity()
    224         return result
    225 

C:\Users\Robbie\Anaconda\lib\site-packages\pandas\core\index.pyc in _reset_identity(self)
    249     def _reset_identity(self):
    250         """Initializes or resets ``_id`` attribute with new object"""
--> 251         self._id = _Identity()
    252 
    253     # ndarray compat

KeyboardInterrupt:



In [ ]:

    
#Iterate through each 'discrete' column of data
#Perform a 2D histogram later

i=0    
for key in varTypes['discrete']:
    
    #print "The category is: {0} with value_counts: {1} and detailed tuple: {2} ".format(key, l.count(), l)
    plt.figure(i)
    fig, axes = plt.subplots(nrows = 1, ncols = 2)
    
    #Histogram based on normalized value counts of the data set
    disD[key].value_counts().hist(ax=axes[0]); axes[0].set_title("Histogram: "+str(key))
    
    #Cumulative histogram based on normalized value counts of the data set
    disD[key].value_counts().hist(cumulative=True,ax=axes[1]); axes[1].set_title("Cumulative HG: "+str(key))
    i+=1



In [ ]:

    
#2D Histogram

i=0    
for key in varTypes['categorical']:
    
    #print "The category is: {0} with value_counts: {1} and detailed tuple: {2} ".format(key, l.count(), l)
    plt.figure(i)
    #fig, axes = plt.subplots(nrows = 1, ncols = 2)
    
    x = catD[key].value_counts(normalize=True)
    y = df['Response']
    
    plt.hist2d(x[1], y, bins=40, norm=LogNorm())
    plt.colorbar()
    
    #catD[key].value_counts(normalize=True).hist(ax=axes[0]); axes[0].set_title("Histogram: "+str(key))
    #catD[key].value_counts(normalize=True).hist(cumulative=True,ax=axes[1]); axes[1].set_title("Cumulative HG: "+str(key))
    i+=1



In [ ]:

    
#Iterate through each categorical column of data
#Perform a 2D histogram later

i=0    
for key in varTypes['categorical']:
    
    #print "The category is: {0} with value_counts: {1} and detailed tuple: {2} ".format(key, l.count(), l)
    plt.figure(i)
    #fig, axes = plt.subplots(nrows = 1, ncols = 2)
    #catD[key].value_counts(normalize=True).hist(ax=axes[0]); axes[0].set_title("Histogram: "+str(key))
    #catD[key].value_counts(normalize=True).hist(cumulative=True,ax=axes[1]); axes[1].set_title("Cumulative HG: "+str(key))
    if df[key].dtype in (np.int64, np.float, float, int):
        #(1.*df[key].value_counts()/len(df[key])).hist()
        df[key].value_counts(normalize=True).plot(kind='bar')
    
    i+=1



In [12]:

    
df.loc('Product_Info_1')









    



---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-12-78d3c441c23a> in <module>()
----> 1 df.loc('Product_Info_1').head()

AttributeError: '_LocIndexer' object has no attribute 'head'

	nResponse	nProduct_Info_3
0	1.000000	0.243243
1	0.428571	0.675676
2	1.000000	0.675676
3	1.000000	0.243243
4	1.000000	0.675676

	Id	Product_Info_1	Product_Info_2	Product_Info_3	Product_Info_4	Product_Info_5	Product_Info_6	Product_Info_7	Ins_Age	Ht	...	Response
0	2	1	D3	10	0.076923	2	1	1	0.641791	0.581818	...	8
1	5	1	A1	26	0.076923	2	3	1	0.059701	0.600000	...	4
2	6	1	E1	26	0.076923	2	3	1	0.029851	0.745455	...	8
3	7	1	D4	10	0.487179	2	3	1	0.164179	0.672727	...	8
4	8	1	D2	26	0.230769	2	3	1	0.417910	0.654545	...	8

	Id	Product_Info_1	Product_Info_3	Product_Info_4	Product_Info_5	Product_Info_6	Product_Info_7	Ins_Age	Ht	Wt	...	Medical_Keyword_40	Medical_Keyword_41	Medical_Keyword_42	Medical_Keyword_43	Medical_Keyword_44	Medical_Keyword_45	Medical_Keyword_46	Medical_Keyword_47	Medical_Keyword_48	Response
count	59381.000000	59381.000000	59381.000000	59381.000000	59381.000000	59381.000000	59381.000000	59381.000000	59381.000000	59381.000000	...	59381.000000	59381.000000	59381.000000	59381.000000	59381.000000	59381.000000	59381.000000	59381.000000	59381.000000	59381.000000
mean	39507.211515	1.026355	24.415655	0.328952	2.006955	2.673599	1.043583	0.405567	0.707283	0.292587	...	0.056954	0.010054	0.045536	0.010710	0.007528	0.013691	0.008488	0.019905	0.054496	5.636837
std	22815.883089	0.160191	5.072885	0.282562	0.083107	0.739103	0.291949	0.197190	0.074239	0.089037	...	0.231757	0.099764	0.208479	0.102937	0.086436	0.116207	0.091737	0.139676	0.226995	2.456833
min	2.000000	1.000000	1.000000	0.000000	2.000000	1.000000	1.000000	0.000000	0.000000	0.000000	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.000000
25%	19780.000000	1.000000	26.000000	0.076923	2.000000	3.000000	1.000000	0.238806	0.654545	0.225941	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	4.000000
50%	39487.000000	1.000000	26.000000	0.230769	2.000000	3.000000	1.000000	0.402985	0.709091	0.288703	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	6.000000
75%	59211.000000	1.000000	26.000000	0.487179	2.000000	3.000000	1.000000	0.567164	0.763636	0.345188	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	8.000000
max	79146.000000	2.000000	38.000000	1.000000	3.000000	3.000000	3.000000	1.000000	1.000000	1.000000	...	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	8.000000

	Id	Product_Info_1	Product_Info_2	Product_Info_3	Product_Info_5	Product_Info_6	Product_Info_7	Employment_Info_2	Employment_Info_3	Employment_Info_5	...	Medical_History_33	Medical_History_34	Medical_History_35	Medical_History_36	Medical_History_37	Medical_History_38	Medical_History_39	Medical_History_40	Medical_History_41	Response
0	2	1	D3	10	2	1	1	12	1	3	...	1	3	1	2	2	1	3	3	3	8
1	5	1	A1	26	2	3	1	1	3	2	...	3	1	1	2	2	1	3	3	1	4
2	6	1	E1	26	2	3	1	9	1	2	...	3	3	1	3	2	1	3	3	1	8
3	7	1	D4	10	2	3	1	9	1	3	...	3	3	1	2	2	1	3	3	1	8
4	8	1	D2	26	2	3	1	9	1	2	...	3	3	1	3	2	1	3	3	1	8

	Id	Medical_History_1	Medical_History_10	Medical_History_15	Medical_History_24	Medical_History_32	Response
count	59381.000000	50492.000000	557.000000	14785.000000	3801.000000	1107.000000	59381.000000
mean	39507.211515	7.962172	141.118492	123.760974	50.635622	11.965673	5.636837
std	22815.883089	13.027697	107.759559	98.516206	78.149069	38.718774	2.456833
min	2.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.000000
25%	19780.000000	2.000000	8.000000	17.000000	1.000000	0.000000	4.000000
50%	39487.000000	4.000000	229.000000	117.000000	8.000000	0.000000	6.000000
75%	59211.000000	9.000000	240.000000	240.000000	64.000000	2.000000	8.000000
max	79146.000000	240.000000	240.000000	240.000000	240.000000	240.000000	8.000000

	Id	Medical_History_1	Medical_History_10	Medical_History_15	Medical_History_24	Medical_History_32	Response
0	2	4	NaN	240	NaN	NaN	8
1	5	5	NaN	0	NaN	NaN	4
2	6	10	NaN	NaN	NaN	NaN	8
3	7	0	NaN	NaN	NaN	NaN	8
4	8	NaN	NaN	NaN	NaN	NaN	8

	Id	Product_Info_1	Product_Info_2	Product_Info_3	Product_Info_5	Product_Info_6	Product_Info_7	Employment_Info_2	Employment_Info_3	Employment_Info_5	...	Medical_History_33	Medical_History_34	Medical_History_35	Medical_History_36	Medical_History_37	Medical_History_38	Medical_History_39	Medical_History_40	Medical_History_41	Response
0	2	1	D3	10	2	1	1	12	1	3	...	1	3	1	2	2	1	3	3	3	8
1	5	1	A1	26	2	3	1	1	3	2	...	3	1	1	2	2	1	3	3	1	4
2	6	1	E1	26	2	3	1	9	1	2	...	3	3	1	3	2	1	3	3	1	8
3	7	1	D4	10	2	3	1	9	1	3	...	3	3	1	2	2	1	3	3	1	8
4	8	1	D2	26	2	3	1	9	1	2	...	3	3	1	3	2	1	3	3	1	8

	Id	Medical_History_1	Medical_History_10	Medical_History_15	Medical_History_24	Medical_History_32	Response
0	2	4	NaN	240	NaN	NaN	8
1	5	5	NaN	0	NaN	NaN	4
2	6	10	NaN	NaN	NaN	NaN	8
3	7	0	NaN	NaN	NaN	NaN	8
4	8	NaN	NaN	NaN	NaN	NaN	8

	Id	Product_Info_1	Product_Info_2	Product_Info_3	Product_Info_5	Product_Info_6	Product_Info_7	Employment_Info_2	Employment_Info_3	Employment_Info_5	...	Medical_History_33	Medical_History_34	Medical_History_35	Medical_History_36	Medical_History_37	Medical_History_38	Medical_History_39	Medical_History_40	Medical_History_41	Response
0	2	1	D3	10	2	1	1	12	1	3	...	1	3	1	2	2	1	3	3	3	8
1	5	1	A1	26	2	3	1	1	3	2	...	3	1	1	2	2	1	3	3	1	4
2	6	1	E1	26	2	3	1	9	1	2	...	3	3	1	3	2	1	3	3	1	8
3	7	1	D4	10	2	3	1	9	1	3	...	3	3	1	2	2	1	3	3	1	8
4	8	1	D2	26	2	3	1	9	1	2	...	3	3	1	3	2	1	3	3	1	8

	Id	Medical_History_1	Medical_History_10	Medical_History_15	Medical_History_24	Medical_History_32	Response
0	2	4	NaN	240	NaN	NaN	8
1	5	5	NaN	0	NaN	NaN	4
2	6	10	NaN	NaN	NaN	NaN	8
3	7	0	NaN	NaN	NaN	NaN	8
4	8	NaN	NaN	NaN	NaN	NaN	8