Exploration of Prudential Life Insurance Data

Data retrieved from:

https://www.kaggle.com/c/prudential-life-insurance-assessment

File descriptions:
  • train.csv - the training set, contains the Response values
  • test.csv - the test set, you must predict the Response variable for all rows in this file
  • sample_submission.csv - a sample submission file in the correct format
Data fields:
Variable Description
Id A unique identifier associated with an application.
Product_Info_1-7 A set of normalized variables relating to the product applied for
Ins_Age Normalized age of applicant
Ht Normalized height of applicant
Wt Normalized weight of applicant
BMI Normalized BMI of applicant
Employment_Info_1-6 A set of normalized variables relating to the employment history of the applicant.
InsuredInfo_1-6 A set of normalized variables providing information about the applicant.
Insurance_History_1-9 A set of normalized variables relating to the insurance history of the applicant.
Family_Hist_1-5 A set of normalized variables relating to the family history of the applicant.
Medical_History_1-41 A set of normalized variables relating to the medical history of the applicant.
Medical_Keyword_1-48 A set of dummy variables relating to the presence of/absence of a medical keyword being associated with the application.
Response This is the target variable, an ordinal variable relating to the final decision associated with an application

The following variables are all categorical (nominal):

Product_Info_1, Product_Info_2, Product_Info_3, Product_Info_5, Product_Info_6, Product_Info_7, Employment_Info_2, Employment_Info_3, Employment_Info_5, InsuredInfo_1, InsuredInfo_2, InsuredInfo_3, InsuredInfo_4, InsuredInfo_5, InsuredInfo_6, InsuredInfo_7, Insurance_History_1, Insurance_History_2, Insurance_History_3, Insurance_History_4, Insurance_History_7, Insurance_History_8, Insurance_History_9, Family_Hist_1, Medical_History_2, Medical_History_3, Medical_History_4, Medical_History_5, Medical_History_6, Medical_History_7, Medical_History_8, Medical_History_9, Medical_History_11, Medical_History_12, Medical_History_13, Medical_History_14, Medical_History_16, Medical_History_17, Medical_History_18, Medical_History_19, Medical_History_20, Medical_History_21, Medical_History_22, Medical_History_23, Medical_History_25, Medical_History_26, Medical_History_27, Medical_History_28, Medical_History_29, Medical_History_30, Medical_History_31, Medical_History_33, Medical_History_34, Medical_History_35, Medical_History_36, Medical_History_37, Medical_History_38, Medical_History_39, Medical_History_40, Medical_History_41

The following variables are continuous:

Product_Info_4, Ins_Age, Ht, Wt, BMI, Employment_Info_1, Employment_Info_4, Employment_Info_6, Insurance_History_5, Family_Hist_2, Family_Hist_3, Family_Hist_4, Family_Hist_5

The following variables are discrete:

Medical_History_1, Medical_History_10, Medical_History_15, Medical_History_24, Medical_History_32

Medical_Keyword_1-48 are dummy variables.

My thoughts are as follows:

The main dependent variable is the Risk Response (1-8) What are variables are correlated to the risk response? How do I perform correlation analysis between variables?

Import libraries


In [26]:
# Importing libraries

%pylab inline
%matplotlib inline
import pandas as pd 
import matplotlib.pyplot as plt
from matplotlib.colors import LogNorm
from sklearn import preprocessing
import numpy as np


Populating the interactive namespace from numpy and matplotlib

In [3]:
# Convert variable data into categorical, continuous, discrete, 
# and dummy variable lists the following into a dictionary

Define categorical data types


In [71]:
s = ["Product_Info_1, Product_Info_2, Product_Info_3, Product_Info_5, Product_Info_6, Product_Info_7, Employment_Info_2, Employment_Info_3, Employment_Info_5, InsuredInfo_1, InsuredInfo_2, InsuredInfo_3, InsuredInfo_4, InsuredInfo_5, InsuredInfo_6, InsuredInfo_7, Insurance_History_1, Insurance_History_2, Insurance_History_3, Insurance_History_4, Insurance_History_7, Insurance_History_8, Insurance_History_9, Family_Hist_1, Medical_History_2, Medical_History_3, Medical_History_4, Medical_History_5, Medical_History_6, Medical_History_7, Medical_History_8, Medical_History_9, Medical_History_11, Medical_History_12, Medical_History_13, Medical_History_14, Medical_History_16, Medical_History_17, Medical_History_18, Medical_History_19, Medical_History_20, Medical_History_21, Medical_History_22, Medical_History_23, Medical_History_25, Medical_History_26, Medical_History_27, Medical_History_28, Medical_History_29, Medical_History_30, Medical_History_31, Medical_History_33, Medical_History_34, Medical_History_35, Medical_History_36, Medical_History_37, Medical_History_38, Medical_History_39, Medical_History_40, Medical_History_41",
    "Product_Info_4, Ins_Age, Ht, Wt, BMI, Employment_Info_1, Employment_Info_4, Employment_Info_6, Insurance_History_5, Family_Hist_2, Family_Hist_3, Family_Hist_4, Family_Hist_5",
     "Medical_History_1, Medical_History_10, Medical_History_15, Medical_History_24, Medical_History_32"]
 

varTypes = dict()


#Very hacky way of inserting and appending ID and Response columns to the required dataframes
#Make this better

varTypes['categorical'] = s[0].split(', ')
#varTypes['categorical'].insert(0, 'Id')
#varTypes['categorical'].append('Response')

varTypes['continuous'] = s[1].split(', ')
#varTypes['continuous'].insert(0, 'Id')
#varTypes['continuous'].append('Response')

varTypes['discrete'] = s[2].split(', ')
#varTypes['discrete'].insert(0, 'Id') 
#varTypes['discrete'].append('Response')



varTypes['dummy'] = ["Medical_Keyword_"+str(i) for i in range(1,49)]
varTypes['dummy'].insert(0, 'Id')
varTypes['dummy'].append('Response')


['Id', 'Medical_Keyword_1', 'Medical_Keyword_2', 'Medical_Keyword_3', 'Medical_Keyword_4', 'Medical_Keyword_5', 'Medical_Keyword_6', 'Medical_Keyword_7', 'Medical_Keyword_8', 'Medical_Keyword_9', 'Medical_Keyword_10', 'Medical_Keyword_11', 'Medical_Keyword_12', 'Medical_Keyword_13', 'Medical_Keyword_14', 'Medical_Keyword_15', 'Medical_Keyword_16', 'Medical_Keyword_17', 'Medical_Keyword_18', 'Medical_Keyword_19', 'Medical_Keyword_20', 'Medical_Keyword_21', 'Medical_Keyword_22', 'Medical_Keyword_23', 'Medical_Keyword_24', 'Medical_Keyword_25', 'Medical_Keyword_26', 'Medical_Keyword_27', 'Medical_Keyword_28', 'Medical_Keyword_29', 'Medical_Keyword_30', 'Medical_Keyword_31', 'Medical_Keyword_32', 'Medical_Keyword_33', 'Medical_Keyword_34', 'Medical_Keyword_35', 'Medical_Keyword_36', 'Medical_Keyword_37', 'Medical_Keyword_38', 'Medical_Keyword_39', 'Medical_Keyword_40', 'Medical_Keyword_41', 'Medical_Keyword_42', 'Medical_Keyword_43', 'Medical_Keyword_44', 'Medical_Keyword_45', 'Medical_Keyword_46', 'Medical_Keyword_47', 'Medical_Keyword_48', 'Response']

In [5]:
#Prints out each of the the variable types as a check
#for i in iter(varTypes['dummy']):
    #print i

Importing life insurance data set

The following variables are all categorical (nominal):

Product_Info_1, Product_Info_2, Product_Info_3, Product_Info_5, Product_Info_6, Product_Info_7, Employment_Info_2, Employment_Info_3, Employment_Info_5, InsuredInfo_1, InsuredInfo_2, InsuredInfo_3, InsuredInfo_4, InsuredInfo_5, InsuredInfo_6, InsuredInfo_7, Insurance_History_1, Insurance_History_2, Insurance_History_3, Insurance_History_4, Insurance_History_7, Insurance_History_8, Insurance_History_9, Family_Hist_1, Medical_History_2, Medical_History_3, Medical_History_4, Medical_History_5, Medical_History_6, Medical_History_7, Medical_History_8, Medical_History_9, Medical_History_11, Medical_History_12, Medical_History_13, Medical_History_14, Medical_History_16, Medical_History_17, Medical_History_18, Medical_History_19, Medical_History_20, Medical_History_21, Medical_History_22, Medical_History_23, Medical_History_25, Medical_History_26, Medical_History_27, Medical_History_28, Medical_History_29, Medical_History_30, Medical_History_31, Medical_History_33, Medical_History_34, Medical_History_35, Medical_History_36, Medical_History_37, Medical_History_38, Medical_History_39, Medical_History_40, Medical_History_41

The following variables are continuous: Product_Info_4, Ins_Age, Ht, Wt, BMI, Employment_Info_1, Employment_Info_4, Employment_Info_6, Insurance_History_5, Family_Hist_2, Family_Hist_3, Family_Hist_4, Family_Hist_5

The following variables are discrete: Medical_History_1, Medical_History_10, Medical_History_15, Medical_History_24, Medical_History_32

Medical_Keyword_1-48 are dummy variables.


In [11]:
#Import training data 
d = pd.read_csv('prud_files/train.csv')

In [93]:
def normalize_df(d):
    min_max_scaler = preprocessing.MinMaxScaler()
    x = d.values.astype(np.float)
    return pd.DataFrame(min_max_scaler.fit_transform(x))

In [130]:
# Import training data 
d = pd.read_csv('prud_files/train.csv')

#Separation into groups

df_cat = pd.DataFrame(d, columns=["Id","Response"]+varTypes["categorical"])
df_disc = pd.DataFrame(d, columns=["Id","Response"]+varTypes["categorical"])
df_cont = pd.DataFrame(d, columns=["Id","Response"]+varTypes["categorical"])

In [159]:
d_cat = df_cat.copy()

#normalizes the columns for binary classification
norm_product_info_2 = [pd.get_dummies(d_cat["Product_Info_2"])]

a = pd.DataFrame(normalize_df(d_cat["Response"]))
a.columns=["nResponse"]
d_cat = pd.concat([d_cat, a], axis=1, join='outer')

for x in varTypes["categorical"]:
        try:

            a = pd.DataFrame(normalize_df(d_cat[x]))
            a.columns=[str("n"+x)]
            d_cat = pd.concat([d_cat, a], axis=1, join='outer')
        except Exception as e:
            print e.args
            print "Error on "+str(x)+" w error: "+str(e)


('could not convert string to float: A8',)
Error on Product_Info_2 w error: could not convert string to float: A8

In [145]:


In [163]:
d_cat.iloc[:,62:66].head(5)

# Normalization of columns
# Create a minimum and maximum processor object


Out[163]:
nResponse nProduct_Info_1 nProduct_Info_3 nProduct_Info_5
0 1.000000 0 0.243243 0
1 0.428571 0 0.675676 0
2 1.000000 0 0.675676 0
3 1.000000 0 0.243243 0
4 1.000000 0 0.675676 0

In [107]:
# Define various group by data streams

df = d
    
gb_PI2 = df.groupby('Product_Info_1')
gb_PI2 = df.groupby('Product_Info_2')

gb_Ins_Age = df.groupby('Ins_Age')
gb_Ht = df.groupby('Ht')
gb_Wt = df.groupby('Wt')

gb_response = df.groupby('Response')

In [66]:
#Outputs rows the differnet categorical groups

for c in df.columns:
    if (c in varTypes['categorical']):
        if(c != 'Id'):
            a = [ str(x)+", " for x in df.groupby(c).groups ]
            print c + " : " + str(a)


Product_Info_1 : ['1, ', '2, ']
Product_Info_2 : ['C2, ', 'B2, ', 'E1, ', 'A4, ', 'B1, ', 'A1, ', 'A3, ', 'A2, ', 'A5, ', 'C1, ', 'A7, ', 'A6, ', 'C3, ', 'A8, ', 'D4, ', 'C4, ', 'D2, ', 'D3, ', 'D1, ']
Product_Info_3 : ['1, ', '2, ', '3, ', '4, ', '5, ', '6, ', '8, ', '9, ', '10, ', '11, ', '12, ', '13, ', '15, ', '16, ', '17, ', '18, ', '19, ', '20, ', '21, ', '22, ', '23, ', '24, ', '26, ', '27, ', '28, ', '29, ', '30, ', '31, ', '32, ', '33, ', '34, ', '36, ', '37, ', '38, ']
Product_Info_5 : ['2, ', '3, ']
Product_Info_6 : ['1, ', '3, ']
Product_Info_7 : ['1, ', '2, ', '3, ']
Employment_Info_2 : ['1, ', '2, ', '3, ', '4, ', '5, ', '6, ', '7, ', '9, ', '10, ', '11, ', '12, ', '13, ', '14, ', '15, ', '16, ', '17, ', '18, ', '19, ', '20, ', '21, ', '22, ', '23, ', '25, ', '26, ', '27, ', '28, ', '29, ', '30, ', '31, ', '32, ', '33, ', '34, ', '35, ', '36, ', '37, ', '38, ']
Employment_Info_3 : ['1, ', '3, ']
Employment_Info_5 : ['2, ', '3, ']
InsuredInfo_1 : ['1, ', '2, ', '3, ']
InsuredInfo_2 : ['2, ', '3, ']
InsuredInfo_3 : ['1, ', '2, ', '3, ', '4, ', '5, ', '6, ', '7, ', '8, ', '9, ', '10, ', '11, ']
InsuredInfo_4 : ['2, ', '3, ']
InsuredInfo_5 : ['1, ', '3, ']
InsuredInfo_6 : ['1, ', '2, ']
InsuredInfo_7 : ['1, ', '3, ']
Insurance_History_1 : ['1, ', '2, ']
Insurance_History_2 : ['1, ', '2, ', '3, ']
Insurance_History_3 : ['1, ', '2, ', '3, ']
Insurance_History_4 : ['1, ', '2, ', '3, ']
Insurance_History_7 : ['1, ', '2, ', '3, ']
Insurance_History_8 : ['1, ', '2, ', '3, ']
Insurance_History_9 : ['1, ', '2, ', '3, ']
Family_Hist_1 : ['1, ', '2, ', '3, ']
Medical_History_2 : ['1, ', '2, ', '3, ', '5, ', '6, ', '7, ', '8, ', '9, ', '10, ', '12, ', '13, ', '14, ', '15, ', '16, ', '17, ', '18, ', '19, ', '20, ', '21, ', '22, ', '23, ', '24, ', '25, ', '26, ', '27, ', '28, ', '29, ', '30, ', '32, ', '33, ', '34, ', '35, ', '36, ', '37, ', '38, ', '39, ', '40, ', '41, ', '42, ', '43, ', '44, ', '45, ', '46, ', '47, ', '48, ', '50, ', '51, ', '52, ', '53, ', '54, ', '55, ', '56, ', '57, ', '58, ', '59, ', '60, ', '61, ', '62, ', '63, ', '64, ', '66, ', '67, ', '68, ', '69, ', '70, ', '71, ', '72, ', '73, ', '74, ', '75, ', '76, ', '77, ', '78, ', '79, ', '81, ', '82, ', '84, ', '85, ', '86, ', '87, ', '88, ', '89, ', '90, ', '91, ', '93, ', '94, ', '95, ', '96, ', '97, ', '98, ', '99, ', '100, ', '101, ', '102, ', '104, ', '105, ', '106, ', '107, ', '108, ', '109, ', '110, ', '111, ', '112, ', '113, ', '114, ', '115, ', '116, ', '117, ', '120, ', '121, ', '122, ', '123, ', '124, ', '125, ', '127, ', '128, ', '129, ', '131, ', '132, ', '133, ', '134, ', '135, ', '136, ', '137, ', '138, ', '139, ', '140, ', '141, ', '142, ', '143, ', '144, ', '145, ', '146, ', '147, ', '148, ', '149, ', '150, ', '151, ', '152, ', '153, ', '154, ', '155, ', '156, ', '157, ', '158, ', '159, ', '160, ', '161, ', '162, ', '163, ', '164, ', '165, ', '166, ', '167, ', '169, ', '170, ', '171, ', '172, ', '173, ', '174, ', '175, ', '177, ', '179, ', '180, ', '181, ', '182, ', '183, ', '184, ', '185, ', '186, ', '187, ', '188, ', '189, ', '190, ', '191, ', '192, ', '193, ', '195, ', '196, ', '197, ', '198, ', '199, ', '200, ', '201, ', '202, ', '203, ', '204, ', '205, ', '207, ', '208, ', '209, ', '210, ', '212, ', '213, ', '214, ', '215, ', '216, ', '217, ', '218, ', '219, ', '220, ', '221, ', '222, ', '223, ', '224, ', '225, ', '226, ', '227, ', '228, ', '229, ', '230, ', '231, ', '232, ', '233, ', '234, ', '235, ', '236, ', '238, ', '239, ', '240, ', '241, ', '242, ', '243, ', '245, ', '247, ', '248, ', '249, ', '250, ', '251, ', '252, ', '253, ', '255, ', '256, ', '257, ', '258, ', '259, ', '260, ', '261, ', '262, ', '264, ', '265, ', '266, ', '267, ', '268, ', '270, ', '271, ', '272, ', '273, ', '274, ', '275, ', '276, ', '277, ', '278, ', '279, ', '280, ', '281, ', '282, ', '283, ', '285, ', '286, ', '287, ', '288, ', '289, ', '290, ', '291, ', '293, ', '294, ', '295, ', '296, ', '297, ', '298, ', '299, ', '301, ', '302, ', '303, ', '305, ', '306, ', '307, ', '310, ', '311, ', '313, ', '314, ', '315, ', '316, ', '317, ', '318, ', '319, ', '320, ', '321, ', '322, ', '323, ', '324, ', '326, ', '327, ', '328, ', '329, ', '330, ', '331, ', '332, ', '333, ', '334, ', '335, ', '336, ', '337, ', '338, ', '343, ', '344, ', '345, ', '346, ', '347, ', '348, ', '349, ', '350, ', '351, ', '352, ', '353, ', '354, ', '355, ', '357, ', '358, ', '360, ', '361, ', '362, ', '363, ', '364, ', '366, ', '368, ', '369, ', '370, ', '371, ', '372, ', '373, ', '374, ', '375, ', '376, ', '377, ', '378, ', '379, ', '380, ', '381, ', '382, ', '383, ', '384, ', '385, ', '386, ', '387, ', '388, ', '389, ', '390, ', '391, ', '392, ', '393, ', '394, ', '395, ', '396, ', '397, ', '398, ', '399, ', '400, ', '403, ', '404, ', '405, ', '406, ', '407, ', '408, ', '409, ', '410, ', '411, ', '412, ', '413, ', '414, ', '415, ', '416, ', '417, ', '418, ', '419, ', '420, ', '421, ', '422, ', '426, ', '427, ', '428, ', '430, ', '431, ', '432, ', '433, ', '434, ', '435, ', '436, ', '437, ', '438, ', '439, ', '440, ', '441, ', '443, ', '444, ', '446, ', '447, ', '448, ', '449, ', '451, ', '452, ', '453, ', '455, ', '456, ', '457, ', '458, ', '459, ', '461, ', '462, ', '464, ', '465, ', '466, ', '467, ', '468, ', '469, ', '470, ', '471, ', '472, ', '473, ', '474, ', '475, ', '476, ', '477, ', '478, ', '479, ', '480, ', '481, ', '482, ', '483, ', '484, ', '486, ', '487, ', '488, ', '489, ', '490, ', '491, ', '492, ', '493, ', '494, ', '495, ', '496, ', '497, ', '498, ', '499, ', '501, ', '502, ', '503, ', '504, ', '505, ', '506, ', '507, ', '509, ', '510, ', '511, ', '512, ', '513, ', '514, ', '515, ', '516, ', '517, ', '518, ', '519, ', '520, ', '522, ', '523, ', '524, ', '525, ', '526, ', '527, ', '528, ', '529, ', '530, ', '531, ', '532, ', '533, ', '534, ', '536, ', '537, ', '538, ', '540, ', '541, ', '542, ', '543, ', '544, ', '545, ', '546, ', '548, ', '549, ', '550, ', '551, ', '552, ', '553, ', '554, ', '557, ', '558, ', '559, ', '560, ', '561, ', '562, ', '563, ', '564, ', '565, ', '566, ', '567, ', '568, ', '569, ', '570, ', '571, ', '572, ', '573, ', '575, ', '576, ', '577, ', '578, ', '579, ', '580, ', '581, ', '582, ', '583, ', '584, ', '586, ', '587, ', '588, ', '589, ', '590, ', '591, ', '592, ', '593, ', '595, ', '596, ', '598, ', '599, ', '600, ', '601, ', '602, ', '603, ', '605, ', '606, ', '607, ', '608, ', '609, ', '610, ', '611, ', '613, ', '614, ', '615, ', '616, ', '617, ', '618, ', '619, ', '620, ', '621, ', '622, ', '623, ', '624, ', '626, ', '627, ', '628, ', '629, ', '630, ', '631, ', '632, ', '633, ', '634, ', '635, ', '636, ', '637, ', '638, ', '639, ', '640, ', '641, ', '642, ', '643, ', '644, ', '645, ', '646, ', '647, ', '648, ']
Medical_History_3 : ['1, ', '2, ', '3, ']
Medical_History_4 : ['1, ', '2, ']
Medical_History_5 : ['1, ', '2, ', '3, ']
Medical_History_6 : ['1, ', '2, ', '3, ']
Medical_History_7 : ['1, ', '2, ', '3, ']
Medical_History_8 : ['1, ', '2, ', '3, ']
Medical_History_9 : ['1, ', '2, ', '3, ']
Medical_History_11 : ['1, ', '2, ', '3, ']
Medical_History_12 : ['1, ', '2, ', '3, ']
Medical_History_13 : ['1, ', '2, ', '3, ']
Medical_History_14 : ['1, ', '2, ', '3, ']
Medical_History_16 : ['1, ', '2, ', '3, ']
Medical_History_17 : ['1, ', '2, ', '3, ']
Medical_History_18 : ['1, ', '2, ', '3, ']
Medical_History_19 : ['1, ', '2, ', '3, ']
Medical_History_20 : ['1, ', '2, ', '3, ']
Medical_History_21 : ['1, ', '2, ', '3, ']
Medical_History_22 : ['1, ', '2, ']
Medical_History_23 : ['1, ', '2, ', '3, ']
Medical_History_25 : ['1, ', '2, ', '3, ']
Medical_History_26 : ['1, ', '2, ', '3, ']
Medical_History_27 : ['1, ', '2, ', '3, ']
Medical_History_28 : ['1, ', '2, ', '3, ']
Medical_History_29 : ['1, ', '2, ', '3, ']
Medical_History_30 : ['1, ', '2, ', '3, ']
Medical_History_31 : ['1, ', '2, ', '3, ']
Medical_History_33 : ['1, ', '3, ']
Medical_History_34 : ['1, ', '2, ', '3, ']
Medical_History_35 : ['1, ', '2, ', '3, ']
Medical_History_36 : ['1, ', '2, ', '3, ']
Medical_History_37 : ['1, ', '2, ', '3, ']
Medical_History_38 : ['1, ', '2, ']
Medical_History_39 : ['1, ', '2, ', '3, ']
Medical_History_40 : ['1, ', '2, ', '3, ']
Medical_History_41 : ['1, ', '2, ', '3, ']
Response : ['1, ', '2, ', '3, ', '4, ', '5, ', '6, ', '7, ', '8, ']

In [28]:
df_prod_info = pd.DataFrame(d, columns=(["Response"]+ [ "Product_Info_"+str(x) for x in range(1,8)])) 
df_emp_info = pd.DataFrame(d, columns=(["Response"]+ [ "Employment_Info_"+str(x) for x in range(1,6)])) 
df_bio = pd.DataFrame(d, columns=["Response", "Ins_Age", "Ht", "Wt","BMI"])
df_med_kw = pd.DataFrame(d, columns=(["Response"]+ [ "Medical_Keyword_"+str(x) for x in range(1,48)])).add(axis=[ "Medical_Keyword_"+str(x) for x in range(1,48)])
df_med_kw.describe()


---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-28-49219c99689d> in <module>()
      2 df_emp_info = pd.DataFrame(d, columns=(["Response"]+ [ "Employment_Info_"+str(x) for x in range(1,6)]))
      3 df_bio = pd.DataFrame(d, columns=["Response", "Ins_Age", "Ht", "Wt","BMI"])
----> 4 df_med_kw = pd.DataFrame(d, columns=(["Response"]+ [ "Medical_Keyword_"+str(x) for x in range(1,48)])).add(axis=[ "Medical_Keyword_"+str(x) for x in range(1,48)])
      5 df_med_kw.describe()

TypeError: f() takes at least 2 arguments (2 given)

In [6]:
df.head(5)


Out[6]:
Id Product_Info_1 Product_Info_2 Product_Info_3 Product_Info_4 Product_Info_5 Product_Info_6 Product_Info_7 Ins_Age Ht ... Medical_Keyword_40 Medical_Keyword_41 Medical_Keyword_42 Medical_Keyword_43 Medical_Keyword_44 Medical_Keyword_45 Medical_Keyword_46 Medical_Keyword_47 Medical_Keyword_48 Response
0 2 1 D3 10 0.076923 2 1 1 0.641791 0.581818 ... 0 0 0 0 0 0 0 0 0 8
1 5 1 A1 26 0.076923 2 3 1 0.059701 0.600000 ... 0 0 0 0 0 0 0 0 0 4
2 6 1 E1 26 0.076923 2 3 1 0.029851 0.745455 ... 0 0 0 0 0 0 0 0 0 8
3 7 1 D4 10 0.487179 2 3 1 0.164179 0.672727 ... 0 0 0 0 0 0 0 0 0 8
4 8 1 D2 26 0.230769 2 3 1 0.417910 0.654545 ... 0 0 0 0 0 0 0 0 0 8

5 rows × 128 columns


In [7]:
df.describe()


Out[7]:
Id Product_Info_1 Product_Info_3 Product_Info_4 Product_Info_5 Product_Info_6 Product_Info_7 Ins_Age Ht Wt ... Medical_Keyword_40 Medical_Keyword_41 Medical_Keyword_42 Medical_Keyword_43 Medical_Keyword_44 Medical_Keyword_45 Medical_Keyword_46 Medical_Keyword_47 Medical_Keyword_48 Response
count 59381.000000 59381.000000 59381.000000 59381.000000 59381.000000 59381.000000 59381.000000 59381.000000 59381.000000 59381.000000 ... 59381.000000 59381.000000 59381.000000 59381.000000 59381.000000 59381.000000 59381.000000 59381.000000 59381.000000 59381.000000
mean 39507.211515 1.026355 24.415655 0.328952 2.006955 2.673599 1.043583 0.405567 0.707283 0.292587 ... 0.056954 0.010054 0.045536 0.010710 0.007528 0.013691 0.008488 0.019905 0.054496 5.636837
std 22815.883089 0.160191 5.072885 0.282562 0.083107 0.739103 0.291949 0.197190 0.074239 0.089037 ... 0.231757 0.099764 0.208479 0.102937 0.086436 0.116207 0.091737 0.139676 0.226995 2.456833
min 2.000000 1.000000 1.000000 0.000000 2.000000 1.000000 1.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000
25% 19780.000000 1.000000 26.000000 0.076923 2.000000 3.000000 1.000000 0.238806 0.654545 0.225941 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 4.000000
50% 39487.000000 1.000000 26.000000 0.230769 2.000000 3.000000 1.000000 0.402985 0.709091 0.288703 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 6.000000
75% 59211.000000 1.000000 26.000000 0.487179 2.000000 3.000000 1.000000 0.567164 0.763636 0.345188 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 8.000000
max 79146.000000 2.000000 38.000000 1.000000 3.000000 3.000000 3.000000 1.000000 1.000000 1.000000 ... 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 8.000000

8 rows × 127 columns

Grouping of various categorical data sets

Histograms and descriptive statistics for Risk Response, Ins_Age, BMI, Wt


In [9]:
plt.figure(0)
plt.title("Categorical - Histogram for Risk Response")
plt.xlabel("Risk Response (1-7)")
plt.ylabel("Frequency")
plt.hist(df.Response)
plt.savefig('images/hist_Response.png')
print df.Response.describe()
print ""


plt.figure(1)
plt.title("Continuous - Histogram for Ins_Age")
plt.xlabel("Normalized Ins_Age [0,1]")
plt.ylabel("Frequency")
plt.hist(df.Ins_Age)
plt.savefig('images/hist_Ins_Age.png')
print df.Ins_Age.describe()
print ""

plt.figure(2)
plt.title("Continuous - Histogram for BMI")
plt.xlabel("Normalized BMI [0,1]")
plt.ylabel("Frequency")
plt.hist(df.BMI)
plt.savefig('images/hist_BMI.png')
print df.BMI.describe()
print ""

plt.figure(3)
plt.title("Continuous - Histogram for Wt")
plt.xlabel("Normalized Wt [0,1]")
plt.ylabel("Frequency")
plt.hist(df.Wt)
plt.savefig('images/hist_Wt.png')
print df.Wt.describe()
print ""

plt.show()


count    59381.000000
mean         5.636837
std          2.456833
min          1.000000
25%          4.000000
50%          6.000000
75%          8.000000
max          8.000000
Name: Response, dtype: float64

count    59381.000000
mean         0.405567
std          0.197190
min          0.000000
25%          0.238806
50%          0.402985
75%          0.567164
max          1.000000
Name: Ins_Age, dtype: float64

count    59381.000000
mean         0.469462
std          0.122213
min          0.000000
25%          0.385517
50%          0.451349
75%          0.532858
max          1.000000
Name: BMI, dtype: float64

count    59381.000000
mean         0.292587
std          0.089037
min          0.000000
25%          0.225941
50%          0.288703
75%          0.345188
max          1.000000
Name: Wt, dtype: float64

Histograms and descriptive statistics for Product_Info_1-7


In [10]:
for i in range(1,8):
    
    print "The iteration is: "+str(i)
    print df['Product_Info_'+str(i)].describe()
    print ""
    
    plt.figure(i)

    if(i == 4):
        plt.title("Continuous - Histogram for Product_Info_"+str(i))
        plt.xlabel("Normalized value: [0,1]")
        plt.ylabel("Frequency")
    else:
        plt.title("Categorical - Histogram of Product_Info_"+str(i))
        plt.xlabel("Categories")
        plt.ylabel("Frequency")
    
    if(i == 2):
        df.Product_Info_2.value_counts().plot(kind='bar')
    else:
        plt.hist(df['Product_Info_'+str(i)])
        
    plt.savefig('images/hist_Product_Info_'+str(i)+'.png')

plt.show()


The iteration is: 1
count    59381.000000
mean         1.026355
std          0.160191
min          1.000000
25%          1.000000
50%          1.000000
75%          1.000000
max          2.000000
Name: Product_Info_1, dtype: float64

The iteration is: 2
count     59381
unique       19
top          D3
freq      14321
Name: Product_Info_2, dtype: object

The iteration is: 3
count    59381.000000
mean        24.415655
std          5.072885
min          1.000000
25%         26.000000
50%         26.000000
75%         26.000000
max         38.000000
Name: Product_Info_3, dtype: float64

The iteration is: 4
count    59381.000000
mean         0.328952
std          0.282562
min          0.000000
25%          0.076923
50%          0.230769
75%          0.487179
max          1.000000
Name: Product_Info_4, dtype: float64

The iteration is: 5
count    59381.000000
mean         2.006955
std          0.083107
min          2.000000
25%          2.000000
50%          2.000000
75%          2.000000
max          3.000000
Name: Product_Info_5, dtype: float64

The iteration is: 6
count    59381.000000
mean         2.673599
std          0.739103
min          1.000000
25%          3.000000
50%          3.000000
75%          3.000000
max          3.000000
Name: Product_Info_6, dtype: float64

The iteration is: 7
count    59381.000000
mean         1.043583
std          0.291949
min          1.000000
25%          1.000000
50%          1.000000
75%          1.000000
max          3.000000
Name: Product_Info_7, dtype: float64

Split dataframes into categorical, continuous, discrete, dummy, and response


In [11]:
catD = df.loc[:,varTypes['categorical']]
contD = df.loc[:,varTypes['continuous']]
disD = df.loc[:,varTypes['discrete']]
dummyD = df.loc[:,varTypes['dummy']]
respD = df.loc[:,['id','Response']]

Descriptive statistics and scatter plot relating Product_Info_2 and Response


In [12]:
prod_info = [ "Product_Info_"+str(i) for i in range(1,8)]

a = catD.loc[:, prod_info[1]]

stats = catD.groupby(prod_info[1]).describe()

In [192]:
c = gb_PI2.Response.count()
plt.figure(0)

plt.scatter(c[0],c[1])


Out[192]:
<matplotlib.collections.PathCollection at 0x419ea828>

In [64]:
plt.figure(0)
plt.title("Histogram of "+"Product_Info_"+str(i))
plt.xlabel("Categories " + str((a.describe())['count']))
plt.ylabel("Frequency")


Out[64]:
'Product_Info_3'

In [61]:
for i in range(1,8):
    a = catD.loc[:, "Product_Info_"+str(i)]
    if(i is not 4):
        print a.describe()
    print ""
    
    plt.figure(i)
    plt.title("Histogram of "+"Product_Info_"+str(i))
    plt.xlabel("Categories " + str((catD.groupby(key).describe())['count']))
    plt.ylabel("Frequency")
    
    #fig, axes = plt.subplots(nrows = 1, ncols = 2)
    #catD[key].value_counts(normalize=True).hist(ax=axes[0]); axes[0].set_title("Histogram: "+str(key))
    #catD[key].value_counts(normalize=True).hist(cumulative=True,ax=axes[1]); axes[1].set_title("Cumulative HG: "+str(key))
    
    if a.dtype in (np.int64, np.float, float, int):
        a.hist()
        
# Random functions
#catD.Product_Info_1.describe()
#catD.loc[:, prod_info].groupby('Product_Info_2').describe()
#df[varTypes['categorical']].hist()


count    59381.000000
mean         1.026355
std          0.160191
min          1.000000
25%          1.000000
50%          1.000000
75%          1.000000
max          2.000000
Name: Product_Info_1, dtype: float64

---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-61-64afcafe78d9> in <module>()
      9     plt.figure(i)
     10     plt.title("Histogram of "+"Product_Info_"+str(i))
---> 11     plt.xlabel("Categories " + str((catD.groupby(key).describe())['count']))
     12     plt.ylabel("Frequency")
     13     #fig, axes = plt.subplots(nrows = 1, ncols = 2)

C:\Users\Robbie\Anaconda\lib\site-packages\pandas\core\groupby.pyc in describe(self, percentile_width, percentiles, include, exclude)

C:\Users\Robbie\Anaconda\lib\site-packages\pandas\core\groupby.pyc in wrapper(*args, **kwargs)
    559             except Exception:
    560                 try:
--> 561                     return self.apply(curried)
    562                 except Exception:
    563 

C:\Users\Robbie\Anaconda\lib\site-packages\pandas\core\groupby.pyc in apply(self, func, *args, **kwargs)
    660         # ignore SettingWithCopy here in case the user mutates
    661         with option_context('mode.chained_assignment',None):
--> 662             return self._python_apply_general(f)
    663 
    664     def _python_apply_general(self, f):

C:\Users\Robbie\Anaconda\lib\site-packages\pandas\core\groupby.pyc in _python_apply_general(self, f)
    664     def _python_apply_general(self, f):
    665         keys, values, mutated = self.grouper.apply(f, self._selected_obj,
--> 666                                                    self.axis)
    667 
    668         return self._wrap_applied_output(keys, values,

C:\Users\Robbie\Anaconda\lib\site-packages\pandas\core\groupby.pyc in apply(self, f, data, axis)
   1272                 hasattr(splitter, 'fast_apply') and axis == 0):
   1273             try:
-> 1274                 values, mutated = splitter.fast_apply(f, group_keys)
   1275                 return group_keys, values, mutated
   1276             except (lib.InvalidApply):

C:\Users\Robbie\Anaconda\lib\site-packages\pandas\core\groupby.pyc in fast_apply(self, f, names)
   3444 
   3445         sdata = self._get_sorted_data()
-> 3446         results, mutated = lib.apply_frame_axis0(sdata, f, names, starts, ends)
   3447 
   3448         return results, mutated

pandas\src\reduce.pyx in pandas.lib.apply_frame_axis0 (pandas\lib.c:38246)()

C:\Users\Robbie\Anaconda\lib\site-packages\pandas\core\groupby.pyc in f(g)
    656         @wraps(func)
    657         def f(g):
--> 658             return func(g, *args, **kwargs)
    659 
    660         # ignore SettingWithCopy here in case the user mutates

C:\Users\Robbie\Anaconda\lib\site-packages\pandas\core\groupby.pyc in curried(x)
    544 
    545             def curried(x):
--> 546                 return f(x, *args, **kwargs)
    547 
    548             # preserve the name so we can detect it when calling plot methods,

C:\Users\Robbie\Anaconda\lib\site-packages\pandas\core\generic.pyc in describe(self, percentile_width, percentiles, include, exclude)
   3841             data = self.select_dtypes(include=include, exclude=exclude)
   3842 
-> 3843         ldesc = [describe_1d(s, percentiles) for _, s in data.iteritems()]
   3844         # set a convenient order for rows
   3845         names = []

C:\Users\Robbie\Anaconda\lib\site-packages\pandas\core\generic.pyc in describe_1d(data, percentiles)
   3819         def describe_1d(data, percentiles):
   3820             if com.is_numeric_dtype(data):
-> 3821                 return describe_numeric_1d(data, percentiles)
   3822             elif com.is_timedelta64_dtype(data):
   3823                 return describe_numeric_1d(data, percentiles)

C:\Users\Robbie\Anaconda\lib\site-packages\pandas\core\generic.pyc in describe_numeric_1d(series, percentiles)
   3793                   [pretty_name(x) for x in percentiles] + ['max'])
   3794             d = ([series.count(), series.mean(), series.std(), series.min()] +
-> 3795                  [series.quantile(x) for x in percentiles] + [series.max()])
   3796             return pd.Series(d, index=stat_index, name=series.name)
   3797 

C:\Users\Robbie\Anaconda\lib\site-packages\pandas\core\series.pyc in quantile(self, q)
   1264                 return _quantile(values, qs*100)
   1265 
-> 1266         return self._maybe_box(lambda values: multi(values, q), dropna=True)
   1267 
   1268     def ptp(self, axis=None, out=None):

C:\Users\Robbie\Anaconda\lib\site-packages\pandas\core\series.pyc in _maybe_box(self, func, dropna)
   2121                 if len(values) == 0:
   2122                     return np.nan
-> 2123             result = func(values)
   2124 
   2125         return result

C:\Users\Robbie\Anaconda\lib\site-packages\pandas\core\series.pyc in <lambda>(values)
   1264                 return _quantile(values, qs*100)
   1265 
-> 1266         return self._maybe_box(lambda values: multi(values, q), dropna=True)
   1267 
   1268     def ptp(self, axis=None, out=None):

C:\Users\Robbie\Anaconda\lib\site-packages\pandas\core\series.pyc in multi(values, qs)
   1258 
   1259         def multi(values, qs):
-> 1260             if com.is_list_like(qs):
   1261                 return Series([_quantile(values, x*100)
   1262                                for x in qs], index=qs)

C:\Users\Robbie\Anaconda\lib\site-packages\pandas\core\common.pyc in is_list_like(arg)
   2501 
   2502 def is_list_like(arg):
-> 2503     return (hasattr(arg, '__iter__') and
   2504             not isinstance(arg, compat.string_and_binary_types))
   2505 

KeyboardInterrupt: 

In [15]:
catD.head(5)


Out[15]:
Id Product_Info_1 Product_Info_2 Product_Info_3 Product_Info_5 Product_Info_6 Product_Info_7 Employment_Info_2 Employment_Info_3 Employment_Info_5 ... Medical_History_33 Medical_History_34 Medical_History_35 Medical_History_36 Medical_History_37 Medical_History_38 Medical_History_39 Medical_History_40 Medical_History_41 Response
0 2 1 D3 10 2 1 1 12 1 3 ... 1 3 1 2 2 1 3 3 3 8
1 5 1 A1 26 2 3 1 1 3 2 ... 3 1 1 2 2 1 3 3 1 4
2 6 1 E1 26 2 3 1 9 1 2 ... 3 3 1 3 2 1 3 3 1 8
3 7 1 D4 10 2 3 1 9 1 3 ... 3 3 1 2 2 1 3 3 1 8
4 8 1 D2 26 2 3 1 9 1 2 ... 3 3 1 3 2 1 3 3 1 8

5 rows × 62 columns


In [16]:
#Exploration of the discrete data
disD.describe()


Out[16]:
Id Medical_History_1 Medical_History_10 Medical_History_15 Medical_History_24 Medical_History_32 Response
count 59381.000000 50492.000000 557.000000 14785.000000 3801.000000 1107.000000 59381.000000
mean 39507.211515 7.962172 141.118492 123.760974 50.635622 11.965673 5.636837
std 22815.883089 13.027697 107.759559 98.516206 78.149069 38.718774 2.456833
min 2.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000
25% 19780.000000 2.000000 8.000000 17.000000 1.000000 0.000000 4.000000
50% 39487.000000 4.000000 229.000000 117.000000 8.000000 0.000000 6.000000
75% 59211.000000 9.000000 240.000000 240.000000 64.000000 2.000000 8.000000
max 79146.000000 240.000000 240.000000 240.000000 240.000000 240.000000 8.000000

In [17]:
disD.head(5)


Out[17]:
Id Medical_History_1 Medical_History_10 Medical_History_15 Medical_History_24 Medical_History_32 Response
0 2 4 NaN 240 NaN NaN 8
1 5 5 NaN 0 NaN NaN 4
2 6 10 NaN NaN NaN NaN 8
3 7 0 NaN NaN NaN NaN 8
4 8 NaN NaN NaN NaN NaN 8

In [60]:
#Iterate through each categorical column of data
#Perform a 2D histogram later

i=0    
for key in varTypes['categorical']:
    
    #print "The category is: {0} with value_counts: {1} and detailed tuple: {2} ".format(key, l.count(), l)
    plt.figure(i)
    plt.title("Histogram of "+str(key))
    plt.xlabel("Categories " + str((df.groupby(key).describe())['count']))
    #fig, axes = plt.subplots(nrows = 1, ncols = 2)
    #catD[key].value_counts(normalize=True).hist(ax=axes[0]); axes[0].set_title("Histogram: "+str(key))
    #catD[key].value_counts(normalize=True).hist(cumulative=True,ax=axes[1]); axes[1].set_title("Cumulative HG: "+str(key))
    if df[key].dtype in (np.int64, np.float, float, int):
        df[key].hist()
    
    i+=1


---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-60-515469ef3abc> in <module>()
      8     plt.figure(i)
      9     plt.title("Histogram of "+str(key))
---> 10     plt.xlabel("Categories " + str((df.groupby(key).describe())['count']))
     11     #fig, axes = plt.subplots(nrows = 1, ncols = 2)
     12     #catD[key].value_counts(normalize=True).hist(ax=axes[0]); axes[0].set_title("Histogram: "+str(key))

C:\Users\Robbie\Anaconda\lib\site-packages\pandas\core\groupby.pyc in describe(self, percentile_width, percentiles, include, exclude)

C:\Users\Robbie\Anaconda\lib\site-packages\pandas\core\groupby.pyc in wrapper(*args, **kwargs)
    559             except Exception:
    560                 try:
--> 561                     return self.apply(curried)
    562                 except Exception:
    563 

C:\Users\Robbie\Anaconda\lib\site-packages\pandas\core\groupby.pyc in apply(self, func, *args, **kwargs)
    660         # ignore SettingWithCopy here in case the user mutates
    661         with option_context('mode.chained_assignment',None):
--> 662             return self._python_apply_general(f)
    663 
    664     def _python_apply_general(self, f):

C:\Users\Robbie\Anaconda\lib\site-packages\pandas\core\groupby.pyc in _python_apply_general(self, f)
    664     def _python_apply_general(self, f):
    665         keys, values, mutated = self.grouper.apply(f, self._selected_obj,
--> 666                                                    self.axis)
    667 
    668         return self._wrap_applied_output(keys, values,

C:\Users\Robbie\Anaconda\lib\site-packages\pandas\core\groupby.pyc in apply(self, f, data, axis)
   1272                 hasattr(splitter, 'fast_apply') and axis == 0):
   1273             try:
-> 1274                 values, mutated = splitter.fast_apply(f, group_keys)
   1275                 return group_keys, values, mutated
   1276             except (lib.InvalidApply):

C:\Users\Robbie\Anaconda\lib\site-packages\pandas\core\groupby.pyc in fast_apply(self, f, names)
   3444 
   3445         sdata = self._get_sorted_data()
-> 3446         results, mutated = lib.apply_frame_axis0(sdata, f, names, starts, ends)
   3447 
   3448         return results, mutated

pandas\src\reduce.pyx in pandas.lib.apply_frame_axis0 (pandas\lib.c:38246)()

C:\Users\Robbie\Anaconda\lib\site-packages\pandas\core\groupby.pyc in f(g)
    656         @wraps(func)
    657         def f(g):
--> 658             return func(g, *args, **kwargs)
    659 
    660         # ignore SettingWithCopy here in case the user mutates

C:\Users\Robbie\Anaconda\lib\site-packages\pandas\core\groupby.pyc in curried(x)
    544 
    545             def curried(x):
--> 546                 return f(x, *args, **kwargs)
    547 
    548             # preserve the name so we can detect it when calling plot methods,

C:\Users\Robbie\Anaconda\lib\site-packages\pandas\core\generic.pyc in describe(self, percentile_width, percentiles, include, exclude)
   3841             data = self.select_dtypes(include=include, exclude=exclude)
   3842 
-> 3843         ldesc = [describe_1d(s, percentiles) for _, s in data.iteritems()]
   3844         # set a convenient order for rows
   3845         names = []

C:\Users\Robbie\Anaconda\lib\site-packages\pandas\core\generic.pyc in describe_1d(data, percentiles)
   3819         def describe_1d(data, percentiles):
   3820             if com.is_numeric_dtype(data):
-> 3821                 return describe_numeric_1d(data, percentiles)
   3822             elif com.is_timedelta64_dtype(data):
   3823                 return describe_numeric_1d(data, percentiles)

C:\Users\Robbie\Anaconda\lib\site-packages\pandas\core\generic.pyc in describe_numeric_1d(series, percentiles)
   3794             d = ([series.count(), series.mean(), series.std(), series.min()] +
   3795                  [series.quantile(x) for x in percentiles] + [series.max()])
-> 3796             return pd.Series(d, index=stat_index, name=series.name)
   3797 
   3798 

C:\Users\Robbie\Anaconda\lib\site-packages\pandas\core\series.pyc in __init__(self, data, index, dtype, name, copy, fastpath)
    129 
    130             if index is not None:
--> 131                 index = _ensure_index(index)
    132 
    133             if data is None:

C:\Users\Robbie\Anaconda\lib\site-packages\pandas\core\index.pyc in _ensure_index(index_like, copy)
   4767             index_like = copy(index_like)
   4768 
-> 4769     return Index(index_like)
   4770 
   4771 

C:\Users\Robbie\Anaconda\lib\site-packages\pandas\core\index.pyc in __new__(cls, data, dtype, copy, name, fastpath, tupleize_cols, **kwargs)
    212                     return PeriodIndex(subarr, name=name, **kwargs)
    213 
--> 214         return cls._simple_new(subarr, name)
    215 
    216     @classmethod

C:\Users\Robbie\Anaconda\lib\site-packages\pandas\core\index.pyc in _simple_new(cls, values, name, **kwargs)
    221         for k, v in compat.iteritems(kwargs):
    222             setattr(result,k,v)
--> 223         result._reset_identity()
    224         return result
    225 

C:\Users\Robbie\Anaconda\lib\site-packages\pandas\core\index.pyc in _reset_identity(self)
    249     def _reset_identity(self):
    250         """Initializes or resets ``_id`` attribute with new object"""
--> 251         self._id = _Identity()
    252 
    253     # ndarray compat

KeyboardInterrupt: 

In [ ]:
#Iterate through each 'discrete' column of data
#Perform a 2D histogram later

i=0    
for key in varTypes['discrete']:
    
    #print "The category is: {0} with value_counts: {1} and detailed tuple: {2} ".format(key, l.count(), l)
    plt.figure(i)
    fig, axes = plt.subplots(nrows = 1, ncols = 2)
    
    #Histogram based on normalized value counts of the data set
    disD[key].value_counts().hist(ax=axes[0]); axes[0].set_title("Histogram: "+str(key))
    
    #Cumulative histogram based on normalized value counts of the data set
    disD[key].value_counts().hist(cumulative=True,ax=axes[1]); axes[1].set_title("Cumulative HG: "+str(key))
    i+=1

In [ ]:
#2D Histogram

i=0    
for key in varTypes['categorical']:
    
    #print "The category is: {0} with value_counts: {1} and detailed tuple: {2} ".format(key, l.count(), l)
    plt.figure(i)
    #fig, axes = plt.subplots(nrows = 1, ncols = 2)
    
    x = catD[key].value_counts(normalize=True)
    y = df['Response']
    
    plt.hist2d(x[1], y, bins=40, norm=LogNorm())
    plt.colorbar()
    
    #catD[key].value_counts(normalize=True).hist(ax=axes[0]); axes[0].set_title("Histogram: "+str(key))
    #catD[key].value_counts(normalize=True).hist(cumulative=True,ax=axes[1]); axes[1].set_title("Cumulative HG: "+str(key))
    i+=1

In [ ]:
#Iterate through each categorical column of data
#Perform a 2D histogram later

i=0    
for key in varTypes['categorical']:
    
    #print "The category is: {0} with value_counts: {1} and detailed tuple: {2} ".format(key, l.count(), l)
    plt.figure(i)
    #fig, axes = plt.subplots(nrows = 1, ncols = 2)
    #catD[key].value_counts(normalize=True).hist(ax=axes[0]); axes[0].set_title("Histogram: "+str(key))
    #catD[key].value_counts(normalize=True).hist(cumulative=True,ax=axes[1]); axes[1].set_title("Cumulative HG: "+str(key))
    if df[key].dtype in (np.int64, np.float, float, int):
        #(1.*df[key].value_counts()/len(df[key])).hist()
        df[key].value_counts(normalize=True).plot(kind='bar')
    
    i+=1

In [12]:
df.loc('Product_Info_1')


---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-12-78d3c441c23a> in <module>()
----> 1 df.loc('Product_Info_1').head()

AttributeError: '_LocIndexer' object has no attribute 'head'