Homework 2:

For Homework 2, Build models to predict credit card approval using dataset http://archive.ics.uci.edu/ml/datasets/Credit+Approval



In [2]:

    
# Standard imports for data analysis packages in Python
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# This enables inline Plots
%matplotlib inline

# Limit the rows displayed in dataframe by inserting this line along with your imports.
pd.set_option('display.max_rows', 10)

Part 1 - Data exploration

First, create 2 data frames: `listings` and `bookings` from their respective data files



In [3]:

    
# Create a data frame from the listings dataset

crx_data = pd.read_csv('../hw2/CRX_Data.csv', header=None)



In [4]:

    
# Replace the field names

crx_data.rename(columns={0: 'A1', 1: 'A2', 2: 'A3', 3: 'A4', 4: 'A5', 5: 'A6', 6: 'A7', 7: 'A8', 8: 'A9', 9: 'A10', 10: 'A11', 11: 'A12', 12: 'A13', 13: 'A14', 14: 'A15', 15: 'A16'}, inplace=True)



In [5]:

    
# Check the header of the file

crx_data.head(5)

You may need to do the following -

1. Impute missing data



In [6]:

    
# Check for missing values

crx_data.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 690 entries, 0 to 689
Data columns (total 16 columns):
A1     690 non-null object
A2     690 non-null object
A3     690 non-null float64
A4     690 non-null object
A5     690 non-null object
A6     690 non-null object
A7     690 non-null object
A8     690 non-null float64
A9     690 non-null object
A10    690 non-null object
A11    690 non-null int64
A12    690 non-null object
A13    690 non-null object
A14    690 non-null object
A15    690 non-null int64
A16    690 non-null object
dtypes: float64(2), int64(2), object(12)
memory usage: 91.6+ KB



In [7]:

    
# Remove NaN values and convert A2 and A14 to continuous variables

crx_data.A2.replace('?', np.nan, inplace = True)
crx_data.A2 = crx_data.A2.astype(float)

crx_data.A14.replace('?', np.nan, inplace = True)
crx_data.A14 = crx_data.A14.astype(float)



In [8]:

    
# Run some descriptive statistics on numeric variables

crx_data.describe()









    Out[8]:






  
    
      
      A2
      A3
      A8
      A11
      A14
      A15
    
  
  
    
      count
       678.000000
       690.000000
       690.000000
       690.00000
        677.000000
          690.000000
    
    
      mean
        31.568171
         4.758725
         2.223406
         2.40000
        184.014771
         1017.385507
    
    
      std
        11.957862
         4.978163
         3.346513
         4.86294
        173.806768
         5210.102598
    
    
      min
        13.750000
         0.000000
         0.000000
         0.00000
          0.000000
            0.000000
    
    
      25%
        22.602500
         1.000000
         0.165000
         0.00000
         75.000000
            0.000000
    
    
      50%
        28.460000
         2.750000
         1.000000
         0.00000
        160.000000
            5.000000
    
    
      75%
        38.230000
         7.207500
         2.625000
         3.00000
        276.000000
          395.500000
    
    
      max
        80.250000
        28.000000
        28.500000
        67.00000
       2000.000000
       100000.000000



In [9]:

    
# Impute values for A1
# Replace '?' with NaN

crx_data.A1.replace('?', np.nan, inplace = True)

# Get the distribution of values for A1

crx_data.A1.value_counts()

# Create a function that randomly assigns these values

a1 = ['b', 'a']
p = [0.69, 0.31]

def get_a1_impute_values(n):
    return np.random.choice(a1, n, p)

# Get the NULL values for A1

crx_data.loc[crx_data.A1.isnull(), 'A1']

# Set these values to the values we picked from Random Normal Distribution

crx_data.loc[crx_data.A1.isnull(), 'A1'] = get_a1_impute_values(n=12)



In [10]:

    
# Impute values for A2
# View the values for A2

crx_data.A2.value_counts()

# Get the Mean and Std of A2 Data

print 'Mean A2:', crx_data.A2.mean()
print 'Std A2:', crx_data.A2.std()

# Create a Normal Distribution centered on Mean of 31.57 and Standard Dev of 11.96

def get_a2_impute_values(n):
    return np.random.normal(31.57, 11.96, n)

# Get the NULL values for A2

crx_data.loc[crx_data.A2.isnull(), 'A2']

# Set these values to the values we picked from Random Normal Distribution

crx_data.loc[crx_data.A2.isnull(), 'A2'] = get_a2_impute_values(n=12)









    



Mean A2: 31.5681710914
Std A2: 11.9578624983



In [11]:

    
# Impute values for A4
# Replace '?' with NaN

crx_data.A4.replace('?', np.nan, inplace = True)

# Get the distribution of values for A4

crx_data.A4.value_counts()

# Create a function that randomly assigns these values

a4 = ['u', 'y', 'l']
p = [0.76, 0.24, 0.]

def get_a4_impute_values(n):
    return np.random.choice(a4, n, p)

# Get the NaN values for A4

crx_data.loc[crx_data.A4.isnull(), 'A4']

# Set these values to the values we picked from Random Normal Distribution

crx_data.loc[crx_data.A4.isnull(), 'A4'] = get_a4_impute_values(n=6)



In [12]:

    
# Impute values for A5
# Replace '?' with NaN

crx_data.A5.replace('?', np.nan, inplace = True)

# Get the distribution of values for A1

crx_data.A5.value_counts()

# Create a function that randomly assigns these values

a5 = ['g', 'p', 'gg']
p = [0.76, 0.24, 0.00]

def get_a5_impute_values(n):
    return np.random.choice(a5, n, p)

# Get the NaN values for A5

crx_data.loc[crx_data.A5.isnull(), 'A5']

# Set these values to the values we picked from Random Normal Distribution

crx_data.loc[crx_data.A5.isnull(), 'A5'] = get_a1_impute_values(n=6)



In [13]:

    
# Impute values for A6
# Replace '?' with NaN

crx_data.A6.replace('?', np.nan, inplace = True)

# Get the distribution of values for A1

crx_data.A6.value_counts()

# Create a function that randomly assigns these values

a6 = ['aa', 'c', 'cc', 'd', 'e', 'ff', 'i', 'j', 'k', 'm', 'q', 'r', 'w', 'x']
p = [0.08, 0.20, 0.06, 0.04, 0.04, 0.08, 0.09, 0.01, 0.07, 0.06, 0.11, 0.00, 0.09, 0,06]

def get_a6_impute_values(n):
    return np.random.choice(a6, n, p)

# Get the NaN values for A6

crx_data.loc[crx_data.A6.isnull(), 'A6']

# Set these values to the values we picked from Random Normal Distribution

crx_data.loc[crx_data.A6.isnull(), 'A6'] = get_a6_impute_values(n=9)



In [14]:

    
# Impute values for A7
# Replace '?' with NaN

crx_data.A7.replace('?', np.nan, inplace = True)

# Get the distribution of values for A7

crx_data.A7.value_counts()

# Create a function that randomly assigns these values

a7 = ['v', 'h', 'bb', 'ff', 'z', 'j', 'dd', 'n', 'o']
p = [0.59, 0.20, 0.09, 0.08, 0.01, 0.01, 0.01, 0.01, 0.00]

def get_a7_impute_values(n):
    return np.random.choice(a7, n, p)

# Get the NaN values for A7

crx_data.loc[crx_data.A7.isnull(), 'A7']

# Set these values to the values we picked from Random Normal Distribution

crx_data.loc[crx_data.A7.isnull(), 'A7'] = get_a7_impute_values(n=9)



In [15]:

    
# Impute values for A14
# View the values for A14

crx_data.A14.value_counts()

# Get the Mean and Std of A14 Data

print 'Mean A14:', crx_data.A14.mean()
print 'Std A14:', crx_data.A14.std()

# Create a Normal Distribution centered on Mean of 184.01 and Standard Dev of 173.81

def get_a14_impute_values(n):
    return np.random.normal(184.01, 173.81, n)

# Get the NULL values for A2

crx_data.loc[crx_data.A14.isnull(), 'A14']

# Set these values to the values we picked from Random Normal Distribution

crx_data.loc[crx_data.A14.isnull(), 'A14'] = get_a14_impute_values(n=13)









    



Mean A14: 184.014771049
Std A14: 173.806768225



In [16]:

    
# Convert the approval variable from a categorical to a continuous variable

for elem in crx_data['A16'].unique():
    crx_data[str(elem)] = crx_data['A16'] == elem

# Rename the fields

crx_data.rename(columns={'+': 'Approved', '-': 'Denied'}, inplace=True)

2. Plot and visualize data to see any patterns



In [75]:

    
# Plotting histograms with A1

fig, ax = plt.subplots(1, 2, figsize=(20, 5))
crx_data[crx_data.Approved].A1.value_counts().plot(kind='bar', ax=ax[0])
crx_data[~crx_data.Approved].A1.value_counts().plot(kind='bar', ax=ax[1])









    Out[75]:





<matplotlib.axes._subplots.AxesSubplot at 0x10d305610>



In [63]:

    
# Plotting histograms with A4

fig, ax = plt.subplots(1, 2, figsize=(20, 5))
crx_data[crx_data.Approved].A4.value_counts().plot(kind='bar', ax=ax[0])
crx_data[~crx_data.Approved].A4.value_counts().plot(kind='bar', ax=ax[1])









    Out[63]:





<matplotlib.axes._subplots.AxesSubplot at 0x119a03650>



In [64]:

    
# Plotting histograms with A5

fig, ax = plt.subplots(1, 2, figsize=(20, 5))
crx_data[crx_data.Approved].A5.value_counts().plot(kind='bar', ax=ax[0])
crx_data[~crx_data.Approved].A5.value_counts().plot(kind='bar', ax=ax[1])









    Out[64]:





<matplotlib.axes._subplots.AxesSubplot at 0x10d5bd2d0>



In [65]:

    
# Plotting histograms with A6

fig, ax = plt.subplots(1, 2, figsize=(20, 5))
crx_data[crx_data.Approved].A6.value_counts().plot(kind='bar', ax=ax[0])
crx_data[~crx_data.Approved].A6.value_counts().plot(kind='bar', ax=ax[1])









    Out[65]:





<matplotlib.axes._subplots.AxesSubplot at 0x10d4036d0>



In [66]:

    
# Plotting histograms with A7

fig, ax = plt.subplots(1, 2, figsize=(20, 5))
crx_data[crx_data.Approved].A7.value_counts().plot(kind='bar', ax=ax[0])
crx_data[~crx_data.Approved].A7.value_counts().plot(kind='bar', ax=ax[1])









    Out[66]:





<matplotlib.axes._subplots.AxesSubplot at 0x10cf13310>



In [67]:

    
# Plotting histograms with A9

fig, ax = plt.subplots(1, 2, figsize=(20, 5))
crx_data[crx_data.Approved].A9.value_counts().plot(kind='bar', ax=ax[0])
crx_data[~crx_data.Approved].A9.value_counts().plot(kind='bar', ax=ax[1])









    Out[67]:





<matplotlib.axes._subplots.AxesSubplot at 0x10dcd7790>



In [69]:

    
# Plotting histograms with A10

fig, ax = plt.subplots(1, 2, figsize=(20, 5))
crx_data[crx_data.Approved].A10.value_counts().plot(kind='bar', ax=ax[0])
crx_data[~crx_data.Approved].A10.value_counts().plot(kind='bar', ax=ax[1])









    Out[69]:





<matplotlib.axes._subplots.AxesSubplot at 0x10dbf8b10>



In [70]:

    
# Plotting histograms with A12

fig, ax = plt.subplots(1, 2, figsize=(20, 5))
crx_data[crx_data.Approved].A12.value_counts().plot(kind='bar', ax=ax[0])
crx_data[~crx_data.Approved].A12.value_counts().plot(kind='bar', ax=ax[1])









    Out[70]:





<matplotlib.axes._subplots.AxesSubplot at 0x10c1d8c50>



In [71]:

    
# Plotting histograms with A13

fig, ax = plt.subplots(1, 2, figsize=(20, 5))
crx_data[crx_data.Approved].A13.value_counts().plot(kind='bar', ax=ax[0])
crx_data[~crx_data.Approved].A13.value_counts().plot(kind='bar', ax=ax[1])









    Out[71]:





<matplotlib.axes._subplots.AxesSubplot at 0x10d0984d0>



In [57]:

    
# Import scatter_matrix functionality

from pandas.tools.plotting import scatter_matrix

# Generate a scatterplot matrix with the continuous variables

scat = scatter_matrix(crx_data[['A2', 'A3', 'A8', 'A11', 'A14', 'A15', 'Approved']], figsize=(15,15))

For the actual model, the submission Notebook should have the following -

1. Build Models using Logistics Regression and SVM (you will learn tonight - Wed)



In [27]:

    
# Generate x data frame from credit card dataset

x_data = crx_data[['A1', 'A2', 'A3', 'A4', 'A5', 'A6', 'A7', 'A8', 'A9', 'A10', 'A11', 'A12', 'A13', 'A14', 'A15']]



In [28]:

    
# Create dummy variables

x_data = pd.get_dummies(x_data)
x_data









    Out[28]:






  
    
      
      A2
      A3
      A8
      A11
      A14
      A15
      A1_a
      A1_b
      A4_l
      A4_u
      ...
      A7_z
      A9_f
      A9_t
      A10_f
      A10_t
      A12_f
      A12_t
      A13_g
      A13_p
      A13_s
    
  
  
    
      0  
       30.83
        0.000
       1.25
       1
       202
         0
       0
       1
       0
       1
      ...
       0
       0
       1
       0
       1
       1
       0
       1
       0
       0
    
    
      1  
       58.67
        4.460
       3.04
       6
        43
       560
       1
       0
       0
       1
      ...
       0
       0
       1
       0
       1
       1
       0
       1
       0
       0
    
    
      2  
       24.50
        0.500
       1.50
       0
       280
       824
       1
       0
       0
       1
      ...
       0
       0
       1
       1
       0
       1
       0
       1
       0
       0
    
    
      3  
       27.83
        1.540
       3.75
       5
       100
         3
       0
       1
       0
       1
      ...
       0
       0
       1
       0
       1
       0
       1
       1
       0
       0
    
    
      4  
       20.17
        5.625
       1.71
       0
       120
         0
       0
       1
       0
       1
      ...
       0
       0
       1
       1
       0
       1
       0
       0
       0
       1
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      685
       21.08
       10.085
       1.25
       0
       260
         0
       0
       1
       0
       0
      ...
       0
       1
       0
       1
       0
       1
       0
       1
       0
       0
    
    
      686
       22.67
        0.750
       2.00
       2
       200
       394
       1
       0
       0
       1
      ...
       0
       1
       0
       0
       1
       0
       1
       1
       0
       0
    
    
      687
       25.25
       13.500
       2.00
       1
       200
         1
       1
       0
       0
       0
      ...
       0
       1
       0
       0
       1
       0
       1
       1
       0
       0
    
    
      688
       17.92
        0.205
       0.04
       0
       280
       750
       0
       1
       0
       1
      ...
       0
       1
       0
       1
       0
       1
       0
       1
       0
       0
    
    
      689
       35.00
        3.375
       8.29
       0
         0
         0
       0
       1
       0
       1
      ...
       0
       1
       0
       1
       0
       0
       1
       1
       0
       0
    
  

690 rows × 48 columns



In [29]:

    
# Generate y data frame from credit card dataset

y_data = crx_data['Approved']
y_data









    Out[29]:





0    True
1    True
2    True
...
687    False
688    False
689    False
Name: Approved, Length: 690, dtype: bool



In [30]:

    
# Import train_test_split from scikit-learn

from sklearn.cross_validation import train_test_split

# Divide the dataset into training and test data

x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, random_state=12, test_size=0.2)



In [31]:

    
# Import logistic regression package

from sklearn.linear_model import LogisticRegression

# Create estimator with logistic regression

clf = LogisticRegression()

# Fit the model with the training data

clf.fit(x_train, y_train)

# Score the model using test data

clf.score(x_test, y_test)









    Out[31]:





0.8623188405797102



In [32]:

    
# Score of 0.86 seems to suggest that the model predicts the actual approval outcomes extremely well



In [33]:

    
# Import SVC from scikit-learn

from sklearn.svm import SVC

# Create estimator with non-linear kernel

est = SVC()

# Fit the model with the training data

est.fit(x_train, y_train)

# Score the model using test data

est.score(x_test, y_test)









    Out[33]:





0.54347826086956519



In [34]:

    
# Score of 0.54 suggests that the SVM approach does a good job of predicting approval outcomes
# but not as good as the linear regression

2. Use Grid Search to evaluate model parameters (Wed Lab) and select a model



In [35]:

    
# Import GridSearchCV from scikit-learn

from sklearn.grid_search import GridSearchCV

# Establish the search space for parameters of C and Gamma

param = {'C':np.logspace(-3,3,10),'gamma':np.logspace(-3,3,10)}

# Set up the grid search

gs = GridSearchCV(SVC(),param)

# Run the grid search on our model

gs.fit(x_train, y_train)









    Out[35]:





GridSearchCV(cv=None,
       estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False),
       fit_params={}, iid=True, loss_func=None, n_jobs=1,
       param_grid={'C': array([  1.00000e-03,   4.64159e-03,   2.15443e-02,   1.00000e-01,
         4.64159e-01,   2.15443e+00,   1.00000e+01,   4.64159e+01,
         2.15443e+02,   1.00000e+03]), 'gamma': array([  1.00000e-03,   4.64159e-03,   2.15443e-02,   1.00000e-01,
         4.64159e-01,   2.15443e+00,   1.00000e+01,   4.64159e+01,
         2.15443e+02,   1.00000e+03])},
       pre_dispatch='2*n_jobs', refit=True, score_func=None, scoring=None,
       verbose=0)



In [36]:

    
# Display the parameters and score for the best fitting model

gs.best_params_,gs.best_score_









    Out[36]:





({'C': 46.415888336127729, 'gamma': 0.001}, 0.67210144927536231)



In [37]:

    
# GridSearchCV finds that the best fitting model has a C = 46.4 and a gamma = 0.001
# The score of 0.67 suggests that this model behaves better than the ordinary SVC model (with the defaults)
# but that it does worse than the logistic regression model

3. Build a Confusion Matrix (Mon Lab) to show how well your prediction did.



In [38]:

    
# Import confusion matrix and classification report from scikit-learn

from sklearn.metrics import confusion_matrix, classification_report

# Predict values of Y based on values of X

y_pred = clf.predict(x_test)



In [39]:

    
# Confusion Matrix for Type 1 and Type 2 Error

print confusion_matrix(y_test, y_pred)









    



[[63 14]
 [ 5 56]]



In [40]:

    
# The confusion matrix suggests that the SVM model accurately predicted 63 + 56 = 119 credit card approvals / denials
# But that it mis-categorized 5 + 14 = 19 approvals / denials, which are Type I or Type II errors



In [41]:

    
# Examine Precision and Recall

print classification_report(y_test, y_pred)









    



             precision    recall  f1-score   support

      False       0.93      0.82      0.87        77
       True       0.80      0.92      0.85        61

avg / total       0.87      0.86      0.86       138



In [42]:

    
# The precision - representing the classifier's ability to not label as positive a sample that is negative - is 0.80
# The recall - representing the classifier's ability to find all the positive samples - is 0.90

	A1	A2	A3	A4	A5	A6	A7	A8	A9	A10	A11	A12	A13	A14	A15	A16
0	b	30.83	0.000	u	g	w	v	1.25	t	t	1	f	g	202	0	+
1	a	58.67	4.460	u	g	q	h	3.04	t	t	6	f	g	43	560	+
2	a	24.5	0.500	u	g	q	h	1.50	t	f	0	f	g	280	824	+
3	b	27.83	1.540	u	g	w	v	3.75	t	t	5	t	g	100	3	+
4	b	20.17	5.625	u	g	w	v	1.71	t	f	0	f	s	120	0	+

	A2	A3	A8	A11	A14	A15
count	678.000000	690.000000	690.000000	690.00000	677.000000	690.000000
mean	31.568171	4.758725	2.223406	2.40000	184.014771	1017.385507
std	11.957862	4.978163	3.346513	4.86294	173.806768	5210.102598
min	13.750000	0.000000	0.000000	0.00000	0.000000	0.000000
25%	22.602500	1.000000	0.165000	0.00000	75.000000	0.000000
50%	28.460000	2.750000	1.000000	0.00000	160.000000	5.000000
75%	38.230000	7.207500	2.625000	3.00000	276.000000	395.500000
max	80.250000	28.000000	28.500000	67.00000	2000.000000	100000.000000

Homework 2:

For Homework 2, Build models to predict credit card approval using dataset http://archive.ics.uci.edu/ml/datasets/Credit+Approval

Part 1 - Data exploration

First, create 2 data frames: listings and bookings from their respective data files

You may need to do the following -

1. Impute missing data

2. Plot and visualize data to see any patterns

For the actual model, the submission Notebook should have the following -

1. Build Models using Logistics Regression and SVM (you will learn tonight - Wed)

2. Use Grid Search to evaluate model parameters (Wed Lab) and select a model

3. Build a Confusion Matrix (Mon Lab) to show how well your prediction did.

First, create 2 data frames: `listings` and `bookings` from their respective data files