""" For Homework 2, Build models to predict "Credit Card Approval" using dataset

http://archive.ics.uci.edu/ml/datasets/Credit+Approval

You may need to do the following -

  1. Impute missing data

  2. Plot and visualize data to see any patterns

For the actual model, the submission Notebook should have the following -

  1. Build Models using Logistics Regression and SVM (you will learn tonight - Wed).

  2. Use Grid Search to evaluate model parameters (Wed Lab) and select a model

  3. Build a Confusion Matrix (Mon Lab) to show how well your prediction did.

The homework is due by Monday Dec'15th Midnight. Upload your submission the same way as Homework 1. """"

Obtain and clean dataset


In [1]:
# Standard imports for data analysis packages in Python
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# This enables inline Plots
%matplotlib inline

# Limit rows displayed in notebook
pd.set_option('display.max_rows', 10)
pd.set_option('display.precision', 2)

In [2]:
pd.__version__


Out[2]:
'0.14.1'

In [3]:
!pwd


/home/Sam/DAT_SF_11/homeworks/hw2

In [4]:
crx_cols = ['A' + str(idx) for idx in range(1, 17)]
creditcheck = pd.read_csv('crx.data.txt', header=None, names=crx_cols)

In [5]:
creditcheck.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 690 entries, 0 to 689
Data columns (total 16 columns):
A1     690 non-null object
A2     690 non-null object
A3     690 non-null float64
A4     690 non-null object
A5     690 non-null object
A6     690 non-null object
A7     690 non-null object
A8     690 non-null float64
A9     690 non-null object
A10    690 non-null object
A11    690 non-null int64
A12    690 non-null object
A13    690 non-null object
A14    690 non-null object
A15    690 non-null int64
A16    690 non-null object
dtypes: float64(2), int64(2), object(12)

In [6]:
creditcheck.head(5)


Out[6]:
A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A16
0 b 30.83 0.0 u g w v 1.2 t t 1 f g 00202 0 +
1 a 58.67 4.5 u g q h 3.0 t t 6 f g 00043 560 +
2 a 24.50 0.5 u g q h 1.5 t f 0 f g 00280 824 +
3 b 27.83 1.5 u g w v 3.8 t t 5 t g 00100 3 +
4 b 20.17 5.6 u g w v 1.7 t f 0 f s 00120 0 +

In [7]:
""" 
From documentation:
    Attribute Information:
    A1:    b, a.
    A2:    continuous.
    A3:    continuous.
    A4:    u, y, l, t.
    A5:    g, p, gg.
    A6:    c, d, cc, i, j, k, m, r, q, w, x, e, aa, ff.
    A7:    v, h, bb, j, n, z, dd, ff, o.
    A8:    continuous.
    A9:    t, f.
    A10:   t, f.
    A11:   continuous.
    A12:   t, f.
    A13:   g, p, s.
    A14:   continuous.
    A15:   continuous.
    A16:   +,-         (class attribute)


  File "<ipython-input-7-ee7ab9009067>", line 19
    A16:   +,-         (class attribute)
                                        
^
SyntaxError: EOF while scanning triple-quoted string literal

In [8]:
# so we aren't seeing the missing values...
""" 
8. Missing Attribute Values:
   37 cases (5%) have one or more missing values. The missing values from particular attributes are:
   A1:  12
   A2:  12
   A4:   6
   A5:   6
   A6:   9
   A7:   9
   A14: 13
   
Missing value quantities appear to be turning their rows into strings instead of int / float.
"""


Out[8]:
' \n8. Missing Attribute Values:\n   37 cases (5%) have one or more missing values. The missing values from particular attributes are:\n   A1:  12\n   A2:  12\n   A4:   6\n   A5:   6\n   A6:   9\n   A7:   9\n   A14: 13\n   \nMissing value quantities appear to be turning their rows into strings instead of int / float.\n'

Split out test data to avoid tainting scoring


In [9]:
# from sklearn.cross_validation import train_test_split
# NOTE train_test_split generates numpy arrays instead of pandas dataframes, so we're going to use something different at this step

In [10]:
rows = np.random.binomial(1, 0.7, size=len(creditcheck)).astype('bool')

In [11]:
credittrain = creditcheck[rows]
credittest = creditcheck[~rows]

1. Impute missing data


In [12]:
credittrain.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 483 entries, 0 to 689
Data columns (total 16 columns):
A1     483 non-null object
A2     483 non-null object
A3     483 non-null float64
A4     483 non-null object
A5     483 non-null object
A6     483 non-null object
A7     483 non-null object
A8     483 non-null float64
A9     483 non-null object
A10    483 non-null object
A11    483 non-null int64
A12    483 non-null object
A13    483 non-null object
A14    483 non-null object
A15    483 non-null int64
A16    483 non-null object
dtypes: float64(2), int64(2), object(12)

In [13]:
credittrain.head(5)


Out[13]:
A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A16
0 b 30.83 0.0 u g w v 1.2 t t 1 f g 00202 0 +
1 a 58.67 4.5 u g q h 3.0 t t 6 f g 00043 560 +
2 a 24.50 0.5 u g q h 1.5 t f 0 f g 00280 824 +
4 b 20.17 5.6 u g w v 1.7 t f 0 f s 00120 0 +
6 b 33.17 1.0 u g r h 6.5 t f 0 t g 00164 31285 +

In [14]:
"""
From documentation, columns A1, A2, A4, A5, A6, A7, and A14 have missing values.
Based on .info() result, these missing values are not currently listed as NaN.
"""


Out[14]:
'\nFrom documentation, columns A1, A2, A4, A5, A6, A7, and A14 have missing values.\nBased on .info() result, these missing values are not currently listed as NaN.\n'

In [15]:
credittrain.A14.unique()
# checks were run on other relevant variables


Out[15]:
array(['00202', '00043', '00280', '00120', '00164', '00128', '00000',
       '00096', '00200', '00145', '00100', '00168', '00583', '00260',
       '00240', '00455', '00311', '00216', '00400', '00320', '00080',
       '00250', '00520', '00420', '?', '00980', '00160', '00180', '00140',
       '00288', '00300', '00928', '00188', '00171', '00268', '00167',
       '00152', '00360', '00410', '00274', '00375', '00408', '00350',
       '00204', '00040', '00399', '00093', '00060', '00070', '00181',
       '00393', '00021', '00029', '00440', '00102', '00431', '00370',
       '00024', '00020', '00129', '00195', '00144', '00380', '00050',
       '00381', '00150', '00117', '00056', '00211', '00156', '00022',
       '00228', '00519', '00253', '00220', '00088', '00073', '00121',
       '00470', '00136', '00292', '00154', '00272', '00340', '00720',
       '00112', '00450', '00500', '00232', '00170', '01160', '00411',
       '00348', '00480', '00640', '00372', '00352', '00132', '00141',
       '00178', '00600', '00550', '02000', '00225', '00210', '00108',
       '00356', '00045', '00062', '00092', '00174', '00086', '00254',
       '00028', '00263', '00333', '00312', '00371', '00099', '00252',
       '00760', '00130', '00680', '00163', '00208', '00330', '00290',
       '00432', '00032', '00186', '00349', '00396', '00224', '00369',
       '00076', '00231', '00309', '00465', '00256', '00176'], dtype=object)

In [16]:
"""
Missing values for A1/A4/A5/A6/A7 (all strings) are input as '?', 
for A2 are '?' and A2 is float, 
for A14 are '?' but A14 is continuous in specs and "Z5" in practice?
"""

# replace '?' with NaN for both numerics and strings


Out[16]:
'\nMissing values for A1/A4/A5/A6/A7 (all strings) are input as \'?\', \nfor A2 are \'?\' and A2 is float, \nfor A14 are \'?\' but A14 is continuous in specs and "Z5" in practice?\n'

In [17]:
credittrain['A1'].value_counts()


Out[17]:
b    320
a    153
?     10
dtype: int64

In [18]:
credittrainfull = credittrain.replace(to_replace='?',value=np.nan)
credittrainfull.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 483 entries, 0 to 689
Data columns (total 16 columns):
A1     473 non-null object
A2     477 non-null object
A3     483 non-null float64
A4     480 non-null object
A5     480 non-null object
A6     478 non-null object
A7     478 non-null object
A8     483 non-null float64
A9     483 non-null object
A10    483 non-null object
A11    483 non-null int64
A12    483 non-null object
A13    483 non-null object
A14    475 non-null object
A15    483 non-null int64
A16    483 non-null object
dtypes: float64(2), int64(2), object(12)

In [19]:
# convert A2 and A14 to numerics, now that the '?'s are gone
credittrainfull.A2 = credittrainfull.A2.astype(float)
credittrainfull.A14 = credittrainfull.A14.astype(float)

credittrainfull.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 483 entries, 0 to 689
Data columns (total 16 columns):
A1     473 non-null object
A2     477 non-null float64
A3     483 non-null float64
A4     480 non-null object
A5     480 non-null object
A6     478 non-null object
A7     478 non-null object
A8     483 non-null float64
A9     483 non-null object
A10    483 non-null object
A11    483 non-null int64
A12    483 non-null object
A13    483 non-null object
A14    475 non-null float64
A15    483 non-null int64
A16    483 non-null object
dtypes: float64(4), int64(2), object(10)

Now we can impute over the NaNs, using the most common non-missing values (for strings A1, 4, 5, 6, 7) and means (numerics A2, A14)

Note that A1, 4, and 5 are heavily concentrated in one category, while 6 and 7 have more spread.


In [20]:
col_dist = {}
def get_col_dist(col_name):
    excl_null_mask = credittrainfull[col_name] != '?'
    row_count = credittrainfull[excl_null_mask][col_name].size
    col_data = {}
    col_data['prob'] = (credittrainfull[excl_null_mask][col_name].value_counts() / row_count).values
    col_data['values'] = (credittrainfull[excl_null_mask][col_name].value_counts() / row_count).index.values
    return col_data

In [21]:
col_dist['A1'] = get_col_dist('A1')
col_dist['A4'] = get_col_dist('A4')
col_dist['A5'] = get_col_dist('A5')
col_dist['A6'] = get_col_dist('A6')
col_dist['A7'] = get_col_dist('A7')

In [22]:
def impute_cols(val, options):
    if val == '?':
        return np.random.choice(options['values'], p=options['prob'])
    return val

In [23]:
def impute_a1(val):
    return impute_cols(val, col_dist['A1'])

def impute_a4(val):
    return impute_cols(val, col_dist['A4'])

def impute_a5(val):
    return impute_cols(val, col_dist['A5'])

def impute_a6(val):
    return impute_cols(val, col_dist['A6'])

def impute_a7(val):
    return impute_cols(val, col_dist['A7'])

In [24]:
credittrainfull['A1imp'] = credittrainfull.A1.map(impute_a1)
credittrainfull['A4imp'] = credittrainfull.A4.map(impute_a4)
credittrainfull['A5imp'] = credittrainfull.A5.map(impute_a5)
credittrainfull['A6imp'] = credittrainfull.A6.map(impute_a6)
credittrainfull['A7imp'] = credittrainfull.A7.map(impute_a7)

In [25]:
# Imputing the numeric vars in place because I'm afraid of breaking the function from class...

def impute_numeric_cols(col_data, col_name):
    na_row_count = col_data.isnull().sum()
    impute_vals = np.random.normal(col_data.mean(), col_data.std(), na_row_count)
    return impute_vals

In [26]:
A2_rows_mask = credittrainfull['A2'].isnull()
credittrainfull.loc[A2_rows_mask, 'A2'] = impute_numeric_cols(credittrainfull['A2'], 'A2')

A14_rows_mask = credittrainfull['A14'].isnull()
credittrainfull.loc[A14_rows_mask, 'A14'] = impute_numeric_cols(credittrainfull['A14'], 'A14')

credittrainfull.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 483 entries, 0 to 689
Data columns (total 21 columns):
A1       473 non-null object
A2       483 non-null float64
A3       483 non-null float64
A4       480 non-null object
A5       480 non-null object
A6       478 non-null object
A7       478 non-null object
A8       483 non-null float64
A9       483 non-null object
A10      483 non-null object
A11      483 non-null int64
A12      483 non-null object
A13      483 non-null object
A14      483 non-null float64
A15      483 non-null int64
A16      483 non-null object
A1imp    473 non-null object
A4imp    480 non-null object
A5imp    480 non-null object
A6imp    478 non-null object
A7imp    478 non-null object
dtypes: float64(4), int64(2), object(15)

In [27]:
"""
# This was my own attempt over the weekend:

# Manually find the most requent values (for strings) and the averages (for numerics).
# I didn't see an easy way to automate this without moving from pandas back to numpy/sklearn, but I'm sure one exists.
print 'A1', credittrainfull.A1.value_counts()[0:1]
print 'A2', credittrainfull.A2.mean
print 'A4', credittrainfull.A4.value_counts()[0:1]
print 'A5', credittrainfull.A5.value_counts()[0:1]
print 'A6', credittrainfull.A6.value_counts()[0:1]
print 'A7', credittrainfull.A7.value_counts()[0:1]
print 'A14', credittrainfull.A14.mean
"""


Out[27]:
"\n# This was my own attempt over the weekend:\n\n# Manually find the most requent values (for strings) and the averages (for numerics).\n# I didn't see an easy way to automate this without moving from pandas back to numpy/sklearn, but I'm sure one exists.\nprint 'A1', credittrainfull.A1.value_counts()[0:1]\nprint 'A2', credittrainfull.A2.mean\nprint 'A4', credittrainfull.A4.value_counts()[0:1]\nprint 'A5', credittrainfull.A5.value_counts()[0:1]\nprint 'A6', credittrainfull.A6.value_counts()[0:1]\nprint 'A7', credittrainfull.A7.value_counts()[0:1]\nprint 'A14', credittrainfull.A14.mean\n"

In [28]:
"""
# We create a dictionary with each variable's desired fillna value.
dfill = {}
dfill['A1'] = 'b'
dfill['A2'] = 30.8
dfill['A4'] = 'u'
dfill['A5'] = 'g'
dfill['A6'] = 'c'
dfill['A7'] = 'v'
dfill['A14'] = 202

dfill
"""


Out[28]:
"\n# We create a dictionary with each variable's desired fillna value.\ndfill = {}\ndfill['A1'] = 'b'\ndfill['A2'] = 30.8\ndfill['A4'] = 'u'\ndfill['A5'] = 'g'\ndfill['A6'] = 'c'\ndfill['A7'] = 'v'\ndfill['A14'] = 202\n\ndfill\n"

In [29]:
"""
credittrainfull['A1imp'] = credittrainfull.A1.fillna(dfill['A1'])
credittrainfull['A2imp'] = credittrainfull.A2.fillna(dfill['A2'])
credittrainfull['A4imp'] = credittrainfull.A4.fillna(dfill['A4'])
credittrainfull['A5imp'] = credittrainfull.A5.fillna(dfill['A5'])
credittrainfull['A6imp'] = credittrainfull.A6.fillna(dfill['A6'])
credittrainfull['A7imp'] = credittrainfull.A7.fillna(dfill['A7'])
credittrainfull['A14imp'] = credittrainfull.A14.fillna(dfill['A14'])

credittrainfull.info()
"""


Out[29]:
"\ncredittrainfull['A1imp'] = credittrainfull.A1.fillna(dfill['A1'])\ncredittrainfull['A2imp'] = credittrainfull.A2.fillna(dfill['A2'])\ncredittrainfull['A4imp'] = credittrainfull.A4.fillna(dfill['A4'])\ncredittrainfull['A5imp'] = credittrainfull.A5.fillna(dfill['A5'])\ncredittrainfull['A6imp'] = credittrainfull.A6.fillna(dfill['A6'])\ncredittrainfull['A7imp'] = credittrainfull.A7.fillna(dfill['A7'])\ncredittrainfull['A14imp'] = credittrainfull.A14.fillna(dfill['A14'])\n\ncredittrainfull.info()\n"

2. Plot and visualize data to see any patterns


In [30]:
grid_plot = sns.FacetGrid(credittrainfull, row='A9', col='A12')
grid_plot.map(sns.regplot, 'A2', 'A3', color='.3', fit_reg=False)


Out[30]:
<seaborn.axisgrid.FacetGrid at 0x1644c6d8>

In [31]:
credittrainnum = credittrainfull.ix[:,['A2','A3','A8','A11','A14','A15']]

from pandas.tools.plotting import scatter_matrix
scatter_matrix(credittrainnum, alpha=0.2, figsize=(15, 10), diagonal='hist')


Out[31]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x0000000016E39588>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000000016D3F438>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000000016F84208>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000000017612358>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000000176CAF98>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000000016D582B0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x00000000178BF208>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000000001797C9E8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000000017A702B0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000000017B6CD68>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000000017BD19E8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000000017CDC668>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x0000000017C04208>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000000017E9F748>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000000017F5AE48>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000000017FEF630>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000000180E8D30>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000000001804C208>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x00000000182A8390>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000000183A9A90>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000000001845D0B8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000000001855D898>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000000186075C0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000000018701EF0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000000001880E630>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000000188C1320>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000000189C1A20>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000000018AB0940>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000000018BBC160>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000000018ACE7B8>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x0000000018D2C6A0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000000018E28DA0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000000018F1D3C8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000000018F95BA8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000000190718D0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000000001919A2E8>]], dtype=object)

Building models to classify the data

We'll need to start by turning the categorical variables into sets of dummies, using the schema from the documentation:

""" A1: b, a. -- could be boolean, but since b is more common than a, ambiguous as to which should be True A4: u, y, l, t. A5: g, p, gg. A6: c, d, cc, i, j, k, m, r, q, w, x, e, aa, ff. A7: v, h, bb, j, n, z, dd, ff, o. A9: t, f. [replace with boolean] A10: t, f. [replace with boolean] A12: t, f. [replace with boolean] A13: g, p, s. Noting that A1 / 4/ 5/ 6/ 7 are all imputed """


In [32]:
A1_dummies = pd.get_dummies(credittrainfull.A1imp, prefix='A1')
A4_dummies = pd.get_dummies(credittrainfull.A4imp, prefix='A4')
A5_dummies = pd.get_dummies(credittrainfull.A5imp, prefix='A5')
A6_dummies = pd.get_dummies(credittrainfull.A6imp, prefix='A6')
A7_dummies = pd.get_dummies(credittrainfull.A7imp, prefix='A7')
A13_dummies = pd.get_dummies(credittrainfull.A13, prefix='A13')

creditmodel = credittrainfull.drop(['A1','A2','A4','A5','A6','A7','A13','A14','A1imp','A4imp','A5imp','A6imp','A7imp'],1)

creditmodel = creditmodel.merge(A1_dummies,left_index=True, right_index=True)
creditmodel = creditmodel.merge(A4_dummies,left_index=True, right_index=True)
creditmodel = creditmodel.merge(A5_dummies,left_index=True, right_index=True)
creditmodel = creditmodel.merge(A6_dummies,left_index=True, right_index=True)
creditmodel = creditmodel.merge(A7_dummies,left_index=True, right_index=True)
creditmodel = creditmodel.merge(A13_dummies,left_index=True, right_index=True)

creditmodel.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 483 entries, 0 to 689
Data columns (total 42 columns):
A3       483 non-null float64
A8       483 non-null float64
A9       483 non-null object
A10      483 non-null object
A11      483 non-null int64
A12      483 non-null object
A15      483 non-null int64
A16      483 non-null object
A1_a     483 non-null float64
A1_b     483 non-null float64
A4_l     483 non-null float64
A4_u     483 non-null float64
A4_y     483 non-null float64
A5_g     483 non-null float64
A5_gg    483 non-null float64
A5_p     483 non-null float64
A6_aa    483 non-null float64
A6_c     483 non-null float64
A6_cc    483 non-null float64
A6_d     483 non-null float64
A6_e     483 non-null float64
A6_ff    483 non-null float64
A6_i     483 non-null float64
A6_j     483 non-null float64
A6_k     483 non-null float64
A6_m     483 non-null float64
A6_q     483 non-null float64
A6_r     483 non-null float64
A6_w     483 non-null float64
A6_x     483 non-null float64
A7_bb    483 non-null float64
A7_dd    483 non-null float64
A7_ff    483 non-null float64
A7_h     483 non-null float64
A7_j     483 non-null float64
A7_n     483 non-null float64
A7_o     483 non-null float64
A7_v     483 non-null float64
A7_z     483 non-null float64
A13_g    483 non-null float64
A13_p    483 non-null float64
A13_s    483 non-null float64
dtypes: float64(36), int64(2), object(4)

In [33]:
bdict = {'t' : True, 'f' : False}

creditmodel['A9bool'] = creditmodel['A9'].map(bdict)
creditmodel['A10bool'] = creditmodel['A10'].map(bdict)
creditmodel['A12bool'] = creditmodel['A12'].map(bdict)

creditmodel.drop(['A9','A10','A12'],1,inplace=True)

creditmodel['A16'] = creditmodel['A16'].replace('-', 0)
creditmodel['A16'] = creditmodel['A16'].replace('+', 1)

In [34]:
# And now we're ready to move back into SKLEARN!
from sklearn.cross_validation import train_test_split

creditmodelout = creditmodel['A16']
creditmodelin = creditmodel.drop('A16',1)

# we'll start with a train/validation split, using the cleaned data
X_train, X_test, y_train, y_test = train_test_split(creditmodelin,creditmodelout,test_size=0.2)

1. Logistic regression / SVM


In [35]:
from sklearn.svm import LinearSVC

est = LinearSVC(C=1e-3)
est.fit(X_train, y_train)


Out[35]:
LinearSVC(C=0.001, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='l2', multi_class='ovr', penalty='l2',
     random_state=None, tol=0.0001, verbose=0)

In [36]:
est.score(X_test, y_test)


Out[36]:
0.79381443298969068

Pretty good! Now let's look at a non-linear (rbf) kernel!


In [37]:
from sklearn.svm import SVC

my_svc = SVC()
my_svc.fit(X_train, y_train)


Out[37]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)

In [38]:
my_svc.score(X_test, y_test)


Out[38]:
0.68041237113402064

hmmmmm, it's worse than the linear kernel! maybe there's a parameter issue we aren't seeing -- I guess we have a good motivation to go to grid search


In [39]:
from sklearn.grid_search import GridSearchCV

In [40]:
# first with linear
d_l = {'C': np.logspace(-3.,3.,10)}

gs_l = GridSearchCV(LinearSVC(), d_l)

In [41]:
gs_l.fit(X_train,y_train)


Out[41]:
GridSearchCV(cv=None,
       estimator=LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='l2', multi_class='ovr', penalty='l2',
     random_state=None, tol=0.0001, verbose=0),
       fit_params={}, iid=True, loss_func=None, n_jobs=1,
       param_grid={'C': array([  1.00000e-03,   4.64159e-03,   2.15443e-02,   1.00000e-01,
         4.64159e-01,   2.15443e+00,   1.00000e+01,   4.64159e+01,
         2.15443e+02,   1.00000e+03])},
       pre_dispatch='2*n_jobs', refit=True, score_func=None, scoring=None,
       verbose=0)

In [42]:
gs_l.best_params_, gs_l.best_score_


Out[42]:
({'C': 0.10000000000000001}, 0.84455958549222798)

This looks like an improvement on the linear model with C=.001 we used initially


In [43]:
# Now with SVC
d = {}
d['C'] = np.logspace(-3.,3.,10)
d['gamma'] = np.logspace(-3.,3.,10)

gs_rbf = GridSearchCV(SVC(), d)

In [44]:
gs_rbf.fit(X_train,y_train)


Out[44]:
GridSearchCV(cv=None,
       estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False),
       fit_params={}, iid=True, loss_func=None, n_jobs=1,
       param_grid={'C': array([  1.00000e-03,   4.64159e-03,   2.15443e-02,   1.00000e-01,
         4.64159e-01,   2.15443e+00,   1.00000e+01,   4.64159e+01,
         2.15443e+02,   1.00000e+03]), 'gamma': array([  1.00000e-03,   4.64159e-03,   2.15443e-02,   1.00000e-01,
         4.64159e-01,   2.15443e+00,   1.00000e+01,   4.64159e+01,
         2.15443e+02,   1.00000e+03])},
       pre_dispatch='2*n_jobs', refit=True, score_func=None, scoring=None,
       verbose=0)

In [45]:
gs_rbf.best_params_, gs_rbf.best_score_


Out[45]:
({'C': 46.415888336127729, 'gamma': 0.0046415888336127772},
 0.76683937823834192)

So that's also an improvement on the default parameters, but not enough to catch the linear estimator. If I have enough time, I may try one or more of the alternative kernel function types built into SVC (‘poly’ or ‘sigmoid’)


In [46]:
sig_svc = SVC(kernel='sigmoid')

sig_svc.fit(X_train, y_train)


Out[46]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
  kernel='sigmoid', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)

In [47]:
sig_svc.score(X_test, y_test)


Out[47]:
0.50515463917525771

In [48]:
gs_sig = GridSearchCV(SVC(kernel='sigmoid'), d)

In [49]:
gs_sig.fit(X_train,y_train)


Out[49]:
GridSearchCV(cv=None,
       estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
  kernel='sigmoid', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False),
       fit_params={}, iid=True, loss_func=None, n_jobs=1,
       param_grid={'C': array([  1.00000e-03,   4.64159e-03,   2.15443e-02,   1.00000e-01,
         4.64159e-01,   2.15443e+00,   1.00000e+01,   4.64159e+01,
         2.15443e+02,   1.00000e+03]), 'gamma': array([  1.00000e-03,   4.64159e-03,   2.15443e-02,   1.00000e-01,
         4.64159e-01,   2.15443e+00,   1.00000e+01,   4.64159e+01,
         2.15443e+02,   1.00000e+03])},
       pre_dispatch='2*n_jobs', refit=True, score_func=None, scoring=None,
       verbose=0)

In [50]:
gs_sig.best_params_, gs_sig.best_score_


Out[50]:
({'C': 0.001, 'gamma': 0.001}, 0.55440414507772018)

Next step: scoring on the 'true' test data (credittest, from the first section)


In [51]:
# but the big question here is: how much cleaning (i.e. fixing variable types, creating dummies) 
# should we be doing with the credittest df

In [52]:
"""
I'm out of time, but it looks to me like we should:
- convert '?'s into NaNs, so that A2 and A14 show up correctly as numeric
- convert A9, A10, and A12 to booleans
- run our two main types of model (linear, SVC) with the parameters determined through grid search
"


  File "<ipython-input-52-f4e0566e30d8>", line 6
    "
     
^
SyntaxError: EOF while scanning triple-quoted string literal

3. Build a Confusion Matrix (Mon Lab) to show how well your prediction did.


In [53]:
from sklearn.metrics import confusion_matrix

In [54]:
# this function takes the following form:
# confusion_matrix(y_true,y_pred)

# We have y_true as the y_test array from train_test_split (using the validation data)
# y_pred is probably some method call on my_SVC, with X_test as the input parameter

# in other words, we'll need to run this function separately for each estimator type

In [55]:
est_pred = est.predict(X_test)

cm_l = confusion_matrix(y_test, est_pred)
print cm_l


[[45  4]
 [16 32]]

In [56]:
my_svc_pred = my_svc.predict(X_test)

cm_rbf = confusion_matrix(y_test, my_svc_pred)
print cm_rbf


[[33 16]
 [15 33]]

In [57]:
"""
For a pretty plot of confusion matrix 'cm':
plt.matshow(cm)
plt.title('Confusion matrix')
plt.colorbar()
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()
"""


Out[57]:
"\nFor a pretty plot of confusion matrix 'cm':\nplt.matshow(cm)\nplt.title('Confusion matrix')\nplt.colorbar()\nplt.ylabel('True label')\nplt.xlabel('Predicted label')\nplt.show()\n"

In [58]:
plt.matshow(cm_l)
plt.title('Confusion matrix - linear estimator')
plt.colorbar()
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()



In [59]:
plt.matshow(cm_rbf)
plt.title('Confusion matrix - linear estimator')
plt.colorbar()
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()


Note that the two diagrams side-by-side suggest an improper conclusion (that rbf is producing a more accurate estimate) because the color-bar used to shade each quadrant is not scaled with a zero minimum...


In [ ]: