Chapter 4 Classification

Concepts and data from "An Introduction to Statistical Learning, with applications in R" (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani " available at www.StatLearning.com.

For Tables reference see http://data8.org/datascience/tables.html


In [63]:
# HIDDEN
# For Tables reference see http://data8.org/datascience/tables.html
# This useful nonsense should just go at the top of your notebook.
from datascience import *
%matplotlib inline
import matplotlib.pyplot as plots
import numpy as np
from sklearn import linear_model
plots.style.use('fivethirtyeight')
plots.rc('lines', linewidth=1, color='r')
from ipywidgets import interact, interactive, fixed
import ipywidgets as widgets
# datascience version number of last run of this notebook
version.__version__


import sys
sys.path.append("..")
from ml_table import ML_Table

import locale
locale.setlocale( locale.LC_ALL, 'en_US.UTF-8' )


Out[63]:
'en_US.UTF-8'

1-dimensional simulation of classification by logistic regression

In logistic regression, we use the logistic function,

$p(X) = \frac{e^{β_0 + β_1X}}{1 + e^{β0+β1X}}$

we seek estimates for β0 and β1 such that the predicted probability $\hat{p}(x_i) for each input corresponds as closely as possible to the individual’s observed with that input.

The estimates βˆ0 and βˆ1 are chosen to maximize the likelihood function.


In [64]:
# Simulation of a process that produces a 1d training set
n1 = 100
eps1 = 0.2
test1 = ML_Table.runiform('ix', n1)
# Categories
test1['Cat'] = test1.apply(lambda x: 'A' if x < 0.2 else 'B', 'ix')
# Noise in the relationship of input to category
test1['x'] = test1['ix'] + eps1*np.random.normal(size=n1)
test1['Class'] = test1.apply(lambda x: 0 if x == 'A' else 1, 'Cat')
test1 = test1.drop('ix')
test1.scatter('x', 'Class')


We could look at this in terms of the distribution of each of the categories over the input parameter, but it doesn't capture the concept of what is the likelihood of each category, given the input.


In [65]:
test1.pivot_hist('Cat', 'x')


This would be the density


In [66]:
cc = test1.density('Cat', 'x').scatter('x')


Logistical regression to predict 'Class', given input 'x'


In [67]:
logit1d = test1.logit_regression('Class', 'x')

In [68]:
# Visualize the accuracy of the classifier on the training set (training error)
test1.plot_fit_1d('Class', 'x', logit1d.model)


Out[68]:
<matplotlib.axes._subplots.AxesSubplot at 0x119d755f8>

In [69]:
test1.classification_error_model('Class', logit1d.model, 'x')


Out[69]:
0.05

In [70]:
test1.plot_fit_1d('Class', 'x', logit1d.likelihood)


Out[70]:
<matplotlib.axes._subplots.AxesSubplot at 0x11aacd9e8>

In [71]:
logit1d.params


Out[71]:
(-0.9462124581477489, array([ 3.84156702]))

In [72]:
# Compute the cutting "plane", i.e., point for the claissifier
# p(x) = 0.5 where x = -b0/b1
p50 = -logit1d.params[0]/logit1d.params[1][0]
p50


Out[72]:
0.24630898125576017

In [73]:
logit1d.likelihood(-1), logit1d.likelihood(p50), logit1d.likelihood(1)


Out[73]:
(0.0082621048504563765, 0.5, 0.94761631780288358)

2D Logistic Regression of simulated data

In 2D, logistic regression finds a line that splits the plane.


In [12]:
n = 200
eps = 0.1
test2 = ML_Table.runiform('ix', n)
test2['iy'] = np.random.rand(n)
test2['Cat'] = test2.apply(lambda x, y: 'A' if x+y <0 else 'B', ['ix', 'iy'])
test2['Class A'] = test2.apply(lambda x: 1 if x=='A' else 0, 'Cat')
test2['x'] = test2['ix'] + eps*np.random.normal(size=n)
test2['y'] = test2['iy'] + eps*np.random.normal(size=n)

In [13]:
test2.pivot_scatter('Cat', 'x', 'y')



In [14]:
logit2d = test2.logit_regression('Class A', ['x', 'y'])
model_2d = logit2d.model

In [15]:
test2.plot_fit_2d('Class A', 'x', 'y', model_2d)


Out[15]:
<matplotlib.axes._subplots.Axes3DSubplot at 0x114e5fa20>

In [74]:
test2.plot_cut_2d('Class A', 'x', 'y', model_2d, n_grid=50)


Out[74]:
<matplotlib.axes._subplots.AxesSubplot at 0x11aa7d630>

In [17]:
# error rate
test2.classification_error_model('Class A', model_2d, ['x', 'y'])


Out[17]:
0.05

In [18]:
test2.plot_cut_2d('Class A', 'x', 'y', logit2d.likelihood, n_grid=50)


Out[18]:
<matplotlib.axes._subplots.AxesSubplot at 0x11631c0f0>

In [19]:
knn_reg = test2.knn_regression('Class A', ['x', 'y'], n_neighbors=3)
test2.plot_cut_2d('Class A', 'x', 'y', knn_reg.model, n_grid=50, levels=[0,1])
test2.classification_error_model('Class A', knn_reg.model, ['x', 'y'])


Out[19]:
0.085

In [ ]:

Classification with logistic regression


In [20]:
raw_default = ML_Table.read_table("data/Default.csv")
raw_default


Out[20]:
Unnamed: 0 default student balance income
1 No No 729.526 44361.6
2 No Yes 817.18 12106.1
3 No No 1073.55 31767.1
4 No No 529.251 35704.5
5 No No 785.656 38463.5
6 No Yes 919.589 7491.56
7 No No 825.513 24905.2
8 No Yes 808.668 17600.5
9 No No 1161.06 37468.5
10 No No 0 29275.3

... (9990 rows omitted)


In [21]:
default = raw_default.drop('Unnamed: 0')
default['Default'] = np.where(default['default']=='Yes', 1, 0)
default['Student'] = np.where(default['student']=='Yes', 1, 0)
default


Out[21]:
default student balance income Default Student
No No 729.526 44361.6 0 0
No Yes 817.18 12106.1 0 1
No No 1073.55 31767.1 0 0
No No 529.251 35704.5 0 0
No No 785.656 38463.5 0 0
No Yes 919.589 7491.56 0 1
No No 825.513 24905.2 0 0
No Yes 808.668 17600.5 0 1
No No 1161.06 37468.5 0 0
No No 0 29275.3 0 0

... (9990 rows omitted)


In [22]:
# Look at the trend in the data
default.density('default', 'balance').scatter('balance')



In [23]:
# Look at the trend in the data
default.density('default', 'income').scatter('income')



In [24]:
# Predict default based on balance
default_balance = default.logit_regression('Default', 'balance')

In [25]:
default_balance.summary()


Out[25]:
Param Coeffient
Intercept -9.46507
balance 0.00478248

In [26]:
# Default when balance gets too high
default.plot_fit_1d('Default', 'balance', default_balance.model, connect=False)


Out[26]:
<matplotlib.axes._subplots.AxesSubplot at 0x116d2d470>

In [27]:
default.classification_error_model('Default', default_balance.model, 'balance')


Out[27]:
0.0274

In [28]:
# How impressive is this error rate?
default.where('Default').num_rows/default.num_rows


Out[28]:
0.0333

In [29]:
default.plot_fit_1d('Default', 'balance', default_balance.likelihood, connect=False)


Out[29]:
<matplotlib.axes._subplots.AxesSubplot at 0x117812e10>

In [30]:
default_balance.likelihood(1000), default_balance.likelihood(2000)


Out[30]:
(0.0091701599811269373, 0.52495160381519868)

In [31]:
default_balance.obj.decision_function([[1000], [2000]])


Out[31]:
array([-4.68258808,  0.09988939])

In [32]:
default.pivot_scatter('Default', 'income', 'balance')



In [33]:
# Predict default based on balance
default_BI = default.logit_regression('Default', ['balance', 'income'])

In [34]:
default_BI.summary()


Out[34]:
Param Coeffient
Intercept -1.94164e-06
balance 0.000407565
income -0.000125881

In [35]:
default_BI.obj.decision_function([[10000, 1000]])


Out[35]:
array([ 3.94976362])

In [36]:
# This classifier does not seem to work at all on this data
default.plot_cut_2d('Default', 'balance','income', default_BI.model, levels=[0,1])


Out[36]:
<matplotlib.axes._subplots.AxesSubplot at 0x11783fa20>

In [37]:
# basically discarded all the defaults
default.classification_error_model('Default', default_BI.model, ['balance', 'income'])


Out[37]:
0.0336

In [38]:
default.logit_regression('Default', ['balance', 'income', 'Student']).summary()


Out[38]:
Param Coeffient
Intercept -1.9418e-06
balance 0.000407584
income -0.000125882
Student -2.51031e-06

Using knn regression for the classifier


In [39]:
default_knn_BI = default.knn_regression('Default', ['balance', 'income'], n_neighbors=5)

default.plot_cut_2d('Default', 'balance', 'income', default_knn_BI.model, levels=[0,1])
default.classification_error_model('Default', default_knn_BI.model, ['balance', 'income'])


Out[39]:
0.0963

In [40]:
default.where('default', 'Yes').pivot_scatter('default',  'balance', 'income')


The book claims to have been able to do some clasification with balance and income, but it is very hard to find a cut that pulls out the defaults without as many false positives.


In [41]:
default.where('default', 'No').pivot_scatter('default',  'balance', 'income')



In [42]:
default_sample = default.sample(1000)

In [43]:
default_sample.where('default', 'Yes').num_rows, default_sample.num_rows


Out[43]:
(41, 1000)

In [44]:
ds_lr = default_sample.logit_regression('Default', ['income', 'balance'])
default_sample.plot_cut_2d('Default', 'income', 'balance', ds_lr.likelihood, levels=[0,1])


Out[44]:
<matplotlib.axes._subplots.AxesSubplot at 0x117db5320>

In [45]:
ds_lr.summary()


Out[45]:
Param Coeffient
Intercept -2.29916e-06
income -0.000136097
balance 0.00075572

In [46]:
default_student = default.logit_regression('Default', 'Student')

In [47]:
default_student.summary()


Out[47]:
Param Coeffient
Intercept -3.48496
Student 0.382569

In [48]:
default_student.likelihood(0), default_student.likelihood(1)


Out[48]:
(0.029743137630008021, 0.043008638239327428)

In [49]:
default_BIS = default.logit_regression('Default', ['balance', 'income', 'Student'])
default_BIS.summary()


Out[49]:
Param Coeffient
Intercept -1.9418e-06
balance 0.000407584
income -0.000125882
Student -2.51031e-06

In [50]:
default.summary()


Out[50]:
statistic default student balance income Default Student
min No No 0 771.968 0 0
FirstQu 481.731 21340.5 0 0
median 823.637 34552.6 0 0
mean 835.375 33517 0.0333 0.2944
ThirdQu 0.42466 13431.8 0 0
max Yes Yes 2654.32 73554.2 1 1

In [51]:
default_rates_stu = default.where('student', 'Yes').density('default', 'balance', bins=np.arange(0, 3000, 100)).drop('No')
default_rates_stu.relabel('Yes', 'Student')
default_rates_no = default.where('student', 'No').density('default', 'balance', bins=np.arange(0, 3000, 100)).drop('No')
default_rates_no.relabel('Yes', 'Non-Student')
default_rates = default_rates_stu.join('balance', default_rates_no)
default_rates.plot('balance')



In [52]:
raw_credit = ML_Table.read_table("data/Credit.csv")
credit = raw_credit.drop('Unnamed: 0')
credit['Gender'] = credit.apply(lambda x:x.strip(), 'Gender')
credit


Out[52]:
Income Limit Rating Cards Age Education Gender Student Married Ethnicity Balance
14.891 3,606 283 2 34 11 Male No Yes Caucasian 333
106.025 6,645 483 3 82 15 Female Yes Yes Asian 903
104.593 7,075 514 4 71 11 Male No No Asian 580
148.924 9,504 681 3 36 11 Female No No Asian 964
55.882 4,897 357 2 68 16 Male No Yes Caucasian 331
80.18 8,047 569 4 77 10 Male No No Caucasian 1,151
20.996 3,388 259 2 37 12 Female No No African American 203
71.408 7,114 512 2 87 9 Male No No Asian 872
15.125 3,300 266 5 66 13 Female No No Caucasian 279
71.061 6,819 491 3 41 19 Female Yes Yes African American 1,350

... (390 rows omitted)


In [53]:
credit.where('Gender', 'Female').num_rows


Out[53]:
207

In [54]:
credit.where('Gender', 'Male').num_rows


Out[54]:
193

In [55]:
credit['Female'] = credit.apply(lambda x: 1 if x=='Female' else 0, 'Gender')

In [56]:
credit.Cor()


Out[56]:
Param Income Limit Rating Cards Age Education Balance Female
Income 1 0.792088 0.791378 -0.0182726 0.175338 -0.027692 0.463656 -0.0107375
Limit 0.792088 1 0.99688 0.0102313 0.100888 -0.0235485 0.861697 0.00939668
Rating 0.791378 0.99688 1 0.053239 0.103165 -0.0301356 0.863625 0.00888459
Cards -0.0182726 0.0102313 0.053239 1 0.0429483 -0.0510842 0.0864563 -0.022658
Age 0.175338 0.100888 0.103165 0.0429483 1 0.00361928 0.00183512 0.0040155
Education -0.027692 -0.0235485 -0.0301356 -0.0510842 0.00361928 1 -0.00806158 -0.00504907
Balance 0.463656 0.861697 0.863625 0.0864563 0.00183512 -0.00806158 1 0.021474
Female -0.0107375 0.00939668 0.00888459 -0.022658 0.0040155 -0.00504907 0.021474 1

In [57]:
credit.pivot_scatter('Student', 'Balance', select=['Income', 'Limit', 'Rating', 'Cards', 'Age', 'Education'])



In [58]:
credit['Student Class'] = credit.apply(lambda x: 1 if x=='Yes' else 0, 'Student')
credit


Out[58]:
Income Limit Rating Cards Age Education Gender Student Married Ethnicity Balance Female Student Class
14.891 3,606 283 2 34 11 Male No Yes Caucasian 333 0 0
106.025 6,645 483 3 82 15 Female Yes Yes Asian 903 1 1
104.593 7,075 514 4 71 11 Male No No Asian 580 0 0
148.924 9,504 681 3 36 11 Female No No Asian 964 1 0
55.882 4,897 357 2 68 16 Male No Yes Caucasian 331 0 0
80.18 8,047 569 4 77 10 Male No No Caucasian 1,151 0 0
20.996 3,388 259 2 37 12 Female No No African American 203 1 0
71.408 7,114 512 2 87 9 Male No No Asian 872 0 0
15.125 3,300 266 5 66 13 Female No No Caucasian 279 1 0
71.061 6,819 491 3 41 19 Female Yes Yes African American 1,350 1 1

... (390 rows omitted)


In [59]:
cr = credit.logit_regression('Student Class', ['Balance', 'Limit'])

credit.plot_cut_2d('Student', 'Balance', 'Limit', cr.model, levels=[0,1])
print('error', credit.classification_error_model('Student Class', cr.model, ['Balance', 'Limit']))
print('density', credit.where('Student Class').num_rows/credit.num_rows)


error 0.0575
density 0.1

In [60]:
credit.plot_cut_2d('Student', 'Balance', 'Limit', cr.likelihood)


Out[60]:
<matplotlib.axes._subplots.AxesSubplot at 0x118a6d5c0>

Linear discriminant analysis


In [61]:
lda = credit.LDA('Student Class', ['Balance', 'Limit'])

credit.plot_cut_2d('Student', 'Balance', 'Limit', lda.model, levels=[0,1])
print('error', credit.classification_error_model('Student Class', lda.model, ['Balance', 'Limit']))


error 0.0575

In [62]:
default_lda_BI = default.LDA('Default', ['balance', 'income'])

default.plot_cut_2d('Default', 'balance', 'income', default_lda_BI.model, levels=[0,1])
default.classification_error_model('Default', default_lda_BI.model, ['balance', 'income'])


Out[62]:
0.0276

In [ ]: