Chapter 4 Classification

Concepts and data from "An Introduction to Statistical Learning, with applications in R" (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani " available at www.StatLearning.com.

For Tables reference see http://data8.org/datascience/tables.html


In [63]:
# HIDDEN
# For Tables reference see http://data8.org/datascience/tables.html
# This useful nonsense should just go at the top of your notebook.
from datascience import *
%matplotlib inline
import matplotlib.pyplot as plots
import numpy as np
from sklearn import linear_model
plots.style.use('fivethirtyeight')
plots.rc('lines', linewidth=1, color='r')
from ipywidgets import interact, interactive, fixed
import ipywidgets as widgets
# datascience version number of last run of this notebook
version.__version__


import sys
sys.path.append("..")
from ml_table import ML_Table

import locale
locale.setlocale( locale.LC_ALL, 'en_US.UTF-8' )


Out[63]:
'en_US.UTF-8'

1-dimensional simulation of classification by logistic regression

In logistic regression, we use the logistic function,

$p(X) = \frac{e^{β_0 + β_1X}}{1 + e^{β0+β1X}}$

we seek estimates for β0 and β1 such that the predicted probability $\hat{p}(x_i) for each input corresponds as closely as possible to the individual’s observed with that input.

The estimates βˆ0 and βˆ1 are chosen to maximize the likelihood function.


In [64]:
# Simulation of a process that produces a 1d training set
n1 = 100
eps1 = 0.2
test1 = ML_Table.runiform('ix', n1)
# Categories
test1['Cat'] = test1.apply(lambda x: 'A' if x < 0.2 else 'B', 'ix')
# Noise in the relationship of input to category
test1['x'] = test1['ix'] + eps1*np.random.normal(size=n1)
test1['Class'] = test1.apply(lambda x: 0 if x == 'A' else 1, 'Cat')
test1 = test1.drop('ix')
test1.scatter('x', 'Class')


We could look at this in terms of the distribution of each of the categories over the input parameter, but it doesn't capture the concept of what is the likelihood of each category, given the input.


In [65]:
test1.pivot_hist('Cat', 'x')


This would be the density


In [66]:
cc = test1.density('Cat', 'x').scatter('x')


Logistical regression to predict 'Class', given input 'x'


In [67]:
logit1d = test1.logit_regression('Class', 'x')

In [68]:
# Visualize the accuracy of the classifier on the training set (training error)
test1.plot_fit_1d('Class', 'x', logit1d.model)


Out[68]:
<matplotlib.axes._subplots.AxesSubplot at 0x119d755f8>

In [69]:
test1.classification_error_model('Class', logit1d.model, 'x')


Out[69]:
0.05

In [70]:
test1.plot_fit_1d('Class', 'x', logit1d.likelihood)


Out[70]:
<matplotlib.axes._subplots.AxesSubplot at 0x11aacd9e8>

In [71]:
logit1d.params


Out[71]:
(-0.9462124581477489, array([ 3.84156702]))

In [72]:
# Compute the cutting "plane", i.e., point for the claissifier
# p(x) = 0.5 where x = -b0/b1
p50 = -logit1d.params[0]/logit1d.params[1][0]
p50


Out[72]:
0.24630898125576017

In [73]:
logit1d.likelihood(-1), logit1d.likelihood(p50), logit1d.likelihood(1)


Out[73]:
(0.0082621048504563765, 0.5, 0.94761631780288358)

2D Logistic Regression of simulated data

In 2D, logistic regression finds a line that splits the plane.


In [12]:
n = 200
eps = 0.1
test2 = ML_Table.runiform('ix', n)
test2['iy'] = np.random.rand(n)
test2['Cat'] = test2.apply(lambda x, y: 'A' if x+y <0 else 'B', ['ix', 'iy'])
test2['Class A'] = test2.apply(lambda x: 1 if x=='A' else 0, 'Cat')
test2['x'] = test2['ix'] + eps*np.random.normal(size=n)
test2['y'] = test2['iy'] + eps*np.random.normal(size=n)

In [13]:
test2.pivot_scatter('Cat', 'x', 'y')



In [14]:
logit2d = test2.logit_regression('Class A', ['x', 'y'])
model_2d = logit2d.model

In [15]:
test2.plot_fit_2d('Class A', 'x', 'y', model_2d)


Out[15]:
<matplotlib.axes._subplots.Axes3DSubplot at 0x114e5fa20>

In [74]:
test2.plot_cut_2d('Class A', 'x', 'y', model_2d, n_grid=50)


Out[74]:
<matplotlib.axes._subplots.AxesSubplot at 0x11aa7d630>

In [17]:
# error rate
test2.classification_error_model('Class A', model_2d, ['x', 'y'])


Out[17]:
0.05

In [18]:
test2.plot_cut_2d('Class A', 'x', 'y', logit2d.likelihood, n_grid=50)


Out[18]:
<matplotlib.axes._subplots.AxesSubplot at 0x11631c0f0>

In [19]:
knn_reg = test2.knn_regression('Class A', ['x', 'y'], n_neighbors=3)
test2.plot_cut_2d('Class A', 'x', 'y', knn_reg.model, n_grid=50, levels=[0,1])
test2.classification_error_model('Class A', knn_reg.model, ['x', 'y'])


Out[19]:
0.085

In [ ]:

Classification with logistic regression


In [20]:
raw_default = ML_Table.read_table("data/Default.csv")
raw_default


Out[20]:
Unnamed: 0 default student balance income
1 No No 729.526 44361.6
2 No Yes 817.18 12106.1
3 No No 1073.55 31767.1
4 No No 529.251 35704.5
5 No No 785.656 38463.5
6 No Yes 919.589 7491.56
7 No No 825.513 24905.2
8 No Yes 808.668 17600.5
9 No No 1161.06 37468.5
10 No No 0 29275.3

... (9990 rows omitted)


In [21]:
default = raw_default.drop('Unnamed: 0')
default['Default'] = np.where(default['default']=='Yes', 1, 0)
default['Student'] = np.where(default['student']=='Yes', 1, 0)
default


Out[21]:
default student balance income Default Student
No No 729.526 44361.6 0 0
No Yes 817.18 12106.1 0 1
No No 1073.55 31767.1 0 0
No No 529.251 35704.5 0 0
No No 785.656 38463.5 0 0
No Yes 919.589 7491.56 0 1
No No 825.513 24905.2 0 0
No Yes 808.668 17600.5 0 1
No No 1161.06 37468.5 0 0
No No 0 29275.3 0 0

... (9990 rows omitted)


In [22]:
# Look at the trend in the data
default.density('default', 'balance').scatter('balance')



In [23]:
# Look at the trend in the data
default.density('default', 'income').scatter('income')



In [24]:
# Predict default based on balance
default_balance = default.logit_regression('Default', 'balance')

In [25]:
default_balance.summary()


Out[25]:
Param Coeffient
Intercept -9.46507
balance 0.00478248

In [26]:
# Default when balance gets too high
default.plot_fit_1d('Default', 'balance', default_balance.model, connect=False)


Out[26]:
<matplotlib.axes._subplots.AxesSubplot at 0x116d2d470>

In [27]:
default.classification_error_model('Default', default_balance.model, 'balance')


Out[27]:
0.0274

In [28]:
# How impressive is this error rate?
default.where('Default').num_rows/default.num_rows


Out[28]:
0.0333

In [29]:
default.plot_fit_1d('Default', 'balance', default_balance.likelihood, connect=False)


Out[29]:
<matplotlib.axes._subplots.AxesSubplot at 0x117812e10>

In [30]:
default_balance.likelihood(1000), default_balance.likelihood(2000)


Out[30]:
(0.0091701599811269373, 0.52495160381519868)

In [31]:
default_balance.obj.decision_function([[1000], [2000]])


Out[31]:
array([-4.68258808,  0.09988939])

In [32]:
default.pivot_scatter('Default', 'income', 'balance')



In [33]:
# Predict default based on balance
default_BI = default.logit_regression('Default', ['balance', 'income'])

In [34]:
default_BI.summary()


Out[34]:
Param Coeffient
Intercept -1.94164e-06
balance 0.000407565
income -0.000125881

In [35]:
default_BI.obj.decision_function([[10000, 1000]])


Out[35]:
array([ 3.94976362])

In [36]:
# This classifier does not seem to work at all on this data
default.plot_cut_2d('Default', 'balance','income', default_BI.model, levels=[0,1])


Out[36]:
<matplotlib.axes._subplots.AxesSubplot at 0x11783fa20>

In [37]:
# basically discarded all the defaults
default.classification_error_model('Default', default_BI.model, ['balance', 'income'])


Out[37]:
0.0336

In [38]:
default.logit_regression('Default', ['balance', 'income', 'Student']).summary()


Out[38]:
Param Coeffient
Intercept -1.9418e-06
balance 0.000407584
income -0.000125882
Student -2.51031e-06

Using knn regression for the classifier


In [39]:
default_knn_BI = default.knn_regression('Default', ['balance', 'income'], n_neighbors=5)

default.plot_cut_2d('Default', 'balance', 'income', default_knn_BI.model, levels=[0,1])
default.classification_error_model('Default', default_knn_BI.model, ['balance', 'income'])


Out[39]:
0.0963

In [40]:
default.where('default', 'Yes').pivot_scatter('default',  'balance', 'income')


The book claims to have been able to do some clasification with balance and income, but it is very hard to find a cut that pulls out the defaults without as many false positives.


In [41]:
default.where('default', 'No').pivot_scatter('default',  'balance', 'income')