Introduction to Machine Learning

Andreas Muller and Sarah Guido (2017) O'Reilly

Ch. 2 Supervised Learning

Linear Models for Classification

Decision boundary is a linear function of the input
Binary linear classifier separates two classes using a line, plane, or hyperplane

Algorithms for Learning Linear Models differ in following two ways:

Way they measure how well a particular combination of coefficients and intercepts fits training data
If and what kind of regularization is used

Logistic Regression

Implemented in linear_model.logisticRegression

Linear Support Vector Machines (linear SVMs)

Implemented in svm.LinearSVC

Import packages

Load forge dataset and assign variables



In [1]:

    
import sklearn
import mglearn

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline



In [2]:

    
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC



In [3]:

    
X, y = mglearn.datasets.make_forge()



In [10]:

    
fig, axes = plt.subplots(1, 2, figsize=(10,3))

for model, ax in zip([LinearSVC(), LogisticRegression()], axes):
    clf = model.fit(X, y)
    mglearn.plots.plot_2d_separator(clf, X, fill=False, eps=0.5,
                                   ax=ax, alpha=0.7)
    mglearn.discrete_scatter(X[:,0], X[:,1], y, ax=ax)
    ax.set_title("{}".format(clf.__class__.__name__))
    ax.set_xlabel("Feature 0")
    ax.set_ylabel("Feature 1")
axes[0].legend(loc=4)









    Out[10]:





<matplotlib.legend.Legend at 0x11de218d0>

Figure 1. Decision boundaries of linear SVM and logistic regresison on forge data with default parameters

Any new data point above decision boundary classified as 1, any points below classified as 2
Both models apply L2 regularization by default, same way as Ridge regression
Foth both models C is tradeoff parameter that determines the strength of regularization

Regularization Parameter C

High values of C:

Correspond to less regularization, models will fit training set as best as possible
Stresses importance of each individual data point to be classified correctly

Low values of C:

Models put more emphasis on finding coefficient vector (w) close to zero
Cause models to try to adjust to the 'majority' of data points



In [11]:

    
mglearn.plots.plot_linear_svc_regularization()

Figure 2.

Left panel: Small values of C gives lots of regularization, two misclassified points
Middle panel: Moderate C, model focuses on misclassified points, tilts decision boundary #### (Right panel) for High C:
- Model tries to correctly classify all points correctly with a straight line
- But may not capture overall layout of classes well. Model is likely OVERFITTING!
- Decision bounday is titled a lot; all points in class 0 are correctly classified.

Linear Models of Classification

Low Dimensional spaces

Linear model for classification may seems restrictive in low-dimensional space
Only allowing for decision boundaries that are straight lines or planes

High Dimensional spaces

Linear models for classification become very powerful
Guarding against OVERFITTING becomes important when considering many features

Example: WI Breast Cancer Study

Load WI Breast Cancer study data
Show features
Visualize dataset



In [19]:

    
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()



In [20]:

    
print(cancer.keys())









    



dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names'])



In [21]:

    
print(cancer['target_names'])









    



['malignant' 'benign']



In [22]:

    
print(cancer['feature_names'])









    



['mean radius' 'mean texture' 'mean perimeter' 'mean area'
 'mean smoothness' 'mean compactness' 'mean concavity'
 'mean concave points' 'mean symmetry' 'mean fractal dimension'
 'radius error' 'texture error' 'perimeter error' 'area error'
 'smoothness error' 'compactness error' 'concavity error'
 'concave points error' 'symmetry error' 'fractal dimension error'
 'worst radius' 'worst texture' 'worst perimeter' 'worst area'
 'worst smoothness' 'worst compactness' 'worst concavity'
 'worst concave points' 'worst symmetry' 'worst fractal dimension']



In [25]:

    
type(cancer)









    Out[25]:





sklearn.datasets.base.Bunch



In [26]:

    
cancer.data.shape









    Out[26]:





(569, 30)



In [28]:

    
cancer_df = pd.DataFrame(X_train, columns=cancer.feature_names)
cancer_df.head()









    Out[28]:







  
    
      
      mean radius
      mean texture
      mean perimeter
      mean area
      mean smoothness
      mean compactness
      mean concavity
      mean concave points
      mean symmetry
      mean fractal dimension
      ...
      worst radius
      worst texture
      worst perimeter
      worst area
      worst smoothness
      worst compactness
      worst concavity
      worst concave points
      worst symmetry
      worst fractal dimension
    
  
  
    
      0
      19.89
      20.26
      130.50
      1214.0
      0.10370
      0.13100
      0.1411
      0.09431
      0.1802
      0.06188
      ...
      23.73
      25.23
      160.5
      1646.0
      0.14170
      0.3309
      0.4185
      0.16130
      0.2549
      0.09136
    
    
      1
      12.89
      13.12
      81.89
      515.9
      0.06955
      0.03729
      0.0226
      0.01171
      0.1337
      0.05581
      ...
      13.62
      15.54
      87.4
      577.0
      0.09616
      0.1147
      0.1186
      0.05366
      0.2309
      0.06915
    
    
      2
      17.14
      16.40
      116.00
      912.7
      0.11860
      0.22760
      0.2229
      0.14010
      0.3040
      0.07413
      ...
      22.25
      21.40
      152.4
      1461.0
      0.15450
      0.3949
      0.3853
      0.25500
      0.4066
      0.10590
    
    
      3
      17.30
      17.08
      113.00
      928.2
      0.10080
      0.10410
      0.1266
      0.08353
      0.1813
      0.05613
      ...
      19.85
      25.09
      130.9
      1222.0
      0.14160
      0.2405
      0.3378
      0.18570
      0.3138
      0.08113
    
    
      4
      22.01
      21.90
      147.20
      1482.0
      0.10630
      0.19540
      0.2448
      0.15010
      0.1824
      0.06140
      ...
      27.66
      25.80
      195.0
      2227.0
      0.12940
      0.3885
      0.4756
      0.24320
      0.2741
      0.08574
    
  

5 rows × 30 columns

Logistic Regression: Cancer Data

Split data into TRAIN and TEST sets
Fit the model on Training data
Evaluate model on Text data



In [16]:

    
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    cancer.data, cancer.target, stratify=cancer.target, random_state=42)



In [17]:

    
logreg = LogisticRegression().fit(X_train, y_train)



In [18]:

    
print("Training set score: {:.3f}".format(logreg.score(X_train, y_train)))
print("Test set score: {:.3f}".format(logreg.score(X_test, y_test)))









    



Training set score: 0.953
Test set score: 0.958

Regularization Parameter settings

Default setting C=1 provides good performance for train and test sets
Very likely UNDERFITTING

Use higher value of C to fit more 'flexible' model

C=100 gives higher training set accuracy and slightly higher Test set accuracy
More complex model (flexible) performs better



In [31]:

    
logreg100 = LogisticRegression(C=100).fit(X_train, y_train)
print("Training set score: {:.3f}".format(logreg100.score(X_train, y_train)))
print("Test set score: {:.3f}".format(logreg100.score(X_test, y_test)))









    



Training set score: 0.967
Test set score: 0.965

Use lower value of C to fit more 'regularized' model

Setting C=0.01 leads model to try to adjust to 'majority' of data points
Decreased model accuracy for both training and Test sets



In [32]:

    
logreg001 = LogisticRegression(C=0.01).fit(X_train, y_train)
print("Training set score: {:.3f}".format(logreg001.score(X_train, y_train)))
print("Test set score: {:.3f}".format(logreg001.score(X_test, y_test)))









    



Training set score: 0.934
Test set score: 0.930

Plot Coefficients of Logistic Regression for different values of C

By default LogisticRegression applies L2 regularization similar to Ridge regression
Stronger regularization pushes coefficients closer to zero
Parameter values can influence values of Coefficients



In [35]:

    
plt.plot(logreg.coef_.T, 'o', label="C=1")
plt.plot(logreg100.coef_.T, '^', label="C=100")
plt.plot(logreg001.coef_.T, 'v', label="C=0.01")

plt.xticks(range(cancer.data.shape[1]), cancer.feature_names, rotation=90)
plt.hlines(0,0, cancer.data.shape[1])
plt.ylim(-5, 5)

plt.xlabel("Coefficient Index")
plt.xlabel("Coefficient Magnitude")
plt.legend()









    Out[35]:





<matplotlib.legend.Legend at 0x11e72fc50>

Creating More Interpretable Model: L1 Regularization (Lasso)

L1 regularization (Lasso) limits values of most coefficients to zero
Model is limited to using only a few features



In [37]:

    
for C, marker in zip([0.01, 1, 100], ['v', 'o', '^']):
    lr_l1 = LogisticRegression(C=C, penalty="l1").fit(X_train, y_train)
    print("Training accuracy of L1 logreg with C={:.3f}: {:.2f}".format(
        C, lr_l1.score(X_train, y_train)))
    print("Test accuracy of L1 logreg with C={:.3f}: {:.2f}".format(
        C, lr_l1.score(X_test, y_test)))
    plt.plot(lr_l1.coef_.T, marker, label="C={:.3f}".format(C))
    
plt.xticks(range(cancer.data.shape[1]), cancer.feature_names, rotation=90)
plt.hlines(0,0, cancer.data.shape[1])
plt.xlabel("Coefficient Index")
plt.xlabel("Coefficient Magnitude")

plt.ylim(-5, 5)
plt.legend()









    



Training accuracy of L1 logreg with C=0.010: 0.92
Test accuracy of L1 logreg with C=0.010: 0.93
Training accuracy of L1 logreg with C=1.000: 0.96
Test accuracy of L1 logreg with C=1.000: 0.96
Training accuracy of L1 logreg with C=100.000: 0.99
Test accuracy of L1 logreg with C=100.000: 0.98






    Out[37]:





<matplotlib.legend.Legend at 0x11e8f0438>

Penalty Parameter (L) and Linear Classification Models

Main difference between linear models for classification is penalty parameter
L2 (Ridge) penalty uses all available features, regularization C pushes toward zero
L2 (Lasso) penalty sets coefficients for most features to zero, uses only a subset
- Improved interpretability with L2 penalty (Lasso)



In [ ]:

	mean radius	mean texture	mean perimeter	mean area	mean smoothness	mean compactness	mean concavity	mean concave points	mean symmetry	mean fractal dimension	...	worst radius	worst texture	worst perimeter	worst area	worst smoothness	worst compactness	worst concavity	worst concave points	worst symmetry	worst fractal dimension
0	19.89	20.26	130.50	1214.0	0.10370	0.13100	0.1411	0.09431	0.1802	0.06188	...	23.73	25.23	160.5	1646.0	0.14170	0.3309	0.4185	0.16130	0.2549	0.09136
1	12.89	13.12	81.89	515.9	0.06955	0.03729	0.0226	0.01171	0.1337	0.05581	...	13.62	15.54	87.4	577.0	0.09616	0.1147	0.1186	0.05366	0.2309	0.06915
2	17.14	16.40	116.00	912.7	0.11860	0.22760	0.2229	0.14010	0.3040	0.07413	...	22.25	21.40	152.4	1461.0	0.15450	0.3949	0.3853	0.25500	0.4066	0.10590
3	17.30	17.08	113.00	928.2	0.10080	0.10410	0.1266	0.08353	0.1813	0.05613	...	19.85	25.09	130.9	1222.0	0.14160	0.2405	0.3378	0.18570	0.3138	0.08113
4	22.01	21.90	147.20	1482.0	0.10630	0.19540	0.2448	0.15010	0.1824	0.06140	...	27.66	25.80	195.0	2227.0	0.12940	0.3885	0.4756	0.24320	0.2741	0.08574