In [4]:
import sklearn
import mglearn

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline

from IPython.display import display

Introduction to Machine Learning

Andreas Mueller and Sarah Guido (2017) O'Reilly

Ch. 2 Supervised Learning

Neural Networks (Deep Learning)

MLP feedforward neural network

  • Generalization of linear models for classification and regression
  • Prediction by a linear regressor is given as: y_hat = w[0]*x[0] + w[1]*x[1] + ... w[p]*x[p]

Visualization of logistic regression

  • Input features and predictions are shown as nodes
  • Coefficients are connections between the nodes

In [6]:
display(mglearn.plots.plot_logistic_regression_graph())


%3 cluster_0 inputs cluster_2 output x[0] x[0] y y x[0]->y w[0] x[1] x[1] x[1]->y w[1] x[2] x[2] x[2]->y w[2] x[3] x[3] x[3]->y w[3]

MLP feedforward neural network

  • Process of computing weighted sumsis repeated multiple times
  • Computing hidden units, which are combined to yield final result

Non-linear function

  • After computing a weighted sum for each hidden unit, a non-linear function is applied to the results
  • Usually the rectifying nonlinearyity (i.e., rectified linear unit, or relu) or the 'tangens hyperbolicus (tanh)
  • Result of this function is then used in weighted sum that computes the output, or target y_hat

Either non-linear functoin allows neural network to learn more complicated functions that a linear model could


In [3]:
display(mglearn.plots.plot_single_hidden_layer_graph())


%3 cluster_0 inputs cluster_1 hidden layer cluster_2 output x[0] x[0] h0 h[0] x[0]->h0 h1 h[1] x[0]->h1 h2 h[2] x[0]->h2 x[1] x[1] x[1]->h0 x[1]->h1 x[1]->h2 x[2] x[2] x[2]->h0 x[2]->h1 x[2]->h2 x[3] x[3] x[3]->h0 x[3]->h1 x[3]->h2 y y h0->y h1->y h2->y

Parameter: # Nodes in hidden layer

  • Number of nodes in the hidden layer needs to be set by the user
  • As small as 10 for simple dataset, or high as 10,000 for cmoplex dataset
  • Can also add additional hidden layers

Plot: MLP with two hidden layers

  • Having large neural network with many layers of computation and hidden units inspired the term 'deep learning'

In [8]:
display(mglearn.plots.plot_two_hidden_layer_graph())


%3 cluster_0 inputs cluster_1 hidden layer 1 cluster_2 hidden layer 2 cluster_3 output x[0] x[0] h1[0] h1[0] x[0]->h1[0] h1[1] h1[1] x[0]->h1[1] h1[2] h1[2] x[0]->h1[2] x[1] x[1] x[1]->h1[0] x[1]->h1[1] x[1]->h1[2] x[2] x[2] x[2]->h1[0] x[2]->h1[1] x[2]->h1[2] x[3] x[3] x[3]->h1[0] x[3]->h1[1] x[3]->h1[2] h2[0] h2[0] h1[0]->h2[0] h2[1] h2[1] h1[0]->h2[1] h2[2] h2[2] h1[0]->h2[2] h1[1]->h2[0] h1[1]->h2[1] h1[1]->h2[2] h1[2]->h2[0] h1[2]->h2[1] h1[2]->h2[2] y y h2[0]->y h2[1]->y h2[2]->y

Tuning Neural Networks

  • By default, MLP uses 100 hidden nodes (a lot for small dataset)
  • With only 10 hidden units, the decision boundary looks more ragged

In [13]:
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_moons

X, y = make_moons(n_samples=100, noise=0.25, random_state=3)

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, 
                                                    random_state=42)

mlp = MLPClassifier(solver='lbfgs', random_state=0, 
                    hidden_layer_sizes=[10,10])
mlp.fit(X_train, y_train)
mglearn.plots.plot_2d_separator(mlp, X_train, fill=True, alpha=0.3)
mglearn.discrete_scatter(X_train[:, 0], X_train[:, 1], y_train)
plt.xlabel("Feature 0")
plt.ylabel("Feature 1")


Out[13]:
<matplotlib.text.Text at 0x118cd1908>

MLP with two layers for smoother boundary

  • Can add more hidden units, add a second layer
  • or use tanh nonlinearity

In [15]:
mlp = MLPClassifier(solver='lbfgs', activation='tanh', 
                    random_state=0, hidden_layer_sizes=[10,10])
mlp.fit(X_train, y_train)
mglearn.plots.plot_2d_separator(mlp, X_train, fill=True, alpha=0.3)
mglearn.discrete_scatter(X_train[:, 0], X_train[:, 1], y_train)
plt.xlabel("Feature 0")
plt.ylabel("Feature 1")


Out[15]:
<matplotlib.text.Text at 0x1156700b8>

L2 Penalty and Neural Network

  • Control complexity of NN using L2 penalty to shrink weights toward zero, as with Ridge regression and linear classifiers
  • alpha parameter in MLPClassifier, is set to low value by default (little regularization)
  • Plots shows effects of different alpha values with two hidden layers of 10 or 100 units:

In [19]:
fig, axes = plt.subplots(2, 4, figsize=(20, 8))
for axx, n_hidden_nodes in zip(axes, [10, 100]):
    for ax, alpha in zip(axx, [0.0001, 0.01, 0.1, 1]):
        mlp = MLPClassifier(solver='lbfgs', random_state=0,
                           hidden_layer_sizes=[n_hidden_nodes, n_hidden_nodes],
                           alpha=alpha)
        mlp.fit(X_train, y_train)
        mglearn.plots.plot_2d_separator(mlp, X_train, fill=True, alpha=0.3, ax=ax)
        ax.set_title("n_hidden=[{}, {}]\nalpha={:.4f}".format(n_hidden_nodes, 
                                                              n_hidden_nodes, alpha))


Neural Network Weights

  • Weights are set randomly before learning is started, random initialization affects the model that is learned
  • Even using same parameters, can get very differen models using different SEEDS

Apply MLPClassifier to Breast Cancer Dataset

  • Start with default parameters

In [21]:
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()

print("Cancer data per-feature maxima\n{}".format(cancer.data.max(axis=0)))


Cancer data per-feature maxima
[  2.81100000e+01   3.92800000e+01   1.88500000e+02   2.50100000e+03
   1.63400000e-01   3.45400000e-01   4.26800000e-01   2.01200000e-01
   3.04000000e-01   9.74400000e-02   2.87300000e+00   4.88500000e+00
   2.19800000e+01   5.42200000e+02   3.11300000e-02   1.35400000e-01
   3.96000000e-01   5.27900000e-02   7.89500000e-02   2.98400000e-02
   3.60400000e+01   4.95400000e+01   2.51200000e+02   4.25400000e+03
   2.22600000e-01   1.05800000e+00   1.25200000e+00   2.91000000e-01
   6.63800000e-01   2.07500000e-01]

In [28]:
X_train, X_test, y_train, y_test = train_test_split(
    cancer.data, cancer.target, stratify=cancer.target, random_state=0)

mlp = MLPClassifier(random_state=42)
mlp.fit(X_train, y_train)

print("Accurary on Training set: {:.2f}".format(mlp.score(X_train, y_train)))
print("Accuracy Test set: {:.2f}".format(mlp.score(X_test, y_test)))


Accurary on Training set: 0.90
Accuracy Test set: 0.90

Rescale the data

  • Accuracy of MLP is good, but as with SVC model, scaling of data is problem
  • Normalize the data (mean=0, stdev=1) for both training and test sets
  • Rerun the MLP analysis on rescaled data

In [31]:
# Compute mean value per feature on Training set
mean_on_train = X_train.mean(axis=0)

# Compute standard deviation of each feature on Training set
std_on_train = X_train.std(axis=0)

# Subtract the mean, and scale by inverse standard deviation
X_train_scaled = (X_train - mean_on_train) / std_on_train

# Do the same for the test set, using min and range of training set
X_test_scaled = (X_test - mean_on_train) / std_on_train

In [34]:
mlp = MLPClassifier(random_state=0)
mlp.fit(X_train_scaled, y_train)

print("Accurary on Training set: {:.3f}".format(mlp.score(X_train_scaled, y_train)))
print("Accuracy Test set: {:.3f}".format(mlp.score(X_test_scaled, y_test)))


Accurary on Training set: 0.998
Accuracy Test set: 0.951
/Users/seanshiverick/anaconda/lib/python3.6/site-packages/sklearn/neural_network/multilayer_perceptron.py:563: ConvergenceWarning: Stochastic Optimizer: Maximum iterations reached and the optimization hasn't converged yet.
  % (), ConvergenceWarning)

Warning from model

  • Results are better after scaling, but warning tells us maximum iterations is reached
  • adam algorithm tells us we should increase the number of iterations
  • Increasing iterations only affect training set performance

Tuning Parameters: alpha

  • Decrease model complexity to get better generalization performance
  • Increase alpha parameter quite aggressively (from 0.001 to 1.0)
  • Adds stronger regularization to the coefficient weights

In [35]:
mlp = MLPClassifier(max_iter=1000, alpha=1, random_state=0)
mlp.fit(X_train_scaled, y_train)

print("Accurary on Training set: {:.3f}".format(mlp.score(X_train_scaled, y_train)))
print("Accuracy Test set: {:.3f}".format(mlp.score(X_test_scaled, y_test)))


Accurary on Training set: 0.991
Accuracy Test set: 0.965

Analysis of Model

  • Analyzing neural network is tricker than analyzing linear model or tree-based model
  • Look at the weights (coefficients) in the model

Heatmap Plot

  • Shows weights learned connecting the input to the first hidden layer
  • Rows in the plot correspond to the 30 input features
  • Columns in plot correspond to the 100 hidden units

Light colors show large positive values, dark colors represent negative numbers


In [36]:
plt.figure(figsize=(20,5))
plt.imshow(mlp.coefs_[0], interpolation='none', cmap='viridis')
plt.yticks(range(30), cancer.feature_names)
plt.xlabel("Columns in weight matrix")
plt.ylabel("Input feature")
plt.colorbar()


Out[36]:
<matplotlib.colorbar.Colorbar at 0x1228292b0>

Interpretation of Figure

  • Feature that have very small weights for all hidden units are 'less important' to model
  • Could also visualize weight connecting the hidden layers to the output layer, but that is even harder to interpret
  • MLP Classifier and MLPRegressor only capture small subset of what is possible with neural networks

In [ ]: