Basic Concepts of Machine Learning and Overview of Classic Machine Learning Strategies


In [0]:
import warnings
warnings.filterwarnings('ignore')

%matplotlib inline
%pylab inline

import matplotlib.pyplot as plt
plt.xkcd()


Populating the interactive namespace from numpy and matplotlib
Out[0]:
<contextlib._GeneratorContextManager at 0x7fe786775c18>

Loading and exploring our data set

This is a database of customers of an insurance company. Each data point is one customer. The group represents the number of accidents the customer has been involved with in the past

  • 0 - red: many accidents
  • 1 - green: few or no accidents
  • 2 - yellow: in the middle

In [0]:
!curl -O https://raw.githubusercontent.com/DJCordhose/deep-learning-crash-course-notebooks/master/data/insurance-customers-1500.csv


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 26783  100 26783    0     0   142k      0 --:--:-- --:--:-- --:--:--  142k

In [0]:
import pandas as pd
df = pd.read_csv('./insurance-customers-1500.csv', sep=';')

In [0]:
df.head()


Out[0]:
speed age miles group
0 98.0 44.0 25.0 1
1 118.0 54.0 24.0 1
2 111.0 26.0 34.0 0
3 97.0 25.0 10.0 2
4 114.0 38.0 22.0 1

In [0]:
df.describe()


Out[0]:
speed age miles group
count 1500.000000 1500.000000 1500.000000 1500.000000
mean 122.492667 44.980667 30.434000 0.998667
std 17.604333 17.130400 15.250815 0.816768
min 68.000000 16.000000 1.000000 0.000000
25% 108.000000 32.000000 18.000000 0.000000
50% 120.000000 42.000000 29.000000 1.000000
75% 137.000000 55.000000 42.000000 2.000000
max 166.000000 100.000000 84.000000 2.000000

A pairplot gives you a nice overview of your data with just a few lines of code


In [0]:
import seaborn as sns

sample_df = df.sample(n=120, random_state=42)
sns.pairplot(sample_df, 
             hue="group", palette={0: '#AA4444', 1: '#006000', 2: '#EEEE44'},
             kind='reg',
             diag_kind='kde', vars=['age', 'speed', 'miles'])


Out[0]:
<seaborn.axisgrid.PairGrid at 0x7fe765101630>

Concepts

First important concept: You train a machine with your data to make it learn the relationship between some input data and a certain label - this is called supervised learning


In [0]:
# we deliberately decide this is going to be our label, it is often called lower case y
y=df['group']

In [0]:
# since 'group' is now the label we want to predict, we need to remove it from the training data 
df.drop('group', axis='columns', inplace=True)

In [0]:
# input data often is named upper case X, the upper case indicates, that each row is a vector
X = df.as_matrix()

We restrict ourselves to two dimensions for now

Because this is all we really can visualize in 2d


In [0]:
# ignore this, it is just technical code to plot decision boundaries
# Adapted from:
# http://scikit-learn.org/stable/auto_examples/neighbors/plot_classification.html
# http://jponttuset.cat/xkcd-deep-learning/

from matplotlib.colors import ListedColormap

cmap_print = ListedColormap(['#AA8888', '#004000', '#FFFFDD'])
cmap_bold = ListedColormap(['#AA4444', '#006000', '#EEEE44'])
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#FFFFDD'])
font_size=25
title_font_size=40

def meshGrid(x_data, y_data):
    h = 1  # step size in the mesh
    x_min, x_max = x_data.min() - 1, x_data.max() + 1
    y_min, y_max = y_data.min() - 1, y_data.max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    return (xx,yy)
    
def plotPrediction(clf, x_data, y_data, x_label, y_label, ground_truth, title="", 
                   mesh=True, fname=None, size=(15, 8)):
    xx,yy = meshGrid(x_data, y_data)
    fig, ax = plt.subplots(figsize=size)

    if clf and mesh:
        Z = clf.predict(np.c_[yy.ravel(), xx.ravel()])
        # Put the result into a color plot
        Z = Z.reshape(xx.shape)
        ax.pcolormesh(xx, yy, Z, cmap=cmap_light)
    
    ax.set_xlim(xx.min(), xx.max())
    ax.set_ylim(yy.min(), yy.max())
    ax.scatter(x_data, y_data, c=ground_truth, cmap=cmap_bold, s=100, marker='o', edgecolors='k')
        
    ax.set_xlabel(x_label, fontsize=font_size)
    ax.set_ylabel(y_label, fontsize=font_size)
    ax.set_title(title, fontsize=title_font_size)

def plot_keras_prediction(clf, x_data, y_data, x_label, y_label, ground_truth, title="", 
                          mesh=True, fixed=None, fname=None, size=(15, 8)):
    xx,yy = meshGrid(x_data, y_data)
    fig, ax = plt.subplots(figsize=size)

    if clf and mesh:
        grid_X = np.array(np.c_[yy.ravel(), xx.ravel()])
        if fixed:
            fill_values = np.full((len(grid_X), 1), fixed)
            grid_X = np.append(grid_X, fill_values, axis=1)
        Z = clf.predict(grid_X)
        Z = np.argmax(Z, axis=1)
        Z = Z.reshape(xx.shape)
        ax.pcolormesh(xx, yy, Z, cmap=cmap_light)
        
    ax.set_xlim(xx.min(), xx.max())
    ax.set_ylim(yy.min(), yy.max())
    ax.scatter(x_data, y_data, c=ground_truth, cmap=cmap_bold, s=100, marker='o', edgecolors='k')
        
    ax.set_xlabel(x_label, fontsize=font_size)
    ax.set_ylabel(y_label, fontsize=font_size)
    ax.set_title(title, fontsize=title_font_size)

def plot_history(history, samples=100, init_phase_samples=None, plot_line=False):
    epochs = history.params['epochs']
    
    acc = history.history['acc']
    val_acc = history.history['val_acc']
    loss = history.history['loss']
    val_loss = history.history['val_loss']

    every_sample =  int(epochs / samples)
    acc = pd.DataFrame(acc).iloc[::every_sample, :]
    val_acc = pd.DataFrame(val_acc).iloc[::every_sample, :]
    loss = pd.DataFrame(loss).iloc[::every_sample, :]
    val_loss = pd.DataFrame(val_loss).iloc[::every_sample, :]

    if init_phase_samples:
        acc = acc.loc[init_phase_samples:]
        val_acc = val_acc.loc[init_phase_samples:]
        loss = loss.loc[init_phase_samples:]
        val_loss = val_loss.loc[init_phase_samples:]
    
    fig, ax = plt.subplots(nrows=2, figsize=(15, 8))

    ax[0].plot(acc, 'bo', label='Training acc')
    ax[0].plot(val_acc, 'b', label='Validation acc')
    ax[0].set_title('Training and validation accuracy')
    ax[0].legend()
    
    if plot_line:
        x, y, _ = linear_regression(acc)
        ax[0].plot(x, y, 'bo', color='red')
        x, y, _ = linear_regression(val_acc)
        ax[0].plot(x, y, 'b', color='red')
    
    ax[1].plot(loss, 'bo', label='Training loss')
    ax[1].plot(val_loss, 'b', label='Validation loss')
    ax[1].set_title('Training and validation loss')
    ax[1].legend()
    
    if plot_line:
        x, y, _ = linear_regression(loss)
        ax[1].plot(x, y, 'bo', color='red')
        x, y, _ = linear_regression(val_loss)
        ax[1].plot(x, y, 'b', color='red')
    
from sklearn import linear_model

def linear_regression(data):
    x = np.array(data.index).reshape(-1, 1)
    y = data.values.reshape(-1, 1)

    regr = linear_model.LinearRegression()
    regr.fit(x, y)
    y_pred = regr.predict(x)
    return x, y_pred, regr.coef_

In [0]:
plotPrediction(None, X[:, 1], X[:, 0], 
               'Age', 'Max Speed', y, mesh=False,
                title="All Data",
                fname='all.png')


Second important concept: To have an idea how well the training worked, we save some data to test our model on previously unseen data.

  • The real objective is to have a generalized model that works well on the test data.
  • How well it performs on this test data as opposed to the training data tells us quite a bit as well.
  • Typical splits are 60% for training and 40% for testing or 80/20
  • It is important that we do not use the test data to tweak the hyper parameters of our learning strategy - in this case the test data would (indirectly) influence the training and can no longer tell how well we did
  • evaluate the test date set only once at the end of your experiment


In [0]:
from sklearn.model_selection import train_test_split

In [0]:
# using stratefy we get a balanced number of samples per category (important!)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42, stratify=y)

In [0]:
X_train.shape, y_train.shape, X_test.shape, y_test.shape


Out[0]:
((900, 3), (900,), (600, 3), (600,))

In [0]:
np.unique(y_train, return_counts=True)


Out[0]:
(array([0, 1, 2]), array([301, 300, 299]))

In [0]:
np.unique(y_test, return_counts=True)


Out[0]:
(array([0, 1, 2]), array([200, 200, 200]))

In [0]:
X_train_2_dim = X_train[:, :2]
X_test_2_dim = X_test[:, :2]

In [0]:
plotPrediction(None, X_train_2_dim[:, 1], X_train_2_dim[:, 0], 
               'Age', 'Max Speed', y_train, mesh=False,
                title="Train Data",
                fname='train.png')



In [0]:
plotPrediction(None, X_test_2_dim[:, 1], X_test_2_dim[:, 0], 
               'Age', 'Max Speed', y_test, mesh=False,
                title="Test Data",
                fname='test.png')


KNN - Most basic learning strategy: Look at the neighbors to make a prediction for a sample yet unknown

Interactive introduction to KNN: https://beta.observablehq.com/@djcordhose/how-to-build-a-teachable-machine-with-tensorflow-js#knndataset


In [0]:
from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier(1)

In [0]:
%time clf.fit(X_train_2_dim, y_train)


CPU times: user 681 µs, sys: 1.86 ms, total: 2.54 ms
Wall time: 6.43 ms
Out[0]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=1, p=2,
           weights='uniform')

In [0]:
plotPrediction(clf, X_train_2_dim[:, 1], X_train_2_dim[:, 0], 
               'Age', 'Max Speed', y_train,
                title="Train Data, KNN, k=1",
                fname='knn1-train.png')



In [0]:
clf.score(X_train_2_dim, y_train)


Out[0]:
0.9577777777777777

In [0]:
plotPrediction(clf, X_test_2_dim[:, 1], X_test_2_dim[:, 0], 
               'Age', 'Max Speed', y_test,
                title="Test Data, KNN, k=1",
                fname='knn1-test.png')



In [0]:
clf.score(X_test_2_dim, y_test)


Out[0]:
0.5983333333333334

In [0]:
# http://scikit-learn.org/stable/modules/cross_validation.html
from sklearn.model_selection import cross_val_score

In [0]:
# cross_val_score?

Cross Validation splits the train data in different ways and performs a number of training runs (3 in this case)


In [0]:
scores = cross_val_score(clf, X_train_2_dim, y_train, n_jobs=-1)
scores


Out[0]:
array([0.59468439, 0.61666667, 0.62876254])

In [0]:
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))


Accuracy: 0.61 (+/- 0.03)

Exercise: Try to manually find a better k

  • execute the noteboon up to this point
  • change k based on your previous experiments
  • what is the important value to check?

Third important concept: Our objective is to make the best prediction for unknown samples. This is called generalization. If we perform well on knwon data, but less good on unknown data this is called overfitting. This is to be avoided. Measures taken to avoid overfitting are also known as regularization.

In KNN we reduce overfitting by taking more neighbors into account

We can try what is the best number of numbers manually, but grid search does the same thing, only with less manual effort. This one tries the number of neighbors between 1 and 50


In [0]:
# KNeighborsClassifier?

In [0]:
from sklearn.model_selection import GridSearchCV
param_grid = {
    'n_neighbors': list(range(1, 50)),
    'weights': ['uniform', 'distance'] # are points that are nearer more important?
    }
clf = GridSearchCV(KNeighborsClassifier(), param_grid, n_jobs=-1)
%time clf.fit(X_train_2_dim, y_train)
clf.best_params_


CPU times: user 164 ms, sys: 22 ms, total: 186 ms
Wall time: 1.41 s
Out[0]:
{'n_neighbors': 39, 'weights': 'uniform'}

In [0]:
clf = KNeighborsClassifier(n_neighbors=39, weights='uniform')
%time clf.fit(X_train_2_dim, y_train)


CPU times: user 1 ms, sys: 1.58 ms, total: 2.58 ms
Wall time: 2.16 ms
Out[0]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=39, p=2,
           weights='uniform')

In [0]:
plotPrediction(clf, X_train_2_dim[:, 1], X_train_2_dim[:, 0], 
               'Age', 'Max Speed', y_train,
                title="Train Data, KNN, k=39")


A rule of thumb: Smoother decision boundaries imply less overfitting


In [0]:
clf.score(X_train_2_dim, y_train)


Out[0]:
0.6955555555555556

In [0]:
plotPrediction(clf, X_test_2_dim[:, 1], X_test_2_dim[:, 0], 
               'Age', 'Max Speed', y_test,
                title="Test Data, KNN, k=39")



In [0]:
clf.score(X_test_2_dim, y_test)


Out[0]:
0.7133333333333334

In [0]:
scores = cross_val_score(clf, X_train_2_dim, y_train, n_jobs=-1)
scores


Out[0]:
array([0.6910299 , 0.70666667, 0.66889632])

In [0]:
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))


Accuracy: 0.69 (+/- 0.03)

Logistic Regression


In [0]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()
%time clf.fit(X_train_2_dim, y_train)


CPU times: user 7.19 ms, sys: 490 µs, total: 7.68 ms
Wall time: 11.6 ms
Out[0]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [0]:
plotPrediction(clf, X_train_2_dim[:, 1], X_train_2_dim[:, 0], 
               'Age', 'Max Speed', y_train,
                title="Train Data, Logistic Regression")



In [0]:
clf.score(X_train_2_dim, y_train)


Out[0]:
0.3788888888888889

In [0]:
scores = cross_val_score(clf, X_train_2_dim, y_train, n_jobs=-1)
scores


Out[0]:
array([0.34883721, 0.40333333, 0.39464883])

In [0]:
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))


Accuracy: 0.38 (+/- 0.05)

In [0]:
plotPrediction(clf, X_test_2_dim[:, 1], X_test_2_dim[:, 0], 
               'Age', 'Max Speed', y_test,
                title="Test Data, Logistic Regression")



In [0]:
clf.score(X_test_2_dim, y_test)


Out[0]:
0.37666666666666665

Descision Trees

  • Another learning strategy, just like KNN is one
  • Splits our data set on a certain variable
  • Similar to what we have done in the manual classifier, but here the rules are actually learned

In [0]:
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier()
%time clf.fit(X_train_2_dim, y_train)


CPU times: user 1.89 ms, sys: 0 ns, total: 1.89 ms
Wall time: 1.9 ms
Out[0]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

How is the Decision Tree being Constructed?

We are using the CART algorithm:

  • top-down split the set of examples into two new sets
  • choose a variable and a value at each step that best splits our customer example
  • terminal node when no further gain possible or regularization kicks in

http://scikit-learn.org/stable/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart

What is the best split?

  • assign a category to each node containing a certain set of samples
  • use a metric (Gini or Entropy) to decide how good a node would be based on that category
  • sum up weighted metric for both child nodes
  • optimize the split for that summed metric

https://machinelearningmastery.com/classification-and-regression-trees-for-machine-learning/


In [0]:
# we perform at most 20 splits of our data until we make a decision where the data point belongs

clf.tree_.max_depth


Out[0]:
18

Complete Decision Tree


In [0]:
plotPrediction(clf, X_train_2_dim[:, 1], X_train_2_dim[:, 0], 
               'Age', 'Max Speed', y_train,
                title="Train Data, Decision Tree",
                fname='dt-overfit-train.png')



In [0]:
clf.score(X_train_2_dim, y_train)


Out[0]:
0.96

In [0]:
scores = cross_val_score(clf, X_train_2_dim, y_train, n_jobs=-1)
scores


Out[0]:
array([0.57142857, 0.57666667, 0.62207358])

In [0]:
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))


Accuracy: 0.59 (+/- 0.05)

In [0]:
plotPrediction(clf, X_test_2_dim[:, 1], X_test_2_dim[:, 0], 
               'Age', 'Max Speed', y_test,
                title="Test Data, Decision Tree",
                fname='dt-overfit-test.png')



In [0]:
clf.score(X_test_2_dim, y_test)


Out[0]:
0.5966666666666667

We overfit heavily and need to change the relevant parameters of our tree

  • its maximum number of splits (depth) - if there is no limit, we can make as many splits as it takes to perfectly match all train data (overfitting)
  • how many samples we need at least for a leaf - if it is just one, we could perfectly fit all training data (overfitting)
  • how many samples do we need to make another split - not as crucial as the other two, but can still limit overfitting

In [0]:
param_grid = {
    'max_depth': list(range(2, 25)),
    'min_samples_split': list(range(2, 11)),
    'min_samples_leaf': list(range(1, 11))
}
clf = GridSearchCV(DecisionTreeClassifier(), param_grid, n_jobs=-1)
%time clf.fit(X_train_2_dim, y_train)
clf.best_params_


CPU times: user 4.02 s, sys: 45 ms, total: 4.07 s
Wall time: 10.8 s
Out[0]:
{'max_depth': 6, 'min_samples_leaf': 3, 'min_samples_split': 4}

In [0]:
clf = DecisionTreeClassifier(max_depth=6,
                              min_samples_leaf=3,
                              min_samples_split=2)
%time clf.fit(X_train_2_dim, y_train)


CPU times: user 2.32 ms, sys: 791 µs, total: 3.11 ms
Wall time: 2.13 ms
Out[0]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=6,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=3, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [0]:
clf.tree_.max_depth


Out[0]:
6

Reduced Decision Tree


In [0]:
plotPrediction(clf, X_train_2_dim[:, 1], X_train_2_dim[:, 0], 
               'Age', 'Max Speed', y_train,
                title="Train Data, Regularized Decision Tree",
                fname='dt-sweet-train.png')



In [0]:
clf.score(X_train_2_dim, y_train)


Out[0]:
0.7444444444444445

In [0]:
scores = cross_val_score(clf, X_train_2_dim, y_train, n_jobs=-1)
scores


Out[0]:
array([0.68438538, 0.68      , 0.67558528])

In [0]:
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))


Accuracy: 0.68 (+/- 0.01)

In [0]:
plotPrediction(clf, X_test_2_dim[:, 1], X_test_2_dim[:, 0], 
               'Age', 'Max Speed', y_test,
                title="Test Data, Regularized Decision Tree",
                fname='dt-sweet-test.png')



In [0]:
clf.score(X_test_2_dim, y_test)


Out[0]:
0.6833333333333333

Random Forest

  • We fight overfitting in decision trees with some success
  • However, inherent to their nature, decision trees tend to overfit
  • Random Forest is an ensemble technique that trains a number of simple decision trees and uses a majority vote over all of them for prediction
  • While each decision tree still overfits using many of them softens this problems
  • You still need to regularize the underlying decision trees
  • sklearn has a default of 10 decision trees for random forest
  • Random Forest is the swiss army knife of machine learning

In [0]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_jobs=-1)
%time clf.fit(X_train_2_dim, y_train)


CPU times: user 27.9 ms, sys: 5.67 ms, total: 33.6 ms
Wall time: 112 ms
Out[0]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=-1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [0]:
clf.score(X_train_2_dim, y_train)


Out[0]:
0.9411111111111111

In [0]:
plotPrediction(clf, X_test_2_dim[:, 1], X_test_2_dim[:, 0], 
               'Age', 'Max Speed', y_test,
                title="Test Data, Random Forest",
                fname='rf-overfit-train.png')



In [0]:
clf.score(X_test_2_dim, y_test)


Out[0]:
0.6316666666666667

In [0]:
# brute force grid search is far too expensive

param_grid = {
    'n_estimators': list(range(3,20)),
    'max_depth': list(range(2, 25)),
    'min_samples_split': list(range(2, 11)),
    'min_samples_leaf': list(range(1, 11))
}
clf = GridSearchCV(RandomForestClassifier(), param_grid, n_jobs=-1)
# %time clf.fit(X_train_2_dim, y_train)
# clf.best_params_

Unfortunately, training random forest classifiers is more expensive than decision trees by the number of estimators it uses (10 in our case). This makes using a deterministic grid search over all parameters prohibitively expensive. We instead use a randomized search, that tries 100 different values and we hope to find the best here.


In [0]:
# http://scikit-learn.org/stable/modules/grid_search.html#randomized-parameter-search
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

param_dist = {
    'n_estimators': randint(3,20),
    "max_depth": randint(2, 25),
    "min_samples_split": randint(2, 11),
    "min_samples_leaf": randint(1, 11)
}

clf = RandomizedSearchCV(RandomForestClassifier(), param_dist, n_iter=1000, n_jobs=-1)
%time clf.fit(X_train_2_dim, y_train)
clf.best_params_


CPU times: user 3.29 s, sys: 67.5 ms, total: 3.35 s
Wall time: 50.3 s
Out[0]:
{'max_depth': 5,
 'min_samples_leaf': 2,
 'min_samples_split': 8,
 'n_estimators': 18}

In [0]:
# parameters might vary a bit with each run, because it is a random search
clf = RandomForestClassifier(max_depth=5, min_samples_leaf=3, min_samples_split=7, n_estimators=12, n_jobs=-1)
%time clf.fit(X_train_2_dim, y_train)


CPU times: user 26.9 ms, sys: 10.2 ms, total: 37.1 ms
Wall time: 118 ms
Out[0]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=5, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=3, min_samples_split=7,
            min_weight_fraction_leaf=0.0, n_estimators=12, n_jobs=-1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [0]:
clf.score(X_train_2_dim, y_train)


Out[0]:
0.7311111111111112

In [0]:
plotPrediction(clf, X_test_2_dim[:, 1], X_test_2_dim[:, 0], 
               'Age', 'Max Speed', y_test,
                title="Test Data, Regularized Random Forest",
                fname='rf-sweet-test.png')



In [0]:
clf.score(X_test_2_dim, y_test)


Out[0]:
0.6866666666666666

Exercise: Try to optimize on the parameters

  • we only have a good guess about the optimal parameters here
  • re-run the ramdomized search and try to get better results
  • you can also try the grid seach with a limited search range

Support Vector Machines (SVM)

  • SVMs used to be the hot stuff before neural networks stole the show
  • SVMs choose a small number of data points to decide where to draw the decision boundary, they are called the support vectors

Interactive Introduction: https://dash-gallery.plotly.host/dash-svm

In its base version SVMs can only use lines as decision boundaries. Let's see how well this works.


In [0]:
from sklearn.svm import SVC
clf = SVC(kernel='linear')
%time clf.fit(X_train_2_dim, y_train)


CPU times: user 403 ms, sys: 1.69 ms, total: 404 ms
Wall time: 407 ms
Out[0]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [0]:
clf.score(X_train_2_dim, y_train)


Out[0]:
0.4533333333333333

In [0]:
plotPrediction(clf, X_test_2_dim[:, 1], X_test_2_dim[:, 0], 
               'Age', 'Max Speed', y_test,
                title="Test Data, SVM linear Kernel",
                fname='svm-underfit-linear-train.png')


SVM Kernels

  • As we can see using lines only, we can not even fit the training data, this is called underfitting.
  • For most realisitic examples we need something better called the 'kernel trick'
  • We transform the original problem space into another that is separable by lines only
  • Radial Base Functions 'rbf' can approximate any function and are trained to perform this transformation

In [0]:
from sklearn.svm import SVC
clf = SVC()
%time clf.fit(X_train_2_dim, y_train)


CPU times: user 61.5 ms, sys: 1.97 ms, total: 63.4 ms
Wall time: 68 ms
Out[0]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [0]:
clf.score(X_train_2_dim, y_train)


Out[0]:
0.9055555555555556

In [0]:
plotPrediction(clf, X_test_2_dim[:, 1], X_test_2_dim[:, 0], 
               'Age', 'Max Speed', y_test,
                title="Test Data, SVM rbf Kernel",
                fname='svm-overfit-test.png')



In [0]:
clf.score(X_test_2_dim, y_test)


Out[0]:
0.64

In [0]:
# SVC?

Again we strongly overfit and need to regularize our model. The two important parameters are

  • Gamma
    • reach of a single training sample - low values: far, high values: close
    • the lower the more points remote from the support vectors influence where the decision boundaries go
  • C (Cost)
    • decides how expensive it is to misclassify one of our support vectors, the smaller the more tolerant to misclassified samples
    • tradeof between smooth decision boundaries and classifying samples correctly

Interactively experiment with C and Gamma: https://dash-gallery.plotly.host/dash-svm

http://scikit-learn.org/stable/auto_examples/svm/plot_rbf_parameters.html


In [0]:
param_grid = {
    'C': list(np.append(np.arange(0.1, 1.0, 0.1), np.arange(2, 10, 1))),
    'gamma': list(np.append(np.arange(0.001, 0.1, 0.0005), np.arange(.02, 1.0, 0.1))),
}
clf = GridSearchCV(SVC(), param_grid, n_jobs=-1)
%time clf.fit(X_train_2_dim, y_train)
clf.best_params_


CPU times: user 9.39 s, sys: 328 ms, total: 9.72 s
Wall time: 4min 58s
Out[0]:
{'C': 9.0, 'gamma': 0.001}

In [0]:
clf = SVC(C=9, gamma=0.001)
%time clf.fit(X_train_2_dim, y_train)


CPU times: user 29.5 ms, sys: 1.31 ms, total: 30.8 ms
Wall time: 34.8 ms
Out[0]:
SVC(C=9, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=0.001, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [0]:
clf.score(X_train_2_dim, y_train)


Out[0]:
0.7122222222222222

In [0]:
plotPrediction(clf, X_test_2_dim[:, 1], X_test_2_dim[:, 0], 
               'Age', 'Max Speed', y_test,
                title="Test Data, SVM Regularized rbf Kernel",
                fname='svm-reg-test.png')



In [0]:
clf.score(X_test_2_dim, y_test)


Out[0]:
0.71

Exercise: Improve Cost and Gamma

  • use https://dash-gallery.plotly.host/dash-svm to get a better intuition for C and Gamma
  • Change C and Gamma based on your intuition
  • Can you do better than the results coming from the grid search?
  • If not: Why not?

Neural Networks using TensorFlow and Keras Layers

Important if you let this run on Colab switch on the GPU option, because otherwise this part will take quite some time


In [0]:
import tensorflow as tf
tf.logging.set_verbosity(tf.logging.ERROR)
print(tf.__version__)


1.11.0

In [0]:
# let's see what compute devices we have available, hopefully a GPU
sess = tf.Session()
devices = sess.list_devices()
for d in devices:
    print(d.name)
hello = tf.constant('Hello TF!')
print(sess.run(hello))


/job:localhost/replica:0/task:0/device:CPU:0
/job:localhost/replica:0/task:0/device:GPU:0
b'Hello TF!'

In [0]:
from tensorflow import keras
print(keras.__version__)


2.1.6-tf

Neuron (aka node or unit)

A neuron takes a number of numerical inputs, multiplies each with a weight, sums up all weighted input and adds bias (constant) to that sum. From this it creates a single numerical output. For one input (one dimension) this would be a description of a line. For more dimensions this describes a hyper plane that can serve as a decision boundary. Typically, this output is transformed using an activation function which compresses the output to a value between 0 and 1 (sigmoid), or between -1 and 1 (tanh) or sets all negative values to zero (relu).

It is not really important to understand the details of a neural network. Practically how you configure them to form something more powerful is much more important. This, however, is still a very experimental domain, so there really is no conscise explanation and understanding how they work.

From Neuron to Layer

  • Neural Networks consist of artificial neurons you organize in layers
  • each neuron is very simple, but, theoretically, having enough of them in a single layer can approximate any funtion
  • practically, we use 2 or 3 layers, as this has turned out to work well
  • the more neurons and the more layers you use the longer the network takes to train
  • neural networks often are no longer approachable using cross validation and grid search to find suitable hyper parameters

We start with two hidden layers each having 500 neurons


In [0]:
# tf.keras.layers.Dense?

In [0]:
from tensorflow.keras.layers import Dense

model = keras.Sequential()

model.add(Dense(units=500, name='hidden1', activation='tanh', input_dim=2))
model.add(Dense(units=500, name='hidden2', activation='tanh'))

model.add(Dense(3, name='softmax', activation='softmax'))

model.summary()


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
hidden1 (Dense)              (None, 500)               1500      
_________________________________________________________________
hidden2 (Dense)              (None, 500)               250500    
_________________________________________________________________
softmax (Dense)              (None, 3)                 1503      
=================================================================
Total params: 253,503
Trainable params: 253,503
Non-trainable params: 0
_________________________________________________________________

If you do not train for too long, even without any further regularization this network will not overfit by too much, but look at how strange the decision boundaries look like

Validation Split

As test evaluation is a one-shot test, we again separate data for continues validation

We can not use cross validation as each training run takes far too much time


In [0]:
BATCH_SIZE=1000
EPOCHS = 2000

model.compile(loss='sparse_categorical_crossentropy',
             optimizer='adam',
             metrics=['accuracy'])

early_stopping = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=100, verbose=1)
# checkpoint = tf.keras.callbacks.ModelCheckpoint('keras-model.epoch-{epoch:02d}-val_loss-{val_loss:.2f}.hdf5',
#                                                 verbose = 1, save_best_only=True)
checkpoint = tf.keras.callbacks.ModelCheckpoint('keras-model.hdf5', verbose = 1, save_best_only=True)

# %time model.fit(X_train_2_dim, y_train_categorical, epochs=EPOCHS, batch_size=BATCH_SIZE, validation_split=0.2, callbacks=[checkpoint, early_stopping])
%time history = model.fit(X_train_2_dim, y_train, epochs=EPOCHS, batch_size=BATCH_SIZE, validation_split=0.2, verbose=0)


CPU times: user 18.8 s, sys: 2.84 s, total: 21.7 s
Wall time: 16.1 s

In [0]:
train_loss, train_accuracy = model.evaluate(X_train_2_dim, y_train, batch_size=BATCH_SIZE)
train_accuracy


900/900 [==============================] - 0s 4us/step
Out[0]:
0.7177777886390686

In [0]:
test_loss, test_accuracy = model.evaluate(X_test_2_dim, y_test, batch_size=BATCH_SIZE)
test_accuracy


600/600 [==============================] - 0s 12us/step
Out[0]:
0.6983333230018616

In [0]:
plot_history(history)



In [0]:
plot_keras_prediction(model, X_test_2_dim[:, 1], X_test_2_dim[:, 0], 
               'Age', 'Max Speed', y_test,
                title="Test Data, NN")


Even though scores do not look too bad, decision boundaries tell us, this is not a good result. We have several means of regularization for neural networks and we use a combination of them as described below.

We have several means of regularization

We use a combination of them:

  • reduce capacity of model
  • dropout
  • batch normalization
  • change activation to relu for faster training
  • reduce amount of training cycles

An experimental approach:

  • keep adding regularization to make test and train scores come closer to each other
  • this will come at the cost of train scores going down
  • if both values start going down you have gone too far
  • each experiment takes some time
  • for larger datasets and more complex models some people start by overfitting on a subsample of the data (because it trains much faster)
    • then you can be sure you have an architecture that at least has the capacity to solve the problem
    • then keep adding regularizations
    • eventually try using the complete data
  • if you want to use batch normalization place it between raw output of neuron and activation function

In [0]:
from tensorflow.keras.layers import Dense, Activation, BatchNormalization, Dropout

# https://stackoverflow.com/questions/34716454/where-do-i-call-the-batchnormalization-function-in-keras
# regularisation:
# - dropout
# - batch normalization
# - reduce capacity of model

dropout = 0.6
model = keras.Sequential()

model.add(Dense(100, name='hidden1', input_dim=2))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(dropout))

model.add(Dense(100, name='hidden2'))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(dropout))

model.add(Dense(3, name='softmax', activation='softmax'))

model.summary()


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
hidden1 (Dense)              (None, 100)               300       
_________________________________________________________________
batch_normalization (BatchNo (None, 100)               400       
_________________________________________________________________
activation (Activation)      (None, 100)               0         
_________________________________________________________________
dropout (Dropout)            (None, 100)               0         
_________________________________________________________________
hidden2 (Dense)              (None, 100)               10100     
_________________________________________________________________
batch_normalization_1 (Batch (None, 100)               400       
_________________________________________________________________
activation_1 (Activation)    (None, 100)               0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 100)               0         
_________________________________________________________________
softmax (Dense)              (None, 3)                 303       
=================================================================
Total params: 11,503
Trainable params: 11,103
Non-trainable params: 400
_________________________________________________________________

In [0]:
BATCH_SIZE=1000
EPOCHS = 3000

model.compile(loss='sparse_categorical_crossentropy',
             optimizer='adam',
             metrics=['accuracy'])

early_stopping = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=500, verbose=1)

%time history = model.fit(X_train_2_dim, y_train, epochs=EPOCHS, batch_size=BATCH_SIZE, validation_split=0.2, callbacks=[early_stopping], verbose=0)


Epoch 02354: early stopping
CPU times: user 36.6 s, sys: 4.64 s, total: 41.2 s
Wall time: 28.8 s

In [0]:
# use the best model
# from keras.models import load_model

# model = load_model('keras-model.hdf5')

In [0]:
train_loss, train_accuracy = model.evaluate(X_train_2_dim, y_train, batch_size=BATCH_SIZE)
train_accuracy


900/900 [==============================] - 0s 10us/step
Out[0]:
0.7144444584846497

In [0]:
test_loss, test_accuracy = model.evaluate(X_test_2_dim, y_test, batch_size=BATCH_SIZE)
test_accuracy


600/600 [==============================] - 0s 17us/step
Out[0]:
0.7099999785423279

In [0]:
plot_history(history)



In [0]:
plot_history(history, init_phase_samples=250, plot_line=True)


Scores around 70% look good now, there might even be a bit more potential here, but we are not going after a final percent here


In [0]:
plot_keras_prediction(model, X_test_2_dim[:, 1], X_test_2_dim[:, 0], 
               'Age', 'Max Speed', y_test,
                title="Test Data, Dropout, BachNormalization, Reduced Cap NN",
                fname='nn-reg-test.png')


It is surprising how smooth and very similar these decision boundaries are to the ones created by SVM

What you should be aware of when you do into PRODUCTION


In [0]:
def meshGrid(x_data, y_data, xlim=None, ylim=None):
    h = 1  # step size in the mesh
    if xlim == None:
        xlim = x_data.min(), x_data.max()
    if ylim == None:
        ylim = y_data.min(), y_data.max()
        
    x_min, x_max = xlim
    y_min, y_max = ylim
    xx, yy = np.meshgrid(np.arange(x_min - 1, x_max + 1, h),
                         np.arange(y_min - 1, y_max + 1, h))
    return xx, yy, xlim, ylim
  
def ext_plot_keras_prediction(clf, x_data, y_data, x_label, y_label, ground_truth, title="", 
                              size=(15, 8),
                              xlim=(16, 90), ylim=(70, 170)):
    xx,yy, xlim, ylim = meshGrid(x_data, y_data, xlim, ylim)
    fig, ax = plt.subplots(figsize=size)

    if clf:
        grid_X = np.array(np.c_[yy.ravel(), xx.ravel()])
        Z = clf.predict(grid_X)
        Z = np.argmax(Z, axis=1)
        Z = Z.reshape(xx.shape)
        ax.pcolormesh(xx, yy, Z, cmap=cmap_light)
        
    ax.set_xlim(xlim)
    ax.set_ylim(ylim)
    ax.scatter(x_data, y_data, c=ground_truth, cmap=cmap_bold, s=100, marker='o', edgecolors='k')
        
    ax.set_xlabel(x_label, fontsize=font_size)
    ax.set_ylabel(y_label, fontsize=font_size)
    ax.set_title(title, fontsize=title_font_size)

In [0]:
ext_plot_keras_prediction(model, X_test_2_dim[:, 1], X_test_2_dim[:, 0], 
               'Age', 'Max Speed', y_test,
                xlim=(6, 400), ylim=(10, 250),
                title="Extended Area - Simulating Production")


Be careful with predictions outside of the area covered by training data

First: Be aware of the reasonable range of values

Options

  • Add a warning
  • Reject requests in extended area
  • Scores in exteded area often like very biased towards one category (1.0 score)
  • Add likelihoods to scores

In [0]:
# red makes sense here
age = 100
speed = 100
model.predict(np.array([[speed, age]]))


Out[0]:
array([[0.8450663 , 0.00088674, 0.15404704]], dtype=float32)

In [0]:
# does not make any sense, but has 1.0 score for yellow
age = 200
speed = 200
model.predict(np.array([[speed, age]]))


Out[0]:
array([[2.1591406e-08, 7.0600289e-29, 1.0000000e+00]], dtype=float32)

In [0]:
# there could be such a customer, but it should not be yellow, but red
age = 85
speed = 200
model.predict(np.array([[speed, age]]))


Out[0]:
array([[3.3167265e-07, 3.2714050e-13, 9.9999964e-01]], dtype=float32)

In [0]: