Basic Concepts of Machine Learning and Overview of Classic Machine Learning Strategies



In [0]:

    
import warnings
warnings.filterwarnings('ignore')

%matplotlib inline
%pylab inline

import matplotlib.pyplot as plt
plt.xkcd()









    



Populating the interactive namespace from numpy and matplotlib






    Out[0]:





<contextlib._GeneratorContextManager at 0x7fe786775c18>

Loading and exploring our data set

This is a database of customers of an insurance company. Each data point is one customer. The group represents the number of accidents the customer has been involved with in the past

0 - red: many accidents
1 - green: few or no accidents
2 - yellow: in the middle



In [0]:

    
!curl -O https://raw.githubusercontent.com/DJCordhose/deep-learning-crash-course-notebooks/master/data/insurance-customers-1500.csv









    



  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 26783  100 26783    0     0   142k      0 --:--:-- --:--:-- --:--:--  142k



In [0]:

    
import pandas as pd
df = pd.read_csv('./insurance-customers-1500.csv', sep=';')



In [0]:

    
df.head()



In [0]:

    
df.describe()









    Out[0]:







  
    
      
      speed
      age
      miles
      group
    
  
  
    
      count
      1500.000000
      1500.000000
      1500.000000
      1500.000000
    
    
      mean
      122.492667
      44.980667
      30.434000
      0.998667
    
    
      std
      17.604333
      17.130400
      15.250815
      0.816768
    
    
      min
      68.000000
      16.000000
      1.000000
      0.000000
    
    
      25%
      108.000000
      32.000000
      18.000000
      0.000000
    
    
      50%
      120.000000
      42.000000
      29.000000
      1.000000
    
    
      75%
      137.000000
      55.000000
      42.000000
      2.000000
    
    
      max
      166.000000
      100.000000
      84.000000
      2.000000

A pairplot gives you a nice overview of your data with just a few lines of code



In [0]:

    
import seaborn as sns

sample_df = df.sample(n=120, random_state=42)
sns.pairplot(sample_df, 
             hue="group", palette={0: '#AA4444', 1: '#006000', 2: '#EEEE44'},
             kind='reg',
             diag_kind='kde', vars=['age', 'speed', 'miles'])









    Out[0]:





<seaborn.axisgrid.PairGrid at 0x7fe765101630>

Concepts

First important concept: You train a machine with your data to make it learn the relationship between some input data and a certain label - this is called supervised learning



In [0]:

    
# we deliberately decide this is going to be our label, it is often called lower case y
y=df['group']



In [0]:

    
# since 'group' is now the label we want to predict, we need to remove it from the training data 
df.drop('group', axis='columns', inplace=True)



In [0]:

    
# input data often is named upper case X, the upper case indicates, that each row is a vector
X = df.as_matrix()

We restrict ourselves to two dimensions for now

Because this is all we really can visualize in 2d



In [0]:

    
# ignore this, it is just technical code to plot decision boundaries
# Adapted from:
# http://scikit-learn.org/stable/auto_examples/neighbors/plot_classification.html
# http://jponttuset.cat/xkcd-deep-learning/

from matplotlib.colors import ListedColormap

cmap_print = ListedColormap(['#AA8888', '#004000', '#FFFFDD'])
cmap_bold = ListedColormap(['#AA4444', '#006000', '#EEEE44'])
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#FFFFDD'])
font_size=25
title_font_size=40

def meshGrid(x_data, y_data):
    h = 1  # step size in the mesh
    x_min, x_max = x_data.min() - 1, x_data.max() + 1
    y_min, y_max = y_data.min() - 1, y_data.max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    return (xx,yy)
    
def plotPrediction(clf, x_data, y_data, x_label, y_label, ground_truth, title="", 
                   mesh=True, fname=None, size=(15, 8)):
    xx,yy = meshGrid(x_data, y_data)
    fig, ax = plt.subplots(figsize=size)

    if clf and mesh:
        Z = clf.predict(np.c_[yy.ravel(), xx.ravel()])
        # Put the result into a color plot
        Z = Z.reshape(xx.shape)
        ax.pcolormesh(xx, yy, Z, cmap=cmap_light)
    
    ax.set_xlim(xx.min(), xx.max())
    ax.set_ylim(yy.min(), yy.max())
    ax.scatter(x_data, y_data, c=ground_truth, cmap=cmap_bold, s=100, marker='o', edgecolors='k')
        
    ax.set_xlabel(x_label, fontsize=font_size)
    ax.set_ylabel(y_label, fontsize=font_size)
    ax.set_title(title, fontsize=title_font_size)

def plot_keras_prediction(clf, x_data, y_data, x_label, y_label, ground_truth, title="", 
                          mesh=True, fixed=None, fname=None, size=(15, 8)):
    xx,yy = meshGrid(x_data, y_data)
    fig, ax = plt.subplots(figsize=size)

    if clf and mesh:
        grid_X = np.array(np.c_[yy.ravel(), xx.ravel()])
        if fixed:
            fill_values = np.full((len(grid_X), 1), fixed)
            grid_X = np.append(grid_X, fill_values, axis=1)
        Z = clf.predict(grid_X)
        Z = np.argmax(Z, axis=1)
        Z = Z.reshape(xx.shape)
        ax.pcolormesh(xx, yy, Z, cmap=cmap_light)
        
    ax.set_xlim(xx.min(), xx.max())
    ax.set_ylim(yy.min(), yy.max())
    ax.scatter(x_data, y_data, c=ground_truth, cmap=cmap_bold, s=100, marker='o', edgecolors='k')
        
    ax.set_xlabel(x_label, fontsize=font_size)
    ax.set_ylabel(y_label, fontsize=font_size)
    ax.set_title(title, fontsize=title_font_size)

def plot_history(history, samples=100, init_phase_samples=None, plot_line=False):
    epochs = history.params['epochs']
    
    acc = history.history['acc']
    val_acc = history.history['val_acc']
    loss = history.history['loss']
    val_loss = history.history['val_loss']

    every_sample =  int(epochs / samples)
    acc = pd.DataFrame(acc).iloc[::every_sample, :]
    val_acc = pd.DataFrame(val_acc).iloc[::every_sample, :]
    loss = pd.DataFrame(loss).iloc[::every_sample, :]
    val_loss = pd.DataFrame(val_loss).iloc[::every_sample, :]

    if init_phase_samples:
        acc = acc.loc[init_phase_samples:]
        val_acc = val_acc.loc[init_phase_samples:]
        loss = loss.loc[init_phase_samples:]
        val_loss = val_loss.loc[init_phase_samples:]
    
    fig, ax = plt.subplots(nrows=2, figsize=(15, 8))

    ax[0].plot(acc, 'bo', label='Training acc')
    ax[0].plot(val_acc, 'b', label='Validation acc')
    ax[0].set_title('Training and validation accuracy')
    ax[0].legend()
    
    if plot_line:
        x, y, _ = linear_regression(acc)
        ax[0].plot(x, y, 'bo', color='red')
        x, y, _ = linear_regression(val_acc)
        ax[0].plot(x, y, 'b', color='red')
    
    ax[1].plot(loss, 'bo', label='Training loss')
    ax[1].plot(val_loss, 'b', label='Validation loss')
    ax[1].set_title('Training and validation loss')
    ax[1].legend()
    
    if plot_line:
        x, y, _ = linear_regression(loss)
        ax[1].plot(x, y, 'bo', color='red')
        x, y, _ = linear_regression(val_loss)
        ax[1].plot(x, y, 'b', color='red')
    
from sklearn import linear_model

def linear_regression(data):
    x = np.array(data.index).reshape(-1, 1)
    y = data.values.reshape(-1, 1)

    regr = linear_model.LinearRegression()
    regr.fit(x, y)
    y_pred = regr.predict(x)
    return x, y_pred, regr.coef_



In [0]:

    
plotPrediction(None, X[:, 1], X[:, 0], 
               'Age', 'Max Speed', y, mesh=False,
                title="All Data",
                fname='all.png')

Second important concept: To have an idea how well the training worked, we save some data to test our model on previously unseen data.

The real objective is to have a generalized model that works well on the test data.
How well it performs on this test data as opposed to the training data tells us quite a bit as well.
Typical splits are 60% for training and 40% for testing or 80/20
It is important that we do not use the test data to tweak the hyper parameters of our learning strategy - in this case the test data would (indirectly) influence the training and can no longer tell how well we did
evaluate the test date set only once at the end of your experiment



In [0]:

    
from sklearn.model_selection import train_test_split



In [0]:

    
# using stratefy we get a balanced number of samples per category (important!)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42, stratify=y)



In [0]:

    
X_train.shape, y_train.shape, X_test.shape, y_test.shape









    Out[0]:





((900, 3), (900,), (600, 3), (600,))



In [0]:

    
np.unique(y_train, return_counts=True)









    Out[0]:





(array([0, 1, 2]), array([301, 300, 299]))



In [0]:

    
np.unique(y_test, return_counts=True)









    Out[0]:





(array([0, 1, 2]), array([200, 200, 200]))



In [0]:

    
X_train_2_dim = X_train[:, :2]
X_test_2_dim = X_test[:, :2]



In [0]:

    
plotPrediction(None, X_train_2_dim[:, 1], X_train_2_dim[:, 0], 
               'Age', 'Max Speed', y_train, mesh=False,
                title="Train Data",
                fname='train.png')



In [0]:

    
plotPrediction(None, X_test_2_dim[:, 1], X_test_2_dim[:, 0], 
               'Age', 'Max Speed', y_test, mesh=False,
                title="Test Data",
                fname='test.png')

KNN - Most basic learning strategy: Look at the neighbors to make a prediction for a sample yet unknown

Interactive introduction to KNN: https://beta.observablehq.com/@djcordhose/how-to-build-a-teachable-machine-with-tensorflow-js#knndataset



In [0]:

    
from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier(1)



In [0]:

    
%time clf.fit(X_train_2_dim, y_train)









    



CPU times: user 681 µs, sys: 1.86 ms, total: 2.54 ms
Wall time: 6.43 ms






    Out[0]:





KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=1, p=2,
           weights='uniform')



In [0]:

    
plotPrediction(clf, X_train_2_dim[:, 1], X_train_2_dim[:, 0], 
               'Age', 'Max Speed', y_train,
                title="Train Data, KNN, k=1",
                fname='knn1-train.png')



In [0]:

    
clf.score(X_train_2_dim, y_train)









    Out[0]:





0.9577777777777777



In [0]:

    
plotPrediction(clf, X_test_2_dim[:, 1], X_test_2_dim[:, 0], 
               'Age', 'Max Speed', y_test,
                title="Test Data, KNN, k=1",
                fname='knn1-test.png')



In [0]:

    
clf.score(X_test_2_dim, y_test)









    Out[0]:





0.5983333333333334



In [0]:

    
# http://scikit-learn.org/stable/modules/cross_validation.html
from sklearn.model_selection import cross_val_score



In [0]:

    
# cross_val_score?

Cross Validation splits the train data in different ways and performs a number of training runs (3 in this case)



In [0]:

    
scores = cross_val_score(clf, X_train_2_dim, y_train, n_jobs=-1)
scores









    Out[0]:





array([0.59468439, 0.61666667, 0.62876254])



In [0]:

    
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))









    



Accuracy: 0.61 (+/- 0.03)

Exercise: Try to manually find a better k

execute the noteboon up to this point
change k based on your previous experiments
what is the important value to check?

Third important concept: Our objective is to make the best prediction for unknown samples. This is called generalization. If we perform well on knwon data, but less good on unknown data this is called overfitting. This is to be avoided. Measures taken to avoid overfitting are also known as regularization.

In KNN we reduce overfitting by taking more neighbors into account

We can try what is the best number of numbers manually, but grid search does the same thing, only with less manual effort. This one tries the number of neighbors between 1 and 50



In [0]:

    
# KNeighborsClassifier?



In [0]:

    
from sklearn.model_selection import GridSearchCV
param_grid = {
    'n_neighbors': list(range(1, 50)),
    'weights': ['uniform', 'distance'] # are points that are nearer more important?
    }
clf = GridSearchCV(KNeighborsClassifier(), param_grid, n_jobs=-1)
%time clf.fit(X_train_2_dim, y_train)
clf.best_params_









    



CPU times: user 164 ms, sys: 22 ms, total: 186 ms
Wall time: 1.41 s






    Out[0]:





{'n_neighbors': 39, 'weights': 'uniform'}



In [0]:

    
clf = KNeighborsClassifier(n_neighbors=39, weights='uniform')
%time clf.fit(X_train_2_dim, y_train)









    



CPU times: user 1 ms, sys: 1.58 ms, total: 2.58 ms
Wall time: 2.16 ms






    Out[0]:





KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=39, p=2,
           weights='uniform')



In [0]:

    
plotPrediction(clf, X_train_2_dim[:, 1], X_train_2_dim[:, 0], 
               'Age', 'Max Speed', y_train,
                title="Train Data, KNN, k=39")

A rule of thumb: Smoother decision boundaries imply less overfitting



In [0]:

    
clf.score(X_train_2_dim, y_train)









    Out[0]:





0.6955555555555556



In [0]:

    
plotPrediction(clf, X_test_2_dim[:, 1], X_test_2_dim[:, 0], 
               'Age', 'Max Speed', y_test,
                title="Test Data, KNN, k=39")



In [0]:

    
clf.score(X_test_2_dim, y_test)









    Out[0]:





0.7133333333333334



In [0]:

    
scores = cross_val_score(clf, X_train_2_dim, y_train, n_jobs=-1)
scores









    Out[0]:





array([0.6910299 , 0.70666667, 0.66889632])



In [0]:

    
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))









    



Accuracy: 0.69 (+/- 0.03)

Logistic Regression



In [0]:

    
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()
%time clf.fit(X_train_2_dim, y_train)









    



CPU times: user 7.19 ms, sys: 490 µs, total: 7.68 ms
Wall time: 11.6 ms






    Out[0]:





LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)



In [0]:

    
plotPrediction(clf, X_train_2_dim[:, 1], X_train_2_dim[:, 0], 
               'Age', 'Max Speed', y_train,
                title="Train Data, Logistic Regression")



In [0]:

    
clf.score(X_train_2_dim, y_train)









    Out[0]:





0.3788888888888889



In [0]:

    
scores = cross_val_score(clf, X_train_2_dim, y_train, n_jobs=-1)
scores









    Out[0]:





array([0.34883721, 0.40333333, 0.39464883])



In [0]:

    
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))









    



Accuracy: 0.38 (+/- 0.05)



In [0]:

    
plotPrediction(clf, X_test_2_dim[:, 1], X_test_2_dim[:, 0], 
               'Age', 'Max Speed', y_test,
                title="Test Data, Logistic Regression")



In [0]:

    
clf.score(X_test_2_dim, y_test)









    Out[0]:





0.37666666666666665

Descision Trees

Another learning strategy, just like KNN is one
Splits our data set on a certain variable
Similar to what we have done in the manual classifier, but here the rules are actually learned



In [0]:

    
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier()
%time clf.fit(X_train_2_dim, y_train)









    



CPU times: user 1.89 ms, sys: 0 ns, total: 1.89 ms
Wall time: 1.9 ms






    Out[0]:





DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

How is the Decision Tree being Constructed?

We are using the CART algorithm:

top-down split the set of examples into two new sets
choose a variable and a value at each step that best splits our customer example
terminal node when no further gain possible or regularization kicks in

http://scikit-learn.org/stable/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart

What is the best split?

assign a category to each node containing a certain set of samples
use a metric (Gini or Entropy) to decide how good a node would be based on that category
sum up weighted metric for both child nodes
optimize the split for that summed metric

https://machinelearningmastery.com/classification-and-regression-trees-for-machine-learning/



In [0]:

    
# we perform at most 20 splits of our data until we make a decision where the data point belongs

clf.tree_.max_depth









    Out[0]:





18

Complete Decision Tree



In [0]:

    
plotPrediction(clf, X_train_2_dim[:, 1], X_train_2_dim[:, 0], 
               'Age', 'Max Speed', y_train,
                title="Train Data, Decision Tree",
                fname='dt-overfit-train.png')



In [0]:

    
clf.score(X_train_2_dim, y_train)









    Out[0]:





0.96



In [0]:

    
scores = cross_val_score(clf, X_train_2_dim, y_train, n_jobs=-1)
scores









    Out[0]:





array([0.57142857, 0.57666667, 0.62207358])



In [0]:

    
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))









    



Accuracy: 0.59 (+/- 0.05)



In [0]:

    
plotPrediction(clf, X_test_2_dim[:, 1], X_test_2_dim[:, 0], 
               'Age', 'Max Speed', y_test,
                title="Test Data, Decision Tree",
                fname='dt-overfit-test.png')



In [0]:

    
clf.score(X_test_2_dim, y_test)









    Out[0]:





0.5966666666666667

We overfit heavily and need to change the relevant parameters of our tree

its maximum number of splits (depth) - if there is no limit, we can make as many splits as it takes to perfectly match all train data (overfitting)
how many samples we need at least for a leaf - if it is just one, we could perfectly fit all training data (overfitting)
how many samples do we need to make another split - not as crucial as the other two, but can still limit overfitting



In [0]:

    
param_grid = {
    'max_depth': list(range(2, 25)),
    'min_samples_split': list(range(2, 11)),
    'min_samples_leaf': list(range(1, 11))
}
clf = GridSearchCV(DecisionTreeClassifier(), param_grid, n_jobs=-1)
%time clf.fit(X_train_2_dim, y_train)
clf.best_params_









    



CPU times: user 4.02 s, sys: 45 ms, total: 4.07 s
Wall time: 10.8 s






    Out[0]:





{'max_depth': 6, 'min_samples_leaf': 3, 'min_samples_split': 4}



In [0]:

    
clf = DecisionTreeClassifier(max_depth=6,
                              min_samples_leaf=3,
                              min_samples_split=2)
%time clf.fit(X_train_2_dim, y_train)









    



CPU times: user 2.32 ms, sys: 791 µs, total: 3.11 ms
Wall time: 2.13 ms






    Out[0]:





DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=6,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=3, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')



In [0]:

    
clf.tree_.max_depth









    Out[0]:





6

Reduced Decision Tree



In [0]:

    
plotPrediction(clf, X_train_2_dim[:, 1], X_train_2_dim[:, 0], 
               'Age', 'Max Speed', y_train,
                title="Train Data, Regularized Decision Tree",
                fname='dt-sweet-train.png')



In [0]:

    
clf.score(X_train_2_dim, y_train)









    Out[0]:





0.7444444444444445



In [0]:

    
scores = cross_val_score(clf, X_train_2_dim, y_train, n_jobs=-1)
scores









    Out[0]:





array([0.68438538, 0.68      , 0.67558528])



In [0]:

    
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))









    



Accuracy: 0.68 (+/- 0.01)



In [0]:

    
plotPrediction(clf, X_test_2_dim[:, 1], X_test_2_dim[:, 0], 
               'Age', 'Max Speed', y_test,
                title="Test Data, Regularized Decision Tree",
                fname='dt-sweet-test.png')



In [0]:

    
clf.score(X_test_2_dim, y_test)









    Out[0]:





0.6833333333333333

Random Forest

We fight overfitting in decision trees with some success
However, inherent to their nature, decision trees tend to overfit
Random Forest is an ensemble technique that trains a number of simple decision trees and uses a majority vote over all of them for prediction
While each decision tree still overfits using many of them softens this problems
You still need to regularize the underlying decision trees
sklearn has a default of 10 decision trees for random forest
Random Forest is the swiss army knife of machine learning



In [0]:

    
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_jobs=-1)
%time clf.fit(X_train_2_dim, y_train)









    



CPU times: user 27.9 ms, sys: 5.67 ms, total: 33.6 ms
Wall time: 112 ms






    Out[0]:





RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=-1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)



In [0]:

    
clf.score(X_train_2_dim, y_train)









    Out[0]:





0.9411111111111111



In [0]:

    
plotPrediction(clf, X_test_2_dim[:, 1], X_test_2_dim[:, 0], 
               'Age', 'Max Speed', y_test,
                title="Test Data, Random Forest",
                fname='rf-overfit-train.png')



In [0]:

    
clf.score(X_test_2_dim, y_test)









    Out[0]:





0.6316666666666667



In [0]:

    
# brute force grid search is far too expensive

param_grid = {
    'n_estimators': list(range(3,20)),
    'max_depth': list(range(2, 25)),
    'min_samples_split': list(range(2, 11)),
    'min_samples_leaf': list(range(1, 11))
}
clf = GridSearchCV(RandomForestClassifier(), param_grid, n_jobs=-1)
# %time clf.fit(X_train_2_dim, y_train)
# clf.best_params_

Unfortunately, training random forest classifiers is more expensive than decision trees by the number of estimators it uses (10 in our case). This makes using a deterministic grid search over all parameters prohibitively expensive. We instead use a randomized search, that tries 100 different values and we hope to find the best here.



In [0]:

    
# http://scikit-learn.org/stable/modules/grid_search.html#randomized-parameter-search
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

param_dist = {
    'n_estimators': randint(3,20),
    "max_depth": randint(2, 25),
    "min_samples_split": randint(2, 11),
    "min_samples_leaf": randint(1, 11)
}

clf = RandomizedSearchCV(RandomForestClassifier(), param_dist, n_iter=1000, n_jobs=-1)
%time clf.fit(X_train_2_dim, y_train)
clf.best_params_









    



CPU times: user 3.29 s, sys: 67.5 ms, total: 3.35 s
Wall time: 50.3 s






    Out[0]:





{'max_depth': 5,
 'min_samples_leaf': 2,
 'min_samples_split': 8,
 'n_estimators': 18}



In [0]:

    
# parameters might vary a bit with each run, because it is a random search
clf = RandomForestClassifier(max_depth=5, min_samples_leaf=3, min_samples_split=7, n_estimators=12, n_jobs=-1)
%time clf.fit(X_train_2_dim, y_train)









    



CPU times: user 26.9 ms, sys: 10.2 ms, total: 37.1 ms
Wall time: 118 ms






    Out[0]:





RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=5, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=3, min_samples_split=7,
            min_weight_fraction_leaf=0.0, n_estimators=12, n_jobs=-1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)



In [0]:

    
clf.score(X_train_2_dim, y_train)









    Out[0]:





0.7311111111111112



In [0]:

    
plotPrediction(clf, X_test_2_dim[:, 1], X_test_2_dim[:, 0], 
               'Age', 'Max Speed', y_test,
                title="Test Data, Regularized Random Forest",
                fname='rf-sweet-test.png')



In [0]:

    
clf.score(X_test_2_dim, y_test)









    Out[0]:





0.6866666666666666

Exercise: Try to optimize on the parameters

we only have a good guess about the optimal parameters here
re-run the ramdomized search and try to get better results
you can also try the grid seach with a limited search range

Support Vector Machines (SVM)

SVMs used to be the hot stuff before neural networks stole the show
SVMs choose a small number of data points to decide where to draw the decision boundary, they are called the support vectors

Interactive Introduction: https://dash-gallery.plotly.host/dash-svm

In its base version SVMs can only use lines as decision boundaries. Let's see how well this works.



In [0]:

    
from sklearn.svm import SVC
clf = SVC(kernel='linear')
%time clf.fit(X_train_2_dim, y_train)









    



CPU times: user 403 ms, sys: 1.69 ms, total: 404 ms
Wall time: 407 ms






    Out[0]:





SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)



In [0]:

    
clf.score(X_train_2_dim, y_train)









    Out[0]:





0.4533333333333333



In [0]:

    
plotPrediction(clf, X_test_2_dim[:, 1], X_test_2_dim[:, 0], 
               'Age', 'Max Speed', y_test,
                title="Test Data, SVM linear Kernel",
                fname='svm-underfit-linear-train.png')

SVM Kernels

As we can see using lines only, we can not even fit the training data, this is called underfitting.
For most realisitic examples we need something better called the 'kernel trick'
We transform the original problem space into another that is separable by lines only
Radial Base Functions 'rbf' can approximate any function and are trained to perform this transformation



In [0]:

    
from sklearn.svm import SVC
clf = SVC()
%time clf.fit(X_train_2_dim, y_train)









    



CPU times: user 61.5 ms, sys: 1.97 ms, total: 63.4 ms
Wall time: 68 ms






    Out[0]:





SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)



In [0]:

    
clf.score(X_train_2_dim, y_train)









    Out[0]:





0.9055555555555556



In [0]:

    
plotPrediction(clf, X_test_2_dim[:, 1], X_test_2_dim[:, 0], 
               'Age', 'Max Speed', y_test,
                title="Test Data, SVM rbf Kernel",
                fname='svm-overfit-test.png')



In [0]:

    
clf.score(X_test_2_dim, y_test)









    Out[0]:





0.64



In [0]:

    
# SVC?

Again we strongly overfit and need to regularize our model. The two important parameters are

Gamma
- reach of a single training sample - low values: far, high values: close
  - https://www.youtube.com/watch?v=m2a2K4lprQw
- the lower the more points remote from the support vectors influence where the decision boundaries go
C (Cost)
- decides how expensive it is to misclassify one of our support vectors, the smaller the more tolerant to misclassified samples
- tradeof between smooth decision boundaries and classifying samples correctly
  - https://www.youtube.com/watch?v=joTa_FeMZ2s
  - high cost means rough, complex decision boundaries

Interactively experiment with C and Gamma: https://dash-gallery.plotly.host/dash-svm

http://scikit-learn.org/stable/auto_examples/svm/plot_rbf_parameters.html



In [0]:

    
param_grid = {
    'C': list(np.append(np.arange(0.1, 1.0, 0.1), np.arange(2, 10, 1))),
    'gamma': list(np.append(np.arange(0.001, 0.1, 0.0005), np.arange(.02, 1.0, 0.1))),
}
clf = GridSearchCV(SVC(), param_grid, n_jobs=-1)
%time clf.fit(X_train_2_dim, y_train)
clf.best_params_









    



CPU times: user 9.39 s, sys: 328 ms, total: 9.72 s
Wall time: 4min 58s






    Out[0]:





{'C': 9.0, 'gamma': 0.001}



In [0]:

    
clf = SVC(C=9, gamma=0.001)
%time clf.fit(X_train_2_dim, y_train)









    



CPU times: user 29.5 ms, sys: 1.31 ms, total: 30.8 ms
Wall time: 34.8 ms






    Out[0]:





SVC(C=9, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=0.001, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)



In [0]:

    
clf.score(X_train_2_dim, y_train)









    Out[0]:





0.7122222222222222



In [0]:

    
plotPrediction(clf, X_test_2_dim[:, 1], X_test_2_dim[:, 0], 
               'Age', 'Max Speed', y_test,
                title="Test Data, SVM Regularized rbf Kernel",
                fname='svm-reg-test.png')



In [0]:

    
clf.score(X_test_2_dim, y_test)









    Out[0]:





0.71

Exercise: Improve Cost and Gamma

use https://dash-gallery.plotly.host/dash-svm to get a better intuition for C and Gamma
Change C and Gamma based on your intuition
Can you do better than the results coming from the grid search?
If not: Why not?

Neural Networks using TensorFlow and Keras Layers

Important if you let this run on Colab switch on the GPU option, because otherwise this part will take quite some time



In [0]:

    
import tensorflow as tf
tf.logging.set_verbosity(tf.logging.ERROR)
print(tf.__version__)



In [0]:

    
# let's see what compute devices we have available, hopefully a GPU
sess = tf.Session()
devices = sess.list_devices()
for d in devices:
    print(d.name)
hello = tf.constant('Hello TF!')
print(sess.run(hello))









    



/job:localhost/replica:0/task:0/device:CPU:0
/job:localhost/replica:0/task:0/device:GPU:0
b'Hello TF!'



In [0]:

    
from tensorflow import keras
print(keras.__version__)









    



2.1.6-tf

Neuron (aka node or unit)

A neuron takes a number of numerical inputs, multiplies each with a weight, sums up all weighted input and adds bias (constant) to that sum. From this it creates a single numerical output. For one input (one dimension) this would be a description of a line. For more dimensions this describes a hyper plane that can serve as a decision boundary. Typically, this output is transformed using an activation function which compresses the output to a value between 0 and 1 (sigmoid), or between -1 and 1 (tanh) or sets all negative values to zero (relu).

It is not really important to understand the details of a neural network. Practically how you configure them to form something more powerful is much more important. This, however, is still a very experimental domain, so there really is no conscise explanation and understanding how they work.

From Neuron to Layer

Neural Networks consist of artificial neurons you organize in layers
each neuron is very simple, but, theoretically, having enough of them in a single layer can approximate any funtion
practically, we use 2 or 3 layers, as this has turned out to work well
the more neurons and the more layers you use the longer the network takes to train
neural networks often are no longer approachable using cross validation and grid search to find suitable hyper parameters

We start with two hidden layers each having 500 neurons



In [0]:

    
# tf.keras.layers.Dense?



In [0]:

    
from tensorflow.keras.layers import Dense

model = keras.Sequential()

model.add(Dense(units=500, name='hidden1', activation='tanh', input_dim=2))
model.add(Dense(units=500, name='hidden2', activation='tanh'))

model.add(Dense(3, name='softmax', activation='softmax'))

model.summary()









    



_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
hidden1 (Dense)              (None, 500)               1500      
_________________________________________________________________
hidden2 (Dense)              (None, 500)               250500    
_________________________________________________________________
softmax (Dense)              (None, 3)                 1503      
=================================================================
Total params: 253,503
Trainable params: 253,503
Non-trainable params: 0
_________________________________________________________________

If you do not train for too long, even without any further regularization this network will not overfit by too much, but look at how strange the decision boundaries look like

Validation Split

As test evaluation is a one-shot test, we again separate data for continues validation

We can not use cross validation as each training run takes far too much time



In [0]:

    
BATCH_SIZE=1000
EPOCHS = 2000

model.compile(loss='sparse_categorical_crossentropy',
             optimizer='adam',
             metrics=['accuracy'])

early_stopping = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=100, verbose=1)
# checkpoint = tf.keras.callbacks.ModelCheckpoint('keras-model.epoch-{epoch:02d}-val_loss-{val_loss:.2f}.hdf5',
#                                                 verbose = 1, save_best_only=True)
checkpoint = tf.keras.callbacks.ModelCheckpoint('keras-model.hdf5', verbose = 1, save_best_only=True)

# %time model.fit(X_train_2_dim, y_train_categorical, epochs=EPOCHS, batch_size=BATCH_SIZE, validation_split=0.2, callbacks=[checkpoint, early_stopping])
%time history = model.fit(X_train_2_dim, y_train, epochs=EPOCHS, batch_size=BATCH_SIZE, validation_split=0.2, verbose=0)









    



CPU times: user 18.8 s, sys: 2.84 s, total: 21.7 s
Wall time: 16.1 s



In [0]:

    
train_loss, train_accuracy = model.evaluate(X_train_2_dim, y_train, batch_size=BATCH_SIZE)
train_accuracy









    



900/900 [==============================] - 0s 4us/step






    Out[0]:





0.7177777886390686



In [0]:

    
test_loss, test_accuracy = model.evaluate(X_test_2_dim, y_test, batch_size=BATCH_SIZE)
test_accuracy









    



600/600 [==============================] - 0s 12us/step






    Out[0]:





0.6983333230018616



In [0]:

    
plot_history(history)



In [0]:

    
plot_keras_prediction(model, X_test_2_dim[:, 1], X_test_2_dim[:, 0], 
               'Age', 'Max Speed', y_test,
                title="Test Data, NN")

Even though scores do not look too bad, decision boundaries tell us, this is not a good result. We have several means of regularization for neural networks and we use a combination of them as described below.

We have several means of regularization

We use a combination of them:

reduce capacity of model
dropout
batch normalization
change activation to relu for faster training
reduce amount of training cycles

Dropout explained in a funny way

https://twitter.com/Smerity/status/980175898119778304

An experimental approach:

keep adding regularization to make test and train scores come closer to each other
this will come at the cost of train scores going down
if both values start going down you have gone too far
each experiment takes some time
for larger datasets and more complex models some people start by overfitting on a subsample of the data (because it trains much faster)
- then you can be sure you have an architecture that at least has the capacity to solve the problem
- then keep adding regularizations
- eventually try using the complete data
if you want to use batch normalization place it between raw output of neuron and activation function



In [0]:

    
from tensorflow.keras.layers import Dense, Activation, BatchNormalization, Dropout

# https://stackoverflow.com/questions/34716454/where-do-i-call-the-batchnormalization-function-in-keras
# regularisation:
# - dropout
# - batch normalization
# - reduce capacity of model

dropout = 0.6
model = keras.Sequential()

model.add(Dense(100, name='hidden1', input_dim=2))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(dropout))

model.add(Dense(100, name='hidden2'))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(dropout))

model.add(Dense(3, name='softmax', activation='softmax'))

model.summary()









    



_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
hidden1 (Dense)              (None, 100)               300       
_________________________________________________________________
batch_normalization (BatchNo (None, 100)               400       
_________________________________________________________________
activation (Activation)      (None, 100)               0         
_________________________________________________________________
dropout (Dropout)            (None, 100)               0         
_________________________________________________________________
hidden2 (Dense)              (None, 100)               10100     
_________________________________________________________________
batch_normalization_1 (Batch (None, 100)               400       
_________________________________________________________________
activation_1 (Activation)    (None, 100)               0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 100)               0         
_________________________________________________________________
softmax (Dense)              (None, 3)                 303       
=================================================================
Total params: 11,503
Trainable params: 11,103
Non-trainable params: 400
_________________________________________________________________



In [0]:

    
BATCH_SIZE=1000
EPOCHS = 3000

model.compile(loss='sparse_categorical_crossentropy',
             optimizer='adam',
             metrics=['accuracy'])

early_stopping = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=500, verbose=1)

%time history = model.fit(X_train_2_dim, y_train, epochs=EPOCHS, batch_size=BATCH_SIZE, validation_split=0.2, callbacks=[early_stopping], verbose=0)









    



Epoch 02354: early stopping
CPU times: user 36.6 s, sys: 4.64 s, total: 41.2 s
Wall time: 28.8 s



In [0]:

    
# use the best model
# from keras.models import load_model

# model = load_model('keras-model.hdf5')



In [0]:

    
train_loss, train_accuracy = model.evaluate(X_train_2_dim, y_train, batch_size=BATCH_SIZE)
train_accuracy









    



900/900 [==============================] - 0s 10us/step






    Out[0]:





0.7144444584846497



In [0]:

    
test_loss, test_accuracy = model.evaluate(X_test_2_dim, y_test, batch_size=BATCH_SIZE)
test_accuracy









    



600/600 [==============================] - 0s 17us/step






    Out[0]:





0.7099999785423279



In [0]:

    
plot_history(history)



In [0]:

    
plot_history(history, init_phase_samples=250, plot_line=True)

Scores around 70% look good now, there might even be a bit more potential here, but we are not going after a final percent here



In [0]:

    
plot_keras_prediction(model, X_test_2_dim[:, 1], X_test_2_dim[:, 0], 
               'Age', 'Max Speed', y_test,
                title="Test Data, Dropout, BachNormalization, Reduced Cap NN",
                fname='nn-reg-test.png')

It is surprising how smooth and very similar these decision boundaries are to the ones created by SVM

What you should be aware of when you do into PRODUCTION



In [0]:

    
def meshGrid(x_data, y_data, xlim=None, ylim=None):
    h = 1  # step size in the mesh
    if xlim == None:
        xlim = x_data.min(), x_data.max()
    if ylim == None:
        ylim = y_data.min(), y_data.max()
        
    x_min, x_max = xlim
    y_min, y_max = ylim
    xx, yy = np.meshgrid(np.arange(x_min - 1, x_max + 1, h),
                         np.arange(y_min - 1, y_max + 1, h))
    return xx, yy, xlim, ylim
  
def ext_plot_keras_prediction(clf, x_data, y_data, x_label, y_label, ground_truth, title="", 
                              size=(15, 8),
                              xlim=(16, 90), ylim=(70, 170)):
    xx,yy, xlim, ylim = meshGrid(x_data, y_data, xlim, ylim)
    fig, ax = plt.subplots(figsize=size)

    if clf:
        grid_X = np.array(np.c_[yy.ravel(), xx.ravel()])
        Z = clf.predict(grid_X)
        Z = np.argmax(Z, axis=1)
        Z = Z.reshape(xx.shape)
        ax.pcolormesh(xx, yy, Z, cmap=cmap_light)
        
    ax.set_xlim(xlim)
    ax.set_ylim(ylim)
    ax.scatter(x_data, y_data, c=ground_truth, cmap=cmap_bold, s=100, marker='o', edgecolors='k')
        
    ax.set_xlabel(x_label, fontsize=font_size)
    ax.set_ylabel(y_label, fontsize=font_size)
    ax.set_title(title, fontsize=title_font_size)



In [0]:

    
ext_plot_keras_prediction(model, X_test_2_dim[:, 1], X_test_2_dim[:, 0], 
               'Age', 'Max Speed', y_test,
                xlim=(6, 400), ylim=(10, 250),
                title="Extended Area - Simulating Production")

Be careful with predictions outside of the area covered by training data

First: Be aware of the reasonable range of values

Options

Add a warning
Reject requests in extended area
Scores in exteded area often like very biased towards one category (1.0 score)
Add likelihoods to scores



In [0]:

    
# red makes sense here
age = 100
speed = 100
model.predict(np.array([[speed, age]]))









    Out[0]:





array([[0.8450663 , 0.00088674, 0.15404704]], dtype=float32)



In [0]:

    
# does not make any sense, but has 1.0 score for yellow
age = 200
speed = 200
model.predict(np.array([[speed, age]]))









    Out[0]:





array([[2.1591406e-08, 7.0600289e-29, 1.0000000e+00]], dtype=float32)



In [0]:

    
# there could be such a customer, but it should not be yellow, but red
age = 85
speed = 200
model.predict(np.array([[speed, age]]))









    Out[0]:





array([[3.3167265e-07, 3.2714050e-13, 9.9999964e-01]], dtype=float32)



In [0]:

	speed	age	miles	group
0	98.0	44.0	25.0	1
1	118.0	54.0	24.0	1
2	111.0	26.0	34.0	0
3	97.0	25.0	10.0	2
4	114.0	38.0	22.0	1

	speed	age	miles	group
count	1500.000000	1500.000000	1500.000000	1500.000000
mean	122.492667	44.980667	30.434000	0.998667
std	17.604333	17.130400	15.250815	0.816768
min	68.000000	16.000000	1.000000	0.000000
25%	108.000000	32.000000	18.000000	0.000000
50%	120.000000	42.000000	29.000000	1.000000
75%	137.000000	55.000000	42.000000	2.000000
max	166.000000	100.000000	84.000000	2.000000