Core Ideas of Supervised Learning



In [1]:

    
import warnings
warnings.filterwarnings('ignore')



In [2]:

    
%matplotlib inline
%pylab inline









    



Populating the interactive namespace from numpy and matplotlib



In [3]:

    
import pandas as pd
print(pd.__version__)



In [4]:

    
import numpy as np
print(np.__version__)

First Load our Data and get an intuition for it



In [8]:

    
# !curl -O https://raw.githubusercontent.com/DJCordhose/ai/master/notebooks/video/data/insurance-customers-1500.csv
!curl -O https://raw.githubusercontent.com/DJCordhose/ai/master/notebooks/video/data/insurance-customers-300.csv









    



  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  5421  100  5421    0     0  49733      0 --:--:-- --:--:-- --:--:-- 49733



In [9]:

    
df = pd.read_csv('./insurance-customers-300.csv', sep=';')



In [10]:

    
df.head()









    Out[10]:







  
    
      
      max speed
      age
      thousand km per year
      group
    
  
  
    
      0
      167.0
      41.0
      5.0
      1
    
    
      1
      158.0
      18.0
      15.0
      1
    
    
      2
      147.0
      28.0
      14.0
      2
    
    
      3
      161.0
      43.0
      62.0
      2
    
    
      4
      179.0
      36.0
      42.0
      0



In [11]:

    
df.describe()









    Out[11]:







  
    
      
      max speed
      age
      thousand km per year
      group
    
  
  
    
      count
      300.000000
      300.000000
      300.000000
      300.000000
    
    
      mean
      171.863333
      44.006667
      31.220000
      1.000000
    
    
      std
      18.807545
      16.191784
      15.411792
      0.817861
    
    
      min
      132.000000
      18.000000
      5.000000
      0.000000
    
    
      25%
      159.000000
      33.000000
      18.000000
      0.000000
    
    
      50%
      171.000000
      42.000000
      30.000000
      1.000000
    
    
      75%
      187.000000
      52.000000
      43.000000
      2.000000
    
    
      max
      211.000000
      90.000000
      99.000000
      2.000000



In [12]:

    
import matplotlib.pyplot as plt
plt.xkcd()

import seaborn as sns
sns.set(style="ticks")

sample_df = df.sample(n=120, random_state=42)

colors_light = {0: '#FFAAAA', 1: '#AAFFAA', 2: '#FFFFDD'}
colors_bold = {0: '#AA4444', 1: '#006000', 2: '#EEEE44'}

sns.pairplot(sample_df, hue="group", palette=colors_bold)









    Out[12]:





<seaborn.axisgrid.PairGrid at 0x270edd8e470>

First important concept: You train a machine with your data to make it learn the relationship between some input data and a certain label - this is called supervised learning



In [13]:

    
y=df['group']



In [14]:

    
df.drop('group', axis='columns', inplace=True)



In [15]:

    
X = df.as_matrix()



In [16]:

    
corrmat = df.corr()



In [17]:

    
matplotlib.rcdefaults();
sns.heatmap(corrmat, annot=True)









    Out[17]:





<matplotlib.axes._subplots.AxesSubplot at 0x270f2af99e8>

We restrict ourselves to two dimensions for now



In [18]:

    
# ignore this, it is just technical code to plot decision boundaries
# Adapted from:
# http://scikit-learn.org/stable/auto_examples/neighbors/plot_classification.html
# http://jponttuset.cat/xkcd-deep-learning/

import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap

plt.xkcd()

cmap_print = ListedColormap(['#AA8888', '#004000', '#FFFFDD'])
cmap_bold = ListedColormap(['#AA4444', '#006000', '#EEEE44'])
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#FFFFDD'])
font_size=25

def meshGrid(x_data, y_data):
    h = 1  # step size in the mesh
    x_min, x_max = x_data.min() - 1, x_data.max() + 1
    y_min, y_max = y_data.min() - 1, y_data.max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    return (xx,yy)
    
def plotPrediction(clf, x_data, y_data, x_label, y_label, ground_truth, title="", 
                   mesh=True, fname=None, print=False):
    xx,yy = meshGrid(x_data, y_data)
    plt.figure(figsize=(20,10))

    if clf and mesh:
        Z = clf.predict(np.c_[yy.ravel(), xx.ravel()])
        # Put the result into a color plot
        Z = Z.reshape(xx.shape)
        plt.pcolormesh(xx, yy, Z, cmap=cmap_light)
    
    plt.xlim(xx.min(), xx.max())
    plt.ylim(yy.min(), yy.max())
    if print:
        plt.scatter(x_data, y_data, c=ground_truth, cmap=cmap_print, s=200, marker='o', edgecolors='k')
    else:
        plt.scatter(x_data, y_data, c=ground_truth, cmap=cmap_bold, s=80, marker='o', edgecolors='k')
    plt.xlabel(x_label, fontsize=font_size)
    plt.ylabel(y_label, fontsize=font_size)
    plt.title(title, fontsize=font_size)
    if fname:
        plt.savefig(fname)



In [19]:

    
X_kmh_age = X[:, :2] 
plotPrediction(None, X_kmh_age[:, 1], X_kmh_age[:, 0], 
               'Age', 'Max Speed', y, mesh=False,
                title="All Data Max Speed vs Age")

Second important concept: To have an idea how well the training worked, we save same data to try our model on previously unseen data. How well it performs on this test data as opposed to the training data tells us quite a bit as well.



In [20]:

    
from sklearn.model_selection import train_test_split



In [21]:

    
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42, stratify=y)



In [22]:

    
X_train.shape, y_train.shape, X_test.shape, y_test.shape









    Out[22]:





((180, 3), (180,), (120, 3), (120,))



In [23]:

    
X_train_2_dim = X_train[:, :2]
X_test_2_dim = X_test[:, :2]



In [24]:

    
plotPrediction(None, X_train_2_dim[:, 1], X_train_2_dim[:, 0], 
               'Age', 'Max Speed', y_train, mesh=False,
                title="Train Data Max Speed vs Age")



In [25]:

    
plotPrediction(None, X_test_2_dim[:, 1], X_test_2_dim[:, 0], 
               'Age', 'Max Speed', y_test, mesh=False,
                title="Test Data Max Speed vs Age")

Most basic learning strategy: Look at the neighbors to make a prediction for a sample yet unknown



In [26]:

    
from sklearn import neighbors
clf = neighbors.KNeighborsClassifier(1)



In [27]:

    
%time clf.fit(X_train_2_dim, y_train)









    



Wall time: 2 ms






    Out[27]:





KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=1, p=2,
           weights='uniform')



In [28]:

    
plotPrediction(clf, X_train_2_dim[:, 1], X_train_2_dim[:, 0], 
               'Age', 'Max Speed', y_train,
                title="Train Data Max Speed vs Age with Classification")



In [29]:

    
clf.score(X_train_2_dim, y_train)









    Out[29]:





0.9777777777777777



In [30]:

    
plotPrediction(clf, X_test_2_dim[:, 1], X_test_2_dim[:, 0], 
               'Age', 'Max Speed', y_test,
                title="Test Data Max Speed vs Age with Prediction")



In [31]:

    
clf.score(X_test_2_dim, y_test)









    Out[31]:





0.65

Third important concept: Our objective is to make the best prediction for unknown samples. This is called generalization. If we perform well on knwon data, but less good on unknown data this is called overfitting. This is to be avoided.

In KNN we reduce overfitting by taking more neighbors into account



In [ ]:

    
clf = neighbors.KNeighborsClassifier(5)
%time clf.fit(X_train_2_dim, y_train)



In [ ]:

    
plotPrediction(clf, X_train_2_dim[:, 1], X_train_2_dim[:, 0], 
               'Age', 'Max Speed', y_train,
                title="Train Data Max Speed vs Age with Classification")

A rule of thumb: Smoother decision boundaries imply less overfitting



In [ ]:

    
clf.score(X_train_2_dim, y_train)



In [ ]:

    
plotPrediction(clf, X_test_2_dim[:, 1], X_test_2_dim[:, 0], 
               'Age', 'Max Speed', y_test,
                title="Test Data Max Speed vs Age with Prediction")



In [ ]:

    
clf.score(X_test_2_dim, y_test)

Fourth Step: Sample Usage and Confusion Matrix



In [ ]:

    
clf = neighbors.KNeighborsClassifier(5)
%time clf.fit(X_train, y_train)



In [ ]:

    
sample_X = X[:1]
sample_X



In [ ]:

    
y[:1]



In [ ]:

    
clf.predict(sample_X)



In [ ]:

    
clf.predict_proba(sample_X)



In [ ]:

    
from sklearn.metrics import confusion_matrix

y_pred = clf.predict(X)
y_true = np.array(y)
cm = confusion_matrix(y_true, y_pred)
cm



In [ ]:

    
# 0: red
# 1: green
# 2: yellow

import seaborn as sns
sns.heatmap(cm, annot=True, cmap="YlGnBu")
figure = plt.gcf()
ax = figure.add_subplot(111)
ax.set_xlabel('Prediction', fontsize=20)
ax.set_ylabel('Ground Truth', fontsize=20)

	max speed	age	thousand km per year	group
0	167.0	41.0	5.0	1
1	158.0	18.0	15.0	1
2	147.0	28.0	14.0	2
3	161.0	43.0	62.0	2
4	179.0	36.0	42.0	0

	max speed	age	thousand km per year	group
count	300.000000	300.000000	300.000000	300.000000
mean	171.863333	44.006667	31.220000	1.000000
std	18.807545	16.191784	15.411792	0.817861
min	132.000000	18.000000	5.000000	0.000000
25%	159.000000	33.000000	18.000000	0.000000
50%	171.000000	42.000000	30.000000	1.000000
75%	187.000000	52.000000	43.000000	2.000000
max	211.000000	90.000000	99.000000	2.000000