Introduction to Machine Learning

How manual coding works

Opposing basic Idea of Supervised Machine Learning

Hope: System can generalize to previously unknown data and situations

Common Use Case: Classification



In [1]:

    
import warnings
warnings.filterwarnings('ignore')



In [2]:

    
%matplotlib inline
%pylab inline









    



Populating the interactive namespace from numpy and matplotlib



In [3]:

    
import matplotlib.pylab as plt
import numpy as np



In [4]:

    
from distutils.version import StrictVersion



In [5]:

    
import sklearn
print(sklearn.__version__)

assert StrictVersion(sklearn.__version__ ) >= StrictVersion('0.18.1')



In [6]:

    
# Evtl. hat Azure nur 0.19, wir brauchen aber .20 für das Plotting, dann das hier installieren und Notebook neu starten
# !conda update pandas -y



In [7]:

    
import pandas as pd
print(pd.__version__)

assert StrictVersion(pd.__version__) >= StrictVersion('0.20.0')

One of the Classics: Classify Iris Type by sizes of their flower

First we load the data set and get an impression

https://en.wikipedia.org/wiki/Iris_flower_data_set



In [8]:

    
from sklearn.datasets import load_iris
iris = load_iris()



In [9]:

    
print(iris.DESCR)









    



Iris Plants Database
====================

Notes
-----
Data Set Characteristics:
    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
    :Summary Statistics:

    ============== ==== ==== ======= ===== ====================
                    Min  Max   Mean    SD   Class Correlation
    ============== ==== ==== ======= ===== ====================
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20  0.76     0.9565  (high!)
    ============== ==== ==== ======= ===== ====================

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :Date: July, 1988

This is a copy of UCI ML iris datasets.
http://archive.ics.uci.edu/ml/datasets/Iris

The famous Iris database, first used by Sir R.A Fisher

This is perhaps the best known database to be found in the
pattern recognition literature.  Fisher's paper is a classic in the field and
is referenced frequently to this day.  (See Duda & Hart, for example.)  The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant.  One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.

References
----------
   - Fisher,R.A. "The use of multiple measurements in taxonomic problems"
     Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
     Mathematical Statistics" (John Wiley, NY, 1950).
   - Duda,R.O., & Hart,P.E. (1973) Pattern Classification and Scene Analysis.
     (Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.
   - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
     Structure and Classification Rule for Recognition in Partially Exposed
     Environments".  IEEE Transactions on Pattern Analysis and Machine
     Intelligence, Vol. PAMI-2, No. 1, 67-71.
   - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule".  IEEE Transactions
     on Information Theory, May 1972, 431-433.
   - See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al"s AUTOCLASS II
     conceptual clustering system finds 3 classes in the data.
   - Many, many more ...

https://en.wikipedia.org/wiki/File:Petal-sepal.jpg



In [10]:

    
X = iris.data
y = iris.target



In [11]:

    
X.shape, y.shape









    Out[11]:





((150, 4), (150,))



In [12]:

    
X[0]









    Out[12]:





array([ 5.1,  3.5,  1.4,  0.2])



In [13]:

    
y[0]









    Out[13]:





0



In [14]:

    
X_sepal_length = X[:, 0]
X_sepal_width =  X[:, 1]
X_petal_length = X[:, 2]
X_petal_width = X[:, 3]



In [15]:

    
X_petal_width.shape









    Out[15]:





(150,)

Scatterplot (red=Setosa, green=Versicolour, blue=Virginica)



In [16]:

    
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap

iris_df = pd.DataFrame(iris.data, columns=iris.feature_names)
CMAP = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])
pd.plotting.scatter_matrix(iris_df, c=iris.target, edgecolor='black', figsize=(15, 15), cmap=CMAP)
plt.show()

Now for training

Issue: How do we know if we have trained well?

Splitting data in trainining and (60%) und test (40%)

http://scikit-learn.org/stable/modules/cross_validation.html



In [1]:

    
from sklearn.model_selection import train_test_split



In [18]:

    
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42, stratify=y)



In [19]:

    
X_train.shape, y_train.shape, X_test.shape, y_test.shape









    Out[19]:





((90, 4), (90,), (60, 4), (60,))

We use KNN Classifier with two random features

http://scikit-learn.org/stable/modules/neighbors.html#classification



In [20]:

    
from sklearn import neighbors



In [21]:

    
# ignore this, it is just technical code
# should come from a lib, consider it to appear magically 
# http://scikit-learn.org/stable/auto_examples/neighbors/plot_classification.html

import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap

cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
font_size=25

def meshGrid(x_data, y_data):
    h = .02  # step size in the mesh
    x_min, x_max = x_data.min() - 1, x_data.max() + 1
    y_min, y_max = y_data.min() - 1, y_data.max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    return (xx,yy)
    
def plotPrediction(clf, x_data, y_data, x_label, y_label, colors, title="", mesh=True):
    xx,yy = meshGrid(x_data, y_data)
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])

    # Put the result into a color plot
    Z = Z.reshape(xx.shape)
    plt.figure(figsize=(20,10))
    if mesh:
        plt.pcolormesh(xx, yy, Z, cmap=cmap_light)
    plt.xlim(xx.min(), xx.max())
    plt.ylim(yy.min(), yy.max())
    plt.scatter(x_data, y_data, c=colors, cmap=cmap_bold, s=80, marker='o')
    plt.xlabel(x_label, fontsize=font_size)
    plt.ylabel(y_label, fontsize=font_size)
    plt.title(title, fontsize=font_size)

Random Feature Selection: Sepal Features



In [22]:

    
X_train_sepal_only = X_train[:, :2]
X_test_sepal_only = X_test[:, :2]



In [23]:

    
X_train_sepal_only[0]









    Out[23]:





array([ 7.4,  2.8])



In [24]:

    
X_train[0]









    Out[24]:





array([ 7.4,  2.8,  6.1,  1.9])

Training works very fast as we just record all positions in 2d space, no abstraction takes place with KNN



In [25]:

    
clf_sepal = neighbors.KNeighborsClassifier(1)
%time clf_sepal.fit(X_train_sepal_only, y_train)









    



CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 1.56 ms






    Out[25]:





KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=1, p=2,
           weights='uniform')



In [26]:

    
plotPrediction(clf_sepal, X_train_sepal_only[:, 0], X_train_sepal_only[:, 1], 
               'Sepal length', 'Sepal width', y_train, mesh=False,
                title="Train Data for Sepal Features")

Getting some intuition: what does KNN predict for a previous unknown data set?



In [27]:

    
# 8 is tough because it is directly between 1 and 2
sample_id = 8
# sample_id = 50
sample_feature = X_test_sepal_only[sample_id]
sample_label = y_test[sample_id]



In [28]:

    
sample_feature









    Out[28]:





array([ 6.9,  3.1])



In [29]:

    
sample_label









    Out[29]:





2



In [30]:

    
clf_sepal.predict([sample_feature])









    Out[30]:





array([1])

And we also try a completely made up data point



In [31]:

    
clf_sepal.predict([[6.0, 4.5]]) # slightly different from above, still gives 0









    Out[31]:





array([0])

Evaluation: What perecentage in each set is corretly predicted?



In [32]:

    
# clf_sepal.score?



In [33]:

    
clf_sepal.score(X_train_sepal_only, y_train)









    Out[33]:





0.9555555555555556



In [34]:

    
clf_sepal.score(X_test_sepal_only, y_test)









    Out[34]:





0.80000000000000004

Scores ok for training, not so good for test: Overfitting

To understand what has happened we draw Decision Boundaries

For each data point we draw the prediction our model would give as an area



In [35]:

    
plotPrediction(clf_sepal, X_train_sepal_only[:, 0], X_train_sepal_only[:, 1], 
               'Sepal length', 'Sepal width', y_train,
               title="Highly Fragmented Decision Boundaries for Train Data")



In [36]:

    
plotPrediction(clf_sepal, X_test_sepal_only[:, 0], X_test_sepal_only[:, 1],
               'Sepal length', 'Sepal width', y_test,
               title="Same Decision Boundaries don't work well for Test Data")

We need to smoothen our boundaries

k=10 neighbors



In [2]:

    
# neighbors.KNeighborsClassifier?



In [38]:

    
clf_sepal_10 = neighbors.KNeighborsClassifier(10)
clf_sepal_10.fit(X_train_sepal_only, y_train)









    Out[38]:





KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=10, p=2,
           weights='uniform')



In [39]:

    
clf_sepal_10.score(X_train_sepal_only, y_train)









    Out[39]:





0.80000000000000004



In [40]:

    
clf_sepal_10.score(X_test_sepal_only, y_test)









    Out[40]:





0.76666666666666672

Scores for both sets equally low: Undefitting



In [41]:

    
plotPrediction(clf_sepal_10, X_train_sepal_only[:, 0], X_train_sepal_only[:, 1], 
               'Sepal length', 'Sepal width', y_train,
               title="Model too simple even for Train Data")



In [60]:

    
plotPrediction(clf_sepal_10, X_test_sepal_only[:, 0], X_test_sepal_only[:, 1], 
               'Sepal length', 'Sepal width', y_test,
               title="Model also too simple for Test Data")

Relationship Overfittung / Underfitting

Black Dots: Training Data
Gray Dots: Test Data

Feature Selektion

Sepal Features seem to always either over- or underfit

Let us try Petal Features



In [43]:

    
X_train_petal_only = X_train[:, 2:]
X_test_petal_only = X_test[:, 2:]



In [44]:

    
X_train_petal_only[0]









    Out[44]:





array([ 6.1,  1.9])



In [45]:

    
X_train[0]









    Out[45]:





array([ 7.4,  2.8,  6.1,  1.9])



In [46]:

    
clf_petal_10 = neighbors.KNeighborsClassifier(10)
clf_petal_10.fit(X_train_petal_only, y_train)









    Out[46]:





KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=10, p=2,
           weights='uniform')



In [47]:

    
plotPrediction(clf_petal_10, X_train_petal_only[:, 0], X_train_petal_only[:, 1], 
               'Petal length', 'Petal width', y_train,
               title="Simple model looks good for Train Data")



In [48]:

    
plotPrediction(clf_petal_10, X_test_petal_only[:, 0], X_test_petal_only[:, 1], 
               'Petal length', 'Petal width', y_test,
               title="Simple model looks good even for Test Data")



In [49]:

    
clf_petal_10.score(X_train_petal_only, y_train)









    Out[49]:





0.96666666666666667



In [50]:

    
clf_petal_10.score(X_test_petal_only, y_test)









    Out[50]:





0.94999999999999996

Petal Features seem to be a better idea, what features we take seems to matter

All 4 Features should give best results, though



In [51]:

    
clf = neighbors.KNeighborsClassifier(1)
clf.fit(X_train, y_train)









    Out[51]:





KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=1, p=2,
           weights='uniform')



In [52]:

    
clf.score(X_train, y_train)









    Out[52]:





1.0



In [53]:

    
clf.score(X_test, y_test)









    Out[53]:





0.94999999999999996

With one neighbor we overfit, let's try more neighbors



In [57]:

    
clf = neighbors.KNeighborsClassifier(13)
clf.fit(X_train, y_train)









    Out[57]:





KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=13, p=2,
           weights='uniform')



In [58]:

    
clf.score(X_train, y_train)









    Out[58]:





0.97777777777777775



In [59]:

    
clf.score(X_test, y_test)









    Out[59]:





0.96666666666666667

13 Neighbors reaches the sweet spot for our train / test split

Might be different for other splits, 150 data sets generally is too small for reliable results

Wrapup

With Supvervised Learning we train relationships between input and output

to generalize is the objective
overfitting (just lean by heart) and underfitting (do not learn properly in the first place) needs to be avoided
finding the sweet spot between the two is the hardest part
in a real-life use case we rarely have data as clean as in this example
getting and clearning the data might be the most work intensve part
feature selection can be crucial
feature selection and engeneering might be second most work intensive part in classic machine learning
deep learning promises to make feature selection and sometimes even cleaning obsolete
150 data sets generally is not enough
however, the better the data, the less of it we need
tuning of parameters for the best result can be done by automatic seach (GridSearch or RandomSearch)
real problems with large feature spaces and more complex learning strategies can easily bring training times up to hours, days and even weeks
using GPU and even distributed computing might be necessary

Hands-On

Execute this notebook and improve the training results

look at the scatter plot and choose two other features that look promising to you
select K to give the best result
avoid overfitting and unterfitting