Einführung in Machine Learning

Online Version: http://bit.ly/nordic-ml

Grundidee des Supervised Machine Learnings

Machine Learning

Grundhoffnung: Generalisierung auf bisher unbekannte Daten und Situationen

Häufiger Anwendungsfall: Klassifikation

Machine Learning Classification



In [1]:

    
import warnings
warnings.filterwarnings('ignore')



In [2]:

    
%matplotlib inline
%pylab inline









    



Populating the interactive namespace from numpy and matplotlib



In [3]:

    
import matplotlib.pylab as plt
import numpy as np



In [4]:

    
from distutils.version import StrictVersion



In [5]:

    
import sklearn
print(sklearn.__version__)

assert StrictVersion(sklearn.__version__ ) >= StrictVersion('0.18.1')



In [6]:

    
# Evtl. hat Azure nur 0.19, wir brauchen aber .20 für das Plotting, dann das hier installieren und Notebook neu starten
# !conda update pandas -y



In [7]:

    
import pandas as pd
print(pd.__version__)

assert StrictVersion(pd.__version__) >= StrictVersion('0.20.0')

Der Klassiker als Beispiel: Lilien anhand von Blütengrößen unterscheiden

Zuerst laden wir den Iris Datensatz und verschaffen uns einen ersten Eindruck

https://de.wikipedia.org/wiki/Portal:Statistik/Datensaetze#Iris
Sepal: Sepalum, Kelchblatt: der grüne Teil der das bunte umschließt: https://de.wikipedia.org/wiki/Kelchblatt
Petal: Petalum, Kronblatt: der bunte Teil, den wir allgemein als Blüte wahrnehmen: https://de.wikipedia.org/wiki/Kronblatt



In [8]:

    
from sklearn.datasets import load_iris
iris = load_iris()



In [9]:

    
print(iris.DESCR)









    



Iris Plants Database
====================

Notes
-----
Data Set Characteristics:
    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
    :Summary Statistics:

    ============== ==== ==== ======= ===== ====================
                    Min  Max   Mean    SD   Class Correlation
    ============== ==== ==== ======= ===== ====================
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20  0.76     0.9565  (high!)
    ============== ==== ==== ======= ===== ====================

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :Date: July, 1988

This is a copy of UCI ML iris datasets.
http://archive.ics.uci.edu/ml/datasets/Iris

The famous Iris database, first used by Sir R.A Fisher

This is perhaps the best known database to be found in the
pattern recognition literature.  Fisher's paper is a classic in the field and
is referenced frequently to this day.  (See Duda & Hart, for example.)  The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant.  One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.

References
----------
   - Fisher,R.A. "The use of multiple measurements in taxonomic problems"
     Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
     Mathematical Statistics" (John Wiley, NY, 1950).
   - Duda,R.O., & Hart,P.E. (1973) Pattern Classification and Scene Analysis.
     (Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.
   - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
     Structure and Classification Rule for Recognition in Partially Exposed
     Environments".  IEEE Transactions on Pattern Analysis and Machine
     Intelligence, Vol. PAMI-2, No. 1, 67-71.
   - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule".  IEEE Transactions
     on Information Theory, May 1972, 431-433.
   - See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al"s AUTOCLASS II
     conceptual clustering system finds 3 classes in the data.
   - Many, many more ...



In [10]:

    
X = iris.data
y = iris.target



In [11]:

    
X.shape, y.shape









    Out[11]:





((150, 4), (150,))



In [12]:

    
X[0]









    Out[12]:





array([ 5.1,  3.5,  1.4,  0.2])



In [13]:

    
y[0]









    Out[13]:





0



In [14]:

    
X_sepal_length = X[:, 0]
X_sepal_width =  X[:, 1]
X_petal_length = X[:, 2]
X_petal_width = X[:, 3]



In [15]:

    
X_petal_width.shape









    Out[15]:





(150,)

Nur eine Art ist linear von den beiden anderen trennbar



In [16]:

    
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap

iris_df = pd.DataFrame(iris.data, columns=iris.feature_names)
CMAP = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])
pd.plotting.scatter_matrix(iris_df, c=iris.target, edgecolor='black', figsize=(15, 15), cmap=CMAP)
plt.show()

Problem: Wie wissen wir, ob wir unser System gut trainiert haben?

Aufteilung der Daten in Training (60%) und Test (40%)

http://scikit-learn.org/stable/modules/cross_validation.html



In [17]:

    
from sklearn.model_selection import train_test_split



In [18]:

    
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42, stratify=y)



In [19]:

    
X_train.shape, y_train.shape, X_test.shape, y_test.shape









    Out[19]:





((90, 4), (90,), (60, 4), (60,))

Wir trainieren einen einfachen KNN Klassifikator mit 2 Features und überprüfen die Ergebnisse

http://scikit-learn.org/stable/modules/neighbors.html#classification



In [20]:

    
from sklearn import neighbors



In [21]:

    
# ignore this, it is just technical code
# should come from a lib, consider it to appear magically 
# http://scikit-learn.org/stable/auto_examples/neighbors/plot_classification.html

import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap

cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
font_size=25

def meshGrid(x_data, y_data):
    h = .02  # step size in the mesh
    x_min, x_max = x_data.min() - 1, x_data.max() + 1
    y_min, y_max = y_data.min() - 1, y_data.max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    return (xx,yy)
    
def plotPrediction(clf, x_data, y_data, x_label, y_label, colors, title="", mesh=True):
    xx,yy = meshGrid(x_data, y_data)
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])

    # Put the result into a color plot
    Z = Z.reshape(xx.shape)
    plt.figure(figsize=(20,10))
    if mesh:
        plt.pcolormesh(xx, yy, Z, cmap=cmap_light)
    plt.xlim(xx.min(), xx.max())
    plt.ylim(yy.min(), yy.max())
    plt.scatter(x_data, y_data, c=colors, cmap=cmap_bold, s=80, marker='o')
    plt.xlabel(x_label, fontsize=font_size)
    plt.ylabel(y_label, fontsize=font_size)
    plt.title(title, fontsize=font_size)

Zuerst für die Sepal Features



In [22]:

    
X_train_sepal_only = X_train[:, :2]
X_test_sepal_only = X_test[:, :2]



In [23]:

    
X_train_sepal_only[0]









    Out[23]:





array([ 7.4,  2.8])



In [24]:

    
X_train[0]









    Out[24]:





array([ 7.4,  2.8,  6.1,  1.9])

Training ist sehr schnell, weil wir uns nur jeden der Punkte merken



In [25]:

    
clf_sepal = neighbors.KNeighborsClassifier(1)
%time clf_sepal.fit(X_train_sepal_only, y_train)









    



CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 2.55 ms






    Out[25]:





KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=1, p=2,
           weights='uniform')



In [26]:

    
plotPrediction(clf_sepal, X_train_sepal_only[:, 0], X_train_sepal_only[:, 1], 
               'Sepal length', 'Sepal width', y_train, mesh=False,
                title="Train Data for Sepal Features")

Ein kleiner Test mit einem einzelnen, bisher unbekannten Datensatz



In [27]:

    
# 8 ist schwer, weil direkt zwischen 1 und 2
sample_id = 8
# sample_id = 50
sample_feature = X_test_sepal_only[sample_id]
sample_label = y_test[sample_id]



In [28]:

    
sample_feature









    Out[28]:





array([ 6.9,  3.1])



In [29]:

    
sample_label









    Out[29]:





2



In [30]:

    
clf_sepal.predict([sample_feature])









    Out[30]:





array([1])

Generalisierung funktioniert grundsätzlich, wir probieren es mit einem ausgedachten Wert, der in keinem Datensatz vorkommt



In [31]:

    
clf_sepal.predict([[6.0, 4.5]]) # slightly different from above, still gives 0









    Out[31]:





array([0])

Wir berechnen nun welcher Anteil der Daten richtig vorhergesagt werden kann



In [32]:

    
# clf_sepal.score?



In [33]:

    
clf_sepal.score(X_train_sepal_only, y_train)









    Out[33]:





0.9555555555555556



In [34]:

    
clf_sepal.score(X_test_sepal_only, y_test)









    Out[34]:





0.80000000000000004

Scores sind ok für die Trainingsdaten, aber nicht so toll für Testdaten, das bedeutet vor allem Overfitting (und zudem etwas Underfitting)

Um zu versehen, was das heißt und was passiert ist, zeichnen wir die Decision Boundaries ein

Für jeden möglichen Datenpunkte zeichnen wir flächig die Vorhersage ein



In [35]:

    
plotPrediction(clf_sepal, X_train_sepal_only[:, 0], X_train_sepal_only[:, 1], 
               'Sepal length', 'Sepal width', y_train,
               title="Highly Fragmented Decision Boundaries for Train Data")



In [36]:

    
plotPrediction(clf_sepal, X_test_sepal_only[:, 0], X_test_sepal_only[:, 1],
               'Sepal length', 'Sepal width', y_test,
               title="Same Decision Boundaries don't work well for Test Data")

Wir machen das Modell weniger komplex, allgemeiner

Jetzt mit 10 Nachbarn



In [37]:

    
# neighbors.KNeighborsClassifier?



In [38]:

    
clf_sepal_10 = neighbors.KNeighborsClassifier(10)
clf_sepal_10.fit(X_train_sepal_only, y_train)









    Out[38]:





KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=10, p=2,
           weights='uniform')



In [39]:

    
clf_sepal_10.score(X_train_sepal_only, y_train)









    Out[39]:





0.80000000000000004



In [40]:

    
clf_sepal_10.score(X_test_sepal_only, y_test)









    Out[40]:





0.76666666666666672

Das ist nun klares Underfitting, das Modell ist zu simpel, sogar für die Trainingsdaten



In [41]:

    
plotPrediction(clf_sepal_10, X_train_sepal_only[:, 0], X_train_sepal_only[:, 1], 
               'Sepal length', 'Sepal width', y_train,
               title="Model too simple even for Train Data")

Zusammenhang Overfittung / Underfitting

Schwarze Punkte: Trainingsdaten
Graue Punkte: Testdaten

Feature Selektion

Spoiler Alert: Mit den Sepal Features werden wir immer entweder overfitten oder underfitten

Wir versuchen es noch einmal mit den Petal Features



In [42]:

    
X_train_petal_only = X_train[:, 2:]
X_test_petal_only = X_test[:, 2:]



In [43]:

    
X_train_petal_only[0]









    Out[43]:





array([ 6.1,  1.9])



In [44]:

    
X_train[0]









    Out[44]:





array([ 7.4,  2.8,  6.1,  1.9])



In [45]:

    
clf_petal_10 = neighbors.KNeighborsClassifier(10)
clf_petal_10.fit(X_train_petal_only, y_train)









    Out[45]:





KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=10, p=2,
           weights='uniform')



In [46]:

    
plotPrediction(clf_petal_10, X_train_petal_only[:, 0], X_train_petal_only[:, 1], 
               'Petal length', 'Petal width', y_train,
               title="Simple model looks good for Train Data")



In [47]:

    
plotPrediction(clf_petal_10, X_test_petal_only[:, 0], X_test_petal_only[:, 1], 
               'Petal length', 'Petal width', y_test,
               title="Simple model looks good even for Test Data")



In [48]:

    
clf_petal_10.score(X_train_petal_only, y_train)









    Out[48]:





0.96666666666666667



In [49]:

    
clf_petal_10.score(X_test_petal_only, y_test)









    Out[49]:





0.94999999999999996

Das klappt schon erstaunlich gut, aber was kriegen wir mit allen 4 Features hin?



In [50]:

    
clf = neighbors.KNeighborsClassifier(1)
clf.fit(X_train, y_train)









    Out[50]:





KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=1, p=2,
           weights='uniform')



In [51]:

    
clf.score(X_train, y_train)









    Out[51]:





1.0



In [52]:

    
clf.score(X_test, y_test)









    Out[52]:





0.94999999999999996

Mit nur einem Nachbarn kriegen wir die Trainingsdaten perfekt hin, overfitten aber leicht

Probieren wir es also mit mehr Nachbarn



In [68]:

    
clf = neighbors.KNeighborsClassifier(13)
clf.fit(X_train, y_train)









    Out[68]:





KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=13, p=2,
           weights='uniform')



In [69]:

    
clf.score(X_train, y_train)









    Out[69]:





0.97777777777777775



In [70]:

    
clf.score(X_test, y_test)









    Out[70]:





0.96666666666666667

Wrapup

Mit allen 4 Parametern und 10 Nachbarn erreichen wir den Sweet Spot
Feature Selektion ist wichtig
Selbst bei so einer einfachen Lernstrategie kann man immer noch genug tunen
Bei "echten" Problemen sind die Turnaround-Zeiten eher Stunden, Tage, Wochen, daher viel schwerer auszuprobieren und brauchen extrem viel Rechenpower => GPU