Introduction to Machine Learning:

Examples of Unsupervised and Supervised Machine-Learning Algorithms

Version 0.1

Broadly speaking, machine-learning methods constitute a diverse collection of data-driven algorithms designed to classify/characterize/analyze sources in multi-dimensional spaces. The topics and studies that fall under the umbrella of machine learning is growing, and there is no good catch-all definition. The number (and variation) of algorithms is vast, and beyond the scope of these exercises. While we will discuss a few specific algorithms today, more importantly, we will explore the scope of the two general methods: unsupervised learning and supervised learning and introduce the powerful (and dangerous?) Python package scikit-learn.

By AA Miller



In [1]:

    
import numpy as np
import matplotlib.pyplot as plt



In [8]:

    
%matplotlib notebook

Problem 1) Introduction to `scikit-learn`

At the most basic level, scikit-learn makes machine learning extremely easy within Python. By way of example, here is a short piece of code that builds a complex, non-linear model to classify sources in the Iris data set that we learned about yesterday:

from sklearn import datasets
from sklearn.ensemble import RandomForestClassifier
iris = datasets.load_iris()
RFclf = RandomForestClassifier().fit(iris.data, iris.target)

Those 4 lines of code have constructed a model that is superior to any system of hard cuts that we could have encoded while looking at the multidimensional space. This can be fast as well: execute the dummy code in the cell below to see how "easy" machine-learning is with scikit-learn.



In [2]:

    
# execute dummy code here

from sklearn import datasets
from sklearn.ensemble import RandomForestClassifier
iris = datasets.load_iris()
RFclf = RandomForestClassifier().fit(iris.data, iris.target)









    



/Users/adamamiller/miniconda3/envs/emcee3/lib/python3.7/site-packages/sklearn/ensemble/forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)

Generally speaking, the procedure for scikit-learn is uniform across all machine-learning algorithms. Models are accessed via the various modules (ensemble, SVM, neighbors, etc), with user-defined tuning parameters. The features (or data) for the models are stored in a 2D array, X, with rows representing individual sources and columns representing the corresponding feature values. [In a minority of cases, X, represents a similarity or distance matrix where each entry represents the distance to every other source in the data set.] In cases where there is a known classification or scalar value (typically supervised methods), this information is stored in a 1D array y.

Unsupervised models are fit by calling .fit(X) and supervised models are fit by calling .fit(X, y). In both cases, predictions for new observations, Xnew, can be obtained by calling .predict(Xnew). Those are the basics and beyond that, the details are algorithm specific, but the documentation for essentially everything within scikit-learn is excellent, so read the docs.

To further develop our intuition, we will now explore the Iris dataset a little further.

Problem 1a What is the pythonic type of iris?



In [3]:

    
type(iris)









    Out[3]:





sklearn.utils.Bunch

You likely haven't encountered a scikit-learn Bunch before. It's functionality is essentially the same as a dictionary.

Problem 1b What are the keys of iris?



In [4]:

    
iris.keys()









    Out[4]:





dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])

Most importantly, iris contains data and target values. These are all you need for scikit-learn, though the feature and target names and description are useful.

Problem 1c What is the shape and content of the iris data?



In [5]:

    
print(np.shape(iris.data))
print(iris.data)









    



(150, 4)
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]
 [5.4 3.9 1.7 0.4]
 [4.6 3.4 1.4 0.3]
 [5.  3.4 1.5 0.2]
 [4.4 2.9 1.4 0.2]
 [4.9 3.1 1.5 0.1]
 [5.4 3.7 1.5 0.2]
 [4.8 3.4 1.6 0.2]
 [4.8 3.  1.4 0.1]
 [4.3 3.  1.1 0.1]
 [5.8 4.  1.2 0.2]
 [5.7 4.4 1.5 0.4]
 [5.4 3.9 1.3 0.4]
 [5.1 3.5 1.4 0.3]
 [5.7 3.8 1.7 0.3]
 [5.1 3.8 1.5 0.3]
 [5.4 3.4 1.7 0.2]
 [5.1 3.7 1.5 0.4]
 [4.6 3.6 1.  0.2]
 [5.1 3.3 1.7 0.5]
 [4.8 3.4 1.9 0.2]
 [5.  3.  1.6 0.2]
 [5.  3.4 1.6 0.4]
 [5.2 3.5 1.5 0.2]
 [5.2 3.4 1.4 0.2]
 [4.7 3.2 1.6 0.2]
 [4.8 3.1 1.6 0.2]
 [5.4 3.4 1.5 0.4]
 [5.2 4.1 1.5 0.1]
 [5.5 4.2 1.4 0.2]
 [4.9 3.1 1.5 0.2]
 [5.  3.2 1.2 0.2]
 [5.5 3.5 1.3 0.2]
 [4.9 3.6 1.4 0.1]
 [4.4 3.  1.3 0.2]
 [5.1 3.4 1.5 0.2]
 [5.  3.5 1.3 0.3]
 [4.5 2.3 1.3 0.3]
 [4.4 3.2 1.3 0.2]
 [5.  3.5 1.6 0.6]
 [5.1 3.8 1.9 0.4]
 [4.8 3.  1.4 0.3]
 [5.1 3.8 1.6 0.2]
 [4.6 3.2 1.4 0.2]
 [5.3 3.7 1.5 0.2]
 [5.  3.3 1.4 0.2]
 [7.  3.2 4.7 1.4]
 [6.4 3.2 4.5 1.5]
 [6.9 3.1 4.9 1.5]
 [5.5 2.3 4.  1.3]
 [6.5 2.8 4.6 1.5]
 [5.7 2.8 4.5 1.3]
 [6.3 3.3 4.7 1.6]
 [4.9 2.4 3.3 1. ]
 [6.6 2.9 4.6 1.3]
 [5.2 2.7 3.9 1.4]
 [5.  2.  3.5 1. ]
 [5.9 3.  4.2 1.5]
 [6.  2.2 4.  1. ]
 [6.1 2.9 4.7 1.4]
 [5.6 2.9 3.6 1.3]
 [6.7 3.1 4.4 1.4]
 [5.6 3.  4.5 1.5]
 [5.8 2.7 4.1 1. ]
 [6.2 2.2 4.5 1.5]
 [5.6 2.5 3.9 1.1]
 [5.9 3.2 4.8 1.8]
 [6.1 2.8 4.  1.3]
 [6.3 2.5 4.9 1.5]
 [6.1 2.8 4.7 1.2]
 [6.4 2.9 4.3 1.3]
 [6.6 3.  4.4 1.4]
 [6.8 2.8 4.8 1.4]
 [6.7 3.  5.  1.7]
 [6.  2.9 4.5 1.5]
 [5.7 2.6 3.5 1. ]
 [5.5 2.4 3.8 1.1]
 [5.5 2.4 3.7 1. ]
 [5.8 2.7 3.9 1.2]
 [6.  2.7 5.1 1.6]
 [5.4 3.  4.5 1.5]
 [6.  3.4 4.5 1.6]
 [6.7 3.1 4.7 1.5]
 [6.3 2.3 4.4 1.3]
 [5.6 3.  4.1 1.3]
 [5.5 2.5 4.  1.3]
 [5.5 2.6 4.4 1.2]
 [6.1 3.  4.6 1.4]
 [5.8 2.6 4.  1.2]
 [5.  2.3 3.3 1. ]
 [5.6 2.7 4.2 1.3]
 [5.7 3.  4.2 1.2]
 [5.7 2.9 4.2 1.3]
 [6.2 2.9 4.3 1.3]
 [5.1 2.5 3.  1.1]
 [5.7 2.8 4.1 1.3]
 [6.3 3.3 6.  2.5]
 [5.8 2.7 5.1 1.9]
 [7.1 3.  5.9 2.1]
 [6.3 2.9 5.6 1.8]
 [6.5 3.  5.8 2.2]
 [7.6 3.  6.6 2.1]
 [4.9 2.5 4.5 1.7]
 [7.3 2.9 6.3 1.8]
 [6.7 2.5 5.8 1.8]
 [7.2 3.6 6.1 2.5]
 [6.5 3.2 5.1 2. ]
 [6.4 2.7 5.3 1.9]
 [6.8 3.  5.5 2.1]
 [5.7 2.5 5.  2. ]
 [5.8 2.8 5.1 2.4]
 [6.4 3.2 5.3 2.3]
 [6.5 3.  5.5 1.8]
 [7.7 3.8 6.7 2.2]
 [7.7 2.6 6.9 2.3]
 [6.  2.2 5.  1.5]
 [6.9 3.2 5.7 2.3]
 [5.6 2.8 4.9 2. ]
 [7.7 2.8 6.7 2. ]
 [6.3 2.7 4.9 1.8]
 [6.7 3.3 5.7 2.1]
 [7.2 3.2 6.  1.8]
 [6.2 2.8 4.8 1.8]
 [6.1 3.  4.9 1.8]
 [6.4 2.8 5.6 2.1]
 [7.2 3.  5.8 1.6]
 [7.4 2.8 6.1 1.9]
 [7.9 3.8 6.4 2. ]
 [6.4 2.8 5.6 2.2]
 [6.3 2.8 5.1 1.5]
 [6.1 2.6 5.6 1.4]
 [7.7 3.  6.1 2.3]
 [6.3 3.4 5.6 2.4]
 [6.4 3.1 5.5 1.8]
 [6.  3.  4.8 1.8]
 [6.9 3.1 5.4 2.1]
 [6.7 3.1 5.6 2.4]
 [6.9 3.1 5.1 2.3]
 [5.8 2.7 5.1 1.9]
 [6.8 3.2 5.9 2.3]
 [6.7 3.3 5.7 2.5]
 [6.7 3.  5.2 2.3]
 [6.3 2.5 5.  1.9]
 [6.5 3.  5.2 2. ]
 [6.2 3.4 5.4 2.3]
 [5.9 3.  5.1 1.8]]

Problem 1d What is the shape and content of the iris target?



In [6]:

    
print(np.shape(iris.target))
print(iris.target)









    



(150,)
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]

Finally, as a baseline for the exercises that follow, we will now make a simple 2D plot showing the separation of the 3 classes in the iris dataset. This plot will serve as the reference for examining the quality of the clustering algorithms.

Problem 1e Make a scatter plot showing sepal length vs. sepal width for the iris data set. Color the points according to their respective classes.



In [11]:

    
print(iris.feature_names)  # shows that sepal length is first feature and sepal width is second feature

fig, ax = plt.subplots()
ax.scatter(iris.data[:,0], iris.data[:,1], c = iris.target, s = 30, edgecolor = "None", cmap = "viridis")
ax.set_xlabel('sepal length', fontsize=14)
ax.set_ylabel('sepal width', fontsize=14)
fig.tight_layout()









    



['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

Problem 2) Supervised Machine Learning

Supervised machine learning, on the other hand, aims to predict a target class or produce a regression result based on the location of labelled sources (i.e. the training set) in the multidimensional feature space. The "supervised" comes from the fact that we are specifying the allowed outputs from the model. As there are labels available for the training set, it is possible to estimate the accuracy of the model (though there are generally important caveats about generalization, which we will explore in further detail later).

The details and machinations of supervised learning will be explored further during the following break-out session. Here, we will simply introduce some of the basics as a point of comparison to unsupervised machine learning.

We will begin with a simple, but nevertheless, elegant algorithm for classification and regression: $k$-nearest-neighbors ($k$NN). In brief, the classification or regression output is determined by examining the $k$ nearest neighbors in the training set, where $k$ is a user defined number. Typically, though not always, distances between sources are Euclidean, and the final classification is assigned to whichever class has a plurality within the $k$ nearest neighbors (in the case of regression, the average of the $k$ neighbors is the output from the model). We will experiment with the steps necessary to optimize $k$, and other tuning parameters, in the detailed break-out problem.

In scikit-learn the KNeighborsClassifer algorithm is implemented as part of the sklearn.neighbors module.

Problem 2a Fit two different $k$NN models to the iris data, one with 3 neighbors and one with 10 neighbors. Plot the resulting class predictions in the sepal length-sepal width plane (same plot as above). How do the results compare to the true classifications? Is there any reason to be suspect of this procedure?

Hint - after you have constructed the model, it is possible to obtain model predictions using the .predict() method, which requires a feature array, same features and order as the training set, as input.

Hint that isn't essential, but is worth thinking about - should the features be re-scaled in any way?



In [14]:

    
from sklearn.neighbors import KNeighborsClassifier

KNNclf = KNeighborsClassifier(n_neighbors = 3).fit(iris.data, iris.target)
preds = KNNclf.predict(iris.data)
fig, ax = plt.subplots()
ax.scatter(iris.data[:,0], iris.data[:,1], 
            c = preds, cmap = "viridis", s = 30, edgecolor = "None")
ax.set_xlabel('sepal length', fontsize=14)
ax.set_ylabel('sepal width', fontsize=14)
fig.tight_layout()

KNNclf = KNeighborsClassifier(n_neighbors = 10).fit(iris.data, iris.target)
preds = KNNclf.predict(iris.data)
fig, ax = plt.subplots()
ax.scatter(iris.data[:,0], iris.data[:,1], 
            c = preds, cmap = "viridis", s = 30, edgecolor = "None")
ax.set_xlabel('sepal length', fontsize=14)
ax.set_ylabel('sepal width', fontsize=14)
fig.tight_layout()

These results are almost identical to the training classifications. However, we have cheated! In this case we are evaluating the accuracy of the model (98% in this case) using the same data that defines the model. Thus, what we have really evaluated here is the training error. The relevant parameter, however, is the generalization error: how accurate are the model predictions on new data?

Without going into too much detail, we will test this using cross validation (CV), which will be explored in more detail later. In brief, CV provides predictions on the training set using a subset of the data to generate a model that predicts the class of the remaining sources. Using cross_val_predict, we can get a better sense of the model accuracy. Predictions from cross_val_predict are produced in the following manner:

from sklearn.cross_validation import cross_val_predict
CVpreds = cross_val_predict(sklearn.model(), X, y)

where sklearn.model() is the desired model, X is the feature array, and y is the label array.

Problem 3b Produce cross-validation predictions for the iris dataset and a $k$NN with 5 neighbors. Plot the resulting classifications, as above, and estimate the accuracy of the model as applied to new data. How does this accuracy compare to a $k$NN with 50 neighbors?



In [18]:

    
from sklearn.model_selection import cross_val_predict

CVpreds = cross_val_predict(KNeighborsClassifier(n_neighbors=5), 
                            iris.data, iris.target, cv=3)
fig, ax = plt.subplots()
ax.scatter(iris.data[:,0], iris.data[:,1], 
           c = preds, s = 30, edgecolor = "None", cmap = "viridis")
ax.set_xlabel('sepal length', fontsize=14)
ax.set_ylabel('sepal width', fontsize=14)
fig.tight_layout()

print("The accuracy of the kNN = 5 model is ~{:.4}".format( sum(CVpreds == iris.target)/len(CVpreds) ))

CVpreds50 = cross_val_predict(KNeighborsClassifier(n_neighbors=50), 
                              iris.data, iris.target, cv=3)

print("The accuracy of the kNN = 50 model is ~{:.4}".format( sum(CVpreds50 == iris.target)/len(CVpreds50) ))









    














    











    



The accuracy of the kNN = 5 model is ~0.9867
The accuracy of the kNN = 50 model is ~0.8867

While it is useful to understand the overall accuracy of the model, it is even more useful to understand the nature of the misclassifications that occur.

Problem 2c Calculate the accuracy for each class in the iris set, as determined via CV for the $k$NN = 50 model.



In [19]:

    
for iris_type in range(3):
    iris_acc = sum( (CVpreds50 == iris_type) & (iris.target == iris_type)) / sum(iris.target == iris_type)

    print("The accuracy for class {:s} is ~{:.4f}".format(iris.target_names[iris_type], iris_acc))









    



The accuracy for class setosa is ~1.0000
The accuracy for class versicolor is ~0.9200
The accuracy for class virginica is ~0.7400

We just found that the classifier does a much better job classifying setosa and versicolor than it does for virginica. The main reason for this is some viginica flowers lie far outside the main virginica locus, and within predominantly versicolor "neighborhoods". In addition to knowing the accuracy for the individual classes, it is also useful to know class predictions for the misclassified sources, or in other words where there is "confusion" for the classifier. The best way to summarize this information is with a confusion matrix. In a confusion matrix, one axis shows the true class and the other shows the predicted class. For a perfect classifier all of the power will be along the diagonal, while confusion is represented by off-diagonal signal.

Like almost everything else we have encountered during this exercise, scikit-learn makes it easy to compute a confusion matrix. This can be accomplished with the following:

from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_prep)

Problem 2d Calculate the confusion matrix for the iris training set and the $k$NN = 50 model.



In [20]:

    
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(iris.target, CVpreds50)
print(cm)









    



[[50  0  0]
 [ 1 46  3]
 [ 0 13 37]]

From this representation, we see right away that most of the virginica that are being misclassifed are being scattered into the versicolor class. However, this representation could still be improved: it'd be helpful to normalize each value relative to the total number of sources in each class, and better still, it'd be good to have a visual representation of the confusion matrix. This visual representation will be readily digestible. Now let's normalize the confusion matrix.

Problem 2e Calculate the normalized confusion matrix. Be careful, you have to sum along one axis, and then divide along the other.

Anti-hint: This operation is actually straightforward using some array manipulation that we have not covered up to this point. Thus, we have performed the necessary operations for you below. If you have extra time, you should try to develop an alternate way to arrive at the same normalization.



In [21]:

    
normalized_cm = cm.astype('float')/cm.sum(axis = 1)[:,np.newaxis]

normalized_cm









    Out[21]:





array([[1.  , 0.  , 0.  ],
       [0.02, 0.92, 0.06],
       [0.  , 0.26, 0.74]])

The normalization makes it easier to compare the classes, since each class has a different number of sources. Now we can procede with a visual representation of the confusion matrix. This is best done using imshow() within pyplot. You will also need to plot a colorbar, and labeling the axes will also be helpful.

Problem 2f Plot the confusion matrix. Be sure to label each of the axeses.

Hint - you might find the sklearn confusion matrix tutorial helpful for making a nice plot.



In [22]:

    
plt.imshow(normalized_cm, interpolation = 'nearest', cmap = 'bone_r')# complete

tick_marks = np.arange(len(iris.target_names))
plt.xticks(tick_marks, iris.target_names, rotation=45)
plt.yticks(tick_marks, iris.target_names)


plt.ylabel( 'True')# complete
plt.xlabel( 'Predicted' )# complete
plt.colorbar()
plt.tight_layout()

Now it is straight-forward to see that virginica and versicolor flowers are the most likely to be confused, which we could intuit from the very first plot in this notebook, but this exercise becomes far more important for large data sets with many, many classes.

Introduction to Machine Learning:

Examples of Unsupervised and Supervised Machine-Learning Algorithms

Version 0.1

Problem 1) Introduction to scikit-learn

Problem 2) Supervised Machine Learning

Problem 1) Introduction to `scikit-learn`