01 - Introduction to Machine Learning

by Alejandro Correa Bahnsen and Jesus Solano

version 1.3, January 2019

Part of the class Practical Machine Learning

This notebook is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Special thanks goes to Rick Muller, Sandia National Laboratories

What is Machine Learning?

In this section we will begin to explore the basic principles of machine learning. Machine Learning is about building programs with tunable parameters (typically an array of floating point values) that are adjusted automatically so as to improve their behavior by adapting to previously seen data.

Machine Learning can be considered a subfield of Artificial Intelligence since those algorithms can be seen as building blocks to make computers learn to behave more intelligently by somehow generalizing rather that just storing and retrieving data items like a database system would do.

We'll take a look at two very simple machine learning tasks here. The first is a classification task: the figure shows a collection of two-dimensional data, colored according to two different class labels.


In [1]:
# Import libraries
%matplotlib inline
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
sns.set();
cmap = mpl.colors.ListedColormap(sns.color_palette("hls", 3))

In [2]:
# Create a random set of examples
from sklearn.datasets.samples_generator import make_blobs
X, Y = make_blobs(n_samples=50, centers=2,random_state=23, cluster_std=2.90)

plt.scatter(X[:, 0], X[:, 1], c=Y, cmap=cmap)
plt.show()


A classification algorithm may be used to draw a dividing boundary between the two clusters of points:


In [4]:
from sklearn.linear_model import SGDClassifier
clf = SGDClassifier(loss="hinge", alpha=0.01, max_iter=300, tol= 0.001, fit_intercept=True)
clf.fit(X, Y)


Out[4]:
SGDClassifier(alpha=0.01, average=False, class_weight=None,
       early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
       l1_ratio=0.15, learning_rate='optimal', loss='hinge', max_iter=300,
       n_iter=None, n_iter_no_change=5, n_jobs=None, penalty='l2',
       power_t=0.5, random_state=None, shuffle=True, tol=0.001,
       validation_fraction=0.1, verbose=0, warm_start=False)

In [5]:
# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, m_max]x[y_min, y_max].
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
xx, yy = np.meshgrid(np.arange(x_min, x_max, .05), np.arange(y_min, y_max, .05))
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)

In [6]:
plt.contour(xx, yy, Z)
plt.scatter(X[:, 0], X[:, 1], c=Y, cmap=cmap)
plt.show()


This may seem like a trivial task, but it is a simple version of a very important concept. By drawing this separating line, we have learned a model which can generalize to new data: if you were to drop another point onto the plane which is unlabeled, this algorithm could now predict whether it's a blue or a red point.

The next simple task we'll look at is a regression task: a simple best-fit line to a set of data:


In [7]:
a = 0.5
b = 1.0

# x from 0 to 10
x = 30 * np.random.random(20)

# y = a*x + b with noise
y = a * x + b + np.random.normal(size=x.shape)

plt.scatter(x, y)


Out[7]:
<matplotlib.collections.PathCollection at 0x7f1dac6a84a8>

In [8]:
from sklearn.linear_model import LinearRegression
clf = LinearRegression()
clf.fit(x[:, None], y)


Out[8]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [9]:
# underscore at the end indicates a fit parameter
print(clf.coef_)
print(clf.intercept_)


[0.47339112]
1.3985254311206043

In [15]:
x_new = np.linspace(0, 30, 100)
y_new = clf.predict(x_new[:, None])
plt.scatter(x, y)
plt.plot(x_new, y_new, 'g-')


Out[15]:
[<matplotlib.lines.Line2D at 0x7f1dac2ba1d0>]

Again, this is an example of fitting a model to data, such that the model can make generalizations about new data. The model has been learned from the training data, and can be used to predict the result of test data: here, we might be given an x-value, and the model would allow us to predict the y value. Again, this might seem like a trivial problem, but it is a basic example of a type of operation that is fundamental to machine learning tasks.

Representation of Data in Scikit-learn

Machine learning is about creating models from data: for that reason, we'll start by discussing how data can be represented in order to be understood by the computer. Along with this, we'll build on our matplotlib examples from the previous section and show some examples of how to visualize data.

Most machine learning algorithms implemented in scikit-learn expect data to be stored in a two-dimensional array or matrix. The arrays can be either numpy arrays, or in some cases scipy.sparse matrices. The size of the array is expected to be [n_samples, n_features]

  • n_samples: The number of samples: each sample is an item to process (e.g. classify). A sample can be a document, a picture, a sound, a video, an astronomical object, a row in database or CSV file, or whatever you can describe with a fixed set of quantitative traits.
  • n_features: The number of features or distinct traits that can be used to describe each item in a quantitative manner. Features are generally real-valued, but may be boolean or discrete-valued in some cases.

The number of features must be fixed in advance. However it can be very high dimensional (e.g. millions of features) with most of them being zeros for a given sample. This is a case where scipy.sparse matrices can be useful, in that they are much more memory-efficient than numpy arrays.

A Simple Example: the Iris Dataset

As an example of a simple dataset, we're going to take a look at the iris data stored by scikit-learn. The data consists of measurements of three different species of irises. There are three species of iris in the dataset, which we can picture here:


In [16]:
from IPython.core.display import Image, display
imp_path = 'https://raw.githubusercontent.com/jakevdp/sklearn_pycon2015/master/notebooks/images/'
display(Image(url=imp_path+'iris_setosa.jpg'))
print("Iris Setosa\n")

display(Image(url=imp_path+'iris_versicolor.jpg'))
print("Iris Versicolor\n")

display(Image(url=imp_path+'iris_virginica.jpg'))
print("Iris Virginica")

display(Image(url='https://raw.githubusercontent.com/albahnsen/PracticalMachineLearningClass/6160065e1e574a20edddc47116a0512d20656e26/notebooks/iris_with_length.png'))
print('Iris versicolor and the petal and sepal width and length')
print('From, Python Data Analytics, Apress, 2015.')


Iris Setosa

Iris Versicolor

Iris Virginica
Iris versicolor and the petal and sepal width and length
From, Python Data Analytics, Apress, 2015.

Quick Question:

If we want to design an algorithm to recognize iris species, what might the data be?

Remember: we need a 2D array of size [n_samples x n_features].

  • What would the n_samples refer to?

  • What might the n_features refer to?

Remember that there must be a fixed number of features for each sample, and feature number i must be a similar kind of quantity for each sample.

Loading the Iris Data with Scikit-Learn

Scikit-learn has a very straightforward set of data on these iris species. The data consist of the following:

  • Features in the Iris dataset:

    1. sepal length in cm
    2. sepal width in cm
    3. petal length in cm
    4. petal width in cm
  • Target classes to predict:

    1. Iris Setosa
    2. Iris Versicolour
    3. Iris Virginica

scikit-learn embeds a copy of the iris CSV file along with a helper function to load it into numpy arrays:


In [18]:
from sklearn.datasets import load_iris
iris = load_iris()
iris.keys()


Out[18]:
dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])

In [19]:
n_samples, n_features = iris.data.shape
print((n_samples, n_features))
print(iris.data[0])


(150, 4)
[5.1 3.5 1.4 0.2]

In [20]:
print(iris.data.shape)
print(iris.target.shape)


(150, 4)
(150,)

In [21]:
print(iris.target)
print(iris.target_names)


[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]
['setosa' 'versicolor' 'virginica']

This data is four dimensional, but we can visualize two of the dimensions at a time using a simple scatter-plot:


In [22]:
import pandas as pd  # Pandas is a topic of next session
data_temp = pd.DataFrame(iris.data, columns=iris.feature_names)
data_temp['target'] = iris.target
data_temp['target'] = data_temp['target'].astype('category')
data_temp['target'].cat.categories = iris.target_names
sns.pairplot(data_temp, hue='target', palette=sns.color_palette("hls", 3))


Out[22]:
<seaborn.axisgrid.PairGrid at 0x7f1de4c965f8>

Dimensionality Reduction: PCA

Principal Component Analysis (PCA) is a dimension reduction technique that can find the combinations of variables that explain the most variance.

Consider the iris dataset. It cannot be visualized in a single 2D plot, as it has 4 features. We are going to extract 2 combinations of sepal and petal dimensions to visualize it:


In [23]:
X, y = iris.data, iris.target
from sklearn.decomposition import PCA
pca = PCA(n_components=3)
pca.fit(X)
X_reduced = pca.transform(X)
plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y, cmap=cmap)


Out[23]:
<matplotlib.collections.PathCollection at 0x7f1daa21d470>

In [24]:
X, y = iris.data, iris.target
from sklearn.manifold import Isomap
pca = Isomap(n_components=3)
pca.fit(X)
X_reduced = pca.transform(X)
plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y, cmap=cmap)


Out[24]:
<matplotlib.collections.PathCollection at 0x7f1da8c367b8>

In [25]:
X_reduced.shape


Out[25]:
(150, 3)

In [26]:
plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y, cmap=cmap)


Out[26]:
<matplotlib.collections.PathCollection at 0x7f1da8b963c8>

In [27]:
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure()
ax = Axes3D(fig)
ax.set_title('Iris Dataset by PCA', size=14)
ax.scatter(X_reduced[:,0],X_reduced[:,1],X_reduced[:,2], c=y, cmap=cmap)
ax.set_xlabel('First eigenvector')
ax.set_ylabel('Second eigenvector')
ax.set_zlabel('Third eigenvector')
ax.w_xaxis.set_ticklabels(())
ax.w_yaxis.set_ticklabels(())
ax.w_zaxis.set_ticklabels(())
plt.show()


Clustering: K-means

Clustering groups together observations that are homogeneous with respect to a given criterion, finding ''clusters'' in the data.

Note that these clusters will uncover relevant hidden structure of the data only if the criterion used highlights it.


In [28]:
from sklearn.cluster import KMeans
k_means = KMeans(n_clusters=3, random_state=0) # Fixing the RNG in kmeans
k_means.fit(X)
y_pred = k_means.predict(X)

plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y_pred, cmap=cmap);


Lets then evaluate the performance of the clustering versus the ground truth


In [29]:
from sklearn.metrics import confusion_matrix

# Compute confusion matrix
cm = confusion_matrix(y, y_pred)
np.set_printoptions(precision=2)
print(cm)


[[ 0 50  0]
 [48  0  2]
 [14  0 36]]

In [30]:
def plot_confusion_matrix(cm, title='Confusion matrix', cmap=plt.cm.Blues):
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(iris.target_names))
    plt.xticks(tick_marks, iris.target_names, rotation=45)
    plt.yticks(tick_marks, iris.target_names)
    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

In [31]:
plt.figure()
plot_confusion_matrix(cm)


Classification Logistic Regression


In [48]:
from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import train_test_split

errors = []
for i in range(1000):
    X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.4, random_state=i)

    clf = LogisticRegression(multi_class='multinomial',solver='lbfgs',max_iter=1000)
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)

    acc = (y_pred == y_test).sum()
    err = 1- acc / n_samples
    errors.append(err)

plt.plot(list(range(1000)), errors)

errors = np.array(errors)
print(errors.max(), errors.min(), errors.mean(), errors.std())


0.6466666666666667 0.6 0.6156933333333334 0.008699128947454712

In [50]:
from sklearn.ensemble import RandomForestClassifier

errors = []
for i in range(1000):
    X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.4, random_state=i)

    clf = RandomForestClassifier(n_estimators=10)
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)

    acc = (y_pred == y_test).sum()
    err = 1- acc / n_samples
    errors.append(err)
plt.plot(list(range(1000)), errors)

errors = np.array(errors)
print(errors.max(), errors.min(), errors.mean(), errors.std())


0.6599999999999999 0.6 0.6209 0.009363700597994847

Recap: Scikit-learn's estimator interface

Scikit-learn strives to have a uniform interface across all methods, and we'll see examples of these below. Given a scikit-learn estimator object named model, the following methods are available:

  • Available in all Estimators
    • model.fit() : fit training data. For supervised learning applications, this accepts two arguments: the data X and the labels y (e.g. model.fit(X, y)). For unsupervised learning applications, this accepts only a single argument, the data X (e.g. model.fit(X)).
  • Available in supervised estimators
    • model.predict() : given a trained model, predict the label of a new set of data. This method accepts one argument, the new data X_new (e.g. model.predict(X_new)), and returns the learned label for each object in the array.
    • model.predict_proba() : For classification problems, some estimators also provide this method, which returns the probability that a new observation has each categorical label. In this case, the label with the highest probability is returned by model.predict().
    • model.score() : for classification or regression problems, most (all?) estimators implement a score method. Scores are between 0 and 1, with a larger score indicating a better fit.
  • Available in unsupervised estimators
    • model.predict() : predict labels in clustering algorithms.
    • model.transform() : given an unsupervised model, transform new data into the new basis. This also accepts one argument X_new, and returns the new representation of the data based on the unsupervised model.
    • model.fit_transform() : some estimators implement this method, which more efficiently performs a fit and a transform on the same input data.

Flow Chart: How to Choose your Estimator

This is a flow chart created by scikit-learn super-contributor Andreas Mueller which gives a nice summary of which algorithms to choose in various situations. Keep it around as a handy reference!


In [29]:
from IPython.display import Image
Image(url="http://scikit-learn.org/dev/_static/ml_map.png")


Out[29]:

Original source on the scikit-learn website