G. Richards, 2016 (Drawn in large part from https://github.com/jakevdp/ESAC-stats-2014/blob/master/notebooks/03.1-Scikit-Learn-Intro.ipynb)
Today we will cover the basics of Scikit-Learn, a popular package containing a collection of tools for machine learning written in Python.
For Scikit-Learn's own introduction, see http://scikit-learn.org/stable/tutorial/basic/tutorial.html#.
In general, we are interested in taking a set of $n$ samples of data and then predicting properties of unknown data. In the simplest case the data will be one dimensional, but generally we will be dealig with multi-dimensional data.
As we saw in the first lecture, we can break machine learning into a few distinct categories. Common to all of these is the concept of training set and test sets, which is just like they sound: training data will be used to make predictions about the test data.
The categories of maching learning are
In order to learn about how to use Scikit-Learn to do machine learning/data mining we'll start by taking a look at two very simple machine learning tasks.
The first is a classification task: the figure shows a collection of two-dimensional data, colored according to two different class labels. We wish to determine a classification algorithm that can be used to draw a dividing boundary between the two clusters of points:
In [ ]:
# start the inline backend for plotting
%matplotlib inline
%run code/sklearn_ex1.py
This may seem like a trivial task, but it is a simple version of a very important concept. By drawing this separating line, we have "learned" a model which can generalize to new data: if you were to drop another point onto the plane which is unlabeled, this algorithm could now predict whether it's a blue or a red point.
The next simple task we'll look at is a regression task: a simple best-fit line to a set of data:
In [ ]:
%run code/sklearn_ex2.py
Again, this is an example of fitting a model to data, such that the model can make generalizations about new data. The model has been learned from the training data and can be used to predict the result of test data: we might be given an $x$-value, and the model would allow us to predict the $y$ value. Again, this might seem like a trivial problem, but it is a basic example of a type of operation that is fundamental to machine learning tasks.
In order to use Scikit-Learn, we need to understand the input and output of its algorithms.
Most machine learning algorithms implemented in Scikit-Learn expect data to be stored in a
two-dimensional array or matrix. The arrays can be
either numpy
arrays, or in some cases scipy.sparse
matrices.
The size of the array is expected to be [n_samples, n_features]
The number of features must be fixed in advance. However it can be very high dimensional
(e.g. millions of features) with most of them being zeros for a given sample. This is a case
where scipy.sparse
matrices can be useful, in that they are
much more memory-efficient than numpy arrays.
In [ ]:
from IPython.core.display import Image, display
display(Image(filename='images/iris_setosa.jpg'))
print("Iris Setosa\n")
display(Image(filename='images/iris_versicolor.jpg'))
print("Iris Versicolor\n")
display(Image(filename='images/iris_virginica.jpg'))
print("Iris Virginica")
Scikit-learn has a very straightforward set of data on these iris species. The data consist of the following:
Scikit-learn refers to the "labels" as "targets". So, every time you see "target", just think "label" and it will make more sense.
Scikit-Learn
embeds a copy of the iris CSV file along with a helper function to load it into numpy arrays:
In [3]:
# Execute this cell
from sklearn.datasets import load_iris
iris = load_iris()
iris
is a dictionary, so we can look at the "keys" of the dictionary as follows:
In [ ]:
print iris.keys()
target_names
were defined abovedata
is the [n_samples, n_features]
data arraytarget
is the list of labels for all of the entries in data
DESCR
is a README file with all of the information about the data setfeature_names
were defined above
In [ ]:
print iris.DESCR
If you are curious, the data come from measurements made in Quebec by botanist Dr. Edgar Anderson and first used for Sir Ronald Fisher's 1936 classification paper, see https://en.wikipedia.org/wiki/Iris_flower_data_set. (I'm beginning to feel like it is time for me to make a donation to wikipedia again!)
We can query the database to determine the shape of the data and find that there are 150 objects with 4 measurements each:
In [ ]:
print iris.data.shape
n_samples, n_features = iris.data.shape
print n_samples, n_features
Now let's look at both the first entry in data
and the full N$\times$M data array. The first entry shows the values of the 4 features for the first object.
Note the struture of the full data array. It is an n_samples
array of arrays with n_features
. Scikit-learn is very picky about this exact format and it isn't always what you woud generate naturally (particulary in the case where we have only one feature). More on this later.
In [ ]:
print iris.data[0]
print iris.data
We can also inpect the shape of the target (labels) array, which is an n_samples
-D array, print the values of those labels, and also learn how those numerical values relate to the names of the iris species:
In [ ]:
# The shape of the target (labels) array is just an n_samples X 1 array
print iris.target.shape
# Here we see that the labels are given numerical values
print iris.target
# Use target_names to translate those numerical values to names
print iris.target_names
print iris.target_names[0]
This data is four dimensional, but we can visualize two of the dimensions at a time using a simple scatter-plot.
In [ ]:
import numpy as np
import matplotlib.pyplot as plt
x_index = 0
y_index = 1
# this formatter will label the colorbar with the correct target names
formatter = plt.FuncFormatter(lambda i, *args: iris.target_names[int(i)])
plt.scatter(iris.data[:, x_index], iris.data[:, y_index], c=iris.target, cmap=plt.cm.get_cmap('RdYlBu', 3))
plt.colorbar(ticks=[0, 1, 2], format=formatter)
plt.clim(-0.5, 2.5)
plt.xlabel(iris.feature_names[x_index])
plt.ylabel(iris.feature_names[y_index]);
comes in three flavors:
sklearn.datasets.load_*
sklearn.datasets.fetch_*
sklearn.datasets.make_*
You can explore the available dataset loaders, fetchers, and generators using IPython's
tab-completion functionality. After importing the datasets
submodule from sklearn
,
type
datasets.load_ + TAB
or
datasets.fetch_ + TAB
or
datasets.make_ + TAB
to see a list of available functions.
In [141]:
from sklearn import datasets
In [ ]:
# Type datasets.fetch_<TAB> in IPython to see all possibilities
datasets.load_
A similar interface to datasets is available in astroML
:
In [8]:
from astroML import datasets
In [16]:
# Use tab completion to explore datasets
datasets.fetch_
Now we can dive into the basic principles of machine learning, and how to utilize them via the Scikit-Learn.
After briefly introducing scikit-learn's Estimator object, we'll get an introduction to supervised learning, including classification and regression problems, and unsupervised learning, including dimensionality reduction and clustering problems.
Every algorithm in scikit-learn generates an ''Estimator'' object.
All the parameters of an estimator can be set when it is instantiated, and have suitable default values.
Let's take a look at all of the parameters, attributes and methods in the LinearRegression module.
In [16]:
from sklearn.linear_model import LinearRegression
In [ ]:
# Initialize a LinearRegression estimator object, call it `model`
model = LinearRegression(normalize=True)
# Check an individual parameters
print(model.normalize)
# Check all of the parameters
print(model)
With our model initialized, we are ready to input some data. So let's make a simply test case.
In [ ]:
x_train = np.arange(10)
y_train = 2 * x_train + 1
print x_train
print y_train
In [ ]:
plt.plot(x_train, y_train, 'o');
Now what we'd like to do is to fit
that data to determine the model parameters that we would need to predict
the $y$ values of any new measurement of $x$.
The general code syntax for this sort of thing might be
lsqfit x_train y_train x_test y_out
Scikit-Learn breaks this into two steps:
1. fitting
2. predicting
Where the syntax of the fit looks like:
In [ ]:
# fit the model on our data
model.fit(x_train, y_train)
Did that work for anyone? No, right?
That's because Scikit-Learn wants the training features as a multi-dimensional array. Even if there is only one dimension, like we have here.
Their standard syntax is to call this multi-dimensional training array, $X$, (capital $x$) and the training "labels" (which can be continuous) simply $y$.
If your $x$ is 1-D, then we have to turn it into something that looks like it is $N$-D. All of the following yield the same result. I personally like the last one since it is (sort of) readable. It says, take by $x$ array and add another feature
that is blank. The reshape
syntax looks like it used to be widely used, but has now mostly been replaced with the np.newaxis
syntax.
The effect of all three is to turn $x$ into a n_samples
by 1-D array rather than an array of n_samples
.
In [ ]:
# The input data for sklearn must be 2D: (samples == N x features == 1)
# All of these give the same result. I'll adopt the convention from the last one.
print x_train.reshape(-1,1)
print x_train[:, np.newaxis]
print x_train[:, None]
X = x_train[:, None]
y = y_train
print y
Note that you can produce random $X$s in the correct format with
N = 10
X = np.random.random(size=(N, 1))
And you can go backwards with
X.squeeze()
In [ ]:
# Now we can fit the model on our data
model.fit(X, y)
Estimated Model parameters
All the estimated parameters are attributes of the estimator object ending with an underscore:
In [ ]:
# Let's look at the fit parameters, which are indicated by an underscore at the end
print model.coef_
print model.intercept_
The model found a line with a slope 2 and intercept 1, as we'd expect.
In Supervised Learning, we have a dataset consisting of both features and labels. The task is to construct an estimator which is able to predict the label of an object given the set of features. A relatively simple example is predicting the species of iris given a set of measurements of its flower.
Some more complicated examples are:
What these tasks have in common is that there is one or more unknown quantities associated with the object which needs to be determined from other observed quantities.
Supervised learning is further broken down into two categories, classification and regression. In classification, the label is discrete, while in regression, the label is continuous. For example, in astronomy, the task of determining whether an object is a star, a galaxy, or a quasar is a classification problem: the label is from three distinct categories. On the other hand, we might wish to estimate the age of an object based on such observations: this would be a regression problem, because the label (age) is a continuous quantity.
K nearest neighbors (kNN) is one of the simplest learning strategies: given a new, unknown observation, look up in your reference database which ones have the closest features and assign the predominant class.
Let's try it out on our iris classification problem. First see if you can do a kNN fit on the iris data using the 5 nearest neighbors.
In [ ]:
from sklearn import neighbors, datasets
iris = datasets.load_iris()
X, y = iris.data, iris.target
# create the model
knn = neighbors.KNeighborsClassifier(# Complete)
# fit the model
# Complete
Now see if you can figure out what kind of iris has a 3cm x 5cm sepal and 4cm x 2cm petal
using the predict method? Remember to be careful constructing your $X$ array. This is tricker than before. We now have multiple features, but we still need those in a n_samples=1
by n_feaures=4
array.
In [ ]:
Xtest = # Complete
print Xtest
Once $X$ is properly formatted, go ahead and do the prediction.
Can you output the name rather than the index of the iris type?
In [ ]:
# What kind of iris has 3cm x 5cm sepal and 4cm x 2cm petal?
# call the "predict" method:
result = knn.predict(Xtest)
print result
print # Complete by printing *name* of flower
In some sense the whole point of classification is that there isn't necessarily going to be a black and white "right" answer. So, we might want to do probabilistic predictions instead.
In [ ]:
knn.predict_proba(Xtest)
We could even visualize the full parameter space so that we could determine the classification by eye. To make things simpler, we have done this in 2-D rather than 4-D. Note that it isn't perfect, but does a really good job of classifying at least the red sources.
In [ ]:
%run code/sklearn_ex3.py
Let's do the same, but now using sklearn.svm.SVC
to predict what type of iris has a 3cm x 5cm sepal and 4cm x 2cm petal?
Note that you don't have to know what it is do use it.
Careful -- our plot above overwrote some things. So, best to start from scratch and reload the iris data set and redefine the test input.
In [30]:
from sklearn.svm import SVC
iris = # Complete
X, y = # Complete
Xtest = # Complete
In [ ]:
model = # Complete
# Fit the model
# Determine and print the result
In [ ]:
# Create some simple data
import numpy as np
np.random.seed(0)
x = np.random.random(size=20)
y = 3 * x + 2 + np.random.randn(20)
X = x[:,None]
# Fit a Linear Model
modelLR = LinearRegression(normalize=True)
modelLR.fit(X,y)
# Plot the data and the model prediction
x_test = np.linspace(0, 1, 100)
X_test = x_test[:,None]
y_testLR = modelLR.predict(X_test)
plt.plot(x, y, 'o')
plt.plot(X_test, y_testLR)
Modify the above code to add a line using the RandomForestRegressor
as well.
In [ ]:
# Fit a Random Forest model
from sklearn.ensemble import RandomForestRegressor
# Complete
Unsupervised Learning addresses a different sort of problem. Here the data has no labels, and we are interested in finding similarities between the objects in question. In a sense, you can think of unsupervised learning as a means of discovering labels from the data itself.
Unsupervised learning comprises tasks such as dimensionality reduction, clustering, and density estimation. For example, in the iris data discussed above, we can use unsupervised methods to determine combinations of the measurements which best display the structure of the data. As we'll see below, such a projection of the data can be used to visualize the four-dimensional dataset in two dimensions. Some more involved unsupervised learning problems are:
Sometimes the two may even be combined: e.g. Unsupervised learning can be used to find useful features in heterogeneous data, and then these features can be used within a supervised framework.
Principle Component Analysis (PCA) is a dimension reduction technique that can find the combinations of variables that explain the most variance.
Consider the iris dataset. It cannot be visualized in a single 2D plot, as it has 4 features. We are going to extract 2 combinations of sepal and petal dimensions to visualize it.
The process starts in the same way as the other Scikit-Learn examples that we have done:
In [ ]:
X, y = iris.data, iris.target
from sklearn.decomposition import PCA
pca = PCA(# Complete with 2 components)
# Do the fitting
But now we can't do a predict
step since we don't have training data. Instead we are going to use the transform()
method. But first let's take a look at the parameters, attributes, and methods for PCA.
In [ ]:
X_reduced = pca.transform(X)
print("Reduced dataset shape:", X_reduced.shape)
plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y, cmap='RdYlBu')
print("Projection of the 2 components:")
for component in pca.components_:
print(" + ".join("%.3f x %s" % (value, name) for value, name in zip(component, iris.feature_names)))
This projection maximizes the separation of the classes. We can use this either just for visualization or to actually help define the classes.
Clustering groups together observations that are homogeneous with respect to a given criterion, finding ''clusters'' in the data.
Note that these clusters will uncover relevent hidden structure of the data only if the criterion used highlights it.
Let's look at KMeans
. Note that here the parameters, attributes, and methods allow for either predict
or transform
.
In [ ]:
from sklearn.cluster import KMeans
k_means = KMeans(n_clusters=3, random_state=0) # Setting the RNG seed in kmeans
k_means.fit(X)
y_pred = k_means.predict(X)
plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y_pred, cmap='RdYlBu')
Scikit-learn strives to have a uniform interface across all methods,
and we'll see examples of these below. Given a scikit-learn estimator
object named model
, the following methods are available:
model.fit()
: fit training data. For supervised learning applications, this accepts two arguments: the data X
and the labels y
[e.g., model.fit(X, y)
]. For unsupervised learning applications, this accepts only a single argument, the data X
[e.g., model.fit(X)
].model.predict()
: given a trained model, predict the label of a new set of data.
This method accepts one argument, the new data X_new
[e.g., model.predict(X_new)
],
and returns the learned label for each object in the array.model.predict_proba()
: For classification problems, some estimators also provide
this method, which returns the probability that a new observation has each categorical label.
In this case, the label with the highest probability is returned by model.predict()
.model.score()
: For classification or regression problems, most (all?) estimators implement
a score method. Scores are between 0 and 1, with a larger score indicating a better fit.model.transform()
: given an unsupervised model, transform new data into the new basis.
This also accepts one argument X_new
, and returns the new representation of the data based
on the unsupervised model.model.fit_transform()
: some estimators implement this method,
which more efficiently performs a fit and a transform on the same input data.An important piece of machine learning is model validation: that is, determining how well your model will generalize from the training data to future unlabeled data. Let's look at an example using the nearest neighbor classifier. This is a very simple classifier: it simply stores all training data, and for any unknown quantity, simply returns the label of the closest training point.
With the iris data, it very easily returns the correct prediction for each of the input points:
In [ ]:
from sklearn.neighbors import KNeighborsClassifier
X, y = iris.data, iris.target
clf = KNeighborsClassifier(n_neighbors=1)
clf.fit(X, y)
y_pred = clf.predict(X)
print (y==y_pred)
print(np.all(y == y_pred)) # To check all at once.
See what happens if we used more than the first nearest neighbor.
A more useful way to look at the results is to view the confusion matrix, or the matrix showing the frequency of inputs and outputs:
In [ ]:
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y, y_pred))
For each class, all 50 training samples are correctly identified. But this does not mean that our model is perfect! In particular, such a model generalizes extremely poorly to new data. We can simulate this by splitting our data into a training set and a test set. Scikit-learn contains some convenient routines to do this, here we will apply train_test_split
:
In [ ]:
from sklearn.cross_validation import train_test_split
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y)
clf.fit(Xtrain, ytrain)
ypred = clf.predict(Xtest)
print(confusion_matrix(ytest, ypred))
This paints a better picture of the true performance of our classifier: apparently there is some confusion between the second and third species, which we might anticipate given what we've seen of the data above.
This is why it's extremely important to use a train/test split when evaluating your models. We'll go into more depth on model evaluation later.
Sometimes you might want to apply multiple machine learning techniques. For example, just because you have labels in your training data doesn't mean that a regression algorithm will work very well. So, you might want to start with a clustering or dimensional reduction algorithm first.
Let's take a look at a really cool data set/example. I used to worry about my handwriting on letter addresses. No longer!
Another built-in Scikit-Learn data set is the handwritten digits data set.
In [ ]:
from sklearn import datasets
digits = datasets.load_digits()
print digits.images.shape
print digits.keys()
Let's plot a few of the images:
In [ ]:
fig, axes = plt.subplots(10, 10, figsize=(8, 8))
fig.subplots_adjust(hspace=0.1, wspace=0.1)
for i, ax in enumerate(axes.flat):
ax.imshow(digits.images[i], cmap='binary')
ax.text(0.05, 0.05, str(digits.target[i]), transform=ax.transAxes, color='green')
ax.set_xticks([])
ax.set_yticks([])
Here the data is simply each pixel value within an 8x8 grid. For analysis, we don't care about the 2-D nature of the image, we can just treat it as an 8x8=64 dimensional array. So our data have 1797 samples in 64 dimensions.
Then each of the 1797 images has a label.
In [ ]:
print digits.data.shape
# The images themselves
print digits.images[0]
# The data for use in our algorithms
print digits.data[0]
# The labels
print digits.target
print digits.target[0]
We'd like to visualize our points within the 64-dimensional parameter space, but it's difficult to plot points in 64 dimensions! Instead we'll use a unsupervised dimensionality reduction technique called Isomap
:
In [ ]:
from sklearn.manifold import Isomap
iso = Isomap(# Complete for 2 components)
data_projected = # Do the fit *and* transform at the same time
data_projected.shape # See what the resulting shape is
Isomap
has turned our 1797x64 dimensional data set into a 1797x2 dimensional data set. This is much easier to visualize! Now let's see if that visualization is at all helpful.
In [ ]:
# Plot all of the data points as the in the two projected dimensions. Color the points by their labels.
plt.scatter(data_projected[:, 0], data_projected[:, 1], c=digits.target, edgecolor='none', alpha=0.5, cmap=plt.cm.get_cmap('nipy_spectral', 10));
# Add the color bar
plt.colorbar(label='digit label', ticks=range(10))
# Make it clear which color goes with which label
plt.clim(-0.5, 9.5)
We see here that the digits are fairly well-separated in the parameter space; this tells us that a supervised classification algorithm should perform fairly well. Let's give it a try.
In [ ]:
from sklearn.cross_validation import train_test_split
# Complete the line below using 80% of the sample for training
Xtrain, Xtest, ytrain, ytest = train_test_split(digits.data, digits.target, random_state=2, # Complete)
print(Xtrain.shape, Xtest.shape)
Let's use a simple Logistic Regression
which (despite its name) is a classification algorithm:
In [ ]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()
clf.fit(# Complete)
ypred = # Complete
We can check our classification accuracy by comparing the true values of the test set to the predictions. We'll use one of sklearn's built-in metrics.
In [ ]:
from sklearn.metrics import accuracy_score
accuracy_score(ytest, ypred)
This single number doesn't tell us where we've gone wrong: one nice way to do this is to use the confusion matrix as we did above.
In [ ]:
from sklearn.metrics import confusion_matrix
print(confusion_matrix(ytest, ypred))
plt.imshow(np.log(confusion_matrix(ytest, ypred)), cmap='Blues', interpolation='nearest')
plt.ylabel('true')
plt.xlabel('predicted');
So, that's all 32 "0"s were classified as "0" and no non-zero digits were so classified.
41 "1"s were classified correctly, but one each were classified as a "3", "8", and "9". Moreover, two "4"s and two "8"s were classified as "1s".
And so on.
We can look at some of the outputs along with their predicted labels, making the bad labels red:
In [ ]:
fig, axes = plt.subplots(10, 10, figsize=(8, 8))
fig.subplots_adjust(hspace=0.1, wspace=0.1)
for i, ax in enumerate(axes.flat):
ax.imshow(Xtest[i].reshape(8, 8), cmap='binary')
ax.text(0.05, 0.05, str(ypred[i]), transform=ax.transAxes, color='green' if (ytest[i] == ypred[i]) else 'red')
ax.set_xticks([])
ax.set_yticks([])
The interesting thing is that even with this simple logistic regression algorithm, many of the mislabeled cases are ones that we ourselves might get wrong!
There are many ways to improve this classifier, for example using a more sophisticated model, use cross validation, etc. We'll get to those in the coming weaks.
In the mean time, I remind you of this the Scikit-Learn algorithm cheat sheet which gives a nice summary of which algorithms to choose in various situations.
In [38]:
from IPython.display import Image
Image("http://scikit-learn.org/dev/_static/ml_map.png")
Out[38]: