This notebook was put together by [Jake Vanderplas](http://www.vanderplas.com) for UW's [Astro 599/AMath 500](http://www.astro.washington.edu/users/vanderplas/Astr599_2014/) course. Source and license info is on [GitHub](https://github.com/jakevdp/2014_fall_ASTR599/).
Note: some examples below require the scripts in the fig_code
directory, which
can be found within the Github repository at http://github.com/jakevdp/2014_fall_Astr599
Main Goal: To introduce the central concepts of machine learning, and how they can be applied in Python using the Scikit-learn Package.
Note that this material is covered in much more depth in my PyCon 2014 Tutorial
Scikit-Learn is a Python package designed to give access to well-known machine learning algorithms within Python code, through a clean, well-thought-out API. It has been built by hundreds of contributors from around the world, and is used across industry and academia.
Scikit-Learn is built upon Python's NumPy (Numerical Python) and SciPy (Scientific Python) libraries, which enable efficient in-core numerical and scientific computation within Python. As such, scikit-learn is not specifically designed for extremely large datasets, though there is some work in this area.
For this short introduction, I'm going to stick to questions of in-core processing of small to medium datasets with Scikit-learn.
In this section we will begin to explore the basic principles of machine learning. Machine Learning is about building programs with tunable parameters (typically an array of floating point values) that are adjusted automatically so as to improve their behavior by adapting to previously seen data.
Machine Learning can be considered a subfield of Artificial Intelligence since those algorithms can be seen as building blocks to make computers learn to behave more intelligently by somehow generalizing rather that just storing and retrieving data items like a database system would do.
We'll take a look at two very simple machine learning tasks here. The first is a classification task: the figure shows a collection of two-dimensional data, colored according to two different class labels. A classification algorithm may be used to draw a dividing boundary between the two clusters of points:
In [1]:
# start the inline backend for plotting
%matplotlib inline
In [2]:
# Import the example plot from the figures directory
from fig_code import plot_sgd_separator
plot_sgd_separator()
This may seem like a trivial task, but it is a simple version of a very important concept. By drawing this separating line, we have learned a model which can generalize to new data: if you were to drop another point onto the plane which is unlabeled, this algorithm could now predict whether it's a blue or a red point.
If you'd like to see the source code used to generate this, you can either open the
code in the figures
directory, or you can load the code using the %load
magic command:
In [3]:
#Uncomment the %load command to load the contents of the file
# %load fig_code/sgd_separator.py
The next simple task we'll look at is a regression task: a simple best-fit line to a set of data:
In [4]:
from fig_code import plot_linear_regression
plot_linear_regression()
Again, this is an example of fitting a model to data, such that the model can make generalizations about new data. The model has been learned from the training data, and can be used to predict the result of test data: here, we might be given an x-value, and the model would allow us to predict the y value. Again, this might seem like a trivial problem, but it is a basic example of a type of operation that is fundamental to machine learning tasks.
Machine learning is about creating models from data: for that reason, we'll start by discussing how data can be represented in order to be understood by the computer. Along with this, we'll build on our matplotlib examples from the previous section and show some examples of how to visualize data.
Most machine learning algorithms implemented in scikit-learn expect data to be stored in a
two-dimensional array or matrix. The arrays can be
either numpy
arrays, or in some cases scipy.sparse
matrices.
The size of the array is expected to be [n_samples, n_features]
The number of features must be fixed in advance. However it can be very high dimensional
(e.g. millions of features) with most of them being zeros for a given sample. This is a case
where scipy.sparse
matrices can be useful, in that they are
much more memory-efficient than numpy arrays.
In [5]:
from IPython.core.display import Image, display
display(Image(filename='images/iris_setosa.jpg'))
print("Iris Setosa\n")
display(Image(filename='images/iris_versicolor.jpg'))
print("Iris Versicolor\n")
display(Image(filename='images/iris_virginica.jpg'))
print("Iris Virginica")
If we want to design an algorithm to recognize iris species, what might the data be?
Remember: we need a 2D array of size [n_samples x n_features]
.
What would the n_samples
refer to?
What might the n_features
refer to?
Remember that there must be a fixed number of features for each sample, and feature
number i
must be a similar kind of quantity for each sample.
Scikit-learn has a very straightforward set of data on these iris species. The data consist of the following:
Features in the Iris dataset:
Target classes to predict:
scikit-learn
embeds a copy of the iris CSV file along with a helper function to load it into numpy arrays:
In [6]:
from sklearn.datasets import load_iris
iris = load_iris()
In [7]:
iris.keys()
Out[7]:
In [8]:
n_samples, n_features = iris.data.shape
print((n_samples, n_features))
print(iris.data[0])
In [9]:
print(iris.data.shape)
print(iris.target.shape)
In [10]:
print(iris.target)
In [11]:
print(iris.target_names)
This data is four dimensional, but we can visualize two of the dimensions at a time using a simple scatter-plot:
In [12]:
import numpy as np
import matplotlib.pyplot as plt
x_index = 0
y_index = 1
# this formatter will label the colorbar with the correct target names
formatter = plt.FuncFormatter(lambda i, *args: iris.target_names[int(i)])
plt.scatter(iris.data[:, x_index], iris.data[:, y_index],
c=iris.target, cmap=plt.cm.get_cmap('RdYlBu', 3))
plt.colorbar(ticks=[0, 1, 2], format=formatter)
plt.clim(-0.5, 2.5)
plt.xlabel(iris.feature_names[x_index])
plt.ylabel(iris.feature_names[y_index]);
They come in three flavors:
sklearn.datasets.load_*
sklearn.datasets.fetch_*
sklearn.datasets.make_*
You can explore the available dataset loaders, fetchers, and generators using IPython's
tab-completion functionality. After importing the datasets
submodule from sklearn
,
type
datasets.load_ + TAB
or
datasets.fetch_ + TAB
or
datasets.make_ + TAB
to see a list of available functions.
In [13]:
from sklearn import datasets
In [14]:
#datasets.fetch_
Here we'll dive into the basic principles of machine learning, and how to utilize them via the Scikit-Learn API.
After briefly introducing scikit-learn's Estimator object, we'll cover supervised learning, including classification and regression problems, and unsupervised learning, including dimensinoality reduction and clustering problems.
In [15]:
from sklearn.linear_model import LinearRegression
Estimator parameters: All the parameters of an estimator can be set when it is instantiated, and have suitable default values:
In [16]:
model = LinearRegression(normalize=True)
print(model.normalize)
In [17]:
print(model)
Estimated Model parameters: When data is fit with an estimator, parameters are estimated from the data at hand. All the estimated parameters are attributes of the estimator object ending by an underscore:
In [18]:
x = np.arange(10)
y = 2 * x + 1
In [19]:
plt.plot(x, y, 'o');
In [20]:
# The input data for sklearn is 2D: (samples == 10 x features == 1)
X = x[:, np.newaxis]
print(X)
print(y)
In [21]:
# fit the model on our data
model.fit(X, y)
Out[21]:
In [22]:
# underscore at the end indicates a fit parameter
print(model.coef_)
print(model.intercept_)
In [23]:
model.residues_
Out[23]:
The model found a line with a slope 2 and intercept 1, as we'd expect.
In Supervised Learning, we have a dataset consisting of both features and labels. The task is to construct an estimator which is able to predict the label of an object given the set of features. A relatively simple example is predicting the species of iris given a set of measurements of its flower. This is a relatively simple task. Some more complicated examples are:
What these tasks have in common is that there is one or more unknown quantities associated with the object which needs to be determined from other observed quantities.
Supervised learning is further broken down into two categories, classification and regression. In classification, the label is discrete, while in regression, the label is continuous. For example, in astronomy, the task of determining whether an object is a star, a galaxy, or a quasar is a classification problem: the label is from three distinct categories. On the other hand, we might wish to estimate the age of an object based on such observations: this would be a regression problem, because the label (age) is a continuous quantity.
In [24]:
from sklearn import neighbors, datasets
iris = datasets.load_iris()
X, y = iris.data, iris.target
# create the model
knn = neighbors.KNeighborsClassifier(n_neighbors=1)
# fit the model
knn.fit(X, y)
# What kind of iris has 3cm x 5cm sepal and 4cm x 2cm petal?
# call the "predict" method:
result = knn.predict([[3, 5, 4, 2],])
print(iris.target_names[result])
In [25]:
from fig_code import plot_iris_knn
plot_iris_knn()
Exercise: Now use as an estimator on the same problem: sklearn.svm.SVC
.
Note that you don't have to know what it is do use it.
If you finish early, do the same plot as above.
In [26]:
from sklearn.svm import SVC
In [27]:
model = SVC()
model.fit(X, y)
result = model.predict([[3, 5, 4, 2],])
print(iris.target_names[result])
In [28]:
# Create some simple data
import numpy as np
np.random.seed(0)
X = np.random.random(size=(20, 1))
y = 3 * X.squeeze() + 2 + np.random.randn(20)
# Fit a linear regression to it
from sklearn.linear_model import LinearRegression
model = LinearRegression(fit_intercept=True)
model.fit(X, y)
print ("Model coefficient: %.5f, and intercept: %.5f"
% (model.coef_, model.intercept_))
# Plot the data and the model prediction
X_test = np.linspace(0, 1, 100)[:, np.newaxis]
y_test = model.predict(X_test)
plt.plot(X.squeeze(), y, 'o')
plt.plot(X_test.squeeze(), y_test);
Unsupervised Learning addresses a different sort of problem. Here the data has no labels, and we are interested in finding similarities between the objects in question. In a sense, you can think of unsupervised learning as a means of discovering labels from the data itself. Unsupervised learning comprises tasks such as dimensionality reduction, clustering, and density estimation. For example, in the iris data discussed above, we can used unsupervised methods to determine combinations of the measurements which best display the structure of the data. As we'll see below, such a projection of the data can be used to visualize the four-dimensional dataset in two dimensions. Some more involved unsupervised learning problems are:
Sometimes the two may even be combined: e.g. Unsupervised learning can be used to find useful features in heterogeneous data, and then these features can be used within a supervised framework.
Principle Component Analysis (PCA) is a dimension reduction technique that can find the combinations of variables that explain the most variance.
Consider the iris dataset. It cannot be visualized in a single 2D plot, as it has 4 features. We are going to extract 2 combinations of sepal and petal dimensions to visualize it:
In [29]:
X, y = iris.data, iris.target
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca.fit(X)
X_reduced = pca.transform(X)
print("Reduced dataset shape:", X_reduced.shape)
import pylab as pl
pl.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y,
cmap='RdYlBu')
print("Meaning of the 2 components:")
for component in pca.components_:
print(" + ".join("%.3f x %s" % (value, name)
for value, name in zip(component,
iris.feature_names)))
In [30]:
from sklearn.cluster import KMeans
k_means = KMeans(n_clusters=3, random_state=0) # Fixing the RNG in kmeans
k_means.fit(X)
y_pred = k_means.predict(X)
pl.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y_pred,
cmap='RdYlBu');
Scikit-learn strives to have a uniform interface across all methods,
and we'll see examples of these below. Given a scikit-learn estimator
object named model
, the following methods are available:
model.fit()
: fit training data. For supervised learning applications,
this accepts two arguments: the data X
and the labels y
(e.g. model.fit(X, y)
).
For unsupervised learning applications, this accepts only a single argument,
the data X
(e.g. model.fit(X)
).model.predict()
: given a trained model, predict the label of a new set of data.
This method accepts one argument, the new data X_new
(e.g. model.predict(X_new)
),
and returns the learned label for each object in the array.model.predict_proba()
: For classification problems, some estimators also provide
this method, which returns the probability that a new observation has each categorical label.
In this case, the label with the highest probability is returned by model.predict()
.model.score()
: for classification or regression problems, most (all?) estimators implement
a score method. Scores are between 0 and 1, with a larger score indicating a better fit.model.transform()
: given an unsupervised model, transform new data into the new basis.
This also accepts one argument X_new
, and returns the new representation of the data based
on the unsupervised model.model.fit_transform()
: some estimators implement this method,
which more efficiently performs a fit and a transform on the same input data.One interesting application where we might apply machine learning is in exploring handwritten digit data. Scikit-learn has a small test set built-in:
We'll use scikit-learn's data access interface and take a look at this data:
In [31]:
from sklearn import datasets
digits = datasets.load_digits()
digits.images.shape
Out[31]:
Let's plot a few of these:
In [32]:
fig, axes = plt.subplots(10, 10, figsize=(8, 8))
fig.subplots_adjust(hspace=0.1, wspace=0.1)
for i, ax in enumerate(axes.flat):
ax.imshow(digits.images[i], cmap='binary')
ax.text(0.05, 0.05, str(digits.target[i]),
transform=ax.transAxes, color='green')
ax.set_xticks([])
ax.set_yticks([])
Here the data is simply each pixel value within an 8x8 grid:
In [33]:
# The images themselves
print(digits.images.shape)
print(digits.images[0])
In [34]:
# The data for use in our algorithms
print(digits.data.shape)
print(digits.data[0])
In [35]:
# The target label
print(digits.target)
So our data have 1797 samples in 64 dimensions.
We'd like to visualize our points within the 64-dimensional parameter space, but it's difficult to plot points in 64 dimensions! Instead we'll use a unsupervised dimensionality reduction technique called Isomap:
In [36]:
from sklearn.manifold import Isomap
In [37]:
iso = Isomap(n_components=2)
data_projected = iso.fit_transform(digits.data)
In [38]:
plt.scatter(data_projected[:, 0], data_projected[:, 1], c=digits.target,
edgecolor='none', alpha=0.5, cmap=plt.cm.get_cmap('nipy_spectral', 10));
plt.colorbar(label='digit label', ticks=range(10))
plt.clim(-0.5, 9.5)
We see here that the digits are fairly well-separated in the parameter space; this tells us that a supervised classification algorithm should perform fairly well. Let's give it a try.
Let's try a classification task on the digits. The first thing we'll want to do is split the digits into a training and testing sample:
In [39]:
from sklearn.cross_validation import train_test_split
Xtrain, Xtest, ytrain, ytest = train_test_split(digits.data, digits.target)
print(Xtrain.shape, Xtest.shape)
Let's use a simple logistic regression to classify this data:
In [40]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(penalty='l2')
clf.fit(Xtrain, ytrain)
ypred = clf.predict(Xtest)
We can check our classification accuracy by comparing the true values of the test set to the predictions:
In [41]:
from sklearn.metrics import accuracy_score
accuracy_score(ytest, ypred)
Out[41]:
This single number doesn't tell us where we've gone wrong: one nice way to do this is to use the confusion matrix
In [42]:
from sklearn.metrics import confusion_matrix
print(confusion_matrix(ytest, ypred))
plt.imshow(confusion_matrix(ytest, ypred),
cmap='Blues', interpolation='nearest')
plt.ylabel('true')
plt.xlabel('predicted');
We might also take a look at some of the outputs along with their predicted labels. We'll make the bad labels red:
In [43]:
fig, axes = plt.subplots(10, 10, figsize=(8, 8))
fig.subplots_adjust(hspace=0.1, wspace=0.1)
for i, ax in enumerate(axes.flat):
ax.imshow(Xtest[i].reshape(8, 8), cmap='binary')
ax.text(0.05, 0.05, str(ypred[i]),
transform=ax.transAxes,
color='green' if (ytest[i] == ypred[i]) else 'red')
ax.set_xticks([])
ax.set_yticks([])
The interesting thing is that even with this simple logistic regression algorithm, many of the mislabeled cases are ones that we ourselves might get wrong!
There are many ways to improve this classifier, but we're out of time here. To go further, we could use a more sophisticated model, use cross validation, etc.
There's lots more information out there.
A couple of resources I've worked on:
Scikit-learn documentation: lots of info and examples.
Scipy 2013 Tutorial: An eight hour tutorial on scikit-learn, presented by Olivier Grisel, Gael Varoquaux, and myself. See the videos and the notebooks.
astroML: machine learning for Astronomy in Python.
In [44]:
from astroML.datasets import fetch_rrlyrae_combined
from sklearn.cross_validation import train_test_split
X, y = fetch_rrlyrae_combined()
# For now, we'll only fit the first two colors
X_train, X_test, y_train, y_test = train_test_split(X, y)
In [45]:
N_plot = 5000
plt.scatter(X[-N_plot:, 0], X[-N_plot:, 1], c=y[-N_plot:],
edgecolors='none')
plt.xlabel('u-g color')
plt.ylabel('g-r color');
Red points are RR-Lyrae, Blue points are main sequence stars.
Now we'll do a simple and fast $K$-neighbors classification.
Note that we train on part of the data and test on another part. Otherwise, an estimator which simply remembers the entire dataset would obtain a perfect classification!
In [46]:
from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier(n_neighbors=5)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
In [47]:
print(np.sum(y_pred == y_test))
print(np.sum(y_pred != y_test))
In [48]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred,
target_names=['MS star', 'RR Lyrae']))
The range for precision, recall, and F1 score is 0 to 1, with 1 being perfect. Often in astronomy, we call the recall the completeness, and (1 - precision) the contamination.
Because this is an unbalanced dataset, we see that the RR Lyrae stars are completely overwhelmed by the larger number of normal Main Sequence stars.
Use another classifier and try to improve on this precision/recall score.
There are several you might try:
sklearn.neighbors.GaussianNaiveBayes
(fast but inaccurate)sklearn.lda.LDA
(fast but inaccurate)sklearn.svm.SVC
(slowish, often relatively accurate)sklearn.svm.SVC
with kernel='rbf'
(slow but relatively accurate)sklearn.ensemble.RandomForestClassifier
(fast & potentially accurate with tuning)For the slower algorithms, it might be a good idea to use only part of the dataset as you experiment, i.e. when training do something like the following:
clf.fit(X_train[::5], y_train[::5])
Once you're happy, you can run the training on the full dataset.
What's the best precision/recall you can obtain for the RR-Lyrae data?
In [49]:
from astroML.datasets import fetch_sdss_specgals
In [50]:
data = fetch_sdss_specgals()
# put magnitudes in a matrix
X = np.vstack([data['modelMag_%s' % f] for f in 'ugriz']).T
y = data['z']
# Split into training and testing data
X_train, X_test, y_train, y_test = train_test_split(X, y)
In [51]:
from sklearn.linear_model import LinearRegression
est = LinearRegression()
est.fit(X_train, y_train)
y_pred = est.predict(X_test)
In [52]:
plt.plot(y_test, y_pred, ',k')
plt.plot([0, 1], [0, 1], ':k')
plt.xlim(0, 0.6)
plt.ylim(0, 0.6)
Out[52]:
Evidently, a simple linear model is not a very good fit!
Let's compute the RMS deviation to see how poor this actually is:
In [53]:
rms = np.sqrt(np.mean((y_test - y_pred) ** 2))
print(rms)
Try to improve on this result using another regression algorithm. Some potential choices:
sklearn.neighbors.KNeighborsRegressor
(fast-ish, but often inaccurate)sklearn.svm.SVR
(potentially accurate, but slow)sklearn.ensemble.RandomForestRegressor
(fast, and accurate with tuning)sklearn.ensemble.GradientBoostingRegressor
(accurate with tuning, can be very slow depending on parameters)Again, sub-sampling the data can help as you explore some of the slower techniques.
What's the best RMS error you can get on the data?
In [54]:
from sklearn.ensemble import RandomForestClassifier
In [55]:
RandomForestClassifier?
In [55]: