Basics of Machine Learning for Astronomy

This notebook was put together by Jake VanderPlas for the Local Group Astrostatistics Workshop, June 2015

Outline:

  • What is Machine Learning?

  • Taxonomy of Machine Learning

  • Basic use of scikit-learn

  • Practical concerns in Astronomy

Starting Stuff:

Before we start we'll do some preliminary imports & notebook setup:


In [1]:
# setup notebook for inline figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np

# set matplotlib figure style
mpl.style.use('ggplot') # only works in matplotlib 1.4+
mpl.rc('figure', figsize=(8, 6))

What is Machine Learning?

In this section we will begin to explore the basic principles of machine learning. Machine Learning is about building programs with tunable parameters (typically an collection of floating point values) that are adjusted automatically so as to improve the model's behavior by adapting to previously seen data.

Machine Learning can be considered a subfield of Artificial Intelligence since those algorithms can be seen as building blocks to make computers learn to behave more intelligently by somehow generalizing rather that just storing and retrieving data items like a database system would do.

Simple Example #1: Classification

As a concrete example, take a look at this figure.


In [2]:
# Import the example plot from the figures directory
from fig_code import plot_sgd_separator
plot_sgd_separator()


We have data in two dimensions with two different labels (red and blue), and we construct a model which indicates the relationship between the data points and their discrete label – this is known as a classification task.

This may seem incredibly trivial, but it is a simple version of a very important concept. We've used a machine learning algorithm to draw a line separating the datasets, which means that for any new datapoint we can predict its color (red or blue). This is the sense in which a Machine learning model makes generalizations that can be applied to new data. While this is fairly simple in two dimensions, these classification algorithms can become very powerful when acting on higher-dimensional or noise-ridden data where the relationships are harder for us to visualize.

If you'd like to see the source code used to generate this, you can either open the code in the figures directory, or you can load the code using the %load magic command:


In [3]:
#Uncomment the %load command to load the contents of the file
%load fig_code/sgd_separator.py

In [ ]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import SGDClassifier
from sklearn.datasets.samples_generator import make_blobs

def plot_sgd_separator():
    # we create 50 separable points
    X, Y = make_blobs(n_samples=50, centers=2,
                      random_state=0, cluster_std=0.60)

    # fit the model
    clf = SGDClassifier(loss="hinge", alpha=0.01,
                        n_iter=200, fit_intercept=True)
    clf.fit(X, Y)

    # plot the line, the points, and the nearest vectors to the plane
    xx = np.linspace(-1, 5, 10)
    yy = np.linspace(-1, 5, 10)

    X1, X2 = np.meshgrid(xx, yy)
    Z = np.empty(X1.shape)
    for (i, j), val in np.ndenumerate(X1):
        x1 = val
        x2 = X2[i, j]
        p = clf.decision_function([x1, x2])
        Z[i, j] = p[0]
    levels = [-1.0, 0.0, 1.0]
    linestyles = ['dashed', 'solid', 'dashed']
    colors = 'k'

    ax = plt.axes()
    ax.contour(X1, X2, Z, levels, colors=colors, linestyles=linestyles)
    ax.scatter(X[:, 0], X[:, 1], c=Y, s=60)

    ax.axis('tight')


if __name__ == '__main__':
    plot_sgd_separator()
    plt.show()

In [4]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import SGDClassifier
from sklearn.datasets.samples_generator import make_blobs

def plot_sgd_separator():
    # we create 50 separable points
    X, Y = make_blobs(n_samples=50, centers=2,
                      random_state=0, cluster_std=0.60)

    # fit the model
    clf = SGDClassifier(loss="hinge", alpha=0.01,
                        n_iter=200, fit_intercept=True)
    clf.fit(X, Y)

    # plot the line, the points, and the nearest vectors to the plane
    xx = np.linspace(-1, 5, 10)
    yy = np.linspace(-1, 5, 10)

    X1, X2 = np.meshgrid(xx, yy)
    Z = np.empty(X1.shape)
    for (i, j), val in np.ndenumerate(X1):
        x1 = val
        x2 = X2[i, j]
        p = clf.decision_function([x1, x2])
        Z[i, j] = p[0]
    levels = [-1.0, 0.0, 1.0]
    linestyles = ['dashed', 'solid', 'dashed']
    colors = 'k'

    ax = plt.axes()
    ax.contour(X1, X2, Z, levels, colors=colors, linestyles=linestyles)
    ax.scatter(X[:, 0], X[:, 1], c=Y, s=60)

    ax.axis('tight')


if __name__ == '__main__':
    plot_sgd_separator()
    plt.show()



In [5]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import SGDClassifier
from sklearn.datasets.samples_generator import make_blobs

def plot_sgd_separator():
    # we create 50 separable points
    X, Y = make_blobs(n_samples=50, centers=2,
                      random_state=0, cluster_std=0.60)

    # fit the model
    clf = SGDClassifier(loss="hinge", alpha=0.01,
                        n_iter=200, fit_intercept=True)
    clf.fit(X, Y)

    # plot the line, the points, and the nearest vectors to the plane
    xx = np.linspace(-1, 5, 10)
    yy = np.linspace(-1, 5, 10)

    X1, X2 = np.meshgrid(xx, yy)
    Z = np.empty(X1.shape)
    for (i, j), val in np.ndenumerate(X1):
        x1 = val
        x2 = X2[i, j]
        p = clf.decision_function([x1, x2])
        Z[i, j] = p[0]
    levels = [-1.0, 0.0, 1.0]
    linestyles = ['dashed', 'solid', 'dashed']
    colors = 'k'

    ax = plt.axes()
    ax.contour(X1, X2, Z, levels, colors=colors, linestyles=linestyles)
    ax.scatter(X[:, 0], X[:, 1], c=Y, s=60)

    ax.axis('tight')


if __name__ == '__main__':
    plot_sgd_separator()
    plt.show()


Simple Example #2: Regression

Let's look at another simple example of a machine learning task:


In [6]:
from fig_code import plot_linear_regression
plot_linear_regression()


We have points in two dimensions, but this time rather than predicting discrete labels for the points, we construct a model of the relationship between the two continuous variables. This predicting of a continuous variable is called a regression problem.

As above, this model lets us generalize what we have learned from the input data, and make a prediction for any new point: knowing the x value, we can predict the y value. Again, this might seem like a trivial problem, but as the complexity of the data increases, such an approach can be fruitful: if, instead of a single x measurement for each point, you had 100 measured attributes for each point, the algorithm could still be applied to predict the y value based on this input.

Again, this is an actual model fit to the data shown above, as you can see by loading the appropriate source code file:


In [7]:
#Uncomment the %load command to load the contents of the file
# %load fig_code/linear_regression.py

Example #3: Clustering

Here is an example of another type of machine learning task: clustering


In [8]:
from fig_code import plot_kmeans
plot_kmeans()


Here we're looking at an unsupervised algorithm, meaning that there are no labels on the input data. This is a clustering algorithm called K-Means which finds clusters within the unlabeled data and predicts labels for them.

In two dimensions, you might be able to do this "by hand". As the dimensionality of the data increases, this becomes more difficult for the human, but still a well-defined machine learning problem.

As before, you can see how this was generated by loading the source:


In [9]:
#Uncomment the %load command to load the contents of the file
# %load fig_code/kmeans.py

Taxonomy of Machine Learning

So far we've seen a few simple examples of Machine Learning algorithms. Before we dive in to actually using scikit-learn, I want to briefly mention the basic taxonomy of machine learning.

Broadly, machine learning models are often divided into the categories of Supervised and Unsupervised methods.

Supervised Methods

Supervised methods are ones like the classification and regression we saw above. They are given a labeled training dataset, and the model is later applied to un-labeled data in order to predict the unknown label.

Examples of supervised methods:

  • Classification, in which the labels are discrete
  • Regression, in which the labels are continuous
  • Recommender Systems, in which sparse labels across a population are used to fill-in missing labels (think "if you like this movie you'll probably like...)

We'll look at a few of these in detail today.

Unsupervised Methods

Unsupervised methods are ones like the clustering example above. They are given an un-labeled training dataset, and make inferences about the structure or nature of the data without any label input.

Examples of unsupervised methods:

  • Clustering, in which objects are grouped and labels are inferred from the structure of the input data
  • Dimensionality Reduction, in which the most "important" features or combination of features are identified, where "important" reflects some well-defined metric
  • Density Estimation, in which the smooth distribution of points is predicted from the input sample.

We will cover some of these in tomorrow's session.

Machine Learning with Scikit-Learn

This afternoon we'll look at performing some machine learning tasks with the scikit-learn package. Here we'll talk about the general API of the package before moving on to applying it to real data.

Scikit-Learn is a Python package designed to give access to well-known machine learning algorithms within Python code, through a clean, well-thought-out API. It has been built by hundreds of contributors from around the world, and is used across industry and academia.

Scikit-Learn is built upon Python's NumPy (Numerical Python) and SciPy (Scientific Python) libraries, which enable efficient in-core numerical and scientific computation within Python. As such, scikit-learn is not specifically designed for extremely large datasets, though there is some work in this area.

For this short introduction, I'm going to stick to questions of in-core processing of small to medium datasets with Scikit-learn.

Data Representation in Scikit-Learn

Fundamentally, data in scikit-learn is represented as vectors and matrices.

Data is generally in the form of 2-dimensional matrices:

  • The rows in the matrix are referred to as samples, and each row corresponds to a particular observed object (a star, galaxy, tweet, etc.)

  • The columns of the matrix are referred to as features, and each column corresponds to a particular type of observation (g-r color, frequency of a particular word, etc.)

Labels are generally in the form of a 1-dimensional vector:

  • Each entry in the vector corresponds to a sample, and gives the (numeric) label of the sample.

Example: RR Lyrae stars

We'll start by downloading some photometry of variable and non-variable stars using the astroML package. This dataset is already cleaned-up and arranged to be used for a machine learning task. Note that the first time you run this code, it will download the data and cache it to your home directory. Subsequent runs will use the cached version.


In [10]:
from astroML.datasets import fetch_rrlyrae_combined
X, y = fetch_rrlyrae_combined()

print(X.shape)
print(y.shape)


(93141, 4)
(93141,)

The X matrix represents 93141 stars, each with four observations (u-g, g-r, r-i, i-z photometry):


In [11]:
print(X[:10])


[[ 1.25099945  0.39400005  0.13700008  0.06199932]
 [ 1.04800034  0.3390007   0.15199852  0.02300072]
 [ 1.00800133  0.34199905  0.12899971  0.20300102]
 [ 0.96500015  0.3920002   0.14900017  0.14999962]
 [ 1.04000092  0.33300018  0.12599945  0.10199928]
 [ 1.15400124  0.37399864  0.14500046  0.12100029]
 [ 0.96500015  0.38400078  0.11899948  0.01099968]
 [ 1.0150013   0.37099838  0.15800095  0.09199905]
 [ 1.00300026  0.39100075  0.14500046  0.07499886]
 [ 0.94799995  0.32999992  0.16399956  0.02099991]]

The y vector represents the same 93141 stars, in this case 0 represents non-variable stars, 1 represents variable stars.


In [12]:
y


Out[12]:
array([ 0.,  0.,  0., ...,  1.,  1.,  1.])

In [13]:
np.unique(y)


Out[13]:
array([ 0.,  1.])

Here, the labels are determined from an additional set of observations (in this case multiple observations in time), while the "data" is what we expect to observe for new stars (in this case, single-time photometry).

Let's plot some of this data to see what we're looking at:


In [14]:
plt.scatter(X[:, 0], X[:, 1], c=y,
            linewidth=0, alpha=0.3)
plt.xlabel('u-g')
plt.ylabel('g-r')
plt.axis('tight');


Whenever you use data within scikit-learn, it will need to be represented in a similar manner: a two-dimensional numerical matrix of samples and features.

Important Notes on Data

  • the entries of y must match the rows of X! If not, you're doing it wrong.
  • the entries must be numerical: this means for labels, you may have to convert strings into numbers (e.g. "RR-Lyrae" -> 0, "Main Sequence" -> 1)
  • each entry in the columns of X must be the same "type" of measurement! If not, you will find nonsense results. For example, everything in column 1 must be a u-g measurement "corrected" for biases due to observing conditions, etc.

Fitting a Model in Scikit-learn

Next we'll discuss briefly the scikit-learn API for model fitting. Scikit-learn has a very clean and well-defined API. Let's look at an example for a simple classification problem:


In [15]:
from sklearn.datasets.samples_generator import make_blobs
X, y = make_blobs(n_samples=500, centers=2,
                  random_state=0, cluster_std=1)

plt.scatter(X[:, 0], X[:, 1], c=y);


1. Scikit-learn Model families are Python Classes

If we want to use a particular machine learning model, the first thing to do is import the appropriate class from the scikit-learn package. Here we'll use a Stochastic Gradient Descent (SGD) classifier:


In [16]:
from sklearn.linear_model import SGDClassifier

2. Particular model types are instances of these classes

We can define the type of model we want by instantiating this class and defining any relevant hyper parameters, which are parameters that define the particular model being used


In [17]:
model = SGDClassifier(loss='hinge', penalty='l2')

3. Models are trained using the fit() method

Notice that up until now the data has not entered the picture. We can fit the model to our particular data by calling the fit() method of the model:


In [18]:
model.fit(X, y)


Out[18]:
SGDClassifier(alpha=0.0001, class_weight=None, epsilon=0.1, eta0=0.0,
       fit_intercept=True, l1_ratio=0.15, learning_rate='optimal',
       loss='hinge', n_iter=5, n_jobs=1, penalty='l2', power_t=0.5,
       random_state=None, shuffle=False, verbose=0, warm_start=False)

4. Labels for new points can be predicted using the predict() method

Generalizing the model to new data is done through the predict() method, which returns the fit labels:


In [19]:
# create some new points and predict their labels
X_new = [-2, -3] + [7, 10] * np.random.rand(5000, 2)
y_new = model.predict(X_new)

# plot together with old points
plt.scatter(X[:, 0], X[:, 1], c=y, alpha=1);
plt.scatter(X_new[:, 0], X_new[:, 1], c=y_new, alpha=0.1)
plt.axis('tight');


Here the transparent points are the new sample whose labels are predicted from the model.

5. Model hyper-parameters are usually tuned using cross-validation.

Cross-validation holds-out part of the training set and uses it to evaluate how well a particular model could be expected to perform on the dataset. This could be done by hand, but scikit-learn provides some nice convenience routines to enable this. For example, the cross_val_score function which automatically computes a 3-fold cross validation:


In [20]:
from sklearn.cross_validation import cross_val_score

for loss in ['hinge', 'log', 'perceptron']:
    for penalty in ['l2', 'l1']:
        print(loss, penalty)
        model = SGDClassifier(loss=loss, penalty=penalty)
        scores = cross_val_score(model, X, y, cv=5)
        print('  score = {0:.3f} +/- {1:.3f}'.format(scores.mean(), scores.std()))


hinge l2
  score = 0.946 +/- 0.022
hinge l1
  score = 0.956 +/- 0.014
log l2
  score = 0.958 +/- 0.012
log l1
  score = 0.950 +/- 0.019
perceptron l2
  score = 0.958 +/- 0.013
perceptron l1
  score = 0.960 +/- 0.013

Here the data is simple enough that the choice of hyperparameters does not drastically affect the fit, but for more complicated data and models this choice can be very important, as we will see later this afternoon

The Strength of the Scikit-Learn API

The strength of scikit-learn's API is that basically every estimator is consistent. For example, all it takes to fit a different model to our data is to substitute the class name (and associated hyperparameters) that we're interested in. So, if I want to try a support vector machine classifier, I can just import the class, copy-and-paste the code, and re-run the example:


In [21]:
# import and instantiate a new model
# SVC = support vector classifier
from sklearn.svm import SVC
model = SVC().fit(X, y)

# all this copied from above
y_new = model.predict(X_new)

plt.scatter(X[:, 0], X[:, 1], c=y, alpha=1);
plt.scatter(X_new[:, 0], X_new[:, 1], c=y_new, alpha=0.1)
plt.axis('tight');


This makes it very easy to compare the results of different models to determine what works best for your own data. For an exploration of some available algorithms, see http://scikit-learn.org/stable/

Practical Considerations for Astronomy

As I alluded to above, Astronomy has some practical considerations that make machine learning difficult to apply. Before you run out to apply scikit-learn algorithms to your data, there are a few things to keep in mind that come up often in astronomy:

Inhomogeneity of Observations

In order to use these algorithms, you must construct the X matrix. Here the columns must be strictly uniform; that is, if you combine u-band measurements from two diffrent datasets, you better be sure you've calibrated them correctly so that the measurements can be considered the "same"! Otherwise, your model will fail.

Missing Data and Data Errors

Most machine learning methods cannot natively handle missing data, and are not generally constructed to handle data with reported errors. The stats community tends to ignore errors entirely, and for missing data, practitioners tend to do some sort of imputation (e.g. fill-in missing values using the average of the column). Imputing missing values and ignoring errors is probably a bad idea in science. There are ways around this, but not many ML resources talk about this. In general, some sort of forward-modeling approach (discussed in previous sessions) is better than classic machine learning when missing data and data errors are present.

There are some ML models which do handle errors robustly (take a look at some of the scikit-learn style models in astroML) but most ML resources will be of no help. The Orange Book has some discussion of these caveats, and how to work with noisy data in a Machine Learning context.

Statistical Differences in Training Data and Unknown Data

This is huge: if there is any statistical difference in the distribution of your training and unknown data your algorithm probably does not work! This comes up all the time in astronomy. For example: if you're doing photometric redshifts, you might use a training sample which has spectroscopically-determined redshifts, and use these to predict the redshift of galaxies for which no spectroscopic measurements are available.

Note that signal to noise differences count as statistical differences (and this is very often a problem).

If you do this, STOP AND BACK AWAY RIGHT NOW!!! Spectroscopic surveys like SDSS have a particular target selection pipeline which will bias your results: e.g. your training set will tend to have brighter galaxies, while your test set will tend to have fainter galaxies. An unrepresentative training set means you can't (automatically) trust your model. You can certainly correct for this (see e.g. Bloom et al's Random Forest work) but it takes some careful thinking about your data.

Keep this in mind: for almost all naive cases (classic) Machine Learning is all but useless in Astronomy. If you're careful about your training data, it can be useful. But generally I would advocate for a forward-modeling approach in almost all circumstances.

Onward and Upward!

With those caveats in mind, let's take a look at applying the classical ML approach to some astro data...