Exploring Supervised Machine Learning: Regression

This notebook explores a regression task on astronomical data with Scikit-Learn. Much of the practice of machine learning relies on searching through documentation to learn (or remind yourself) of how things are done. Recall that in IPython, you can see the documentation of any function using the ? character. For example:


In [ ]:
import numpy as np
np.linspace?

In addition, Google is your friend. Do you want to know how to import a Gaussian mixture model in scikit-learn? Search for "scikit-learn Gaussian Mixture" and go from there.

Below we'll explore a galaxy photometry dataset for regression, where the labels give the redshift (based on SDSS spectra for the training set).

(see also the Classification breakout)


In [ ]:
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np

# set matplotlib figure style
mpl.style.use('ggplot') # only works in matplotlib 1.4+
mpl.rc('figure', figsize=(8, 6))

The Data: Photometric Redshifts

The photometric redshift data comes from the spectroscopic galaxy catalog from SDSS DR7. There is a wealth of data in this dataset, but we will focus on photometry and redshift measurements:


In [ ]:
from astroML.datasets import fetch_sdss_specgals
data = fetch_sdss_specgals()

This is a much more involved dataset than used in the classification example; it's in the form of a NumPy structured array.


In [ ]:
data.dtype.names

We'll extract a matrix of (u,g,r,i,z) model magnitudes to tackle the "classic" photometric redshift problem:


In [ ]:
# put magnitudes in a matrix
mags = np.vstack([data['modelMag_%s' % f] for f in 'ugriz']).T
z = data['z']

In [ ]:
print(mags.shape)
print(z.shape)

More often, photometric redshifts are done on galaxy colors, because this minimizes a dependency on galaxy mass which can cause problems, and also helps with data calibration issues.


In [ ]:
colors = mags[:, 1:] - mags[:, :-1]

Nevertheless, this step does discard some potentially useful data, so it may not be the best choice!

Note that we're explicitly ignoring errors in data, and this is bad. We'll think about how you might address this problem further below.

Let's take a look at color-vs-redshift for a subset of these galaxies:


In [ ]:
fig, ax = plt.subplots()

N = 50000
ax.plot(colors[:N, 1], z[:N], ',', alpha=0.3)

ax.set(xlim=(-2, 0),
       ylim=(0, 0.4),
       xlabel='g - r color',
       ylabel='redshift');

From this, it appears that a regression applied to the photometry should give us some amount of information on the redshift.

Regression: Determining Photometric Redshift

Here we will explore a regression task using the above data.

1. Exploring the Data

With any machine learning task, it's a good idea to start by exploring the data. Plot some of the colors vs the redshift. Are there any patterns you see? If you were to create some sort of rough photometric redshift by hand, what would you do? How accurate would you expect to be?

Note that plotting the full dataset can be a bit slow; you might try just plotting the first 10,000 points or so.


In [ ]:


In [ ]:


In [ ]:

2. Simple Regression

It's always a good idea to start out with a simple regression algorithm to get an idea for how it performs. In Scikit-learn, the sklearn.linear_model.LinearRegression is a good candidate. Import and fit the model, and use cross-validation to determine the accuracy of the result.


In [ ]:


In [ ]:


In [ ]:

The default scoring for regression cross-validation is the "R2 score", in which 1.0 is perfect and lower values are worse. Other options are mean_absolute_error, mean_squared_error, and median_absolute_error.

For the photometric redshift problem in particular, often we're less interested in average performance and more interested in minimizing things like catastrophic outliers (i.e. predictions which are very far from the true value). Scikit-learn doesn't have any scoring methods which account for this, but we can define our own scoring method to measure whatever we'd like. For example, here is a scoring function which computes the percentage of inputs within $\pm 0.05$ of the true redshift:


In [ ]:
def outlier_score(model, X, y, thresh=0.05):
    y_pred = model.predict(X)
    diff = abs(y - y_pred)
    return np.sum(diff <= thresh) / len(y)

Pass this scoring function to the scoring parameter of the cross validation. What percentage does this model attain?


In [ ]:


In [ ]:


In [ ]:

This illustrates one of the key pieces of doing machine learning with astronomy data: we can't simply rely on the defaults. We have to think about what we're after, and customize the output to get what we're after.

3. Moving to a more complex model

There are a large number of regression routines available in scikit-learn; one of the most interesting is the Random Forest regressor (for some insight into Random Forests, see the Random Forest Notebook). Repeat the above experiments using the random forest estimator. How does the R2 score and the outlier score compare to those obtained with simple linear regression?


In [ ]:


In [ ]:


In [ ]:

4. Tuning the HyperParameters

For random forests in particular, the choice of hyperparameters can be very important in finding the best fit. Adjust the n_estimators and max_depth parameters and see what the effect is. What's the best model for attaining high R2 score? For attaining a low outlier fraction?


In [ ]:


In [ ]:


In [ ]:

Searching through a grid of hyperparameters for the optimal model is a very common (and useful!) task in performing a machine learning analysis. Scikit-learn provides a Grid Search interface to enable this to be done quickly. Take a look at the scikit-learn Grid Search Documentation and use the grid search tools to find the best combination of n_estimators and max_depth for your particular data. What's the best r2 score & outlier score that you can attain with RandomForests on this dataset?


In [ ]:


In [ ]:


In [ ]:

5. More Regression models

If you have more time, read through the scikit-learn documentation and take a look at other regression models that are available. Try some of them on this dataset – keep in mind that some don't scale all that well to large numbers of samples, so it may be beneficial to use only a subset of the training data.

Did you find any model which out-performs the random forest?


In [ ]:


In [ ]:


In [ ]: