Exercise 2: Photometric Redshift Determination

Here we will take a closer look at the photometric redshift problem discussed in section 5 of the tutorial. Using the decision tree classifier, we'll take a look at the 4-color observations of just over 400,000 galaxies observed by the Sloan Digital Sky Survey.

We're trying to determine photometric redshifts of galaxies, which are related to their distances, and useful in a wide variety of cosmological studies. The point of this exercise is to answer the question: how can we get the rms error down to below 0.1? Would it be a better use of telescope time to observe more objects, or to observe additional features of the objects in the data set? To answer this, we'll make use of the learning curve techniques discussed in section 3 of the tutorial.

Because we're going to be plotting things below, we'll first make sure we're in ipython's pylab mode:



In [ ]:

    
%pylab inline
import pylab as pl

Loading Data

This tutorial assumes the notebook is within the tutorial directory structure, and that the fetch_data.py script has been run to download the data locally. If the data is in a different location, you can change the DATA_HOME variable below.



In [ ]:

    
import os
DATA_HOME = os.path.abspath('../data/sdss_photoz')



In [ ]:

    
import numpy as np
data = np.load(os.path.join(DATA_HOME, 'sdss_photoz.npy'))

Here we'll truncate the data to 50,000 points. This will allow the code below to be run quickly while it's being written. When we're satisfied that the code is ready to go, we can comment out this line and re-run the analysis.



In [ ]:

    
data = data[:50000]

Now as usual, we'll put the data into the matrix form expected by scikit-learn:



In [ ]:

    
print '%i points' % data.shape[0]
u, g, r, i, z = [data[f] for f in 'ugriz']

X = np.zeros((len(data), 4))
X[:, 0] = u - g
X[:, 1] = g - r
X[:, 2] = r - i
X[:, 3] = i - z

y = data['redshift']

Now let's divide our data into the training, cross-validation, and test samples.



In [ ]:

    
Ntot = len(y)

Ncv = Ntot / 5
Ntest = Ntot / 5
Ntrain = Ntot - Ncv - Ntest

X_train = X[:Ntrain]
y_train = y[:Ntrain]

X_cv = X[Ntrain:Ntrain + Ncv]
y_cv = y[Ntrain:Ntrain + Ncv]

X_test = X[Ntrain + Ncv:]
y_test = y[Ntrain + Ncv:]

A Basic Decision Tree

Recall from the tutorial the use of the decision tree for regression:



In [ ]:

    
from sklearn.tree import DecisionTreeRegressor

clf = DecisionTreeRegressor(max_depth=15)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_cv)

pl.scatter(y_cv, y_pred, c='black', s=4, lw=0)
pl.plot([0, 3], [0, 3], '-', color='gray')

pl.xlabel('true redshift')
pl.ylabel('predicted redshift')

pl.xlim(0, 2.5)
pl.ylim(0, 2.5)

With this simple model, the points are generally clustered toward the true values, but there are many catastrophic outliers: i.e. points which have predicted redshifts very far from their true redshifts.

Part 1: Improving our Model

For the first part of this exercise, we'll use the cross-validation techniques discussed earlier to determine the best hyper-parameter choice for the decision tree regressor. Here you'll plot the training error and cross-validation error as a function of the meta-parameter max_depth.

We'll first define a convenience function to compute the rms error:



In [ ]:

    
from sklearn import metrics

def compute_rms_error(y_pred, y_true):
    return np.sqrt(metrics.mean_squared_error(y_pred, y_true))

For the first part of this exercise, you will create three arrays: max_depth_array, train_error, and cv_error. Use at least 10 different values of max_depth, and compute the training and cross-validation error associated with each of them. for power-users, note that this is a natural application for ipython's parallel processing capabilities, but that's beyond the scope of this tutorial.



In [ ]:

    
from sklearn.tree import DecisionTreeRegressor

# we'll explore results for max_depth from 1 to 20
max_depth_array = np.arange(1, 21)
train_error = np.zeros(len(max_depth_array))
cv_error = np.zeros(len(max_depth_array))



In [ ]:

    
# here you'll write the code to compute the training error and
# test error for the Decision Tree Regressor at each value of
# max_depth.  You can type DecisionTreeRegressor? to see its
# documentation.

If the notebook is within the tutorial directory structure, the following command will load the solution:



In [ ]:

    
%load soln/02-01.py

Now we'll plot the results.



In [ ]:

    
pl.plot(max_depth_array, cv_error, label='cross-val error')
pl.plot(max_depth_array, train_error, label='training error')

pl.legend(loc=0)
pl.xlabel('max depth')
pl.ylabel('error')

You should see a minimum of the cross-validation error for max_depth around 8. We can automatically find the optimal value using numpy.argmin:



In [ ]:

    
# select the value of max_depth which led to the best results
max_depth = max_depth_array[np.argmin(cv_error)]
print "max_depth =", max_depth
print "cross-val error = ", cv_error.min()

So we see that after using cross-validation to avoid over-fitting, a simple decision tree got our cross-validation error down to around 0.2.

Part 2: Improving our Data

At this point, we have a simple decision tree estimator which gives a cross-validation error around 0.22. How can we improve on this? We could certainly use a more sophisticated model: that will be explored below. But how could our data be made better?

As noted during the tutorial, there are several options:

get more training data. This involves pointing the telescope at new parts of the sky.
get more features for the existing data. This involves observing the same objects over again, perhaps with telescopes which are sensitive to different regions of the electromagnetic spectrum.

We'll use learning curves to determine this.

We're now going to plot the training error and cross-validation error as a function of the number of training samples. Remember to Make sure that when computing the training error for each number of samples, you use the same samples that the model was trained on.



In [ ]:

    
# we'll explore 20 training set sizes
# sample them evenly between 50 and the number of training samples
n_samples_array = np.linspace(50, Ntrain, 20).astype(int)
train_error_2 = np.zeros(n_samples_array.shape)
cv_error_2 = np.zeros(n_samples_array.shape)



In [ ]:

    
#  Here you will fill-in the train_error_2 array and the cv_error_2 array
#  with values that correspond to the number of samples.
#  Make sure that when computing the training error for each number of
#  samples, you use the same samples that the model was trained on.

If the notebook is within the tutorial directory structure, the following command will load the solution:



In [ ]:

    
%loadpy soln/02-02.py



In [ ]:

    
pl.plot(n_samples_array, cv_error_2, label='cross-val error')
pl.plot(n_samples_array, train_error_2, label='training error')

pl.legend(loc=0)
pl.xlabel('number of samples')
pl.ylabel('error')
pl.title('max_depth = %i' % max_depth)

Based on this, we can guess that the best possible rms error for the Decision Tree Regressor with the given value of max_depth is around 0.2, and that using more samples will not significantly improve the results (the two lines will simply approach each other asymtotically).

Note, though, that if more samples were added, a higher value of max_depth might be a better choice. These lines are really two-dimensional slices of surfaces in three dimensions:

 (number of samples) x (max_depth) x (rms error)

You can re-run this analysis using more points by commenting-out the line

#data = data[:50000]

which is found near the beginning of the notebook, and choosing "Run All" from the ipython menu. Note that this might take a very long time to execute, depending on how fast your computer is. How does this change the results? Is it as you would expect?

Part 3: A different metric

RMS error is not always the best metric to use. In the case of photometric redshifts, the catastrophic errors mentioned above can be much more damaging to scientific results. For that reason, we might want to use a different metric when exploring the above results.

We'll define a convenience function to compute the outlier fraction, and use this as a measure of how well the model performs.



In [ ]:

    
def compute_outlier_fraction(y_pred, y_true, cutoff=0.2):
    return np.sum((abs(y_pred - y_true) > cutoff)) * 1. / len(y_pred)

Here, your task is to re-implement what's above using the outlier fraction rather than the rms error as a cost function.



In [ ]:

    
# repeat the two previous plots using compute_outlier_fraction() rather
# than compute_rms_error().
# feel free to make use of copying and pasting!

If the notebook is within the tutorial directory structure, the following commands will load the solution:



In [ ]:

    
%loadpy soln/02-03a.py



In [ ]:

    
%loadpy soln/02-03b.py

Compared to the rms error, does the outlier fraction prefer more or less complicated models? Would you recommend more data? A more complicated model?

Bonus

One way to improve on Decision Trees is to use Ensemble Methods. Ensemble methods (also known as boosting and bagging) use ensembles of randomized estimators which can prevent over-fitting and lead to very powerful learning algorithms.

It is interesting to see how small an RMS you can attain on the photometric redshift problem using a more sophisticated method. Try one of the following:

sklearn.ensemble.RandomForestRegressor
sklearn.ensemble.GradientBoostingRegressor
sklearn.ensemble.ExtraTreesRegressor

You can read more about each of these methods in the scikit-learn documentation:

http://scikit-learn.org/stable/modules/ensemble.html

Each method has hyperparameters which need to be determined using cross-validation steps like those above. Can you get the rms error for the test set down below 0.1?