Here we will take a closer look at the photometric redshift problem discussed in section 5 of the tutorial. Using the decision tree classifier, we'll take a look at the 4-color observations of just over 400,000 galaxies observed by the Sloan Digital Sky Survey.

We're trying to determine photometric redshifts of galaxies, which are related
to their distances, and useful in a wide variety of cosmological studies.
The point of this exercise is to answer the question: how can we get the rms
error down to below 0.1? Would it be a better use of telescope time to
**observe more objects**, or to **observe additional features** of the objects
in the data set? To answer this, we'll make use of the learning curve
techniques discussed in section 3 of the tutorial.

Because we're going to be plotting things below, we'll first make sure we're in ipython's pylab mode:

```
In [ ]:
```%pylab inline
import pylab as pl

`fetch_data.py`

script has been run
to download the data locally.
If the data is in a different location, you can change the
`DATA_HOME`

variable below.

```
In [ ]:
```import os
DATA_HOME = os.path.abspath('../data/sdss_photoz')

```
In [ ]:
```import numpy as np
data = np.load(os.path.join(DATA_HOME, 'sdss_photoz.npy'))

```
In [ ]:
```data = data[:50000]

Now as usual, we'll put the data into the matrix form expected by scikit-learn:

```
In [ ]:
```print '%i points' % data.shape[0]
u, g, r, i, z = [data[f] for f in 'ugriz']
X = np.zeros((len(data), 4))
X[:, 0] = u - g
X[:, 1] = g - r
X[:, 2] = r - i
X[:, 3] = i - z
y = data['redshift']

Now let's divide our data into the training, cross-validation, and test samples.

```
In [ ]:
```Ntot = len(y)
Ncv = Ntot / 5
Ntest = Ntot / 5
Ntrain = Ntot - Ncv - Ntest
X_train = X[:Ntrain]
y_train = y[:Ntrain]
X_cv = X[Ntrain:Ntrain + Ncv]
y_cv = y[Ntrain:Ntrain + Ncv]
X_test = X[Ntrain + Ncv:]
y_test = y[Ntrain + Ncv:]

Recall from the tutorial the use of the decision tree for regression:

```
In [ ]:
```from sklearn.tree import DecisionTreeRegressor
clf = DecisionTreeRegressor(max_depth=15)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_cv)
pl.scatter(y_cv, y_pred, c='black', s=4, lw=0)
pl.plot([0, 3], [0, 3], '-', color='gray')
pl.xlabel('true redshift')
pl.ylabel('predicted redshift')
pl.xlim(0, 2.5)
pl.ylim(0, 2.5)

*catastrophic outliers*: i.e. points which
have predicted redshifts very far from their true redshifts.

For the first part of this exercise, we'll use the cross-validation
techniques discussed earlier to determine the best hyper-parameter
choice for the decision tree regressor. Here you'll plot
the training error and cross-validation error as a function of the
meta-parameter `max_depth`

.

We'll first define a convenience function to compute the rms error:

```
In [ ]:
```from sklearn import metrics
def compute_rms_error(y_pred, y_true):
return np.sqrt(metrics.mean_squared_error(y_pred, y_true))

`max_depth_array`

, `train_error`

, and `cv_error`

.
Use at least 10 different values of `max_depth`

, and compute the training
and cross-validation error associated with each of them.
for power-users, note that this is a natural application for
ipython's parallel processing capabilities, but that's beyond
the scope of this tutorial.

```
In [ ]:
```from sklearn.tree import DecisionTreeRegressor
# we'll explore results for max_depth from 1 to 20
max_depth_array = np.arange(1, 21)
train_error = np.zeros(len(max_depth_array))
cv_error = np.zeros(len(max_depth_array))

```
In [ ]:
```# here you'll write the code to compute the training error and
# test error for the Decision Tree Regressor at each value of
# max_depth. You can type DecisionTreeRegressor? to see its
# documentation.

```
In [ ]:
```%load soln/02-01.py

Now we'll plot the results.

```
In [ ]:
```pl.plot(max_depth_array, cv_error, label='cross-val error')
pl.plot(max_depth_array, train_error, label='training error')
pl.legend(loc=0)
pl.xlabel('max depth')
pl.ylabel('error')

`max_depth`

around 8. We can automatically find the optimal
value using `numpy.argmin`

:

```
In [ ]:
```# select the value of max_depth which led to the best results
max_depth = max_depth_array[np.argmin(cv_error)]
print "max_depth =", max_depth
print "cross-val error = ", cv_error.min()

At this point, we have a simple decision tree estimator which gives a cross-validation error around 0.22. How can we improve on this? We could certainly use a more sophisticated model: that will be explored below. But how could our data be made better?

As noted during the tutorial, there are several options:

- get more training data. This involves pointing the telescope at new parts of the sky.
- get more features for the existing data. This involves observing the same objects over again, perhaps with telescopes which are sensitive to different regions of the electromagnetic spectrum.

We'll use learning curves to determine this.

We're now going to plot the training error and cross-validation error
as a function of the number of training samples. Remember to
**Make sure** that when computing the training error for each number of
samples, you use the same samples that the model was trained on.

```
In [ ]:
```# we'll explore 20 training set sizes
# sample them evenly between 50 and the number of training samples
n_samples_array = np.linspace(50, Ntrain, 20).astype(int)
train_error_2 = np.zeros(n_samples_array.shape)
cv_error_2 = np.zeros(n_samples_array.shape)

```
In [ ]:
```# Here you will fill-in the train_error_2 array and the cv_error_2 array
# with values that correspond to the number of samples.
# Make sure that when computing the training error for each number of
# samples, you use the same samples that the model was trained on.

```
In [ ]:
```%loadpy soln/02-02.py

```
In [ ]:
```pl.plot(n_samples_array, cv_error_2, label='cross-val error')
pl.plot(n_samples_array, train_error_2, label='training error')
pl.legend(loc=0)
pl.xlabel('number of samples')
pl.ylabel('error')
pl.title('max_depth = %i' % max_depth)

Based on this, we can guess that the best possible rms error for the
Decision Tree Regressor with the given value of `max_depth`

is around 0.2,
and that using more samples will not significantly improve the results
(the two lines will simply approach each other asymtotically).

Note, though, that if more samples were added, a higher value of
`max_depth`

might be a better choice. These lines are really
two-dimensional slices of surfaces in three dimensions:

```
(number of samples) x (max_depth) x (rms error)
```

You can re-run this analysis using more points by commenting-out the line

```
#data = data[:50000]
```

which is found near the beginning of the notebook, and choosing "Run All" from the ipython menu. Note that this might take a very long time to execute, depending on how fast your computer is. How does this change the results? Is it as you would expect?

RMS error is not always the best metric to use. In the case
of photometric redshifts, the *catastrophic errors* mentioned
above can be much more damaging to scientific results. For
that reason, we might want to use a different metric when
exploring the above results.

We'll define a convenience function to compute the outlier fraction, and use this as a measure of how well the model performs.

```
In [ ]:
```def compute_outlier_fraction(y_pred, y_true, cutoff=0.2):
return np.sum((abs(y_pred - y_true) > cutoff)) * 1. / len(y_pred)

```
In [ ]:
```# repeat the two previous plots using compute_outlier_fraction() rather
# than compute_rms_error().
# feel free to make use of copying and pasting!

```
In [ ]:
```%loadpy soln/02-03a.py

```
In [ ]:
```%loadpy soln/02-03b.py

One way to improve on Decision Trees is to use *Ensemble Methods*.
Ensemble methods (also known as *boosting* and *bagging*) use ensembles
of randomized estimators which can prevent over-fitting and lead to very
powerful learning algorithms.

It is interesting to see how small an RMS you can attain on the photometric redshift problem using a more sophisticated method. Try one of the following:

`sklearn.ensemble.RandomForestRegressor`

`sklearn.ensemble.GradientBoostingRegressor`

`sklearn.ensemble.ExtraTreesRegressor`

You can read more about each of these methods in the scikit-learn documentation:

Each method has hyperparameters which need to be determined using cross-validation steps like those above. Can you get the rms error for the test set down below 0.1?