You'll need to modify the `DATA_HOME`

variable to the location of the datasets.

In this tutorial we'll use the the colors of 400,000+ galaxies from the Sloan Digital Sky Survey.

The data was downloaded with the script fetch_data.py. This script also includes a python implementation of the SQL query used to construct this data. This template can be modified to download more features if desired.

```
In [ ]:
```import os
DATA_HOME = os.path.abspath('C:/temp/AstroML/data/sdss_photoz/')

```
In [ ]:
```import numpy as np
data = np.load(os.path.join(DATA_HOME,'sdss_photoz.npy'))

The data is in a record array, as in the classification example

```
In [ ]:
```print(data.dtype.names)

Now we'll set up our data matrix `X`

and redshift `z`

```
In [43]:
```N = len(data)
X = np.zeros((N, 4), dtype=np.float32)
X[:, 0] = data['u'] - data['g']
X[:, 1] = data['g'] - data['r']
X[:, 2] = data['r'] - data['i']
X[:, 3] = data['i'] - data['z']
z = data['redshift']
X = np.nan_to_num(X)

```
In [44]:
```Ntrain = 3 * N / 4
Xtrain = X[:Ntrain]
ztrain = z[:Ntrain]
Xtest = X[Ntrain:]
ztest = z[Ntrain:]

`DecisionTreeRegressor`

method
to train a model and predict redshifts for the test set based
on a 20-level decision tree:

```
In [45]:
```from sklearn.tree import DecisionTreeRegressor
clf = DecisionTreeRegressor(max_depth=20)
clf.fit(Xtrain, ztrain)
zpred = clf.predict(Xtest)

```
In [46]:
```rms = np.sqrt(np.mean((ztest - zpred) ** 2))
print rms

```
```

```
In [47]:
```print "Number of test samples: ", len(ztest)
print "Number of catastrophic errors:", np.sum(abs(ztest - zpred) > 1)

```
```

```
In [50]:
```%matplotlib inline
import matplotlib.pyplot as pl

```
In [51]:
```ax = pl.axes()
pl.scatter(ztest, zpred, c='k', lw=0, s=4)
axis_lim = np.array([0, 2.5])
# plot the true redshift
pl.plot(axis_lim, axis_lim, '--k')
# plot +/- the rms
pl.plot(axis_lim, axis_lim + rms, '--r')
pl.plot(axis_lim, axis_lim - rms, '--r')
pl.xlim(axis_lim)
pl.ylim(axis_lim)
pl.title('Photo-z: Decision Tree Regression')
pl.xlabel(r'$\mathrm{z_{true}}$', fontsize=14)
pl.ylabel(r'$\mathrm{z_{phot}}$', fontsize=14)

```
Out[51]:
```

The true and predicted redshifts of 102,798 SDSS galaxies, using a simple decision tree regressor. Notice the presece of catastrophic outliers: those galaxies whose predicted redshifts are extremely far from the true value.

Later, in Exercise #2, we will attempt to improve on this by optimizing the parameters of the decision tree.

In practice, the solutions to the photometric redshift problem can benefit from approaches that use physical intuition as well as machine learning tools. For example, some solutions involve the use of libraries of synthetic galaxy spectra which are known to be representative of the true galaxy distribution. This extra information can be used either directly, in a physically motivated analysis, or can be used to generate a larger suite of artificial training instances for a pure machine learning approach.