This notebook shows how to estimate the distance to a galaxy, given measurements of the optical light coming from it. More specifically, we're going to estimate the redshift of a galaxy, which is directly related to its distance from us, using 5-band photometry from the Sloan Digitial Sky Survey (SDSS).

Load the SDSS data.



In [1]:

    
# Load the SDSS data.  This is the photometric data 
# for 100k galaxies with spectroscopic redshifts. 
# The SQL call is in sdss_SQL.txt.
%pylab inline
from pandas.io.parsers import read_csv
d = read_csv('sdss_specz_r20_z0p01_z0p5.csv')









    



Populating the interactive namespace from numpy and matplotlib

Plot some photometry data, color-coded by (the logarithm of) spectroscopic redshift.



In [2]:

    
from plotting import plot_photodata
plot_photodata(d)

The main thing to take away from these figures is that redshift is a complex, non-linear function of the photometric data. Below we'll approximate this function using a Random Forest Regressor.

Build the photo-z estimator.



In [3]:

    
# We'll be using the scikit-learn machine learning library.
# Construct the feature array `X` and the target redshifts, `y`.
X = np.vstack([d.u-d.g, d.g-d.r, d.r-d.i, d.i-d.z, d.r]).T
y = d.redshift

# Split the data into training and testing samples.
np.random.seed(0)
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

# Fit a random forest regressor to the training data.
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(n_estimators=10, n_jobs=4)
rf.fit(X_train, y_train);

How well does it work?



In [4]:

    
# Make predictions for the redshifts of the test data.
y_predict = rf.predict(X_test)

# Calculate the residual error
err = y_predict-y_test
err99 = np.std(err[np.where(np.abs(err)<np.percentile(np.abs(err),99))[0]])
print 'The residual error is %0.03f.'%np.std(err)
print 'The residual error of the best 99'+r'%'+' is %0.03f.'%err99
print 'The median bias is small: %0.3f.'%np.median(err)
print 'The mean r-band magnitdue of the sample is %0.2f.'%np.mean(d.r)









    



The residual error is 0.021.
The residual error of the best 99% is 0.015.
The median bias is small: 0.001.
The mean r-band magnitdue of the sample is 16.64.

Summary

We were able to build a pretty decent photo-z estimator using a small number of lines of python. The RMS of the redshift error is 0.021, or 0.015 if we ignore the worst 1% of outliers. This is comparable to the RMS=0.018 value reported on the SDSS website, or the RMS=0.023 reported in this paper. That being said, we may be using slightly different datasets, and regardless, I'm sure that the other photo-z estimators have been much more thoroughly tested than the one presented here.