Our previous breakout involved classification of RR Lyrae stars, and regression for photometric redshifts. For this session, we will revisit these problems using what we've learned about validation. The overall task is this: use cross-validation to select the best models for these data.
In [1]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
# use seaborn plotting defaults
# If this causes an error, you can comment it out.
import seaborn as sns
sns.set()
In [2]:
from astroML.datasets import fetch_rrlyrae_combined
from sklearn.cross_validation import train_test_split
X, y = fetch_rrlyrae_combined()
N_plot = 5000
plt.scatter(X[-N_plot:, 0], X[-N_plot:, 1], c=y[-N_plot:],
edgecolors='none', cmap='RdBu')
plt.xlabel('u-g color')
plt.ylabel('g-r color')
plt.xlim(0.7, 1.4)
plt.ylim(-0.2, 0.4);
Now we want to fit an SVM classifier to this data, but adjust the SVM parameters to find the optimal model.
The Support Vector Classifier, SVC
, has several hyperparameters which affect the final fit:
kernel
: can be 'rbf'
(radial basis function) or 'linear'
, among others. This controls whether a linear or kernel fit is usedC
: the SVC penalty parametergamma
: the kernel coefficient for rbf
You can see more using IPython's help feature:
In [3]:
from sklearn.svm import SVC
SVC?
Using sklearn.cross_validation.cross_val_score
, explore various values for these parameters. Recall the discussion from the previous breakout. What is the best completeness you can obtain? What is the best precision?
Use the concept of validation curves and learning curves to determine how this could be improved. Would you expect more training samples to help? More features for the current samples? A more complicated model?
In [4]:
from sklearn.cross_validation import cross_val_score
cross_val_score?
In [4]:
In [4]:
We'll now do a similar validation exercise using the photometric redshift problem on SDSS dr7 quasars using sklearn.ensemble.RandomForestRegressor
. The parameters you should explore are
n_estimators
, criterion
, and max_depth
. You can read more about these with IPython's help functionality:
In [5]:
from sklearn.ensemble import RandomForestRegressor
RandomForestRegressor?
Here's the code again to download the data:
In [6]:
from astroML.datasets import fetch_sdss_specgals
data = fetch_sdss_specgals()
# put magnitudes in a matrix
feature_names = ['modelMag_%s' % f for f in 'ugriz']
X = np.vstack([data[f] for f in feature_names]).T
y = data['z']
In [7]:
# Plot some magnitudes for the first two thousand points
i, j = 0, 1
N = 2000
plt.scatter(X[:N, i], X[:N, j], c=y[:N],
edgecolor='none', cmap='cubehelix')
plt.xlabel(feature_names[i])
plt.ylabel(feature_names[j])
plt.colorbar(label='redshift');
Think about what you know about the random forest.
Think about how over-fitting and under-fitting might affect the results: how do you expect n_estimators
and max_depth
to affect these? Can you make some learning curve plots which confirm this expectation?
What is the best mean squared error you can find for this data? (use sklearn.metrics.mean_squared_error
)
Often for photometric redshifts, one is not concerned with mean squared error, but with minimizing catastrophic outliers: that is, points for which the redshift is off by (say) 0.5 or more. Can you find a combination of model parameters which leads to the lowest catastrophic outlier rate? (Note that you can provide a scoring
function to sklearn.cross_validation.cross_val_score
.
Create some learning curves for this data. If you wanted to improve random forest photometric redshift results, would it be more fruitful to: A. Gather more training samples (i.e. more galaxies with spectroscopic redshifts) B. Gather more features (i.e. more photometric observations for each existing sample)
In [8]:
from sklearn.metrics import mean_squared_error
In [8]:
In [9]:
from sklearn.cross_validation import cross_val_score
cross_val_score?
In [9]: