Solutions: Model Validation Breakout

This contains some possible solutions to the Model Validation Breakout

Preliminaries

Again, we'll start with some boilerplate imports and setup



In [1]:

    
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

# use seaborn plotting defaults
# If this causes an error, you can comment it out.
import seaborn as sns
sns.set()

RR Lyrae

Just as we did in the previous breakout, we'll take a look at the RR Lyrae data that we'll be classifying:



In [2]:

    
from astroML.datasets import fetch_rrlyrae_combined
from sklearn.cross_validation import train_test_split

X, y = fetch_rrlyrae_combined()

N_plot = 5000
plt.scatter(X[-N_plot:, 0], X[-N_plot:, 1], c=y[-N_plot:],
            edgecolors='none', cmap='RdBu')
plt.xlabel('u-g color')
plt.ylabel('g-r color')
plt.xlim(0.7, 1.4)
plt.ylim(-0.2, 0.4);

Now we want to fit an SVM classifier to this data, but adjust the SVM parameters to find the optimal model. The Support Vector Classifier, SVC, has several hyperparameters which affect the final fit:

kernel: can be 'rbf' (radial basis function) or 'linear', among others. This controls whether a linear or kernel fit is used
C: the SVC penalty parameter
gamma: the kernel coefficient for rbf

You can see more using IPython's help feature:



In [3]:

    
from sklearn.svm import SVC
SVC?

Classification breakout questions

Using sklearn.cross_validation.cross_val_score, explore various values for these parameters. Recall the discussion from the previous breakout. What is the best completeness you can obtain? What is the best precision?
Use the concept of validation curves and learning curves to determine how this could be improved. Would you expect more training samples to help? More features for the current samples? A more complicated model?



In [4]:

    
from sklearn.cross_validation import cross_val_score
cross_val_score?

Let's start by working with just a subset of the data, because these things can be pretty slow:



In [5]:

    
from sklearn.cross_validation import train_test_split
Xsubset, _, ysubset, _ = train_test_split(X, y, train_size=0.1)



In [6]:

    
Xsubset.shape, ysubset.shape









    Out[6]:





((9314, 4), (9314,))

We can get a cross-validation score for a model as follows, using the f1-score (remember that the f1-score combines the notions of precision and recall)



In [7]:

    
model = SVC(kernel='linear', class_weight='auto')
cross_val_score(model, Xsubset, ysubset, scoring='f1')









    Out[7]:





array([ 0.17610063,  0.24242424,  0.21582734])

But what we really want is the trend of the f1-score with different values of the inputs:



In [8]:

    
C_values = 10 ** np.linspace(-2, 8, 10)
f1_scores = []

for C in C_values:
    model = SVC(kernel='linear', class_weight='auto', C=C)
    cv = cross_val_score(model, Xsubset, ysubset, scoring='f1')
    f1_scores.append(cv.mean())



In [9]:

    
plt.semilogx(C_values, f1_scores, '-k')









    Out[9]:





[<matplotlib.lines.Line2D at 0x10a994bd0>]

We see that we can get an F1 score of arount 0.2 depending on what our $C$ value is.

This is how cross-validation usually works: as a function of hyperparameters, compute the best-fit model and see how well your model does.

Photometric Redshifts

We'll now do a similar validation exercise using the photometric redshift problem on SDSS dr7 quasars using sklearn.ensemble.RandomForestRegressor. The parameters you should explore are n_estimators, criterion, and max_depth. You can read more about these with IPython's help functionality:



In [10]:

    
from sklearn.ensemble import RandomForestRegressor
RandomForestRegressor?

Here's the code again to download the data:



In [11]:

    
from astroML.datasets import fetch_sdss_specgals

data = fetch_sdss_specgals()

# put magnitudes in a matrix
feature_names = ['modelMag_%s' % f for f in 'ugriz']
X = np.vstack([data[f] for f in feature_names]).T
y = data['z']



In [12]:

    
# Plot some magnitudes for the first two thousand points
i, j = 0, 1
N = 2000
plt.scatter(X[:N, i], X[:N, j], c=y[:N],
            edgecolor='none', cmap='cubehelix')
plt.xlabel(feature_names[i])
plt.ylabel(feature_names[j])
plt.colorbar(label='redshift');

Photoz Breakout Questions

Think about what you know about the random forest.

Think about how over-fitting and under-fitting might affect the results: how do you expect n_estimators and max_depth to affect these? Can you make some learning curve plots which confirm this expectation?
What is the best mean squared error you can find for this data? (use sklearn.metrics.mean_squared_error)
Often for photometric redshifts, one is not concerned with mean squared error, but with minimizing catastrophic outliers: that is, points for which the redshift is off by (say) 0.5 or more. Can you find a combination of model parameters which leads to the lowest catastrophic outlier rate? (Note that you can provide a scoring function to sklearn.cross_validation.cross_val_score.
Create some learning curves for this data. If you wanted to improve random forest photometric redshift results, would it be more fruitful to: A. Gather more training samples (i.e. more galaxies with spectroscopic redshifts) B. Gather more features (i.e. more photometric observations for each existing sample)

1. Over-fitting vs. Under-fitting

The problem with random forests is that as max_depth increases, our tendency to over-fit the data also increases. Adding more randomized trees tends to correct for this over-fitting.

2. Best mean-squared error

We can use cross-validation just like above to answer this.

3. Custom scorer

We can see from the documentation of cross_val_score that we can give it a custom scoring function; we simply need to define what we want to be the score here.

4. Learning Curves

This takes a bit more work, because the learning curves are not yet built-in to the scikit-learn utilities.