In this exercise, we will explore methods to do model selection in a machine learning context, in particular cross-validation and information criteria. At the same time, we'll learn about scikit-learn
's class structure and how to build a pipeline.
There are several reasons why you might want to perform model selection:
KMeans
, ...) you would like to determine.Question: Can you think of other reasons and contexts in which model selection might be important?
Your decisions for how to do model selection depend very strongly (like everything else in machine learning) on the purpose of your machine learning procedure. Is your main purpose to accurately predict outcomes for new samples? Or are you trying to infer something about the system?
Inference generally restricts the number of algorithms you can reasonably use, and also the number of model selection procedures you can apply. In the following, assume that everything below works for prediction problems; I will point out methods for inference where appropriate. Additionally, assume that everything below works for supervised machine learning. We will cover unsupervised methods further below.
Let's first import some stuff we're going to need.
In [ ]:
%matplotlib inline
import matplotlib.pyplot as plt
# comment out this line if you don't have seaborn installed
import seaborn as sns
sns.set_palette("colorblind")
import numpy as np
First, we're going to need some data. We'll work with the star-galaxy data from the first session. This uses the astroquery
package and then queries the top 10000 observations from SDSS (see this exercise for more details):
In [ ]:
# execute this line:
from astroquery.sdss import SDSS
In [ ]:
TSquery = """SELECT TOP 10000
p.psfMag_r, p.fiberMag_r, p.fiber2Mag_r, p.petroMag_r,
p.deVMag_r, p.expMag_r, p.modelMag_r, p.cModelMag_r,
s.class
FROM PhotoObjAll AS p JOIN specObjAll s ON s.bestobjid = p.objid
WHERE p.mode = 1 AND s.sciencePrimary = 1 AND p.clean = 1 AND s.class != 'QSO'
ORDER BY p.objid ASC
"""
SDSSts = SDSS.query_sql(TSquery)
SDSSts
Exercise 1: Visualize this data set. What representation is most appropriate, do you think?
In [ ]:
Exercise 2: Let's now do some machine learning. In this exercise, you are going to use a random forest classifier to classify this data set. Here are the steps you'll need to perform:
scikit-learn
function test_train_split
for this taskRandomForest
object from the sklearn.ensemble
module. Note that the RandomForest
class has three parameters:n_estimators
: The number of decision trees in the random forestmax_features
: The maximum number of features to use for the decision treesmin_samples_leaf
: The minimum number of samples that need to end up in a terminal leaf (this effectively limits the number of branchings each tree can make)scikit-learn
class GridSearchCV
. This class takes a classifier as an input, along with a dictionary of the parameter values to search over.In the earlier lecture, you learned about four different types of cross-validation:
Exercise 2a: Which of the four algorithms is most appropriate here? And why?
Answer: In this case, k-fold CV or random subset CV seem to be the most appropriate algorithms to use.
Important: One important thing to remember that cross-validation crucially depends on your samples being independent from each other. Be sure that this is the case before using it. For example, say you want to classify images of galaxies, but your data set is small, and you're not sure whether your algorithm is rotation independent. So you might choose to use the same images multiple times in your training data set, but rotated by a random degree. In this case, you have to make sure all versions of the same image are included in the same data set (either the training, the validation or the test set), and not split across data sets! If you don't, your algorithm will be unreasonably confident in its accuracy (because you are training and validating essentially on the same data points).
Note that scikit-learn
can actually deal with that! The class GroupKFold
allows k-fold cross validation using an array of indices for your training data. Validation sets will only be split among samples with different indices.
But this was just an aside. Last time, you used a random forest and used k-fold cross validation to effectively do model selection for the different parameters that the random forest classifier uses.
Exercise 2b: Now follow the instructions above and implement your random forest classifier.
In [ ]:
from sklearn.cross_validation import train_test_split
from sklearn.grid_search import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
# set the random state
rs = 23
# extract feature names, remove class
# cast astropy table to pandas and then to a numpy array, remove classes
# our classes are the outcomes to classify on
# let's do a split in training and test set:
# we'll leave the test set for later.
# instantiate the random forest classifier:
# do a grid search over the free random forest parameters:
pars =
grid_results =
Exercise 2c: Take a look at the different validation scores for the different parameter combinations. Are they very different or are they similar?
In [ ]:
It looks like the scores are very similar, and have very small variance between the different cross validation instances. It can be useful to do this kind of representation to see for example whether there is a large variance in the cross-validation results.
In most machine learning applications, your machine learning algorithm might not be the only component having free parameters. You might not even be sure which machine learning algorithm to use!
For demonstration purposes, imagine you have many features, but many of them might be correlated. A standard dimensionality reduction technique to use is Principal Component Analysis.
Exercise 4: The number of features in our present data set is pretty small, but let's nevertheless attempt to reduce dimensionality with PCA. Run a PCA decomposition in 2 dimensions and plot the results. Colour-code stars versus galaxies. How well do they separate along the principal components?
Hint: Think about whether you can run PCA on training and test set separately, or whether you need to run it on both together before doing the train-test split?
In [ ]:
from sklearn.decomposition import PCA
# instantiate the PCA object
pca =
# fit and transform the samples:
X_pca =
# make a plot of the PCA components colour-coded by stars and galaxies
fig, ax = plt.subplots(1, 1, figsize=(12,8))
Exercise 5: Re-do the classification on the PCA components instead of the original features. Does it work better or worse than the classification on the original features?
In [ ]:
# Train PCA on training data set
# apply to test set
# instantiate the random forest classifier:
# do a grid search over the free random forest parameters:
pars =
grid_results =
In [ ]:
Note: In general, you should (cross-)validate both your data transformations and your classifiers!
But how do we know whether two components was really the right number to choose? perhaps it should have been three? Or four? Ideally, we would like to include the feature engineering in our cross validation procedure. In principle, you can do this by running a complicated for-loop. In practice, this is what scikit-learn
s Pipeline is for! A Pipeline
object takes a list of tuples of ("string", ScikitLearnObject)
pairs as input and strings them together (your feature vector X
will be put first through the first object, then the second object and so on sequentially).
Note: scikit-learn
distinguishes between transformers (i.e. classes that transform the features into something else, like PCA, t-SNE, StandardScaler, ...) and predictors (i.e. classes that produce predictions, such as random forests, logistic regression, ...). In a pipeline, all but the last objects must be transformers; the last object can be either.
Exercise 6: Make a pipeline including (1) a PCA object and (2) a random forest classifier. Cross-validate both the PCA components and the parameters of the random forest classifier. What is the best number of PCA components to use?
Hint: You can also use the convenience function make_pipeline
to creatue your pipeline.
Hint: Check the documentation for the precise notation to use for cross-validating parameters.
In [ ]:
from sklearn.pipeline import Pipeline
# make a list of name-estimator tuples
estimators =
# instantiate the pipeline
pipe =
# make a dictionary of parameters
params =
# perform the grid search
grid_search =
In [ ]:
So far, we've just picked PCA because it's common. But what if there's a better algorithm for dimensionality reduction out there for our problem? Or what if you'd want to compare random forests to other classifiers?
In this case, your best option is to split off a separate validation set, perform cross-validation for each algorithm separately, and then compare the results using hold-out cross validation and your validation set (Note: Do not use your test set for this! Your test set is only used for your final error estimate!)
Doing CV across algorithms is difficult, since the KFoldCV
object needs to know which parameters belong to which algorithms, which is difficult to do.
Exercise 7: Pick an algorithm from the manifold learning library in scikit-learn
, cross-validate a random forest for both, and compare the performance of both.
Important: Do not choose t-SNE. The reason is that t-SNE does not generalize to new samples! This means while it's useful for data visualization, you cannot train a t-SNE transformation (in the scikit-learn
implementation) on one part of your data and apply it to another!
In [ ]:
# First, let's redo the train-test split to split the training data
# into training and hold-out validation set
# make a list of name-estimator tuples
estimators =
# instantiate the pipeline
pipe =
# make a dictionary of parameters
params =
# perform the grid search
grid_search =
# complete the print functions
print("Best score: ")
print("Best parameter set: " )
print("Validation score for model with PCA: ")
# Now repeat the same procedure with the second algorithm you've picked.
Earlier today, we talked about interpreting machine learning models. Let's see how you would go about this in practice.
In [ ]:
from sklearn.linear_model import LogisticRegressionCV
lr =
In [ ]:
In [ ]:
Sometimes, you might want to use algorithms, for example for feature engineering, that are not implemented in scikit-learn. But perhaps these transformations still have free parameters to estimate. What to do?
scikit-learn
classes inherit from certain base classes that make it easy to implement your own objects. Below is an example I wrote for a machine learning model on time series, where I wanted to re-bin the time series in different ways and and optimize the rebinning factor with respect to the classification afterwards.
In [ ]:
from sklearn.base import BaseEstimator, TransformerMixin
class RebinTimeseries(BaseEstimator, TransformerMixin):
def __init__(self, n=4, method="average"):
"""
Initialize hyperparameters
:param n: number of samples to bin
:param method: "average" or "sum" the samples within a bin?
:return:
"""
self.n = n ## save number of bins to average together
self.method = method
return
def fit(self,X):
"""
I don't really need a fit method!
"""
## set number of light curves (L) and
## number of samples per light curve (k)
return self
def transform(self, X):
self.L, self.K = X.shape
## set the number of binned samples per light curve
K_binned = int(self.K/self.n)
## if the number of samples in the original light curve
## is not divisible by n, then chop off the last few samples of
## the light curve to make it divisible
#print("X shape: " + str(X.shape))
if K_binned*self.n < self.K:
X = X[:,:self.n*K_binned]
## the array for the new, binned light curves
X_binned = np.zeros((self.L, K_binned))
if self.method in ["average", "mean"]:
method = np.mean
elif self.method == "sum":
method = np.sum
else:
raise Exception("Method not recognized!")
#print("X shape: " + str(X.shape))
#print("L: " + str(self.L))
for i in xrange(self.L):
t_reshape = X[i,:].reshape((K_binned, self.n))
X_binned[i,:] = method(t_reshape, axis=1)
return X_binned
def predict(self, X):
pass
def score(self, X):
pass
def fit_transform(self, X, y=None):
self.fit(X)
X_binned = self.transform(X)
return X_binned
Here are the important things about writing transformer objects for use in scikit-learn:
fit
: fit your training datatransform
: transform your training data into the new representationpredict
: predict new examplesscore
: score predictionsfit_transform
is optional (I think)__init__
method only sets up parameters. Don't put any relevant code in there (this is convention more than anything else, but it's a good one to follow!)fit
method is always called in a Pipeline
object (either on its own or as part of fit_transform
). It usually modifies the internal state of the object, so returning self
(i.e. the object itself) is usually fine.pass
as above.Exercise 8: Last time, you learned that the SDSS photometric classifier uses a single hard cut to separate stars and galaxies in imaging data: $$\mathtt{psfMag} - \mathtt{cmodelMag} \gt 0.145,$$ sources that satisfy this criteria are considered galaxies.
p
that sets the value above which a source is considered a galaxy. transform
methods that returns a single binary feature that is one if $$\mathtt{psfMag} - \mathtt{cmodelMag} \gt p$$ and zero otherwise. Hint: $\mathtt{psfMag}$ and $\mathtt{cmodelMag}$ are the first and the last column in your feature vector, respectively.
Hint: You can use FeatureUnion
to combine the outputs of two transformers in a single data set. (Note that using pipeline with all three will chain them, rather than compute the feature union, followed by a classifier). You can input your FeatureUnion
object into Pipeline
.
In [ ]:
class PSFMagThreshold(BaseEstimator, TransformerMixin):
def __init__(self, p=1.45,):
def fit(self,X):
def transform(self, X):
def predict(self, X):
def score(self, X):
def fit_transform(self, X, y=None):
In [ ]:
Now let's make a feature set that combines this feature with the PCA features:
In [ ]:
from sklearn.pipeline import FeatureUnion
transformers =
feat_union =
X_transformed =
Now we can build the pipeline:
In [ ]:
# combine the transformers
transformers =
# make the feature union
feat_union =
# combine estimators for the pipeline
estimators =
# define the pipeline object
pipe_c =
# make the parameter set
params =
# perform the grid search
grid_search_c =
# complete the print statements:
print("Best score: ")
print("Best parameter set: ")
print("Validation score: ")
As a standard, the algorithms in scikit-learn
use accuracy
to score results. The accuracy is basically the raw fraction of correctly classified samples in your validation or test set.
Question: Is this scoring function always the best method to use? Why (not)? Can you think of alternatives to use?
Let's make a heavily biased data set:
In [ ]:
# all stars
star_ind = np.argwhere(y == b"STAR").T[0]
# all galaxies
galaxy_ind = np.argwhere(y == b"GALAXY").T[0]
np.random.seed(100)
# new array with much fewer stars
star_ind_new = np.random.choice(star_ind, replace=False, size=int(len(star_ind)/80.0))
X_new = np.vstack((X[galaxy_ind], X[star_ind_new]))
y_new = np.hstack((y[galaxy_ind], y[star_ind_new]))
We have now made a really imbalanced data set with many galaxies and only a few stars:
In [ ]:
print(len(y_new[y_new == b"GALAXY"]))
print(len(y_new[y_new == b"STAR"]))
Exercise 10: Run a logistic regression classifier on this data, for a very low regularization (0.0001) and a very large regularization (10000) parameter. Print the accuracy and a confusion matrix of the results for each run. How many mis-classified samples are in each? Where do the mis-classifications end up? If you were to run a cross validation on this, could you be sure to get a good model? Why (not)?
As part of this exercise, you should plot a confusion matrix. A confusion matrix takes the true labels and the predicted labels, and then plots a grid for all samples where true labels and predicted labels match versus do not match. You can use the scikit-learn
function confusion_matrix
to create one. pyplot.matshow
is useful for plotting it, but just printing it on the screen works pretty well, too (at least for the two classes considered here).
In [ ]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score
X_train2, X_test2, y_train2, y_test2 = train_test_split(X_new, y_new,
test_size = 0.3,
random_state = 20)
C_all =
for C in C_all:
lr =
# ... insert code here ...
# make predictions for the validation set
y_pred =
# print accuracy score for this regularization:
# make and print a confusion matrix
cm =
Exercise 11: Take a look at the metrics implemented for model evaluation in scikit-learn
, in particular the different versions of the F1 score. Is there a metric that may be more suited to the task above? Which one?
Hint: Our imbalanced class, the one we're interested in, is the STAR class. Make sure you set the keyword pos_label
in the f1_score
function correctly.
In [ ]:
for C in C_all:
lr =
# ... insert code here ...
# predict the validdation set
y_pred = lr.predict(X_test2)
# print both accuracy and F1 score for comparison:
# create and plot a confusion matrix:
cm =