In this exercise, we will explore methods to do model selection in a machine learning context, in particular cross-validation and information criteria. At the same time, we'll learn about scikit-learn
's class structure and how to build a pipeline.
There are several reasons why you might want to perform model selection:
KMeans
, ...) you would like to determine.Question: Can you think of other reasons and contexts in which model selection might be important?
Your decisions for how to do model selection depend very strongly (like everything else in machine learning) on the purpose of your machine learning procedure. Is your main purpose to accurately predict outcomes for new samples? Or are you trying to infer something about the system?
Inference generally restricts the number of algorithms you can reasonably use, and also the number of model selection procedures you can apply. In the following, assume that everything below works for prediction problems; I will point out methods for inference where appropriate. Additionally, assume that everything below works for supervised machine learning. We will cover unsupervised methods further below.
Let's first import some stuff we're going to need.
In [13]:
%matplotlib inline
import matplotlib.pyplot as plt
# comment out this line if you don't have seaborn installed
import seaborn as sns
sns.set_palette("colorblind")
import numpy as np
First, we're going to need some data. We'll work with the star-galaxy data from the first session. This uses the astroquery
package and then queries the top 10000 observations from SDSS (see this exercise for more details):
In [3]:
# execute this line:
from astroquery.sdss import SDSS
In [4]:
TSquery = """SELECT TOP 10000
p.psfMag_r, p.fiberMag_r, p.fiber2Mag_r, p.petroMag_r,
p.deVMag_r, p.expMag_r, p.modelMag_r, p.cModelMag_r,
s.class
FROM PhotoObjAll AS p JOIN specObjAll s ON s.bestobjid = p.objid
WHERE p.mode = 1 AND s.sciencePrimary = 1 AND p.clean = 1 AND s.class != 'QSO'
ORDER BY p.objid ASC
"""
SDSSts = SDSS.query_sql(TSquery)
SDSSts
Out[4]:
Exercise 1: Visualize this data set. What representation is most appropriate, do you think?
In [ ]:
Exercise 2: Let's now do some machine learning. In this exercise, you are going to use a random forest classifier to classify this data set. Here are the steps you'll need to perform:
scikit-learn
function test_train_split
for this taskRandomForest
object from the sklearn.ensemble
module. Note that the RandomForest
class has three parameters:n_estimators
: The number of decision trees in the random forestmax_features
: The maximum number of features to use for the decision treesmin_samples_leaf
: The minimum number of samples that need to end up in a terminal leaf (this effectively limits the number of branchings each tree can make)scikit-learn
class GridSearchCV
. This class takes a classifier as an input, along with a dictionary of the parameter values to search over.In the earlier lecture, you learned about four different types of cross-validation:
Exercise 2a: Which of the four algorithms is most appropriate here? And why?
Answer: In this case, k-fold CV or random subset CV seem to be the most appropriate algorithms to use.
Important: One important thing to remember that cross-validation crucially depends on your samples being independent from each other. Be sure that this is the case before using it. For example, say you want to classify images of galaxies, but your data set is small, and you're not sure whether your algorithm is rotation independent. So you might choose to use the same images multiple times in your training data set, but rotated by a random degree. In this case, you have to make sure all versions of the same image are included in the same data set (either the training, the validation or the test set), and not split across data sets! If you don't, your algorithm will be unreasonably confident in its accuracy (because you are training and validating essentially on the same data points).
Note that scikit-learn
can actually deal with that! The class GroupKFold
allows k-fold cross validation using an array of indices for your training data. Validation sets will only be split among samples with different indices.
But this was just an aside. Last time, you used a random forest and used k-fold cross validation to effectively do model selection for the different parameters that the random forest classifier uses.
Exercise 2b: Now follow the instructions above and implement your random forest classifier.
In [10]:
from sklearn.cross_validation import train_test_split
from sklearn.grid_search import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
# set the random state
rs = 23 # we are in Chicago after all
# extract feature names, remove class
feats = list(SDSSts.columns)
feats.remove('class')
# cast astropy table to pandas, remove classes
X = np.array(SDSSts[feats].to_pandas())
# our classes are the outcomes to classify on
y = np.array(SDSSts['class'])
# let's do a split in training and test set:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = rs)
# we'll leave the test set for later.
# instantiate the random forest classifier:
RFmod = RandomForestClassifier()
# do a grid search over the free random forest parameters:
pars = {"n_estimators": [10, 100, 300],
"max_features": [1, 3, 7],
"min_samples_leaf": [1,10]}
grid_results = GridSearchCV(RandomForestClassifier(),
pars,
cv = 5)
grid_results.fit(X_train, y_train)
Out[10]:
Exercise 2c: Take a look at the different validation scores for the different parameter combinations. Are they very different or are they similar?
In [460]:
grid_results.grid_scores_
Out[460]:
It looks like the scores are very similar, and have very small variance between the different cross validation instances. It can be useful to do this kind of representation to see for example whether there is a large variance in the cross-validation results.
In most machine learning applications, your machine learning algorithm might not be the only component having free parameters. You might not even be sure which machine learning algorithm to use!
For demonstration purposes, imagine you have many features, but many of them might be correlated. A standard dimensionality reduction technique to use is Principal Component Analysis.
Exercise 4: The number of features in our present data set is pretty small, but let's nevertheless attempt to reduce dimensionality with PCA. Run a PCA decomposition in 2 dimensions and plot the results. Colour-code stars versus calaxies. How well do they separate along the principal components?
Hint: Think about whether you can run PCA on training and test set separately, or whether you need to run it on both together before doing the train-test split?
In [16]:
from sklearn.decomposition import PCA
# instantiate the PCA object
pca = PCA(n_components=2)
# fit and transform the samples:
X_pca = pca.fit_transform(X)
# make a plot object
fig, ax = plt.subplots(1, 1, figsize=(12,8))
# loop over number of classes:
for i,l in enumerate(np.unique(y)):
members = y == l
plt.scatter(X_pca[members, 0], X_pca[members, 1],
color=sns.color_palette("colorblind",8)[i],
label=l)
ax.set_xlabel("PCA Component 1")
ax.set_ylabel("PCA Component 2")
plt.legend()
Out[16]:
Exercise 5: Re-do the classification on the PCA components instead of the original features.
In [235]:
# Train PCA on training data set
X_pca_train = pca.fit_transform(X_train)
# apply to test set
X_pca_test = pca.transform(X_test)
# we'll leave the test set for later.
# instantiate the random forest classifier:
RFmod = RandomForestClassifier()
# do a grid search over the free random forest parameters:
pars = {"n_estimators": [10, 100, 300],
"max_features": [1, 2],
"min_samples_leaf": [1,10]}
grid_results = GridSearchCV(RandomForestClassifier(),
pars,
cv = 5)
grid_results.fit(X_pca_train, y_train)
Out[235]:
In [236]:
grid_results.best_score_
Out[236]:
Note: In general, you should (cross-)validate both your data transformations and your classifiers!
But how do we know whether two components was really the right number to choose? perhaps it should have been three? Or four? Ideally, we would like to include the feature engineering in our cross validation procedure. In principle, you can do this by running a complicated for-loop. In practice, this is what scikit-learn
s Pipeline is for! A Pipeline
object takes a list of tuples of ("string", ScikitLearnObject)
pairs as input and strings them together (your feature vector X
will be put first through the first object, then the second object and so on sequentially).
Note: scikit-learn
distinguishes between transformers (i.e. classes that transform the features into something else, like PCA, t-SNE, StandardScaler, ...) and predictors (i.e. classes that produce predictions, such as random forests, logistic regression, ...). In a pipeline, all but the last objects must be transformers; the last object can be either.
Exercise 6: Make a pipeline including (1) a PCA object and (2) a random forest classifier. Cross-validate both the PCA components and the parameters of the random forest classifier. What is the best number of PCA components to use?
Hint: You can also use the convenience function make_pipeline
to creatue your pipeline.
Hint: Check the documentation for the precise notation to use for cross-validating parameters.
In [28]:
from sklearn.pipeline import Pipeline
# make a list of name-estimator tuples
estimators = [('pca', PCA()), ('clf', RandomForestClassifier())]
# instantiate the pipeline
pipe = Pipeline(estimators)
# make a dictionary of parameters
params = dict(pca__n_components=[2, 4, 6, 8],
clf__n_estimators=[10, 100, 300],
clf__min_samples_leaf=[1,10])
# perform the grid search
grid_search = GridSearchCV(pipe, param_grid=params)
grid_search.fit(X_train, y_train)
Out[28]:
In [30]:
print(grid_search.best_score_)
print(grid_search.best_params_)
It looks like n_components=6
works best.
So far, we've just picked PCA because it's common. But what if there's a better algorithm for dimensionality reduction out there for our problem? Or what if you'd want to compare random forests to other classifiers?
In this case, your best option is to split off a separate validation set, perform cross-validation for each algorithm separately, and then compare the results using hold-out cross validation and your validation set (Note: Do not use your test set for this! Your test set is only used for your final error estimate!)
Doing CV across algorithms is difficult, since the KFoldCV
object needs to know which parameters belong to which algorithms, which is difficult to do.
Exercise 7: Pick an algorithm from the manifold learning library in scikit-learn
, cross-validate a random forest for both, and compare the performance of both.
Important: Do not choose t-SNE. The reason is that t-SNE does not generalize to new samples! This means while it's useful for data visualization, you cannot train a t-SNE transformation (in the scikit-learn
implementation) on one part of your data and apply it to another!
In [36]:
# First, let's redo the train-test split to split the training data
# into training and hold-out validation set
X_train_new, X_val, y_train_new, y_val = train_test_split(X_train, y_train,
test_size = 0.2,
random_state = rs)
# Now we have to re-do the PCA pipeline:
from sklearn.pipeline import Pipeline
# make a list of name-estimator tuples
estimators = [('pca', PCA()), ('clf', RandomForestClassifier())]
# instantiate the pipeline
pipe = Pipeline(estimators)
# make a dictionary of parameters
params = dict(pca__n_components=[2, 4, 6, 8],
clf__n_estimators=[10, 100, 300],
clf__min_samples_leaf=[1,10])
# perform the grid search
grid_search = GridSearchCV(pipe, param_grid=params)
grid_search.fit(X_train_new, y_train_new)
print("Best score: " + str(grid_search.best_score_))
print("Best parameter set: " + str(grid_search.best_params_))
print("Validation score for model with PCA: " + str(grid_search.score(X_val, y_val)))
# I'm going to pick locally linear embedding here:
# LLE has two free parameters:
# - the number of parameters to use `n_neighbors`
# - the number of components in the output
from sklearn.manifold import LocallyLinearEmbedding
from sklearn.pipeline import Pipeline
# make a list of name-estimator tuples
estimators = [('lle', LocallyLinearEmbedding()), ('clf', RandomForestClassifier())]
# instantiate the pipeline
pipe2 = Pipeline(estimators)
# make a dictionary of parameters
params = dict(lle__n_components=[2, 4, 6, 8],
lle__n_neighbors=[5, 10, 100],
clf__n_estimators=[10, 100, 300],
clf__min_samples_leaf=[1,10])
# perform the grid search
grid_search2 = GridSearchCV(pipe2, param_grid=params)
grid_search2.fit(X_train_new, y_train_new)
print("Best score: " + str(grid_search2.best_score_))
print("Best parameter set: " + str(grid_search2.best_params_))
print("Validation score for model with LLE: " + str(grid_search2.score(X_val, y_val)))
Looks like PCA does slightly better as a dimensionality reduction method.
Earlier today, we talked about interpreting machine learning models. Let's see how you would go about this in practice.
In [239]:
from sklearn.linear_model import LogisticRegressionCV
lr = LogisticRegressionCV(penalty="l2", Cs=10, cv=10)
lr.fit(X_train, y_train)
Out[239]:
In [240]:
lr.coef_
Out[240]:
Answer 1: Whether the model is easier or harder to interpret depends on what type of interpretability is desired. If you are interested in how the features influence the classification, the logistic regression model is easier to interpret: because random forests is an ensemble method, it's very hard to understand in detail how a prediction comes about (since the individual decision trees may have very different structures). However, for very large feature spaces with complicated, engineered features, your linear model (the logistic regression model) loses interpretability in how the parameters affect the outcomes just as much.
Answer 2: The more feature engineering you do, the harder it will be to interpret the results. The PCA features are a linear transformation of your original eight features. But what do they mean in physical terms? Who knows?
In [242]:
# let's leave out the first parameter and see whether the coefficients change:
lr.fit(X_train[:,1:], y_train)
lr.coef_
Out[242]:
Answer 3: Some of the coefficients just changed sign! This is one of the problems with directly interpreting linear models: they are quite sensitive to the structure of the feature space. If you took these parameters and interpreted them in a causal sense, you might get completely different causal inferences depending on which parameters you use so be careful to check how robust your model is to changes in the feature space!
Sometimes, you might want to use algorithms, for example for feature engineering, that are not implemented in scikit-learn. But perhaps these transformations still have free parameters to estimate. What to do?
scikit-learn
classes inherit from certain base classes that make it easy to implement your own objects. Below is an example I wrote for a machine learning model on time series, where I wanted to re-bin the time series in different ways and and optimize the rebinning factor with respect to the classification afterwards.
In [233]:
from sklearn.base import BaseEstimator, TransformerMixin
class RebinTimeseries(BaseEstimator, TransformerMixin):
def __init__(self, n=4, method="average"):
"""
Initialize hyperparameters
:param n: number of samples to bin
:param method: "average" or "sum" the samples within a bin?
:return:
"""
self.n = n ## save number of bins to average together
self.method = method
return
def fit(self,X):
"""
I don't really need a fit method!
"""
## set number of light curves (L) and
## number of samples per light curve (k)
return self
def transform(self, X):
self.L, self.K = X.shape
## set the number of binned samples per light curve
K_binned = int(self.K/self.n)
## if the number of samples in the original light curve
## is not divisible by n, then chop off the last few samples of
## the light curve to make it divisible
#print("X shape: " + str(X.shape))
if K_binned*self.n < self.K:
X = X[:,:self.n*K_binned]
## the array for the new, binned light curves
X_binned = np.zeros((self.L, K_binned))
if self.method in ["average", "mean"]:
method = np.mean
elif self.method == "sum":
method = np.sum
else:
raise Exception("Method not recognized!")
#print("X shape: " + str(X.shape))
#print("L: " + str(self.L))
for i in xrange(self.L):
t_reshape = X[i,:].reshape((K_binned, self.n))
X_binned[i,:] = method(t_reshape, axis=1)
return X_binned
def predict(self, X):
pass
def score(self, X):
pass
def fit_transform(self, X, y=None):
self.fit(X)
X_binned = self.transform(X)
return X_binned
Here are the important things about writing transformer objects for use in scikit-learn:
fit
: fit your training datatransform
: transform your training data into the new representationpredict
: predict new examplesscore
: score predictionsfit_transform
is optional (I think)__init__
method only sets up parameters. Don't put any relevant code in there (this is convention more than anything else, but it's a good one to follow!)fit
method is always called in a Pipeline
object (either on its own or as part of fit_transform
). It usually modifies the internal state of the object, so returning self
(i.e. the object itself) is usually fine.pass
as above.Exercise 8: Last time, you learned that the SDSS photometric classifier uses a single hard cut to separate stars and galaxies in imaging data: $$\mathtt{psfMag} - \mathtt{cmodelMag} \gt 0.145,$$ sources that satisfy this criteria are considered galaxies.
s
that sets the value above which a source is considered a galaxy. transform
methods that returns a single binary feature that is one if $$\mathtt{psfMag} - \mathtt{cmodelMag} \gt p$$ and zero otherwise. Hint: $\mathtt{psfMag}$ and $\mathtt{cmodelMag}$ are the first and the last column in your feature vector, respectively.
Hint: You can use FeatureUnion
to combine the outputs of two transformers in a single data set. (Note that using pipeline with all three will chain them, rather than compute the feature union, followed by a classifier). You can input your FeatureUnion
object into Pipeline
.
In [259]:
class PSFMagThreshold(BaseEstimator, TransformerMixin):
def __init__(self, p=1.45,):
"""
Initialize hyperparameters
Parameters
----------
p : float
The threshold for the magnitude - model magnitude
"""
self.p = p # store parameter in object
return
def fit(self,X):
"""
I don't really need a fit method!
"""
return self
def transform(self, X):
# extract relevant columns
psfmag = X[:,0]
c_psfmag = X[:,-1]
# compute difference
d_psfmag = psfmag - c_psfmag
# make a 1D array of length N
X_new = np.zeros(X.shape[0])
X_new[d_psfmag > self.p] = 1.0
# IMPORTANT: Your output vector must be a COLUMN vector
# You can achieve this with the numpy function atleast_2D()
# and the numpy function transpose()
return np.atleast_2d(X_new).T
def predict(self, X):
pass
def score(self, X):
pass
def fit_transform(self, X, y=None):
self.fit(X)
X_new = self.transform(X)
return X_new
In [260]:
pt = PSFMagThreshold(p=1.45)
X_pt = pt.fit_transform(X)
Now let's make a feature set that combines this feature with the PCA features:
In [268]:
from sklearn.pipeline import FeatureUnion
transformers = [("pca", PCA(n_components=2)),
("pt", PSFMagThreshold(p=1.45))]
feat_union = FeatureUnion(transformers)
X_transformed = feat_union.fit_transform(X_train)
Now we can build the pipeline:
In [276]:
# combine the
transformers = [("pca", PCA()),
("pt", PSFMagThreshold(p=1.45))]
feat_union = FeatureUnion(transformers)
estimators = [("feats", feat_union),
("clf", RandomForestClassifier())]
pipe_c = Pipeline(estimators)
# make the parameter set
params = dict(feats__pca__n_components=[2, 4, 6, 8],
feats__pt__p=[0.5, 0.9, 1.45, 2.0],
clf__n_estimators=[10, 100, 300],
clf__min_samples_leaf=[1,10])
# perform the grid search
grid_search_c = GridSearchCV(pipe_c, param_grid=params)
grid_search_c.fit(X_train_new, y_train_new)
# print validation score
print("Best score: " + str(grid_search_c.best_score_))
print("Best parameter set: " + str(grid_search_c.best_params_))
print("Validation score: " + str(grid_search_c.score(X_val, y_val)))
As a standard, the algorithms in scikit-learn
use accuracy
to score results. The accuracy is basically the raw fraction of correctly classified samples in your validation or test set.
Question: Is this scoring function always the best method to use? Why (not)? Can you think of alternatives to use?
Let's make a heavily biased data set:
In [456]:
# all stars
star_ind = np.argwhere(y == b"STAR").T[0]
# all galaxies
galaxy_ind = np.argwhere(y == b"GALAXY").T[0]
np.random.seed(100)
# new array with much fewer stars
star_ind_new = np.random.choice(star_ind, replace=False, size=int(len(star_ind)/80.0))
X_new = np.vstack((X[galaxy_ind], X[star_ind_new]))
y_new = np.hstack((y[galaxy_ind], y[star_ind_new]))
We have now made a really imbalanced data set with many galaxies and only a few stars:
In [457]:
print(len(y_new[y_new == b"GALAXY"]))
print(len(y_new[y_new == b"STAR"]))
Exercise 10: Run a logistic regression classifier on this data, for a very low regularization (0.0001) and a very large regularization (10000) parameter. Print the accuracy and a confusion matrix of the results for each run. How many mis-classified samples are in each? Where do the mis-classifications end up? If you were to run a cross validation on this, could you be sure to get a good model? Why (not)?
Hint: Our imbalanced class, the one we're interested in, is the
In [458]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score
X_train2, X_test2, y_train2, y_test2 = train_test_split(X_new, y_new,
test_size = 0.3,
random_state = 20)
C_all = [0.0001, 10000]
for C in C_all:
lr = LogisticRegression(penalty='l2', C=C)
lr.fit(X_train2, y_train2)
y_pred = lr.predict(X_test2)
print("The accuracy score for C = %i is %.4f"%(C, accuracy_score(y_test2, y_pred)))
cm = confusion_matrix(y_test2, y_pred, labels=np.unique(y))
print(cm)
In [459]:
for C in C_all:
lr = LogisticRegression(penalty='l2', C=C)
lr.fit(X_train2, y_train2)
y_pred = lr.predict(X_test2)
print("The accuracy score for C = %i is %.4f"%(C, accuracy_score(y_test2, y_pred)))
print("The F1 score for C = %.5f is %.4f"%(C, f1_score(y_test2, y_pred,
pos_label=b"STAR",
average="binary")))
cm = confusion_matrix(y_test2, y_pred, labels=np.unique(y))
print(cm)
As you can see, the accuracy for the model that mis-classifies all stars ends up being higher than the accuracy for the model that only mis-classifies some of them (admittedly, still more than you'd want). So when doing cross-validation, you're quite likely to end up with a model that mis-classifies all your samples from the rare class into the more common class.
If the rare class is actually one that you care about (say, gravitational wave triggers), then this is very bad behaviour.
The F-score, conversely, is zero for stars with the small regularization parameter (since there are no stars in this group), and not very high, but higher for the large regularization parameter. This is more what we'd like to see if we're interested in the rare class.