Exercise 4

Work on this before the next lecture on 1 May. We will talk about questions, comments, and solutions during the exercise after the third lecture.

Please do form study groups! When you do, make sure you can explain everything in your own words, do not simply copy&paste from others.

The solutions to a lot of these problems can probably be found with Google. Please don't. You will not learn a lot by copy&pasting from the internet.

If you want to get credit/examination on this course please upload your work to your GitHub repository for this course before the next lecture starts and post a link to your repository in this thread. If you worked on things together with others please add their names to the notebook so we can see who formed groups.


These are some useful default imports for plotting and numpy


In [1]:
%config InlineBackend.figure_format='retina'
%matplotlib inline

import numpy as np
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (8, 8)
plt.rcParams["font.size"] = 14
from sklearn.utils import check_random_state

Pitfalls of estimating model performance

This question sets up a classification problem to illustrate a common pitfall in evaluating model performance. To keep things simple the ys in this class room problem are picked at random: there is no way for a classifier to learn how to model y based on the features provided. This means we know what the true accuracy is: 0.5.


In [2]:
import numpy as np

import matplotlib.pyplot as plt

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score
from sklearn.feature_selection import SelectKBest, f_regression

np.random.seed(6450345)

A common task when building a new model is to select only those variables that are "best" for the problem. This selection procedure can take many different shapes, here we will compute the correlation of each feature with the target, select the 20 features that have the highest correlation and use those in our gradient boosted tree ensemble.

We will then use cross validation to evaluate the performance.


In [3]:
def make_data(N=1000, n_vars=10,
              n_classes=2):
    X = np.random.normal(size=(N,n_vars))
    y = np.random.choice(n_classes, N)
    
    return X, y

X,y = make_data(N=2000, n_vars=50000)

select = SelectKBest(f_regression, k=20)
X_sel = select.fit_transform(X, y)

clf = GradientBoostingClassifier()
scores = cross_val_score(clf, X_sel, y, cv=5, n_jobs=8)

print("Scores on each subset:")
print(scores)
avg = (100*np.mean(scores), 100*np.std(scores)/np.sqrt(scores.shape[0]))
print("Average score and uncertainty: (%.2f +- %.3f)%%" % avg)


Scores on each subset:
[ 0.60349127  0.63092269  0.5975      0.63157895  0.65413534]
Average score and uncertainty: (62.35 +- 0.924)%

What just happened? We have a classifier that achieves and accuracy of ~60% but we know that the features are uncorrelated with the target. How did this happen? What mistake did we make?

What do we need to repair this? How do we know it is repaired? When the predicted performance is close to what we know to be the true performance.

My answer

I guess that the error is that we are asking to select the features which have the highest correlation, which may be higher than 0.5 because of random fluctuation. Hence, selecting the best ones we are biasing our GradientBoostingClassifier() to take into account only the 20 more correlated features. On the other hand, I expect that the same random fluctuations may cause other features to have a lower correlation, which overall should compensate.