Cross Validation: The Right and Wrong Way

For a real-world example that showcases the pitfalls of improper cross validation, see this blog post.

The scenario

You have 20 datapoints, each of which has 1,000,000 attributes. Each observation also has an associated $y$ value, and you are interested in whether a linear combination of a few attributes can be used to predict $y$. That is, you are looking for a model

$$ y_i \sim \sum_j w_j x_{ij} $$

where most of the 1 million $w_j$ values are 0.

The problem

Since there are so many more attributes than datapoints, the chance that a few attributes correlate with $y$ by pure coincidence is fairly high.

You kind of remember that cross-validation helps you detect over-fitting, but you're fuzzy on the details.

The wrong way to cross-validate

  • Determine a few attributes of X that correlate well with Y
  • Use cross-validation to measure how well a linear fit to these attributes predicts y

In [1]:
%matplotlib inline

import numpy as np
import matplotlib.pyplot as plt

from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.cross_validation import cross_val_score, KFold
from sklearn.linear_model import LinearRegression

Let's make the dataset, and compute the y's with a "hidden" model that we are trying to recover


In [15]:
def hidden_model(x):
    #y is a linear combination of columns 5 and 10...
    result = x[:, 5] + x[:, 10]
    #... with a little noise
    result += np.random.normal(0, .005, result.shape)
    return result
    
    
def make_x(nobs):
    return np.random.uniform(0, 3, (nobs, 10 ** 6))

x = make_x(20)
y = hidden_model(x)

print(x.shape)


(20, 1000000)

Find the 2 attributes in X that best correlate with y


In [17]:
selector = SelectKBest(f_regression, k=2).fit(x, y)
best_features = np.where(selector.get_support())[0]
print(best_features)


[341338 690135]

We know we are already in trouble -- we've selected 2 columns which correlate with Y by chance, but neither of which are columns 5 or 10 (the only 2 columns that actually have anything to do with y). We can look at the correlations between these columns and Y, and confirm they are pretty good (again, just a coincidence):


In [18]:
for b in best_features:
    plt.plot(x[:, b], y, 'o')
    plt.title("Column %i" % b)
    plt.xlabel("X")
    plt.ylabel("Y")
    plt.show()


A linear regression on the full data looks good. The "score" here is the $R^2$ score -- scores close to 1 imply a good fit.


In [25]:
xt = x[:, best_features]
clf = LinearRegression().fit(xt, y)
print("Score is ", clf.score(xt, y))


Score is  0.839859389027

In [26]:
yp = clf.predict(xt)
plt.plot(yp, y, 'o')
plt.plot(y, y, 'r-')
plt.xlabel("Predicted")
plt.ylabel("Observed")


Out[26]:
<matplotlib.text.Text at 0x109e426d0>

We're worried about overfitting, and remember that cross-validation is supposed to detect this. Let's look at the average $R^2$ score, when performing 5-fold cross validation. It's not as good, but still not bad...


In [27]:
cross_val_score(clf, xt, y, cv=5).mean()


Out[27]:
0.61616795686754722

And even if we make some plots of the predicted and actual data at each cross-validation iteration, the model seems to predict the "independent" data pretty well...


In [29]:
for train, test in KFold(len(y), 10):
    xtrain, xtest, ytrain, ytest = xt[train], xt[test], y[train], y[test]

    clf.fit(xtrain, ytrain)
    yp = clf.predict(xtest)
    
    plt.plot(yp, ytest, 'o')
    plt.plot(ytest, ytest, 'r-')
    

plt.xlabel("Predicted")
plt.ylabel("Observed")


Out[29]:
<matplotlib.text.Text at 0x109ec4810>

But -- what if we generated some more data?


In [9]:
x2 = make_x(100)
y2 = hidden_model(x2)
x2 = x2[:, best_features]

y2p = clf.predict(x2)

plt.plot(y2p, y2, 'o')
plt.plot(y2, y2, 'r-')
plt.xlabel("Predicted")
plt.ylabel("Observed")


Out[9]:
[<matplotlib.lines.Line2D at 0x1136efa90>]

Yikes -- there is no correlation at all! Cross-validation did not detect the overfitting, because we used the entire data to select "good" features.

The right way to cross-validate

To prevent overfitting, we can't let any information about the full dataset leak into cross-validation. Thus, we must re-select good features in each cross-validation iteration


In [31]:
scores = []

for train, test in KFold(len(y), n_folds=5):
    xtrain, xtest, ytrain, ytest = x[train], x[test], y[train], y[test]
    
    b = SelectKBest(f_regression, k=2)
    b.fit(xtrain, ytrain)
    xtrain = xtrain[:, b.get_support()]
    xtest = xtest[:, b.get_support()]
    
    clf.fit(xtrain, ytrain)    
    scores.append(clf.score(xtest, ytest))

    yp = clf.predict(xtest)
    plt.plot(yp, ytest, 'o')
    plt.plot(ytest, ytest, 'r-')
    
plt.xlabel("Predicted")
plt.ylabel("Observed")

print("CV Score is ", np.mean(scores))


CV Score is  -1.64839183777

Now cross-validation properly detects overfitting, by reporting a low average $R^2$ score and a plot that looks like noise. Of course, it doesn't help us actually discover the fact that columns 5 and 10 determine Y (this task is probably hopeless without more data) -- it just lets us know when our fitting approach isn't generalizing to new data.