Prediction error, cross-validation, and significance testing

In the previous notebook we showed a few different models. We should have a way to evaluate our models and possibly to choose the best one.

Let's explore the topic of model validation and understanding various errors for regression models. We will start from creating a new dataset:


In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

def make_data(N=90, error=1.0, random_seed=None):
    def test_func(x, err=0.5):
        y = 10 - 1. / (x + 0.1)
        if err > 0:
            y = np.random.normal(y, err)
        return y

    # randomly sample the data
    np.random.seed(1)
    X = np.random.random(N)[:, np.newaxis]
    y = test_func(X.ravel(), error)
    
    return X, y

X, y = make_data(random_seed=1)
plt.scatter(X.ravel(), y)


Out[1]:
<matplotlib.collections.PathCollection at 0x7fa2d0dd4be0>

This time, let's divide our data to two sets that we will call training and testing sets, and build a simple linear regression model using training data only:


In [2]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=7)

regr = LinearRegression()
regr.fit(X_train, y_train)


Out[2]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

Let's plot the results. We will also calculate the model score. The score is a function of mean square error, but defined in a way that gives 1 (or -1) for the best possible model and 0 if there is no linear relation, full definition you can find here.


In [3]:
from sklearn.metrics import mean_squared_error

def plot_regr(X, y, model, color="b"):
    fig, ax = plt.subplots()
    
    X_plot = np.linspace(X[:,0].min(), X[:,0].max(), 100)[:, np.newaxis]
    y_plot = model.predict(X_plot)
    ax.plot(X_plot, y_plot, color='darkgrey', linewidth=2)
    ax.scatter(X[:,0], y, s=100, alpha=0.75, c=color)
    
    y_pr = model.predict(X)
    mse = mean_squared_error(y, y_pr)
    
    ax.set_title("model score = {:03.2f}, mean squared error = {:4.2f}".format(model.score(X,y), mse))
    ax.set_xlabel('X')
    ax.set_ylabel('y')

plot_regr(X_train, y_train, regr)


As you can see, our model is not doing a great job and we might want to try more complex model, e.g. including polynomial terms. We will create our PolynomialRegression using scikit-learn PolynomialFeatures and make_pipeline.


In [4]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline

def PolynomialRegression(degree=2, **kwargs):
    return make_pipeline(PolynomialFeatures(degree),
                         LinearRegression(**kwargs))

In [5]:
poly2 = PolynomialRegression(2)
poly2.fit(X_train, y_train)
plot_regr(X_train, y_train, poly2)


This is much better, so we might try to go to even higher polynomials.


In [6]:
poly15 = PolynomialRegression(15)
poly15.fit(X_train, y_train)
plot_regr(X_train, y_train, poly15)


That seems to be even better fit to our data. But at the end we shouldn't care about the score for training data. We want our model to predict a good responses for new data! Now, it's the time to use our test dataset that we did not use to build our model.


In [7]:
plot_regr(X_test, y_test, poly15, color="r")


Now we can see that for the test set the model doesn't look so great and the score is not impressive. It might be even lower than for the first linear model we built:


In [8]:
plot_regr(X_test, y_test, regr, color="r")


The best results we have for poly2:


In [9]:
plot_regr(X_test, y_test, poly2, color="r")


The problem is related to the bias–variance tradeoff. The bias is error from erroneous assumptions in the learning algorithm. High bias can cause an algorithm to miss the relevant relations between features and target outputs and it's also called underfitting. The variance is error that comes from high sensitivity to small fluctuations in the training set. High variance can cause an algorithm to model the random noise in the training data, rather than the main relation between input and output and it's also called overfitting.

In order to evaluate models prediction error is used. Expected prediction error can be written:

\begin{aligned}{\mathrm {E}}{\Big [}{\big (}y-{\hat {f}}(x){\big )}^{2}{\Big ]}&={\mathrm {Bias}}{\big [}{\hat {f}}(x){\big ]}^{2}+{\mathrm {Var}}{\big [}{\hat {f}}(x){\big ]}+\sigma ^{2}\\\end{aligned}

where bias:

\begin{aligned}\mathrm {Bias} {\big [}{\hat {f}}(x){\big ]}=\mathrm {E} {\big [}{\hat {f}}(x)-f(x){\big ]}\end{aligned}

and variance:

\begin{aligned}\mathrm {Var} {\big [}{\hat {f}}(x){\big ]}=\mathrm {E} [{\hat {f}}(x)^{2}]-\mathrm {E} [{\hat {f}}(x)]^{2}\end{aligned}

and $\sigma^2$ is irreducible error.

All terms are positive and in order to minimize the value we have to minimize bias and/or variance. When we are increasing complexity of the model we're decreasing bias, but increasing variance, so we should always try to find optimal solution, that is schematically presented in the figure:

<img src="../img/bias_var_modcomplex.png", width=500>

Another way of thinking about it - consider training error vs predicted/testing error. When we are building more complex model we are decreasing training error but the difference between test and training error at some point starts increasing faster.

<img src="../img/error_complexity.png", width=500>

Cross validation

Now we know why we divided our dataset to training and testing sets. It's critical to validate a supervised model using data that was NOT used during training! Unfortunately, the simplest splitting we used in our example has a serious drawback, a big part of our data is not used to build a better model. If you have a large dataset it might not be a problem, but often in neuroscientific studies this is not a case.

However, there are smarter ways of splitting the data and we will show some popular types of so called cross validation) (all figures taken from this post):

Holdout

In the simples split you train a model once using a part of the dataset

<img src="../img/holdout.png", width=500>

Leave-p-out cross-validation (LpO CV)

Leave-p-out cross-validation involves using $p$ observations as the validation set and the remaining observations as the training set. This is an exhaustive method, so all possible combinations of $p$ samples must be used. This requires training and validating the model $C_{p}^{n}$ times, where $C_{p}^{n}$ is a binomial coefficient, $n$ is the number of observations in the original dataset, for even moderately large $n$, LpO CV can become computationally infeasible (e.g. $C_{30}^{100}\approx 3\times 10^{25}$).

<img src="../img/leave_p_out.png", width=500>

Leave-one-out cross-validation (LOO CV)

Leave-one-out cross-validation is a special case of leave-p-out cross-validation with $p = 1$. LOO cross-validation does not have the same problem of excessive compute time, since $C_{1}^{n}=n$.

<img src="../img/leave_one_out.png", width=500>

K-fold

There are also non-exhaustive cross validation methods, that don't require computing all ways of splitting the original dataset. Those methods are approximations of leave-p-out cross-validation. In k-fold cross-validation, the original sample is randomly partitioned into $k$ equal sized subsamples. A single subsample is retained as the validation data, and the remaining $k − 1$ subsamples are used as training data. The cross-validation process is then repeated $k$ times. The advantage of this method is that all observations are used for both training and validation, and each observation is used for validation exactly once.

<img src="../img/kfold.png", width=500>

Random subsamples (without replacement)

This method is similar to k-fold strategy, but in each iteration we randomly select some samples for testing, and some others for training. An advantage over k-fold is that we can freely decide the number of iterations and the length of each train-test. A drawback of this method is that the samples may never be selected in the test set, whereas others may be selected more than once:

<img src="../img/random_subsampling.png", width=500>

In scikit-learn there are special methods for cross validation. We will run coss_val_score for our previously defined models:


In [10]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(regr, X, y)
print("Scores for regr: {}, mean score = {:03.2f}, std = {:03.2f}".format(scores, scores.mean(), scores.std()))
scores = cross_val_score(poly2, X, y)
print("Scores for poly2: {}, mean score = {:03.2f}, std = {:03.2f}".format(scores, scores.mean(), scores.std()))
scores = cross_val_score(poly15, X, y)
print("Scores for poly16: {}, mean score = {:03.2f}, std = {:03.2f}".format(scores, scores.mean(), scores.std()))


Scores for regr: [ 0.55280768  0.62096582  0.62899577], mean score = 0.60, std = 0.03
Scores for poly2: [ 0.74106042  0.7598275   0.69354111], mean score = 0.73, std = 0.03
Scores for poly16: [ 0.63291851 -3.462925    0.69988794], mean score = -0.71, std = 1.95

As you see the scores has 3 numbers, this is because we used default CV method, that is KFold for k=3.

Exercise 1

Using make_data function generate a new dataset with different sample size. Calculate cross validation score using one of the splitter methods available in scikit-learn. See how the scores differ with the sample size.

Let's start from smaller data size, $N=50$:


In [11]:
X_new, y_new = make_data(N=50)

X_new_tr, X_new_ts, y_new_tr, y_new_ts = train_test_split(X_new, y_new)

poly2new = PolynomialRegression(2)
poly2new.fit(X_new_tr, y_new_tr)
plot_regr(X_new_tr, y_new_tr, poly2new)



In [12]:
plot_regr(X_new_ts, y_new_ts, poly2new, color="r")



In [13]:
from sklearn.model_selection import ShuffleSplit

scores = cross_val_score(poly2new, X_new, y_new, cv=ShuffleSplit())
print("Scores for regr: {}, mean score = {:03.2f}, std = {:03.2f}".format(scores, scores.mean(), scores.std()))


Scores for regr: [-0.73663368 -0.68503726  0.90746139  0.68001119 -0.05776878  0.84479021
  0.77374705  0.87195499  0.54127374  0.21956967], mean score = 0.34, std = 0.60

As you can see the mean score it's not very good. You can try changing to $N=300$ now.


In [14]:
# write your solution here

# 1. use make_data with different N (e.g. 50 and 300)

# 2. split the data for training and testing and use PolynomialRegression(2) as your model 

# 3. plot results using plot_regr (both for training data and testing data)

# 4. calculate cross-validation score using cross_val_score

Error metrics for classification

In order to validate binary classification models we are using error matrix (or confusion matrix):

<img src="../img/sensit_specif.png", width=500>

The error matric describes four possible scenarios, since the true response is either Positive or Negative (or 1/0) and model prediction can also have these two value for every sample. That gives us for possibilities:

  • True Positive (TP): model correctly identifies Positive cases, e.g. correctly classifies people with malignant cancer
  • False Positive (FP): model incorrectly identifies Positive cases, e.g. predicts malignant cancer for people with benign cancer
  • True Negative (TN): model correctly identifies Negative cases, e.g. correctly classifies people with benign cancer
  • False Negative (FN): model incorrectly identifies Negative cases, e.g. predict benign cancer for people with malignant cancer

Of course, we would like to have only True Positives and True Negatives, but we often want to have one value to optimize, from error matrix we can calculate following rates:

  • Precision: $\frac{TP}{TP + FP}$
  • Recall: $\frac{TP}{TP + FN}$
  • Sensitivity: $\frac{TP}{TP + FN}$
  • Specificity: $\frac{TN}{TN + FP}$

There is always trade-off between Precision and Recall or Sensitivity and Specificity. Sometimes additional scores are defined:

F score: $\frac{Precision * Recall}{Precision + Recall}$

Let's calculate values of Precision and Recall for the breast cancer dataset (but splitting data to training and testing sets this time):


In [15]:
from sklearn import datasets
cancer = datasets.load_breast_cancer()

X_can = cancer.data
y_can = cancer.target

X_can_tr, X_can_ts, y_can_tr, y_can_ts = train_test_split(X_can, y_can, random_state=2)

from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier()
clf.fit(X_can_tr[:, :2], y_can_tr)
y_pred = clf.predict(X_can_ts[:, :2])

In [16]:
from sklearn.metrics import precision_score, recall_score, accuracy_score
precision_score(y_can_ts, y_pred), recall_score(y_can_ts, y_pred)


Out[16]:
(0.85416666666666663, 0.94252873563218387)

We diagnosed correctly 94% of people with malignant cancer, but at the same time 15% of people who were diagnosed with malignant cancer didn't actually have one and they might have mixed feelings...

And what if we take all features into account?


In [17]:
clf_full = KNeighborsClassifier()
clf_full.fit(X_can_tr, y_can_tr)
y_pred = clf_full.predict(X_can_ts)
precision_score(y_can_ts, y_pred), recall_score(y_can_ts, y_pred)


Out[17]:
(0.93181818181818177, 0.94252873563218387)

This looks better! We didn't improve recall too much, but we do have much less FP, so our precision is higher.

Significance testing

When we use any of these metrics to evaluate our model we shouldn't think only about the single number. 90 or 95% might look very good, but it doesn't necessarily mean that our model is doing a great job. Let's imagine situation when we have 100 samples and only 2 of them have cancer. If we build a model that gives always answer NO we will have a great accuracy and even better specificity, but we probably shouldn't call this a ML algorithm.

One way of checking if our algorithm is learning and not giving a random or a trivial answer is performing a significance or permutation test. The main idea behind the test is checking if our model is sensitive to the training labels. In order to check it, we run multiple models (hundreds or thousands) with permuted labels in the training set. For every single run we can calculate metric, e.g. accuracy, and create a histogram of our results that should form some distribution. Now we can see how many randomly permuted sets give the original or better score and calculate the P-value. If we see that our score is in the middle of distribution, this is a clear sign that our model is useless.

Scikit-lear have a special method to make this evaluation for you - permutation_test_score


In [18]:
from sklearn.model_selection import permutation_test_score

In [19]:
clf = KNeighborsClassifier()
score, permutation_scores, pvalue = permutation_test_score(
    clf, X_can, y_can, scoring="accuracy", cv=None, n_permutations=1000, n_jobs=1)
print("Classification score %s (pvalue : %s)" % (score, pvalue))


Classification score 0.922667780563 (pvalue : 0.000999000999001)

In [20]:
def plot_permutation_score(score, permutation_scores, pvalue):
    plt.hist(permutation_scores, 20, label='Permutation scores',
             edgecolor='black', alpha=0.6)
    ylim = plt.ylim()

    plt.plot(2 * [score], ylim, '--g', linewidth=3,
             label='Classification Score')
    plt.title("p_value = {:06.5f}".format(pvalue))
    plt.ylim(ylim)
    plt.legend()
    plt.xlabel('Score')
plot_permutation_score(score, permutation_scores, pvalue)


You can see that our score is much bigger than scores from model that used permuted labels.

Exercise 2

Change number of neigbors in KNeighborsClassifier model and run permutation_test_score again. Try a very large number, e.g. 300, can you explain the result?


In [21]:
clf_n300 = KNeighborsClassifier(n_neighbors=300)
score_n300, permutation_scores_n300, pvalue_n300 = permutation_test_score(
        clf_n300, X_can, y_can, scoring="accuracy", cv=None, n_permutations=1000, n_jobs=1)
print("Classification score %s (pvalue : %s)" % (score_n300, pvalue_n300))


Classification score 0.627420402859 (pvalue : 1.0)

In [22]:
plot_permutation_score(score_n300, permutation_scores_n300, pvalue_n300)


As you see now our model is not doing a good job. Let's recall what is the size of our data:


In [23]:
X_can.shape


Out[23]:
(569, 30)

We had less than 600 samples and we use even less for training, so if we use n_neighbors=300 we build a model that has very high bias.


In [24]:
# write your solution here

# 1. initialize KNeighborsClassifier with number of neighbours equal to 300 

# 2. run permutation_test_score taking accuracy for scoring 

# 3. use plot_permutation_score to see the results

Exercise 3

Run permutation test score for the model build for Iris data. You can use original data or after PCA.

Let's recreate X_pca from our previous notebook:


In [25]:
from sklearn.decomposition import PCA

iris = datasets.load_iris()
X_ir = iris.data
y_ir = iris.target

n_components = 2
pca = PCA(n_components=n_components)
X_pca = pca.fit_transform(X_ir)

And run permutation_test_score:


In [26]:
clf_ir = KNeighborsClassifier()

score_ir, permutation_scores_ir, pvalue_ir = permutation_test_score(
    clf_ir, X_pca, y_ir, scoring="accuracy", cv=None, n_permutations=1000, n_jobs=1)
print("Classification score %s (pvalue : %s)" % (score, pvalue))


Classification score 0.922667780563 (pvalue : 0.000999000999001)

In [27]:
plot_permutation_score(score_ir, permutation_scores_ir, pvalue_ir)



In [28]:
# write your solution here

# 1. load iris_data and run PCA as in we did in the previous notebook

# 2. run permutation_test_score

# 3. use plot_permutation_score to see the results