Cross Validation


In [34]:
# import
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score, KFold, train_test_split, cross_val_predict, LeaveOneOut, LeavePOut
from sklearn.model_selection import ShuffleSplit, StratifiedKFold, StratifiedShuffleSplit, GroupKFold, LeaveOneGroupOut
from sklearn.model_selection import LeavePGroupsOut, GroupShuffleSplit, TimeSeriesSplit
from sklearn.metrics import accuracy_score
from sklearn.svm import SVC

from scipy.stats import sem
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

In [2]:
iris = load_iris()

X, y = iris.data, iris.target

In [3]:
# splotting the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=27)

print(X_train.shape, X_test.shape, X_train.shape[0])


(112, 4) (38, 4) 112
cross_val_score uses the KFold or StratifiedKFold strategies by default    

In [38]:
# define cross_val func

def xVal_score(clf, X, y, K):
    
    # creating K using KFold
    cv = KFold(n_splits=2)
    
    # Can use suffle as well
    # cv = ShuffleSplit(n_splits=3, test_size=0.3, random_state=0)
    
    # doing cross validation
    scores = cross_val_score(clf, X, y, cv=cv)
    print(scores)
    print("Accuracy Mean : %0.3f" %np.mean(scores))
    print("Std : ", np.std(scores))
    print("Standard Err : +/- {0:0.6f} ".format(sem(scores)))

In [39]:
svc1 = SVC()
xVal_score(svc1, X_train, y_train, 10)


[ 0.98214286  0.94642857]
Accuracy Mean : 0.964
Std :  0.0178571428571
Standard Err : +/- 0.017857 

In [8]:
# define cross_val predict
# The function cross_val_predict has a similar interface to cross_val_score, but returns, 
# for each element in the input, the prediction that was obtained for that element when it 
# was in the test set. Only cross-validation strategies that assign all elements to a test 
# set exactly once can be used (otherwise, an exception is raised).

def xVal_predict(clf, X, y, K):
    
    # creating K using KFold
    cv = KFold(n_splits=K)
    
    # Can use suffle as well
    # cv = ShuffleSplit(n_splits=3, test_size=0.3, random_state=0)
    
    # doing cross validation prediction
    predicted = cross_val_predict(clf, X, y, cv=cv)
    print(predicted)
    print("Accuracy Score : %0.3f" % accuracy_score(y, predicted))

In [9]:
xVal_predict(svc1, X_train, y_train, 10)


[1 0 0 2 2 2 0 1 2 2 1 0 0 0 1 0 1 0 2 0 0 1 2 2 1 2 2 1 2 2 0 2 0 2 2 0 1
 0 0 0 1 1 2 2 0 2 0 1 2 2 1 0 1 2 1 2 1 1 0 1 0 0 0 0 2 2 0 1 1 2 0 1 2 2
 0 0 2 1 0 2 1 0 1 1 2 1 0 1 1 1 1 2 0 1 1 0 1 0 2 0 2 2 0 2 2 0 0 0 1 0 1
 0]
Accuracy Score : 0.973

Cross Validation Iterator

K-Fold - KFold divides all the samples in k groups of samples, called folds (if k = n, this is equivalent to the Leave One Out strategy), of equal sizes (if possible). The prediction function is learned using k - 1 folds, and the fold left out is used for test.


In [11]:
X = [1,2,3,4,5]
kf = KFold(n_splits=2)
print(kf)
for i in kf.split(X):
    print(i)


KFold(n_splits=2, random_state=None, shuffle=False)
(array([3, 4]), array([0, 1, 2]))
(array([0, 1, 2]), array([3, 4]))

Leave One Out (LOO) - LeaveOneOut (or LOO) is a simple cross-validation. Each learning set is created by taking all the samples except one, the test set being the sample left out. Thus, for n samples, we have n different training sets and n different tests set. This cross-validation procedure does not waste much data as only one sample is removed from the training set:


In [12]:
X = [1,2,3,4,5]
loo = LeaveOneOut()
print(loo)
for i in loo.split(X):
    print(i)


LeaveOneOut()
(array([1, 2, 3, 4]), array([0]))
(array([0, 2, 3, 4]), array([1]))
(array([0, 1, 3, 4]), array([2]))
(array([0, 1, 2, 4]), array([3]))
(array([0, 1, 2, 3]), array([4]))

Leave P Out (LPO) - LeavePOut is very similar to LeaveOneOut as it creates all the possible training/test sets by removing p samples from the complete set. For n samples, this produces {n \choose p} train-test pairs. Unlike LeaveOneOut and KFold, the test sets will overlap for p > 1


In [13]:
X = [1,2,3,4,5]
loo = LeavePOut(p=3)
print(loo)
for i in loo.split(X):
    print(i)


LeavePOut(p=3)
(array([3, 4]), array([0, 1, 2]))
(array([2, 4]), array([0, 1, 3]))
(array([2, 3]), array([0, 1, 4]))
(array([1, 4]), array([0, 2, 3]))
(array([1, 3]), array([0, 2, 4]))
(array([1, 2]), array([0, 3, 4]))
(array([0, 4]), array([1, 2, 3]))
(array([0, 3]), array([1, 2, 4]))
(array([0, 2]), array([1, 3, 4]))
(array([0, 1]), array([2, 3, 4]))

Random permutations cross-validation a.k.a. Shuffle & Split - The ShuffleSplit iterator will generate a user defined number of independent train / test dataset splits. Samples are first shuffled and then split into a pair of train and test sets.

It is possible to control the randomness for reproducibility of the results by explicitly seeding the random_state pseudo random number generator.


In [14]:
X = [1,2,3,4,5]
loo = ShuffleSplit(n_splits=3, test_size=0.25,random_state=0)
print(loo)
for i in loo.split(X):
    print(i)


ShuffleSplit(n_splits=3, random_state=0, test_size=0.25, train_size=None)
(array([1, 3, 4]), array([2, 0]))
(array([1, 4, 3]), array([0, 2]))
(array([4, 0, 2]), array([1, 3]))

Some classification problems can exhibit a large imbalance in the distribution of the target classes: for instance there could be several times more negative samples than positive samples. In such cases it is recommended to use stratified sampling as implemented in StratifiedKFold and StratifiedShuffleSplit to ensure that relative class frequencies is approximately preserved in each train and validation fold.

Stratified k-fold
StratifiedKFold is a variation of k-fold which returns stratified folds: each set contains approximately the same percentage of samples of each target class as the complete set.


In [16]:
X = np.ones(10)
y = [0, 0, 0, 0, 1, 1, 1, 1, 1, 1]

skf = StratifiedKFold(n_splits=3)
for i in skf.split(X, y):
    print(i)


(array([2, 3, 6, 7, 8, 9]), array([0, 1, 4, 5]))
(array([0, 1, 3, 4, 5, 8, 9]), array([2, 6, 7]))
(array([0, 1, 2, 4, 5, 6, 7]), array([3, 8, 9]))

Stratified Shuffle Split
StratifiedShuffleSplit is a variation of ShuffleSplit, which returns stratified splits, i.e which creates splits by preserving the same percentage for each target class as in the complete set.


In [19]:
X = np.ones(10)
y = [0, 0, 0, 0, 1, 1, 1, 1, 1, 1]

skf = StratifiedShuffleSplit(n_splits=3, test_size=0.25, random_state=33)
for i in skf.split(X, y):
    print(i)


(array([1, 9, 8, 2, 3, 7, 4], dtype=int64), array([5, 6, 0], dtype=int64))
(array([5, 0, 1, 7, 6, 8, 2], dtype=int64), array([3, 9, 4], dtype=int64))
(array([9, 1, 4, 5, 6, 0, 2], dtype=int64), array([7, 3, 8], dtype=int64))

Cross-validation iterators for grouped data

The i.i.d. assumption is broken if the underlying generative process yield groups of dependent samples.

Such a grouping of data is domain specific. An example would be when there is medical data collected from multiple patients, with multiple samples taken from each patient. And such data is likely to be dependent on the individual group. In our example, the patient id for each sample will be its group identifier.

In this case we would like to know if a model trained on a particular set of groups generalizes well to the unseen groups. To measure this, we need to ensure that all the samples in the validation fold come from groups that are not represented at all in the paired training fold.

The following cross-validation splitters can be used to do that. The grouping identifier for the samples is specified via the groups parameter.

Group k-fold
class:GroupKFold is a variation of k-fold which ensures that the same group is not represented in both testing and training sets. For example if the data is obtained from different subjects with several samples per-subject and if the model is flexible enough to learn from highly person specific features it could fail to generalize to new subjects. class:GroupKFold makes it possible to detect this kind of overfitting situations.


In [21]:
X = [0.1, 0.2, 2.2, 2.4, 2.3, 4.55, 5.8, 8.8, 9, 10]
y = ["a", "b", "b", "b", "c", "c", "c", "d", "d", "d"]
groups = [1, 1, 1, 2, 2, 2, 3, 3, 3, 3]

gkf = GroupKFold(n_splits=3)
for train, test in gkf.split(X, y, groups=groups):
    print("%s %s" % (train, test))


[0 1 2 3 4 5] [6 7 8 9]
[0 1 2 6 7 8 9] [3 4 5]
[3 4 5 6 7 8 9] [0 1 2]

LeaveOneGroupOut
LeaveOneGroupOut is a cross-validation scheme which holds out the samples according to a third-party provided array of integer groups. This group information can be used to encode arbitrary domain specific pre-defined cross-validation folds.

Each training set is thus constituted by all the samples except the ones related to a specific group.


In [23]:
X = [0.1, 0.2, 2.2, 2.4, 2.3, 4.55, 5.8, 8.8, 9, 10]
y = ["a", "b", "b", "b", "c", "c", "c", "d", "d", "d"]
groups = [1, 1, 1, 2, 2, 2, 3, 3, 3, 3]

gkf = LeaveOneGroupOut()
for train, test in gkf.split(X, y, groups=groups):
    print("%s %s" % (train, test))


[3 4 5 6 7 8 9] [0 1 2]
[0 1 2 6 7 8 9] [3 4 5]
[0 1 2 3 4 5] [6 7 8 9]

Leave P Groups Out
LeavePGroupsOut is similar as LeaveOneGroupOut, but removes samples related to P groups for each training/test set.


In [30]:
X = [0.1, 0.2, 2.2, 2.4, 2.3, 4.55, 5.8, 8.8, 9, 10]
y = ["a", "b", "b", "b", "c", "c", "c", "d", "d", "d"]
groups = [1, 1, 1, 2, 2, 2, 3, 3, 3, 3]

gkf = LeavePGroupsOut(n_groups=2)
for train, test in gkf.split(X, y, groups=groups):
    print("%s %s" % (train, test))


[6 7 8 9] [0 1 2 3 4 5]
[3 4 5] [0 1 2 6 7 8 9]
[0 1 2] [3 4 5 6 7 8 9]

Group Shuffle Split
The GroupShuffleSplit iterator behaves as a combination of ShuffleSplit and LeavePGroupsOut, and generates a sequence of randomized partitions in which a subset of groups are held out for each split.


In [33]:
X = [0.1, 0.2, 2.2, 2.4, 2.3, 4.55, 5.8, 8.8, 9, 10]
y = ["a", "b", "b", "b", "c", "c", "c", "d", "d", "d"]
groups = [1, 1, 1, 2, 2, 2, 3, 3, 3, 3]

gkf = GroupShuffleSplit(n_splits=4, test_size=0.5, random_state=33)
for train, test in gkf.split(X, y, groups=groups):
    print("%s %s" % (train, test))


[0 1 2] [3 4 5 6 7 8 9]
[0 1 2] [3 4 5 6 7 8 9]
[6 7 8 9] [0 1 2 3 4 5]
[3 4 5] [0 1 2 6 7 8 9]

Time Series Split
TimeSeriesSplit is a variation of k-fold which returns first k folds as train set and the (k+1) th fold as test set. Note that unlike standard cross-validation methods, successive training sets are supersets of those that come before them. Also, it adds all surplus data to the first training partition, which is always used to train the model.

This class can be used to cross-validate time series data samples that are observed at fixed time intervals.


In [35]:
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([1, 2, 3, 4, 5, 6])
tscv = TimeSeriesSplit(n_splits=3)
print(tscv)  
for train, test in tscv.split(X):
    print("%s %s" % (train, test))


TimeSeriesSplit(n_splits=3)
[0 1 2] [3]
[0 1 2 3] [4]
[0 1 2 3 4] [5]

Model evaluation: quantifying the quality of predictions

Estimator score method: Estimators have a score method providing a default evaluation criterion for the problem they are designed to solve.
Scoring parameter: Model-evaluation tools using cross-validation (such as model_selection.cross_val_score and model_selection.GridSearchCV) rely on an internal scoring strategy.
Metric functions: The metrics module implements functions assessing prediction error for specific purposes.


In [ ]: