This notebook explores the main elements of Optunity's cross-validation facilities, including:
We recommend perusing the related documentation for more details.
Nested cross-validation is available as a separate notebook.
In [2]:
import optunity
import optunity.cross_validation
We start by generating some toy data containing 6 instances which we will partition into folds.
In [2]:
data = list(range(6))
labels = [True] * 3 + [False] * 3
Each function to be decorated with cross-validation functionality must accept the following arguments:
These arguments will be set implicitly by the cross-validation decorator to match the right folds. Any remaining arguments to the decorated function remain as free parameters that must be set later on.
Lets start with the basics and look at Optunity's cross-validation in action. We use an objective function that simply prints out the train and test data in every split to see what's going on.
In [3]:
def f(x_train, y_train, x_test, y_test):
print("")
print("train data:\t" + str(x_train) + "\t train labels:\t" + str(y_train))
print("test data:\t" + str(x_test) + "\t test labels:\t" + str(y_test))
return 0.0
We start with 2 folds, which leads to equally sized train and test partitions.
In [4]:
f_2folds = optunity.cross_validated(x=data, y=labels, num_folds=2)(f)
print("using 2 folds")
f_2folds()
Out[4]:
In [5]:
# f_2folds as defined above would typically be written using decorator syntax as follows
# we don't do that in these examples so we can reuse the toy objective function
@optunity.cross_validated(x=data, y=labels, num_folds=2)
def f_2folds(x_train, y_train, x_test, y_test):
print("")
print("train data:\t" + str(x_train) + "\t train labels:\t" + str(y_train))
print("test data:\t" + str(x_test) + "\t test labels:\t" + str(y_test))
return 0.0
If we use three folds instead of 2, we get 3 iterations in which the training set is twice the size of the test set.
In [6]:
f_3folds = optunity.cross_validated(x=data, y=labels, num_folds=3)(f)
print("using 3 folds")
f_3folds()
Out[6]:
If we do two iterations of 3-fold cross-validation (denoted by 2x3 fold), two sets of folds are generated and evaluated.
In [7]:
f_2x3folds = optunity.cross_validated(x=data, y=labels, num_folds=3, num_iter=2)(f)
print("using 2x3 folds")
f_2x3folds()
Out[7]:
Strata are defined as sets of instances that should be spread out across folds as much as possible (e.g. stratify patients by age). Clusters are sets of instances that must be put in a single fold (e.g. cluster measurements of the same patient).
Optunity allows you to specify strata and/or clusters that must be accounted for while construct cross-validation folds. Not all instances have to belong to a stratum or clusters.
We start by illustrating strata. Strata are specified as a list of lists of instances indices. Each list defines one stratum. We will reuse the toy data and objective function specified above. We will create 2 strata with 2 instances each. These instances will be spread across folds. We create two strata: $\{0, 1\}$ and $\{2, 3\}$.
In [8]:
strata = [[0, 1], [2, 3]]
f_stratified = optunity.cross_validated(x=data, y=labels, strata=strata, num_folds=3)(f)
f_stratified()
Out[8]:
Clusters work similarly, except that now instances within a cluster are guaranteed to be placed within a single fold. The way to specify clusters is identical to strata. We create two clusters: $\{0, 1\}$ and $\{2, 3\}$. These pairs will always occur in a single fold.
In [9]:
clusters = [[0, 1], [2, 3]]
f_clustered = optunity.cross_validated(x=data, y=labels, clusters=clusters, num_folds=3)(f)
f_clustered()
Out[9]:
Strata and clusters can be used together. Lets say we have the following configuration:
In this particular example, instances 1 and 2 will inevitably end up in a single fold, even though they are part of one stratum. This happens because the total data set has size 6, and 4 instances are already in clusters.
In [10]:
strata = [[0, 1, 2]]
clusters = [[0, 3], [4, 5]]
f_strata_clustered = optunity.cross_validated(x=data, y=labels, clusters=clusters, strata=strata, num_folds=3)(f)
f_strata_clustered()
Out[10]:
Aggregators are used to combine the scores per fold into a single result. The default approach used in cross-validation is to take the mean of all scores. In some cases, we might be interested in worst-case or best-case performance, the spread, ...
Opunity allows passing a custom callable to be used as aggregator.
The default aggregation in Optunity is to compute the mean across folds.
In [11]:
@optunity.cross_validated(x=data, num_folds=3)
def f(x_train, x_test):
result = x_test[0]
print(result)
return result
f(1)
Out[11]:
This can be replaced by any function, e.g. min or max.
In [12]:
@optunity.cross_validated(x=data, num_folds=3, aggregator=max)
def fmax(x_train, x_test):
result = x_test[0]
print(result)
return result
fmax(1)
Out[12]:
In [13]:
@optunity.cross_validated(x=data, num_folds=3, aggregator=min)
def fmin(x_train, x_test):
result = x_test[0]
print(result)
return result
fmin(1)
Out[13]:
Often, it may be useful to retain all intermediate results, not just the final aggregated data. This is made possible via optunity.cross_validation.mean_and_list
aggregator. This aggregator computes the mean for internal use in cross-validation, but also returns a list of lists containing the full evaluation results.
In [14]:
@optunity.cross_validated(x=data, num_folds=3,
aggregator=optunity.cross_validation.mean_and_list)
def f_full(x_train, x_test, coeff):
return x_test[0] * coeff
# evaluate f
mean_score, all_scores = f_full(1.0)
print(mean_score)
print(all_scores)
Note that a cross-validation based on the mean_and_list
aggregator essentially returns a tuple of results. If the result is iterable, all solvers in Optunity use the first element as the objective function value. You can let the cross-validation procedure return other useful statistics too, which you can access from the solver trace.
In [15]:
opt_coeff, info, _ = optunity.minimize(f_full, coeff=[0, 1], num_evals=10)
print(opt_coeff)
print("call log")
for args, val in zip(info.call_log['args']['coeff'], info.call_log['values']):
print(str(args) + '\t\t' + str(val))
In this example we will show how to use cross-validation methods that are provided by scikit-learn in conjunction with Optunity. To do this we provide Optunity with the folds that scikit-learn produces in a specific format.
In supervised learning datasets often have unbalanced labels. When performing cross-validation with unbalanced data it is good practice to preserve the percentage of samples for each class across folds. To achieve this label balance we will use StratifiedKFold.
In [17]:
data = list(range(20))
labels = [1 if i%4==0 else 0 for i in range(20)]
@optunity.cross_validated(x=data, y=labels, num_folds=5)
def unbalanced_folds(x_train, y_train, x_test, y_test):
print("")
print("train data:\t" + str(x_train) + "\ntrain labels:\t" + str(y_train)) + '\n'
print("test data:\t" + str(x_test) + "\ntest labels:\t" + str(y_test)) + '\n'
return 0.0
unbalanced_folds()
Out[17]:
Notice above how the test label sets have a varying number of postive samples, some have none, some have one, and some have two.
In [19]:
from sklearn.cross_validation import StratifiedKFold
stratified_5folds = StratifiedKFold(labels, n_folds=5)
folds = [[list(test) for train, test in stratified_5folds]]
@optunity.cross_validated(x=data, y=labels, folds=folds, num_folds=5)
def balanced_folds(x_train, y_train, x_test, y_test):
print("")
print("train data:\t" + str(x_train) + "\ntrain labels:\t" + str(y_train)) + '\n'
print("test data:\t" + str(x_test) + "\ntest labels:\t" + str(y_test)) + '\n'
return 0.0
balanced_folds()
Out[19]:
Now all of our train sets have four positive samples and our test sets have one positive sample.
To use predetermined folds, place a list of the test sample idices into a list. And then insert that list into another list. Why so many nested lists? Because you can perform multiple cross-validation runs by setting num_iter appropriately and then append num_iter lists of test samples to the outer most list. Note that the test samples for a given fold are the idicies that you provide and then the train samples for that fold are all of the indices from all other test sets joined together. If not done carefully this may lead to duplicated samples in a train set and also samples that fall in both train and test sets of a fold if a datapoint is in multiple folds' test sets.
In [3]:
data = list(range(6))
labels = [True] * 3 + [False] * 3
fold1 = [[0, 3], [1, 4], [2, 5]]
fold2 = [[0, 5], [1, 4], [0, 3]] # notice what happens when the indices are not unique
folds = [fold1, fold2]
@optunity.cross_validated(x=data, y=labels, folds=folds, num_folds=3, num_iter=2)
def multiple_iters(x_train, y_train, x_test, y_test):
print("")
print("train data:\t" + str(x_train) + "\t train labels:\t" + str(y_train))
print("test data:\t" + str(x_test) + "\t\t test labels:\t" + str(y_test))
return 0.0
multiple_iters()
Out[3]:
In [ ]: