Cross-validation


Training/validation/test data sets

  • Training set: the data set for training your models.
  • Validation set: The data set used for testing the performance of your models you have built with training sets. Based on the performance, you choose the best model (final).
  • Test set: use this data set to test the performance of your final model.

K-folds cross validation steps (k=4 as an example).

  • step 1: split your data into training set and test set (for example 80% training and 20% test). Test set will never be used in model training and selection.
  • step 2: split training set into k (k=4) eqaul subsets: 3 subsets for traing + 1 subset for validation.
  • step 3: training your models with the 3 subsets and calculate a performance score with the remaining 1 subset.
  • step 4: choose a different subset for validation and then repeat step 3 until every subset has been used as a validation subset.
  • step 5: for a k=4 fold cross validation, each trained model should have been validated by 4 subsets and therefore has 4 performance scores. Calculate the average of these 4 perfermance scores for each model. Use the average score to select the best, final model.
  • step 6: apply your final model to the untouched test data and see how it performs.

Example of k-folds cross validation

Build parameter grids

  • parameter grid: a combination of all variable parameters in your model.
  • example: If I want to train a logistic regression model on 4 different regParam and 3 different elasticNetParam, I will have 3 x 4 = 12 models to train and validate.

In [ ]:
from pyspark.ml.classification import LogisticRegression
blor = LogisticRegression(featuresCol='indexed_features', labelCol='label', family='binomial')

from pyspark.ml.tuning import ParamGridBuilder
param_grid = ParamGridBuilder().\
    addGrid(blor.regParam, [0, 0.5, 1, 2]).\
    addGrid(blor.elasticNetParam, [0, 0.5, 1]).\
    build()

Split data into training and test sets

Run k (k=4) folds cross validation


In [ ]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator
evaluator = BinaryClassificationEvaluator()

from pyspark.ml.tuning import CrossValidator
cv = CrossValidator(estimator=blor, estimatorParamMaps=param_grid, evaluator=evaluator, numFolds=4)

cvModel = cv.fit(training)

In [ ]: