Selecting model hyperparameters by cross-validation

The overview is that we will split the dataset into args.num_folds distinct partitions ("folds"). (This example description will use args.num_folds=10.) Eight of the folds will be used for training, one will be used for validation, and one for testing. This is performed for each set of hyperparameters. Finally, that process is repeated using each fold as the validation and test set. In this work, when fold i is the validation set, then fold i+1%10 is the test set; the other folds are taken as the training set. Other strategies can be used, but this results in exactly a single prediction for each data record in each of the validation and test set for each set of hyperparameters.

The rough purpose of each of the dataset types is as follows.

  • training. given the hyperparameters, set the model parameters.
  • validation. select the hyperparameters which perform the best on an unseen dataset.
  • testing. evaluate the selected hyperparameters on a different unseen dataset.

Goal

This notebook demonstrates the necessary steps to distribute training across a cluster using dask, select hyperparameters using a validation set, collect the test set predictions, and evaluate the model performance using the pyllars library.


In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

# initialize a logger for ipython
import pyllars.logging_utils as logging_utils
logger = logging_utils.get_ipython_logger()

# create an argparse namespace to hold parameters
from argparse import Namespace
args = Namespace()

# create (or connect to) a dask cluster
import pyllars.dask_utils as dask_utils
cluster_location = "LOCAL"

cluster_restart=False
dask_utils.add_dask_values_to_args(
    args,
    cluster_location=cluster_location,
    num_procs=3,
    num_threads_per_proc=1
)

dask_client, cluster = dask_utils.connect(args)

# machine learning imports
import sklearn.datasets
import sklearn.model_selection
import sklearn.pipeline
import sklearn.preprocessing

import xgboost

# other tools and helpers
import itertools
import json


INFO     : [dask_utils]: starting local dask cluster
DEBUG    : Using selector: EpollSelector
DEBUG    : Using selector: EpollSelector
DEBUG    : Using selector: EpollSelector
DEBUG    : Using selector: EpollSelector
DEBUG    : Loaded backend module://ipykernel.pylab.backend_inline version unknown.

In [2]:
import pyllars.ml_utils as ml_utils

In [3]:
import numpy as np
import pandas as pd

In [ ]:

Provide "command line" arguments


In [4]:
args.random_state = 8675309 # several steps require a seed. we will use the same one to avoid unexpected results.
args.num_folds = 10 # use 10-fold cross-validation

args.evaluation_metric = 'mean_squared_error'
args.selection_strategy = np.argmin

In [5]:
# standard pydata imports
import joblib
import matplotlib.pyplot as plt
import networkx as nx
import numpy as np
import pandas as pd
import pathlib
import seaborn as sns; sns.set(style='white', color_codes=True)
import tqdm

TODO: The evaluation always uses predict. Currently, there is not a way to tell it to use predict_proba (or anything else).


In [6]:
from typing import Any, Callable, Container, Dict, Iterable, List, NamedTuple, Optional, Set, Tuple

In [7]:
# this function *is not* in ml_utils
# we can do whatever we want here
def evaluate_hyperparameters_helper(
        hv:List,
        args:Namespace,
        estimator_template:sklearn.base.BaseEstimator,
        data:pd.DataFrame,
        collect_metrics:Callable,
        train_folds:Optional[Any]=None,
        split_field:str='fold',
        target_field:str='target',
        target_transform:Optional[Callable]=None,
        target_inverse_transform:Optional[Callable]=None,
        collect_metrics_kwargs:Optional[Dict]=None,
        fields_to_ignore:Optional[Container[str]]=None) -> NamedTuple:
    
    # these come from our iterator
    hyperparameters = hv[0]
    validation_folds = hv[1]
    
    # we know we are doing 10-fold cv
    test_folds = (validation_folds + 1) % args.num_folds
    
    res = ml_utils.evaluate_hyperparameters(
        estimator_template=estimator_template,
        hyperparameters=hyperparameters,
        validation_folds=validation_folds,
        test_folds=test_folds,
        data=data,
        collect_metrics=collect_metrics,
        train_folds=train_folds,
        split_field=split_field,
        target_field=target_field,
        target_transform=target_transform,
        target_inverse_transform=target_inverse_transform,
        collect_metrics_kwargs=collect_metrics_kwargs,
        fields_to_ignore=fields_to_ignore
    )
    
    return res

In [ ]:


In [ ]:


In [ ]:

Load a small regression dataset

We will use the boston dataset made available in sklearn. All of its features are numeric, and the target is also numeric.


In [8]:
def load_dataset():
    data = sklearn.datasets.load_boston()
    df = pd.DataFrame(data['data'], columns=data['feature_names'])
    df['target'] = data['target']
    return df

In [9]:
df = load_dataset()

###
# Determine the fold of row.
#
# N.B. This could be performed once in an "offline" step
#    and stored as another column in the dataframe
###
folds = ml_utils.get_cv_folds(
    df['target'],
    num_splits=args.num_folds,
    use_stratified=False, # stratified does not work for regression
    shuffle=True,
    random_state=args.random_state # ensure we always shuffle the same way
)

# add a column to indicate the fold of each row
df['fold'] = folds

In [10]:
df.head()


Out[10]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT target fold
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98 24.0 3
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14 21.6 9
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03 34.7 5
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94 33.4 4
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33 36.2 6

In [ ]:

Create our estimator and hyperparameter grid

We will use a simple scaling followed by extreme gradient boosting for regression.


In [11]:
# We could also include things like PCA here. However, one
# advantage of trees (and forests) is that we can assign
# importances to features; if we use dimensionality reduction
# or other techniques which significantly change the
# interpretation of the features, though, we largely lose
# that advantage.
estimator_template = sklearn.pipeline.Pipeline([
    ('scaler',sklearn.preprocessing.StandardScaler()),
    ('xgb', xgboost.XGBRegressor())
])

# In practice, this is a small hyperparameter grid; xgboost has
# many more hyperparameters that can be worth investigating.
#
# A quick overview of some of the most important hyperparameters and
# their interpretation is available here:
#   https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/ 
hyperparam_grid = sklearn.model_selection.ParameterGrid({
    'xgb__n_estimators': [50, 100, 500], 
    'xgb__learning_rate': [.01,0.1],
})

Create an iterator over folds and hyperparameter configurations

As described in the introduction, we will test each set of hyperparameters in our grid on each fold (while training on eight of the other folds). For selecting our final set of hyperparameters, we will use performance on the validation fold. (More details are given on this procedure below.)

Concretely, we will accomplish this by iterating over the cross-product of the sets of hyperparameters and validation set folds. (As described, given the validation fold index, we can determine the training and testing folds.)


In [12]:
# for simplicity, just create lists of everything

# In principle, a lazy generator or something more fancy could
# be used; in practice, this is usually not necessary unless
# the hyperparameter grid contains very large objects.
hyperparam_grid = list(hyperparam_grid)
folds = list(range(args.num_folds))

# an iterator over (hyperparameter, validation_fold) tuples
hp_fold_it = itertools.product(hyperparam_grid, folds)
hp_fold_it = list(hp_fold_it)

In [ ]:


In [13]:
df.head()


Out[13]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT target fold
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98 24.0 3
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14 21.6 9
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03 34.7 5
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94 33.4 4
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33 36.2 6

In [14]:
hyperparameters, validation_fold = hp_fold_it[51]
test_fold = (validation_fold + 1) % 10

res = ml_utils.evaluate_hyperparameters(
    estimator_template=estimator_template,
    hyperparameters=hyperparameters,
    validation_folds=validation_fold,
    test_folds=test_fold,
    data=df,
    split_field='fold',
    target_field='target',
    collect_metrics=ml_utils.collect_regression_metrics,
    target_transform=np.log1p,
    target_inverse_transform=np.expm1
)

res.metrics_val


Out[14]:
{'explained_variance': 0.8744051444128544,
 'mean_absolute_error': 1.7798925885967178,
 'mean_squared_error': 7.453449727978154,
 'median_absolute_error': 1.0303699493408196,
 'r2': 0.8743925298651639}

In [ ]:


In [15]:
res = evaluate_hyperparameters_helper(
    hp_fold_it[51],
    args=args,
    estimator_template=estimator_template,
    data=df,
    split_field='fold',
    target_field='target',
    collect_metrics=ml_utils.collect_regression_metrics,
    target_transform=np.log1p,
    target_inverse_transform=np.expm1
)

res.metrics_val


Out[15]:
{'explained_variance': 0.8744051444128544,
 'mean_absolute_error': 1.7798925885967178,
 'mean_squared_error': 7.453449727978154,
 'median_absolute_error': 1.0303699493408196,
 'r2': 0.8743925298651639}

In [16]:
res.hyperparameters_str


Out[16]:
'{"xgb__learning_rate": 0.1, "xgb__n_estimators": 500}'

In [17]:
f_res = dask_utils.apply_iter(
    hp_fold_it,
    dask_client,
    evaluate_hyperparameters_helper,
    args=args,
    estimator_template=estimator_template,
    data=df,
    split_field='fold',
    target_field='target',
    collect_metrics=ml_utils.collect_regression_metrics,
    target_transform=np.log1p,
    target_inverse_transform=np.expm1,
    return_futures=True
)


DEBUG    : [dask_utils.apply_iter] submitting jobs to cluster
100%|██████████| 60/60 [00:05<00:00, 10.38it/s]

In [19]:
dask_utils.check_status(f_res)


Out[19]:
Counter({'finished': 60})

In [20]:
all_res = dask_utils.collect_results(f_res)

In [21]:
all_res[51].metrics_val


Out[21]:
{'explained_variance': 0.8744051444128544,
 'mean_absolute_error': 1.7798925885967178,
 'mean_squared_error': 7.453449727978154,
 'median_absolute_error': 1.0303699493408196,
 'r2': 0.8743925298651639}

In [ ]:


In [ ]:


In [22]:
def _get_res(res):
    ret_val = {
        'validation_{}'.format(k): v
            for k,v in res.metrics_val.items()
    }
    
    ret_test = {
        'test_{}'.format(k): v
            for k,v in res.metrics_test.items()
    }
    
    ret = ret_val
    ret.update(ret_test)
    
    hp_string = json.dumps(res.hyperparameters)
    ret['hyperparameters_str'] = hp_string
    
    ret['hyperparameters'] = res.hyperparameters
    ret['validation_fold'] = res.fold_val
    ret['test_fold'] = res.fold_test
    return ret

In [23]:
###
# Create the results data frame
###
results = [
    _get_res(res) for res in all_res
]

df_results = pd.DataFrame(results)

In [ ]:
df_results.head()

In [ ]:


In [24]:
###
# Based on the performance on the validation set, select
# the best hyperparameters.
###
hp_groups = df_results.groupby('hyperparameters_str')

validation_evaluation_metric = "validation_{}".format(args.evaluation_metric)
test_evaluation_metric = "test_{}".format(args.evaluation_metric)

val_performance = hp_groups[validation_evaluation_metric].mean()

# now, select the best
val_best = args.selection_strategy(val_performance)
val_best

In [ ]:


In [27]:
m_val_best = (df_results['hyperparameters_str'] == val_best)

In [ ]:


In [29]:
df_results[m_val_best]


Out[29]:
hyperparameters hyperparameters_str test_explained_variance test_fold test_mean_absolute_error test_mean_squared_error test_median_absolute_error test_r2 validation_explained_variance validation_fold validation_mean_absolute_error validation_mean_squared_error validation_median_absolute_error validation_r2
50 {'xgb__learning_rate': 0.1, 'xgb__n_estimators... {"xgb__learning_rate": 0.1, "xgb__n_estimators... 0.874402 1 1.753658 7.543395 1.297095 0.872877 0.912643 0 1.822645 6.794249 1.214471 0.912614
51 {'xgb__learning_rate': 0.1, 'xgb__n_estimators... {"xgb__learning_rate": 0.1, "xgb__n_estimators... 0.865814 2 2.511070 12.671853 2.009729 0.864809 0.874405 1 1.779893 7.453450 1.030370 0.874393
52 {'xgb__learning_rate': 0.1, 'xgb__n_estimators... {"xgb__learning_rate": 0.1, "xgb__n_estimators... 0.917257 3 2.157178 8.057555 1.589156 0.917002 0.869113 2 2.693119 12.271260 2.454405 0.869083
53 {'xgb__learning_rate': 0.1, 'xgb__n_estimators... {"xgb__learning_rate": 0.1, "xgb__n_estimators... 0.796827 4 2.660182 20.315691 1.513907 0.796801 0.918609 3 2.091659 7.903636 1.434607 0.918587
54 {'xgb__learning_rate': 0.1, 'xgb__n_estimators... {"xgb__learning_rate": 0.1, "xgb__n_estimators... 0.870987 5 2.297164 14.350152 1.608324 0.854788 0.735580 4 2.908079 26.731201 1.444008 0.732632
55 {'xgb__learning_rate': 0.1, 'xgb__n_estimators... {"xgb__learning_rate": 0.1, "xgb__n_estimators... 0.920622 6 2.289658 7.925646 1.829349 0.919489 0.866730 5 2.308649 14.544421 1.472402 0.852823
56 {'xgb__learning_rate': 0.1, 'xgb__n_estimators... {"xgb__learning_rate": 0.1, "xgb__n_estimators... 0.837550 7 1.758184 9.445142 1.066853 0.837429 0.910525 6 2.302242 8.897191 1.844307 0.909620
57 {'xgb__learning_rate': 0.1, 'xgb__n_estimators... {"xgb__learning_rate": 0.1, "xgb__n_estimators... 0.952020 8 1.740860 4.409847 1.420570 0.951984 0.785040 7 1.876944 12.508143 1.066783 0.784708
58 {'xgb__learning_rate': 0.1, 'xgb__n_estimators... {"xgb__learning_rate": 0.1, "xgb__n_estimators... 0.926487 9 1.618849 4.282926 1.297615 0.926470 0.953309 8 1.675114 4.291643 1.459321 0.953271
59 {'xgb__learning_rate': 0.1, 'xgb__n_estimators... {"xgb__learning_rate": 0.1, "xgb__n_estimators... 0.916742 0 1.847324 6.488884 1.350085 0.916541 0.921541 9 1.653786 4.585898 1.339002 0.921269

In [30]:
# go back and select the predictions for the best hyperparameters
best_res = [
    res for res in all_res
        if res.hyperparameters_str == val_best
]

In [ ]:
len(best_res)

In [31]:
best_res[0].predictions_test


Out[31]:
array([20.654224, 18.821327, 21.639122, 16.754532, 15.321813, 20.07061 ,
       16.127214, 20.023327, 22.61253 , 41.999878, 23.750202, 17.175694,
       14.028635, 13.158178, 18.151676, 17.758194, 21.566748, 27.407398,
       38.46888 , 25.53746 , 24.041073, 21.12991 , 32.521378, 25.545212,
       35.316124, 23.70998 , 31.244513, 47.158398, 24.47955 , 23.735378,
       26.30453 , 18.260584, 20.323818, 20.774757, 23.592388, 13.737642,
       11.497095,  8.396353, 11.563587, 13.739315, 14.433248, 16.893173,
        9.005595, 18.918177, 16.17804 , 19.562008, 20.8405  , 19.49091 ,
       25.620035, 19.670761, 23.72944 ], dtype=float32)

In [32]:
best_res[0].true_test


Out[32]:
array([15. , 18.2, 23.1, 19.6, 14.8, 20. , 14.4, 19.7, 22.9, 43.8, 22.8,
       17.4, 15.6, 13.8, 15.6, 13.1, 17.4, 29.6, 33.3, 22.6, 24.4, 21.5,
       31.5, 25.1, 31.5, 23.7, 32. , 45.4, 22.1, 23.1, 24.5, 18.2, 21.7,
       22.6, 25. , 13.3, 10.2,  7.4, 11.5, 12.5, 27.5, 15. ,  7. , 20.8,
       16.4, 19.5, 19.9, 19.9, 29.8, 17.5, 23.9])

In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:
hp_groups['test_mean_absolute_error'].mean()

In [ ]:


In [ ]:
df_results.columns

In [ ]:


In [ ]:


In [ ]:


In [ ]:
###
# Extract masks for the training, validation, and testing
#    sets.
###
splits = ml_utils.get_train_val_test_splits(
    df,
    validation_splits=validation_set,
    test_splits=test_set,
    split_field='fold'
)

###
# Create the data matrices necessary for the various
#    sklearn operations we will perform later.
###

# we do not want to use metadata fields
fields_to_ignore = [
    'fold'
]

fold_data = ml_utils.get_fold_data(
    df,
    target_field='target',
    m_train=splits.training,
    m_test=splits.test,
    m_validation=splits.validation,
    fields_to_ignore=fields_to_ignore
)

In [ ]:
###
# Based on the template of our estimator pipeline template
# and hyperparameters, create a concrete estimator with the
# specified hyperparameters.
###
estimator = sklearn.clone(estimator_template)
estimator = estimator.set_params(**hyperparameters)

In [ ]:
###
# Transform the target variable by taking the log(1+y).
#
# This operation may not help in all domains.
###
y_train = np.log1p(fold_data.y_train)

In [ ]:
###
# Fit the estimator on the training set
###
estimator_fit = estimator.fit(fold_data.X_train, y_train)

In [ ]:
###
# Use the fit estimator to make predictions on *both* the
# validation and testing set. We will use both of these later.
###
y_pred = estimator_fit.predict(fold_data.X_test)
y_val = estimator_fit.predict(fold_data.X_validation)

# make sure to transform the predictions back to the original
# scale using exp(y-1)
y_pred = np.expm1(y_pred)
y_val = np.expm1(y_val)

In [ ]:


In [ ]:
###
# Collect various evaluation metrics of the trained model
# on the validation and testing data.
###
metrics_val = ml_utils.collect_regression_metrics(fold_data.y_validation, y_val, prefix="val_")
metrics_test = ml_utils.collect_regression_metrics(fold_data.y_test, y_pred, prefix="test_")

In [ ]:
metrics_test

In [ ]:
metrics_val

In [ ]:
###
# Construct a summary of the hyperparameters, validation
# testing set, as well as the performance of the trained
# model on those sets. We will use the performance on the
# validation set across all folds in order to select the
# optimal hyperparameters for evaluation on the test set.
###

ret = {
    'val_set': validation_set,
    'test_set': test_set,
    'hyperparameter': str(hyperparameters),
}

ret.update(metrics_test)
ret.update(metrics_val)

In [ ]: