About

This notebook demonstrates several additional tools to optimize classification model provided by Reproducible experiment platform (REP) package:

  • grid search for the best classifier hyperparameters

  • different optimization algorithms

  • different scoring models (optimization of arbirtary figure of merit)


In [1]:
%pylab inline


Populating the interactive namespace from numpy and matplotlib

Loading data


In [2]:
!cd toy_datasets; wget -O magic04.data -nc https://archive.ics.uci.edu/ml/machine-learning-databases/magic/magic04.data


File `magic04.data' already there; not retrieving.

In [3]:
import numpy, pandas
from rep.utils import train_test_split
from sklearn.metrics import roc_auc_score

columns = ['fLength', 'fWidth', 'fSize', 'fConc', 'fConc1', 'fAsym', 'fM3Long', 'fM3Trans', 'fAlpha', 'fDist', 'g']
data = pandas.read_csv('toy_datasets/magic04.data', names=columns)
labels = numpy.array(data['g'] == 'g', dtype=int)
data = data.drop('g', axis=1)

train_data, test_data, train_labels, test_labels = train_test_split(data, labels)

list(data.columns)


Out[3]:
['fLength',
 'fWidth',
 'fSize',
 'fConc',
 'fConc1',
 'fAsym',
 'fM3Long',
 'fM3Trans',
 'fAlpha',
 'fDist']

Variables used in training


In [4]:
features = list(set(columns) - {'g'})

Metric definition

In Higgs challenge the aim is to maximize AMS metrics.
To measure the quality one should choose not only classifier, but also an optimal threshold, where the maximal value of AMS is achieved.

Such metrics (which require a threshold) are called threshold-based.

rep.utils contain class OptimalMetric, which computes the maximal value for threshold-based metric (and may be used as metric).

Use this class to generate metric and use it in grid search.

Prepare quality metric

first we define AMS metric, and utils.OptimalMetric generates


In [5]:
from rep.report import metrics

In [6]:
def AMS(s, b, s_norm=sum(test_labels == 1), b_norm=6*sum(test_labels == 0)): 
    return s * s_norm / numpy.sqrt(b * b_norm + 10.)
optimal_AMS = metrics.OptimalMetric(AMS)

In [7]:
sum(test_labels == 1),sum(test_labels == 0)


Out[7]:
(3071, 1684)

Compute threshold vs metric quality

random predictions for signal and background were used here


In [8]:
probs_rand = numpy.ndarray((1000, 2))
probs_rand[:, 1] = numpy.random.random(1000)
probs_rand[:, 0] = 1 - probs_rand[:, 1]
labels_rand = numpy.random.randint(0, high=2, size=1000)

optimal_AMS.plot_vs_cut(labels_rand, probs_rand)


Optimal cut=0.0096, quality=30.5758
Out[8]:

The best quality


In [9]:
optimal_AMS(labels_rand, probs_rand)


Out[9]:
30.575776003303893

Hyperparameters optimization algorithms

AbstractParameterGenerator is an abstract class to generate new points, where the scorer function will be computed. It is used in grid search to get new set of parameters to train classifier.

Properties:

  • best_params_ - return the best grid point

  • best_score_ - return the best quality

  • print_results(self, reorder=True) - print all points with corresponding quality

The following algorithms inherit from AbstractParameterGenerator:

  • RandomParameterOptimizer - generates random point in parameters space

  • RegressionParameterOptimizer - generate next point using regression algorithm, which was trained on previous results

  • SubgridParameterOptimizer - uses subgrids if grid is huge + annealing-like technique (details see in REP)

Grid search

GridOptimalSearchCV implemets optimal search over specified parameter values for an estimator. Parameters to use it are:

  • estimator - object of type that implements the "fit" and "predict" methods

  • params_generator - generator of grid search algorithm (AbstractParameterGenerator)

  • scorer - which implement method call with kwargs: "base_estimator", "params", "X", "y", "sample_weight"

  • Important members are "fit", "fit_best_estimator"


In [10]:
from rep.metaml import GridOptimalSearchCV
from rep.metaml.gridsearch import RandomParameterOptimizer, FoldingScorer
from rep.estimators import SklearnClassifier
from sklearn.ensemble import AdaBoostClassifier
from collections import OrderedDict

Grid search with folding scorer

FoldingScorer provides folding cross-validation for train dataset:

  • folds - k, number of folds (train on k-1 fold, test on 1 fold)
  • folds_check - number of times model will be tested
  • score_function - function to calculate quality with interface "function(y_true, proba, sample_weight=None)"

NOTE: if fold_checks > 1, the quality is averaged over tests.


In [11]:
# define grid parameters
grid_param = OrderedDict()
grid_param['n_estimators'] = [30, 50]
grid_param['learning_rate'] = [0.2, 0.1, 0.05]

# use random hyperparameter optimization algorithm 
generator = RandomParameterOptimizer(grid_param)
# define folding scorer
scorer = FoldingScorer(optimal_AMS, folds=6, fold_checks=4)

grid_sk = GridOptimalSearchCV(SklearnClassifier(AdaBoostClassifier(), features=features), generator, scorer)
grid_sk.fit(data, labels)


Out[11]:
<rep.metaml.gridsearch.GridOptimalSearchCV at 0x10c4e7f50>

In [12]:
grid_sk.generator.best_params_


Out[12]:
OrderedDict([('n_estimators', 50), ('learning_rate', 0.2)])

In [13]:
grid_sk.generator.print_results()


58.674:  n_estimators=50, learning_rate=0.2
56.484:  n_estimators=30, learning_rate=0.2
56.100:  n_estimators=50, learning_rate=0.1
53.704:  n_estimators=30, learning_rate=0.1
53.278:  n_estimators=50, learning_rate=0.05
53.091:  n_estimators=30, learning_rate=0.05

Grid search with user-defined scorer

You can define your own scorer with specific logic by simple way. Scorer must have just the following:

  • scorer(base_estimator, params, X, y, sample_weight)

Define scorer, which will be train model on all dataset and test it on the pre-defined dataset


In [14]:
from sklearn import clone
def generate_scorer(test, test_labels, test_weight=None):
    """ Generate scorer which calculate metric on fixed test dataset """
    def custom(base_estimator, params, X, y, sample_weight=None):
        cl = clone(base_estimator)
        cl.set_params(**params)
        cl.fit(X, y)
        res = optimal_AMS(test_labels, cl.predict_proba(test), sample_weight)
        return res
    return custom

In [15]:
# define grid parameters
grid_param = OrderedDict()
grid_param['n_estimators'] = [30, 50]
grid_param['learning_rate'] = [0.2, 0.1, 0.05]
grid_param['features'] = [features[:5], features[:8]]

# define random hyperparameter optimization algorithm 
generator = RandomParameterOptimizer(grid_param)
# define specific scorer
scorer = generate_scorer(test_data, test_labels)

grid = GridOptimalSearchCV(SklearnClassifier(clf=AdaBoostClassifier(), features=features), generator, scorer)
grid.fit(train_data, train_labels)


Out[15]:
<rep.metaml.gridsearch.GridOptimalSearchCV at 0x111badf90>

In [16]:
len(train_data), len(test_data)


Out[16]:
(14265, 4755)

In [17]:
grid.generator.print_results()


60.269:  n_estimators=50, learning_rate=0.2, features=['fLength', 'fDist', 'fConc', 'fWidth', 'fM3Long', 'fSize', 'fConc1', 'fAlpha']
57.120:  n_estimators=30, learning_rate=0.2, features=['fLength', 'fDist', 'fConc', 'fWidth', 'fM3Long', 'fSize', 'fConc1', 'fAlpha']
56.238:  n_estimators=50, learning_rate=0.1, features=['fLength', 'fDist', 'fConc', 'fWidth', 'fM3Long', 'fSize', 'fConc1', 'fAlpha']
55.923:  n_estimators=50, learning_rate=0.05, features=['fLength', 'fDist', 'fConc', 'fWidth', 'fM3Long', 'fSize', 'fConc1', 'fAlpha']
55.677:  n_estimators=30, learning_rate=0.1, features=['fLength', 'fDist', 'fConc', 'fWidth', 'fM3Long', 'fSize', 'fConc1', 'fAlpha']
55.221:  n_estimators=30, learning_rate=0.05, features=['fLength', 'fDist', 'fConc', 'fWidth', 'fM3Long', 'fSize', 'fConc1', 'fAlpha']
37.438:  n_estimators=30, learning_rate=0.2, features=['fLength', 'fDist', 'fConc', 'fWidth', 'fM3Long']
37.281:  n_estimators=50, learning_rate=0.05, features=['fLength', 'fDist', 'fConc', 'fWidth', 'fM3Long']
37.218:  n_estimators=50, learning_rate=0.1, features=['fLength', 'fDist', 'fConc', 'fWidth', 'fM3Long']
37.063:  n_estimators=30, learning_rate=0.05, features=['fLength', 'fDist', 'fConc', 'fWidth', 'fM3Long']

Results comparison


In [18]:
from rep.report import ClassificationReport
from rep.data.storage import LabeledDataStorage

lds = LabeledDataStorage(test_data, test_labels)
classifiers = {'grid_fold': grid_sk.fit_best_estimator(train_data[features], train_labels),
               'grid_test_dataset': grid.fit_best_estimator(train_data[features], train_labels) }

report = ClassificationReport(classifiers, lds)

ROCs


In [19]:
report.roc().plot()


Metric


In [20]:
report.metrics_vs_cut(AMS, metric_label='AMS').plot()