About

This notebook demonstrates several additional tools to optimize classification model provided by Reproducible experiment platform (REP) package:

grid search for the best classifier hyperparameters
different optimization algorithms
different scoring models (optimization of arbirtary figure of merit)



In [1]:

    
%pylab inline









    



Populating the interactive namespace from numpy and matplotlib

Loading data



In [2]:

    
!cd toy_datasets; wget -O magic04.data -nc https://archive.ics.uci.edu/ml/machine-learning-databases/magic/magic04.data









    



File `magic04.data' already there; not retrieving.



In [3]:

    
import numpy, pandas
from rep.utils import train_test_split
from sklearn.metrics import roc_auc_score

columns = ['fLength', 'fWidth', 'fSize', 'fConc', 'fConc1', 'fAsym', 'fM3Long', 'fM3Trans', 'fAlpha', 'fDist', 'g']
data = pandas.read_csv('toy_datasets/magic04.data', names=columns)
labels = numpy.array(data['g'] == 'g', dtype=int)
data = data.drop('g', axis=1)

train_data, test_data, train_labels, test_labels = train_test_split(data, labels)

list(data.columns)









    Out[3]:





['fLength',
 'fWidth',
 'fSize',
 'fConc',
 'fConc1',
 'fAsym',
 'fM3Long',
 'fM3Trans',
 'fAlpha',
 'fDist']

Variables used in training



In [4]:

    
features = list(set(columns) - {'g'})

Metric definition

In Higgs challenge the aim is to maximize AMS metrics.
To measure the quality one should choose not only classifier, but also an optimal threshold, where the maximal value of AMS is achieved.

Such metrics (which require a threshold) are called threshold-based.

rep.utils contain class OptimalMetric, which computes the maximal value for threshold-based metric (and may be used as metric).

Use this class to generate metric and use it in grid search.

Prepare quality metric

first we define AMS metric, and utils.OptimalMetric generates



In [5]:

    
from rep.report import metrics



In [6]:

    
def AMS(s, b, s_norm=sum(test_labels == 1), b_norm=6*sum(test_labels == 0)): 
    return s * s_norm / numpy.sqrt(b * b_norm + 10.)
optimal_AMS = metrics.OptimalMetric(AMS)



In [7]:

    
sum(test_labels == 1),sum(test_labels == 0)









    Out[7]:





(3071, 1684)

Compute threshold vs metric quality

random predictions for signal and background were used here



In [8]:

    
probs_rand = numpy.ndarray((1000, 2))
probs_rand[:, 1] = numpy.random.random(1000)
probs_rand[:, 0] = 1 - probs_rand[:, 1]
labels_rand = numpy.random.randint(0, high=2, size=1000)

optimal_AMS.plot_vs_cut(labels_rand, probs_rand)









    



Optimal cut=0.0096, quality=30.5758






    Out[8]:

The best quality



In [9]:

    
optimal_AMS(labels_rand, probs_rand)









    Out[9]:





30.575776003303893

Hyperparameters optimization algorithms

AbstractParameterGenerator is an abstract class to generate new points, where the scorer function will be computed. It is used in grid search to get new set of parameters to train classifier.

Properties:

best_params_ - return the best grid point
best_score_ - return the best quality
print_results(self, reorder=True) - print all points with corresponding quality

The following algorithms inherit from AbstractParameterGenerator:

RandomParameterOptimizer - generates random point in parameters space
RegressionParameterOptimizer - generate next point using regression algorithm, which was trained on previous results
SubgridParameterOptimizer - uses subgrids if grid is huge + annealing-like technique (details see in REP)

Grid search

GridOptimalSearchCV implemets optimal search over specified parameter values for an estimator. Parameters to use it are:

estimator - object of type that implements the "fit" and "predict" methods
params_generator - generator of grid search algorithm (AbstractParameterGenerator)
scorer - which implement method call with kwargs: "base_estimator", "params", "X", "y", "sample_weight"
Important members are "fit", "fit_best_estimator"



In [10]:

    
from rep.metaml import GridOptimalSearchCV
from rep.metaml.gridsearch import RandomParameterOptimizer, FoldingScorer
from rep.estimators import SklearnClassifier
from sklearn.ensemble import AdaBoostClassifier
from collections import OrderedDict

Grid search with folding scorer

FoldingScorer provides folding cross-validation for train dataset:

folds - k, number of folds (train on k-1 fold, test on 1 fold)
folds_check - number of times model will be tested
score_function - function to calculate quality with interface "function(y_true, proba, sample_weight=None)"

NOTE: if fold_checks > 1, the quality is averaged over tests.



In [11]:

    
# define grid parameters
grid_param = OrderedDict()
grid_param['n_estimators'] = [30, 50]
grid_param['learning_rate'] = [0.2, 0.1, 0.05]

# use random hyperparameter optimization algorithm 
generator = RandomParameterOptimizer(grid_param)
# define folding scorer
scorer = FoldingScorer(optimal_AMS, folds=6, fold_checks=4)

grid_sk = GridOptimalSearchCV(SklearnClassifier(AdaBoostClassifier(), features=features), generator, scorer)
grid_sk.fit(data, labels)









    Out[11]:





<rep.metaml.gridsearch.GridOptimalSearchCV at 0x10c4e7f50>

Print best parameters



In [12]:

    
grid_sk.generator.best_params_









    Out[12]:





OrderedDict([('n_estimators', 50), ('learning_rate', 0.2)])

Print all qualities for used parameters



In [13]:

    
grid_sk.generator.print_results()









    



58.674:  n_estimators=50, learning_rate=0.2
56.484:  n_estimators=30, learning_rate=0.2
56.100:  n_estimators=50, learning_rate=0.1
53.704:  n_estimators=30, learning_rate=0.1
53.278:  n_estimators=50, learning_rate=0.05
53.091:  n_estimators=30, learning_rate=0.05

Grid search with user-defined scorer

You can define your own scorer with specific logic by simple way. Scorer must have just the following:

scorer(base_estimator, params, X, y, sample_weight)

Define scorer, which will be train model on all dataset and test it on the pre-defined dataset



In [14]:

    
from sklearn import clone
def generate_scorer(test, test_labels, test_weight=None):
    """ Generate scorer which calculate metric on fixed test dataset """
    def custom(base_estimator, params, X, y, sample_weight=None):
        cl = clone(base_estimator)
        cl.set_params(**params)
        cl.fit(X, y)
        res = optimal_AMS(test_labels, cl.predict_proba(test), sample_weight)
        return res
    return custom



In [15]:

    
# define grid parameters
grid_param = OrderedDict()
grid_param['n_estimators'] = [30, 50]
grid_param['learning_rate'] = [0.2, 0.1, 0.05]
grid_param['features'] = [features[:5], features[:8]]

# define random hyperparameter optimization algorithm 
generator = RandomParameterOptimizer(grid_param)
# define specific scorer
scorer = generate_scorer(test_data, test_labels)

grid = GridOptimalSearchCV(SklearnClassifier(clf=AdaBoostClassifier(), features=features), generator, scorer)
grid.fit(train_data, train_labels)









    Out[15]:





<rep.metaml.gridsearch.GridOptimalSearchCV at 0x111badf90>



In [16]:

    
len(train_data), len(test_data)









    Out[16]:





(14265, 4755)

Print all tried combinations of parameters and quality



In [17]:

    
grid.generator.print_results()









    



60.269:  n_estimators=50, learning_rate=0.2, features=['fLength', 'fDist', 'fConc', 'fWidth', 'fM3Long', 'fSize', 'fConc1', 'fAlpha']
57.120:  n_estimators=30, learning_rate=0.2, features=['fLength', 'fDist', 'fConc', 'fWidth', 'fM3Long', 'fSize', 'fConc1', 'fAlpha']
56.238:  n_estimators=50, learning_rate=0.1, features=['fLength', 'fDist', 'fConc', 'fWidth', 'fM3Long', 'fSize', 'fConc1', 'fAlpha']
55.923:  n_estimators=50, learning_rate=0.05, features=['fLength', 'fDist', 'fConc', 'fWidth', 'fM3Long', 'fSize', 'fConc1', 'fAlpha']
55.677:  n_estimators=30, learning_rate=0.1, features=['fLength', 'fDist', 'fConc', 'fWidth', 'fM3Long', 'fSize', 'fConc1', 'fAlpha']
55.221:  n_estimators=30, learning_rate=0.05, features=['fLength', 'fDist', 'fConc', 'fWidth', 'fM3Long', 'fSize', 'fConc1', 'fAlpha']
37.438:  n_estimators=30, learning_rate=0.2, features=['fLength', 'fDist', 'fConc', 'fWidth', 'fM3Long']
37.281:  n_estimators=50, learning_rate=0.05, features=['fLength', 'fDist', 'fConc', 'fWidth', 'fM3Long']
37.218:  n_estimators=50, learning_rate=0.1, features=['fLength', 'fDist', 'fConc', 'fWidth', 'fM3Long']
37.063:  n_estimators=30, learning_rate=0.05, features=['fLength', 'fDist', 'fConc', 'fWidth', 'fM3Long']

Results comparison



In [18]:

    
from rep.report import ClassificationReport
from rep.data.storage import LabeledDataStorage

lds = LabeledDataStorage(test_data, test_labels)
classifiers = {'grid_fold': grid_sk.fit_best_estimator(train_data[features], train_labels),
               'grid_test_dataset': grid.fit_best_estimator(train_data[features], train_labels) }

report = ClassificationReport(classifiers, lds)

ROCs



In [19]:

    
report.roc().plot()

Metric



In [20]:

    
report.metrics_vs_cut(AMS, metric_label='AMS').plot()