About

This notebook demonstrates classifiers, which are provided by Reproducible experiment platform (REP) package.
REP contains following classifiers

  • scikit-learn
  • TMVA
  • XGBoost
  • estimators from hep_ml
  • theanets
  • PyBrain
  • Neurolab

(and any sklearn-compatible classifiers may be used).

Neural network libraries are introduced in different notebook.

In this notebook we show the most simple way to

  • train classifier
  • build predictions
  • measure quality
  • combine metaclassifiers

Loading data

download particle identification Data Set from UCI


In [1]:
!cd toy_datasets; wget -O MiniBooNE_PID.txt -nc --no-check-certificate https://archive.ics.uci.edu/ml/machine-learning-databases/00199/MiniBooNE_PID.txt


File `MiniBooNE_PID.txt' already there; not retrieving.

In [2]:
import numpy, pandas
from rep.utils import train_test_split
from sklearn.metrics import roc_auc_score

data = pandas.read_csv('toy_datasets/MiniBooNE_PID.txt', sep='\s*', skiprows=[0], header=None, engine='python')
labels = pandas.read_csv('toy_datasets/MiniBooNE_PID.txt', sep=' ', nrows=1, header=None)
labels = [1] * labels[1].values[0] + [0] * labels[2].values[0]
data.columns = ['feature_{}'.format(key) for key in data.columns]

First rows of our data


In [3]:
data[:5]


Out[3]:
feature_0 feature_1 feature_2 feature_3 feature_4 feature_5 feature_6 feature_7 feature_8 feature_9 ... feature_40 feature_41 feature_42 feature_43 feature_44 feature_45 feature_46 feature_47 feature_48 feature_49
0 2.59413 0.468803 20.6916 0.322648 0.009682 0.374393 0.803479 0.896592 3.59665 0.249282 ... 101.174 -31.3730 0.442259 5.86453 0.000000 0.090519 0.176909 0.457585 0.071769 0.245996
1 3.86388 0.645781 18.1375 0.233529 0.030733 0.361239 1.069740 0.878714 3.59243 0.200793 ... 186.516 45.9597 -0.478507 6.11126 0.001182 0.091800 -0.465572 0.935523 0.333613 0.230621
2 3.38584 1.197140 36.0807 0.200866 0.017341 0.260841 1.108950 0.884405 3.43159 0.177167 ... 129.931 -11.5608 -0.297008 8.27204 0.003854 0.141721 -0.210559 1.013450 0.255512 0.180901
3 4.28524 0.510155 674.2010 0.281923 0.009174 0.000000 0.998822 0.823390 3.16382 0.171678 ... 163.978 -18.4586 0.453886 2.48112 0.000000 0.180938 0.407968 4.341270 0.473081 0.258990
4 5.93662 0.832993 59.8796 0.232853 0.025066 0.233556 1.370040 0.787424 3.66546 0.174862 ... 229.555 42.9600 -0.975752 2.66109 0.000000 0.170836 -0.814403 4.679490 1.924990 0.253893

5 rows × 50 columns

Splitting into train and test


In [4]:
# Get train and test data
train_data, test_data, train_labels, test_labels = train_test_split(data, labels, train_size=0.25)

Classifiers

All classifiers inherit from sklearn.BaseEstimator and have the following methods:

  • classifier.fit(X, y, sample_weight=None) - train classifier

  • classifier.predict_proba(X) - return probabilities vector for all classes

  • classifier.predict(X) - return predicted labels

  • classifier.staged_predict_proba(X) - return probabilities after each iteration (not supported by TMVA)

  • classifier.get_feature_importances()

Here we use X to denote matrix with data of shape [n_samples, n_features], y is vector with labels (0 or 1) of shape [n_samples],
sample_weight is vector with weights.

Difference from default scikit-learn interface

X should be* pandas.DataFrame, not numpy.array.
Provided this, you'll be able to choose features used in training by setting e.g. features=['FlightTime', 'p'] in constructor.

* it works fine with numpy.array as well, but in this case all the features will be used.

Variables used in training


In [5]:
variables = list(data.columns[:15])

Sklearn

wrapper for scikit-learn classifiers. In this example we use GradientBoosting with default settings


In [6]:
from rep.estimators import SklearnClassifier
from sklearn.ensemble import GradientBoostingClassifier
# Using gradient boosting with default settings
sk = SklearnClassifier(GradientBoostingClassifier(), features=variables)
# Training classifier
sk.fit(train_data, train_labels)
print('training complete')


training complete

Predicting probabilities, measuring the quality


In [7]:
# predict probabilities for each class
prob = sk.predict_proba(test_data)
print prob


[[ 0.99242263  0.00757737]
 [ 0.07570713  0.92429287]
 [ 0.9342327   0.0657673 ]
 ..., 
 [ 0.95540457  0.04459543]
 [ 0.16007055  0.83992945]
 [ 0.99436947  0.00563053]]

In [8]:
print 'ROC AUC', roc_auc_score(test_labels, prob[:, 1])


ROC AUC 0.969997858413

Predictions of classes


In [9]:
sk.predict(test_data)


Out[9]:
array([0, 1, 0, ..., 0, 1, 0])

In [10]:
sk.get_feature_importances()


Out[10]:
effect
feature_0 0.201533
feature_1 0.158487
feature_2 0.098564
feature_3 0.076836
feature_4 0.030222
feature_5 0.031973
feature_6 0.032771
feature_7 0.029889
feature_8 0.050288
feature_9 0.021392
feature_10 0.023092
feature_11 0.050503
feature_12 0.125505
feature_13 0.052451
feature_14 0.016493

TMVA


In [11]:
from rep.estimators import TMVAClassifier
print TMVAClassifier.__doc__


    TMVAClassifier wraps classifiers from TMVA (CERN library for machine learning)

    Parameters:
    -----------
    :param str method: algorithm method (default='kBDT')
    :param features: features used in training
    :type features: list[str] or None
    :param str factory_options: options, for example::

        "!V:!Silent:Color:Transformations=I;D;P;G,D"

    :param str sigmoid_function: function which is used to convert TMVA output to probabilities;

        * *identity* (use for svm, mlp) --- the same output, use this for methods returning class probabilities

        * *sigmoid* --- sigmoid transformation, use it if output varies in range [-infinity, +infinity]

        * *bdt* (for bdt algorithms output varies in range [-1, 1])

        * *sig_eff=0.4* --- for rectangular cut optimization methods,
        for instance, here 0.4 will be used as signal efficiency to evaluate MVA,
        (put any float number from [0, 1])

    :param dict method_parameters: estimator options, example: NTrees=100, BoostType='Grad'

    .. warning::
        TMVA doesn't support *staged_predict_proba()* and *feature_importances__*

    .. warning::
        TMVA doesn't support multiclassification, only two-class classification

    `TMVA guide <http://mirror.yandex.ru/gentoo-distfiles/distfiles/TMVAUsersGuide-v4.03.pdf>`_
    

In [12]:
tmva = TMVAClassifier(method='kBDT', NTrees=50, Shrinkage=0.05, features=variables)
tmva.fit(train_data, train_labels)
print('training complete')


training complete

Predict probabilities and estimate quality


In [13]:
# predict probabilities for each class
prob = tmva.predict_proba(test_data)
print prob


[[ 0.88198512  0.11801488]
 [ 0.22006905  0.77993095]
 [ 0.67851749  0.32148251]
 ..., 
 [ 0.66461504  0.33538496]
 [ 0.42547775  0.57452225]
 [ 0.92236464  0.07763536]]

In [14]:
print 'ROC AUC', roc_auc_score(test_labels, prob[:, 1])


ROC AUC 0.956336458008

In [15]:
# predict labels
tmva.predict(test_data)


Out[15]:
array([0, 1, 0, ..., 0, 1, 0])

XGBoost


In [16]:
from rep.estimators import XGBoostClassifier
print XGBoostClassifier.__doc__


Implements classification (and multiclassification) from XGBoost library. 
    Base class for XGBoostClassifier and XGBoostRegressor. XGBoost tree booster is used.

    Parameters:
    -----------
    :param int n_estimators: the number of trees built.
    :param int nthreads: number of parallel threads used to run xgboost.
    :param num_feature: feature dimension used in boosting, set to maximum dimension of the feature
        (set automatically by xgboost, no need to be set by user).
    :type num_feature: None or int
    :param float gamma: minimum loss reduction required to make a further partition on a leaf node of the tree.
        The larger, the more conservative the algorithm will be.
    :type gamma: None or float
    :param float eta: step size shrinkage used in update to prevent overfitting.
        After each boosting step, we can directly get the weights of new features
        and eta actually shrinkage the feature weights to make the boosting process more conservative.
    :param int max_depth: maximum depth of a tree.
    :param float scale_pos_weight: ration of weights of the class 1 to the weights of the class 0.
    :param float min_child_weight: minimum sum of instance weight(hessian) needed in a child.
        If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight,
        then the building process will give up further partitioning.

        .. note:: weights are normalized so that mean=1 before fitting. Roughly min_child_weight is equal to the number of events.
    :param float subsample: subsample ratio of the training instance.
        Setting it to 0.5 means that XGBoost randomly collected half of the data instances to grow trees
        and this will prevent overfitting.
    :param float colsample: subsample ratio of columns when constructing each tree.
    :param float base_score: the initial prediction score of all instances, global bias.
    :param int random_state: random number seed.
    :param boot verbose: if 1, will print messages during training
    :param float missing: the number considered by xgboost as missing value.
    

In [17]:
# XGBoost with default parameters
xgb = XGBoostClassifier(features=variables)
xgb.fit(train_data, train_labels)
print('training complete')


training complete

Predict probabilities and estimate quality


In [18]:
prob = xgb.predict_proba(test_data)
print 'ROC AUC:', roc_auc_score(test_labels, prob[:, 1])


ROC AUC: 0.975258121592

Predict labels


In [19]:
xgb.predict(test_data)


Out[19]:
array([0, 1, 0, ..., 0, 1, 0])

In [20]:
xgb.get_feature_importances()


Out[20]:
effect
feature_0 656
feature_1 800
feature_2 740
feature_3 608
feature_4 378
feature_5 480
feature_6 462
feature_7 522
feature_8 638
feature_9 556
feature_10 540
feature_11 598
feature_12 772
feature_13 592
feature_14 486

Advantages of common interface

As one can see above, all the classifiers implement the same interface, this simplifies work, simplifies comparison of different classifiers, but this is not the only profit.

Sklearn provides different tools to combine different classifiers and transformers. One of this tools is AdaBoost, which is abstract metaclassifier built on the top of some other classifier (usually, decision dree). Also bagging is other frequently used ensembling meta-algorithm.

Let's show that now you can run AdaBoost over classifiers from other libraries!
(isn't boosting over neural network what you were dreaming of all your life?)

AdaBoost over XGBoost


In [21]:
from sklearn.ensemble import AdaBoostClassifier

In [22]:
%%time
base_xgb = XGBoostClassifier(n_estimators=20)
ada_xgb = SklearnClassifier(AdaBoostClassifier(base_estimator=base_xgb, n_estimators=5))
ada_xgb.fit(train_data[variables], train_labels)
print('training complete!')

# predict probabilities for each class
prob = ada_xgb.predict_proba(test_data[variables])
print 'AUC', roc_auc_score(test_labels, prob[:, 1])

# predict probabilities for each class
prob = ada_xgb.predict_proba(train_data[variables])
print 'AUC', roc_auc_score(train_labels, prob[:, 1])


training complete!
AUC 0.975709190087
AUC 0.998466758443
CPU times: user 34 s, sys: 266 ms, total: 34.3 s
Wall time: 34.4 s

AdaBoost over TMVA classifier

the following code shows that you can do the same with i.e. TMVA, uncomment it to try


In [23]:
# base_tmva = TMVAClassifier(method='kBDT', NTrees=20)
# ada_tmva  = SklearnClassifier(AdaBoostClassifier(base_estimator=base_tmva, n_estimators=5), features=variables)
# ada_tmva.fit(train_data, train_labels)
# print('training complete')

# prob = ada_tmva.predict_proba(test_data)
# print 'AUC', roc_auc_score(test_labels, prob[:, 1])

Other advantages of common interface

There are many things you can do with classifiers now:

  • cloning
  • getting / setting parameters as dictionaries
  • automatic hyperparameter optimization
  • build pipelines (sklearn.pipeline)
  • use hierarchical training, training on subsets
  • passing over internet / train classifiers on other machines

And you can replace classifiers at any moment.

Exercises

Exercise 1. Play with parameters in each type of classifiers

Exercise 2. Add weight column and train models with weights in training