About

This notebook demonstrates classifiers, which are provided by Reproducible experiment platform (REP) package.
REP contains following classifiers

  • scikit-learn
  • TMVA
  • XGBoost
  • PyBrain

In this notebook we show the most simple way to

  • train classifier
  • build predictions
  • measure quality

Loading data


In [1]:
import numpy, pandas
from rep.utils import train_test_split
from sklearn.metrics import roc_auc_score

sig_data = pandas.read_csv('toy_datasets/toyMC_sig_mass.csv', sep='\t')
bck_data = pandas.read_csv('toy_datasets/toyMC_bck_mass.csv', sep='\t')

labels = numpy.array([1] * len(sig_data) + [0] * len(bck_data))
data = pandas.concat([sig_data, bck_data])

First rows of our data


In [2]:
data[:5]


Out[2]:
CDF1 CDF2 CDF3 DOCAone DOCAthree DOCAtwo FlightDistance FlightDistanceError Hlt1Dec Hlt2Dec ... p1_IP p1_IPSig p1_Laura_IsoBDT p1_pt p2_IP p2_IPSig p2_Laura_IsoBDT p2_pt peakingbkg pt
0 1.000000 1.000000 1.000000 0.111337 0.012695 0.123426 162.650955 0.870942 0 0 ... 11.314665 83.196968 -0.223668 699.066467 9.799975 64.790207 -0.121159 521.628174 NaN 220.742111
1 0.759755 0.597375 0.389256 0.021781 0.094551 0.088421 4.193265 1.262280 0 0 ... 0.720070 7.237868 -0.256142 587.628935 0.882111 8.834325 -0.203220 532.679950 NaN 661.208843
2 1.000000 0.796142 0.566286 0.011852 0.004400 0.009153 1.580610 0.261697 0 0 ... 0.362181 4.173097 -0.252788 802.746495 0.427290 5.008959 -0.409469 674.122342 NaN 1290.963982
3 0.716397 0.524712 0.279033 0.015171 0.083900 0.069127 7.884569 1.310151 0 0 ... 0.753449 6.615949 -0.253550 564.203857 0.917409 8.695459 -0.192284 537.791687 NaN 692.654175
4 1.000000 0.996479 0.888159 0.005547 0.070438 0.064689 -2.267649 0.139555 0 0 ... 0.589455 21.869143 -0.254778 746.624928 0.388996 8.465344 -0.217319 988.539221 NaN 1328.837840

5 rows × 40 columns

Splitting into train and test


In [3]:
# Get train and test data
train_data, test_data, train_labels, test_labels = train_test_split(data, labels, train_size=0.5)

Classifiers

All classifiers inherit from sklearn.BaseEstimator and have the following methods:

  • classifier.fit(X, y, sample_weight=None) - train classifier

  • classifier.predict_proba(X) - return probabilities vector for all classes

  • classifier.predict(X) - return predicted labels

  • classifier.staged_predict_proba(X) - return probabilities after each iteration (not supported by TMVA)

  • classifier.get_feature_importances()

Here we use X to denote matrix with data of shape [n_samples, n_features], y is vector with labels (0 or 1) of shape [n_samples],
sample_weight is vector with weights.

Difference from default scikit-learn interface

X should be* pandas.DataFrame, not numpy.array.
Provided this, you'll be able to choose features used in training by setting e.g. features=['FlightTime', 'p'] in constructor.

* it works fine with numpy.array as well, but in this case all the features will be used.

Variables used in training


In [4]:
variables = ["FlightDistance", "FlightDistanceError", "IP", "VertexChi2", "pt", "p0_pt", "p1_pt", "p2_pt", 'LifeTime','dira']

Sklearn

wrapper for scikit-learn classifiers. In this example we use GradientBoosting with default settings


In [5]:
from rep.estimators import SklearnClassifier
from sklearn.ensemble import GradientBoostingClassifier
# Using gradient boosting with default settings
sk = SklearnClassifier(GradientBoostingClassifier(), features=variables)
# Training classifier
sk.fit(train_data, train_labels)
print('training complete')


training complete

Predicting probabilities, measuring the quality


In [6]:
# predict probabilities for each class
prob = sk.predict_proba(test_data)
print prob


[[ 0.01925103  0.98074897]
 [ 0.01577953  0.98422047]
 [ 0.01379252  0.98620748]
 ..., 
 [ 0.01624765  0.98375235]
 [ 0.01442193  0.98557807]
 [ 0.03089716  0.96910284]]

In [7]:
print 'ROC AUC', roc_auc_score(test_labels, prob[:, 1])


ROC AUC 0.908278717567

Predictions of classes


In [8]:
sk.predict(test_data)


Out[8]:
array([1, 1, 1, ..., 1, 1, 1])

In [9]:
sk.get_feature_importances()


Out[9]:
effect
FlightDistance 0.018544
FlightDistanceError 0.075000
IP 0.163375
VertexChi2 0.101842
pt 0.098344
p0_pt 0.061443
p1_pt 0.109733
p2_pt 0.091812
LifeTime 0.127267
dira 0.152638

TMVA

TMVAClassifier wraps classifiers from TMVA (CERN library for machine learning)

Parameters:
-----------
:param str method: algorithm method (default='kBDT')
:param list(str) | None features: features used in training
:param str factory_options: default="!V:!Silent:Color:Transformations=I;D;P;G,D:AnalysisType=Classification"
:param kwargs: other parameters passed as key=value.

TMVA doesn't support staged_predict_proba


In [10]:
from rep.estimators.tmva import TMVAClassifier, TMVARegressor

In [11]:
from rep.estimators import TMVAClassifier

In [12]:
tmva = TMVAClassifier(method='kBDT', NTrees=50, Shrinkage=0.05, features=variables)
tmva.fit(train_data, train_labels)
print('training complete')


training complete

Predict probabilities and estimate quality


In [13]:
# predict probabilities for each class
prob = tmva.predict_proba(test_data)
print prob


[[ 0.2709105   0.7290895 ]
 [ 0.28991175  0.71008825]
 [ 0.17888081  0.82111919]
 ..., 
 [ 0.27305794  0.72694206]
 [ 0.22307611  0.77692389]
 [ 0.31567514  0.68432486]]

In [14]:
print 'AUC', roc_auc_score(test_labels, prob[:, 1])


AUC 0.901148306027

In [15]:
# predict labels
tmva.predict(test_data)


Out[15]:
array([1, 1, 1, ..., 1, 1, 1])

XGBoost

XGBoost is open-source library with fast gradient boosting over decision trees.

Parameters:
-----------
:param list(str) features: list of features to train model

:param n_estimators: the number of round for boosting.

:param nthreads: number of parallel threads used to run xgboost.

:param num_feature: feature dimension used in boosting, set to maximum dimension of the feature
(set automatically by xgboost, no need to be set by user).

:param gamma: minimum loss reduction required to make a further partition on a leaf node of the tree.
 The larger, the more conservative the algorithm will be.

:param eta: step size shrinkage used in update to prevents overfitting.
After each boosting step, we can directly get the weights of new features
 and eta actually shrinkage the feature weights to make the boosting process more conservative.

:param max_depth: maximum depth of a tree.

:params scale_pos_weight: ration of weights of the class 1 to the weights of the class 0.

:param min_child_weight: minimum sum of instance weight(hessian) needed in a child.
If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight,
then the building process will give up further partitioning.
In linear regression mode, this simply corresponds to minimum number of instances needed to be in each node.
The larger, the more conservative the algorithm will be.

:param subsample: subsample ratio of the training instance.
Setting it to 0.5 means that XGBoost randomly collected half of the data instances to grow trees
    and this will prevent overfitting.

:param colsample: subsample ratio of columns when constructing each tree.

:param objective: specify the learning task and the corresponding learning objective, and the options are below:
"reg:linear" --linear regression
"reg:logistic" --logistic regression
"binary:logistic" --logistic regression for binary classification, output probability
"binary:logitraw" --logistic regression for binary classification, output score before logistic transformation
"multi:softmax" --set XGBoost to do multiclass classification using the softmax objective, you also need to
    set num_class(number of classes)
"multi:softprob" --same as softmax, but output a vector of ndata * nclass, which can be further reshaped to ndata,
    nclass matrix. The result contains predicted probability of each data point belonging to each class.
"rank:pairwise" --set XGBoost to do ranking task by minimizing the pairwise loss

:param base_score: the initial prediction score of all instances, global bias.

:param random_state: random number seed.

:param verbose: if 1, will print messages during training

:param missing: the number considered by xgboost as missing value.

In [16]:
from rep.estimators import XGBoostClassifier
# XGBoost with default parameters
xgb = XGBoostClassifier(features=variables)
xgb.fit(train_data, train_labels, sample_weight=numpy.ones(len(train_labels)))
print('training complete')


training complete
/Users/antares/code/xgboost/wrapper/xgboost.py:80: FutureWarning: comparison to `None` will result in an elementwise object comparison in the future.
  if label != None:
/Users/antares/code/xgboost/wrapper/xgboost.py:82: FutureWarning: comparison to `None` will result in an elementwise object comparison in the future.
  if weight !=None:

Predict probabilities and estimate quality


In [17]:
prob = xgb.predict_proba(test_data)
print 'ROC AUC:', roc_auc_score(test_labels, prob[:, 1])


ROC AUC: 0.922402718668

Predict labels


In [18]:
xgb.predict(test_data)


Out[18]:
array([1, 1, 1, ..., 1, 1, 1])

In [19]:
xgb.get_feature_importances()


Out[19]:
effect
FlightDistance 1012
FlightDistanceError 940
IP 934
VertexChi2 918
pt 998
p0_pt 866
p1_pt 1070
p2_pt 1040
LifeTime 414
dira 338

Advantages of common interface

As one can see above, all the classifiers implement the same interface, this simplifies work, simplifies comparison of different classifiers, but this is not the only profit.

Sklearn provides different tools to combine different classifiers and transformers. One of this tools is AdaBoost, which is abstract metaclassifier built on the top of some other classifier (usually, decision dree)

Let's show that now you can run AdaBoost over classifiers from other libraries!
(isn't boosting over neural network what you were dreaming of all your life?)

AdaBoost over TMVA classifier


In [20]:
from sklearn.ensemble import AdaBoostClassifier

# Construct AdaBoost with TMVA as base estimator
base_tmva = TMVAClassifier(method='kBDT', NTrees=15, Shrinkage=0.05)
ada_tmva  = SklearnClassifier(AdaBoostClassifier(base_estimator=base_tmva, n_estimators=5), features=variables)
ada_tmva.fit(train_data, train_labels)
print('training complete')


training complete

In [21]:
prob = ada_tmva.predict_proba(test_data)
print 'AUC', roc_auc_score(test_labels, prob[:, 1])


AUC 0.885266565878

AdaBoost over XGBoost


In [22]:
# Construct AdaBoost with xgboost base estimator
base_xgb = XGBoostClassifier(n_estimators=50)
# ada_xgb = SklearnClassifier(AdaBoostClassifier(base_estimator=base_xgb, n_estimators=1), features=variables)
ada_xgb = AdaBoostClassifier(base_estimator=base_xgb, n_estimators=1)
ada_xgb.fit(train_data[variables], train_labels)
print('training complete!')


training complete!

In [23]:
# predict probabilities for each class
prob = ada_xgb.predict_proba(test_data[variables])
print 'AUC', roc_auc_score(test_labels, prob[:, 1])


AUC 0.917898066693

In [24]:
# predict probabilities for each class
prob = ada_xgb.predict_proba(train_data[variables])
print 'AUC', roc_auc_score(train_labels, prob[:, 1])


AUC 0.950702225833

PyBrain

PyBrainClassifier wraps classifiers from PyBrain. PyBrain is short for Python-Based Reinforcement Learning, Artificial Intelligence and Neural Network Library.

Parameters:
-----------
:param features: features used in training.
:type features: list[str] or None
:param epochs: number of iterations of training; if < 0 then classifier trains until convergence.
:param scaler: scaler used to transform data; default is StandardScaler.
:type scaler: transformer from sklearn.preprocessing or None
:param boolean use_rprop: flag to indicate whether we should use rprop trainer.

net parameters:
:param list layers: indicate how many neurons in each hidden layer; default is 1 layer included 10 neurons.
:param hiddenclass: classes of the hidden layers; default is 'SigmoidLayer'.
:type hiddenclass: list of str
:param params: other net parameters:
    :param boolean bias and outputbias: flags to indicate whether the network should have the corresponding biases;
     both default to True.
    :param boolean peepholes.
    :param boolean recurrent: if the `recurrent` flag is set, a :class:`RecurrentNetwork` will be created,
     otherwise a :class:`FeedForwardNetwork`.
:type params: dict

trainer parameters:
:param float learningrate: gives the ratio of which parameters are changed into the direction of the gradient.
:param float lrdecay: the learning rate decreases by lrdecay, which is used to multiply the learning rate after each training step.
:param float momentum: the ratio by which the gradient of the last timestep is used.
:param boolean verbose.
:param boolean batchlearning: if batchlearning is set, the parameters are updated only at the end of each epoch. Default is False.
:param float weightdecay: corresponds to the weightdecay rate, where 0 is no weight decay at all.

rprop parameters:
:param float etaminus: factor by which step width is decreased when overstepping (0.5).
:param float etaplus: factor by which step width is increased when following gradient (1.2).
:param float delta: step width for each weight.
:param float deltamin: minimum step width (1e-6).
:param float deltamax: maximum step width (5.0).
:param float delta0: initial step width (0.1).

trainUntilConvergence parameters:
:param int max_epochs: if is given, at most that many epochs are trained.
:param int continue_epochs: each time validation error hits a minimum, try for continue_epochs epochs to find a better one.
:param float validation_proportion: the ratio of the dataset that is used for the validation dataset.

Loading data


In [2]:
import pandas
import numpy
from rep.utils import train_test_split
from sklearn.metrics import roc_auc_score

sig_data = pandas.read_csv('toy_datasets/toyMC_sig_mass.csv', sep='\t')
bck_data = pandas.read_csv('toy_datasets/toyMC_bck_mass.csv', sep='\t')

labels = numpy.array([1] * len(sig_data) + [0] * len(bck_data))
data = pandas.concat([sig_data, bck_data])
variables = ["FlightDistance", "FlightDistanceError", "IP", "VertexChi2", "pt", "p0_pt", "p1_pt", "p2_pt", 'LifeTime','dira']

Split on train and test


In [3]:
X_train, X_test, y_train, y_test = train_test_split(data[variables].values, labels, train_size=0.5, random_state = 384)
print X_train.shape, X_test.shape, len(y_train), len(y_test)


(72449, 10) (72450, 10) 72449 72450

Now we make a classifier with 2 hidden layers with 10 and 20 neurons in each layer


In [5]:
from rep.estimators.pybrain import PyBrainClassifier
from sklearn.preprocessing import StandardScaler, MinMaxScaler, Binarizer

pb = PyBrainClassifier(layers=[10, 20], epochs=10, verbose=True, scaler=StandardScaler())

Training


In [7]:
%%time
pb.fit(X_train, y_train)


Total error:  0.040186239795
Total error:  0.0368244026758
Total error:  0.0364056124783
Total error:  0.0361849120992
Total error:  0.0359050794801
Total error:  0.0358441554205
Total error:  0.0357131273455
Total error:  0.0357788689297
Total error:  0.0357235802459
Total error:  0.035619950252
CPU times: user 10min 42s, sys: 2.26 s, total: 10min 44s
Wall time: 10min 44s
Out[7]:
PyBrainClassifier(batchlearning=False, delta0=0.1, deltamax=0.5,
         deltamin=1e-06, epochs=10, etaminus=0.5, etaplus=1.2,
         features=['Feature_0', 'Feature_1', 'Feature_2', 'Feature_3', 'Feature_4', 'Feature_5', 'Feature_6', 'Feature_7', 'Feature_8', 'Feature_9'],
         hiddenclass=['SigmoidLayer', 'SigmoidLayer'], layers=[10, 20],
         learningrate=0.01, lrdecay=1.0, momentum=0.0,
         scaler=StandardScaler(copy=True, with_mean=True, with_std=True),
         use_rprop=False, verbose=True, weightdecay=0.0)

Predicting probabilities


In [8]:
prob = pb.predict_proba(X_test)
prob


Out[8]:
array([[ 0.01786076,  0.98213924],
       [ 0.09110269,  0.90889731],
       [ 0.22386682,  0.77613318],
       ..., 
       [ 0.39654057,  0.60345943],
       [ 0.02806454,  0.97193546],
       [ 0.02478111,  0.97521889]])

Predicting result


In [9]:
y_pred = pb.predict(X_test)

In [10]:
from sklearn.metrics import zero_one_loss
from sklearn.metrics import classification_report

print "Accuracy:", zero_one_loss(y_pred, y_test)
print "Classification report:"
print classification_report(y_pred, y_test)


Accuracy: 0.0975569358178
Classification report:
             precision    recall  f1-score   support

          0       0.48      0.74      0.58      6611
          1       0.97      0.92      0.94     65839

avg / total       0.93      0.90      0.91     72450

Other advantages of common interface

There are many things you can do with classifiers now:

  • cloning
  • getting / setting parameters as dictionaries
  • do automatic hyperparameter optimization
  • build pipelines (sklearn.pipeline)
  • use hierarchical training, training on subsets
  • passing over internet / train classifiers on other machines

And you can replace classifiers at any moment.

Exercises

Exercise 1. Play with parameters in each type of classifiers

Exercise 2. Add weight column and train models with weights in training