About

This notebook demonstrates classifiers, which are provided by Reproducible experiment platform (REP) package.
REP contains following classifiers

scikit-learn
TMVA
XGBoost
PyBrain

In this notebook we show the most simple way to

train classifier
build predictions
measure quality

Loading data



In [1]:

    
import numpy, pandas
from rep.utils import train_test_split
from sklearn.metrics import roc_auc_score

sig_data = pandas.read_csv('toy_datasets/toyMC_sig_mass.csv', sep='\t')
bck_data = pandas.read_csv('toy_datasets/toyMC_bck_mass.csv', sep='\t')

labels = numpy.array([1] * len(sig_data) + [0] * len(bck_data))
data = pandas.concat([sig_data, bck_data])

First rows of our data



In [2]:

    
data[:5]









    Out[2]:






  
    
      
      CDF1
      CDF2
      CDF3
      DOCAone
      DOCAthree
      DOCAtwo
      FlightDistance
      FlightDistanceError
      Hlt1Dec
      Hlt2Dec
      ...
      p1_IP
      p1_IPSig
      p1_Laura_IsoBDT
      p1_pt
      p2_IP
      p2_IPSig
      p2_Laura_IsoBDT
      p2_pt
      peakingbkg
      pt
    
  
  
    
      0
       1.000000
       1.000000
       1.000000
       0.111337
       0.012695
       0.123426
       162.650955
       0.870942
       0
       0
      ...
       11.314665
       83.196968
      -0.223668
       699.066467
       9.799975
       64.790207
      -0.121159
       521.628174
      NaN
        220.742111
    
    
      1
       0.759755
       0.597375
       0.389256
       0.021781
       0.094551
       0.088421
         4.193265
       1.262280
       0
       0
      ...
        0.720070
        7.237868
      -0.256142
       587.628935
       0.882111
        8.834325
      -0.203220
       532.679950
      NaN
        661.208843
    
    
      2
       1.000000
       0.796142
       0.566286
       0.011852
       0.004400
       0.009153
         1.580610
       0.261697
       0
       0
      ...
        0.362181
        4.173097
      -0.252788
       802.746495
       0.427290
        5.008959
      -0.409469
       674.122342
      NaN
       1290.963982
    
    
      3
       0.716397
       0.524712
       0.279033
       0.015171
       0.083900
       0.069127
         7.884569
       1.310151
       0
       0
      ...
        0.753449
        6.615949
      -0.253550
       564.203857
       0.917409
        8.695459
      -0.192284
       537.791687
      NaN
        692.654175
    
    
      4
       1.000000
       0.996479
       0.888159
       0.005547
       0.070438
       0.064689
        -2.267649
       0.139555
       0
       0
      ...
        0.589455
       21.869143
      -0.254778
       746.624928
       0.388996
        8.465344
      -0.217319
       988.539221
      NaN
       1328.837840
    
  

5 rows × 40 columns

Splitting into train and test



In [3]:

    
# Get train and test data
train_data, test_data, train_labels, test_labels = train_test_split(data, labels, train_size=0.5)

Classifiers

All classifiers inherit from sklearn.BaseEstimator and have the following methods:

classifier.fit(X, y, sample_weight=None) - train classifier
classifier.predict_proba(X) - return probabilities vector for all classes
classifier.predict(X) - return predicted labels
classifier.staged_predict_proba(X) - return probabilities after each iteration (not supported by TMVA)
classifier.get_feature_importances()

Here we use X to denote matrix with data of shape [n_samples, n_features], y is vector with labels (0 or 1) of shape [n_samples],
sample_weight is vector with weights.

Difference from default scikit-learn interface

X should be* pandas.DataFrame, not numpy.array.
Provided this, you'll be able to choose features used in training by setting e.g. features=['FlightTime', 'p'] in constructor.

* it works fine with numpy.array as well, but in this case all the features will be used.

Variables used in training



In [4]:

    
variables = ["FlightDistance", "FlightDistanceError", "IP", "VertexChi2", "pt", "p0_pt", "p1_pt", "p2_pt", 'LifeTime','dira']

Sklearn

wrapper for scikit-learn classifiers. In this example we use GradientBoosting with default settings



In [5]:

    
from rep.estimators import SklearnClassifier
from sklearn.ensemble import GradientBoostingClassifier
# Using gradient boosting with default settings
sk = SklearnClassifier(GradientBoostingClassifier(), features=variables)
# Training classifier
sk.fit(train_data, train_labels)
print('training complete')









    



training complete

Predicting probabilities, measuring the quality



In [6]:

    
# predict probabilities for each class
prob = sk.predict_proba(test_data)
print prob









    



[[ 0.01925103  0.98074897]
 [ 0.01577953  0.98422047]
 [ 0.01379252  0.98620748]
 ..., 
 [ 0.01624765  0.98375235]
 [ 0.01442193  0.98557807]
 [ 0.03089716  0.96910284]]



In [7]:

    
print 'ROC AUC', roc_auc_score(test_labels, prob[:, 1])









    



ROC AUC 0.908278717567

Predictions of classes



In [8]:

    
sk.predict(test_data)









    Out[8]:





array([1, 1, 1, ..., 1, 1, 1])



In [9]:

    
sk.get_feature_importances()









    Out[9]:






  
    
      
      effect
    
  
  
    
      FlightDistance
       0.018544
    
    
      FlightDistanceError
       0.075000
    
    
      IP
       0.163375
    
    
      VertexChi2
       0.101842
    
    
      pt
       0.098344
    
    
      p0_pt
       0.061443
    
    
      p1_pt
       0.109733
    
    
      p2_pt
       0.091812
    
    
      LifeTime
       0.127267
    
    
      dira
       0.152638

TMVA

TMVAClassifier wraps classifiers from TMVA (CERN library for machine learning)

Parameters:
-----------
:param str method: algorithm method (default='kBDT')
:param list(str) | None features: features used in training
:param str factory_options: default="!V:!Silent:Color:Transformations=I;D;P;G,D:AnalysisType=Classification"
:param kwargs: other parameters passed as key=value.

TMVA doesn't support staged_predict_proba



In [10]:

    
from rep.estimators.tmva import TMVAClassifier, TMVARegressor



In [11]:

    
from rep.estimators import TMVAClassifier



In [12]:

    
tmva = TMVAClassifier(method='kBDT', NTrees=50, Shrinkage=0.05, features=variables)
tmva.fit(train_data, train_labels)
print('training complete')









    



training complete

Predict probabilities and estimate quality



In [13]:

    
# predict probabilities for each class
prob = tmva.predict_proba(test_data)
print prob









    



[[ 0.2709105   0.7290895 ]
 [ 0.28991175  0.71008825]
 [ 0.17888081  0.82111919]
 ..., 
 [ 0.27305794  0.72694206]
 [ 0.22307611  0.77692389]
 [ 0.31567514  0.68432486]]



In [14]:

    
print 'AUC', roc_auc_score(test_labels, prob[:, 1])









    



AUC 0.901148306027



In [15]:

    
# predict labels
tmva.predict(test_data)









    Out[15]:





array([1, 1, 1, ..., 1, 1, 1])

XGBoost

XGBoost is open-source library with fast gradient boosting over decision trees.

Parameters:
-----------
:param list(str) features: list of features to train model

:param n_estimators: the number of round for boosting.

:param nthreads: number of parallel threads used to run xgboost.

:param num_feature: feature dimension used in boosting, set to maximum dimension of the feature
(set automatically by xgboost, no need to be set by user).

:param gamma: minimum loss reduction required to make a further partition on a leaf node of the tree.
 The larger, the more conservative the algorithm will be.

:param eta: step size shrinkage used in update to prevents overfitting.
After each boosting step, we can directly get the weights of new features
 and eta actually shrinkage the feature weights to make the boosting process more conservative.

:param max_depth: maximum depth of a tree.

:params scale_pos_weight: ration of weights of the class 1 to the weights of the class 0.

:param min_child_weight: minimum sum of instance weight(hessian) needed in a child.
If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight,
then the building process will give up further partitioning.
In linear regression mode, this simply corresponds to minimum number of instances needed to be in each node.
The larger, the more conservative the algorithm will be.

:param subsample: subsample ratio of the training instance.
Setting it to 0.5 means that XGBoost randomly collected half of the data instances to grow trees
    and this will prevent overfitting.

:param colsample: subsample ratio of columns when constructing each tree.

:param objective: specify the learning task and the corresponding learning objective, and the options are below:
"reg:linear" --linear regression
"reg:logistic" --logistic regression
"binary:logistic" --logistic regression for binary classification, output probability
"binary:logitraw" --logistic regression for binary classification, output score before logistic transformation
"multi:softmax" --set XGBoost to do multiclass classification using the softmax objective, you also need to
    set num_class(number of classes)
"multi:softprob" --same as softmax, but output a vector of ndata * nclass, which can be further reshaped to ndata,
    nclass matrix. The result contains predicted probability of each data point belonging to each class.
"rank:pairwise" --set XGBoost to do ranking task by minimizing the pairwise loss

:param base_score: the initial prediction score of all instances, global bias.

:param random_state: random number seed.

:param verbose: if 1, will print messages during training

:param missing: the number considered by xgboost as missing value.



In [16]:

    
from rep.estimators import XGBoostClassifier
# XGBoost with default parameters
xgb = XGBoostClassifier(features=variables)
xgb.fit(train_data, train_labels, sample_weight=numpy.ones(len(train_labels)))
print('training complete')









    



training complete






    



/Users/antares/code/xgboost/wrapper/xgboost.py:80: FutureWarning: comparison to `None` will result in an elementwise object comparison in the future.
  if label != None:
/Users/antares/code/xgboost/wrapper/xgboost.py:82: FutureWarning: comparison to `None` will result in an elementwise object comparison in the future.
  if weight !=None:

Predict probabilities and estimate quality



In [17]:

    
prob = xgb.predict_proba(test_data)
print 'ROC AUC:', roc_auc_score(test_labels, prob[:, 1])









    



ROC AUC: 0.922402718668

Predict labels



In [18]:

    
xgb.predict(test_data)









    Out[18]:





array([1, 1, 1, ..., 1, 1, 1])



In [19]:

    
xgb.get_feature_importances()









    Out[19]:






  
    
      
      effect
    
  
  
    
      FlightDistance
       1012
    
    
      FlightDistanceError
        940
    
    
      IP
        934
    
    
      VertexChi2
        918
    
    
      pt
        998
    
    
      p0_pt
        866
    
    
      p1_pt
       1070
    
    
      p2_pt
       1040
    
    
      LifeTime
        414
    
    
      dira
        338

Advantages of common interface

As one can see above, all the classifiers implement the same interface, this simplifies work, simplifies comparison of different classifiers, but this is not the only profit.

Sklearn provides different tools to combine different classifiers and transformers. One of this tools is AdaBoost, which is abstract metaclassifier built on the top of some other classifier (usually, decision dree)

Let's show that now you can run AdaBoost over classifiers from other libraries!
(isn't boosting over neural network what you were dreaming of all your life?)

AdaBoost over TMVA classifier



In [20]:

    
from sklearn.ensemble import AdaBoostClassifier

# Construct AdaBoost with TMVA as base estimator
base_tmva = TMVAClassifier(method='kBDT', NTrees=15, Shrinkage=0.05)
ada_tmva  = SklearnClassifier(AdaBoostClassifier(base_estimator=base_tmva, n_estimators=5), features=variables)
ada_tmva.fit(train_data, train_labels)
print('training complete')









    



training complete



In [21]:

    
prob = ada_tmva.predict_proba(test_data)
print 'AUC', roc_auc_score(test_labels, prob[:, 1])









    



AUC 0.885266565878

AdaBoost over XGBoost



In [22]:

    
# Construct AdaBoost with xgboost base estimator
base_xgb = XGBoostClassifier(n_estimators=50)
# ada_xgb = SklearnClassifier(AdaBoostClassifier(base_estimator=base_xgb, n_estimators=1), features=variables)
ada_xgb = AdaBoostClassifier(base_estimator=base_xgb, n_estimators=1)
ada_xgb.fit(train_data[variables], train_labels)
print('training complete!')









    



training complete!



In [23]:

    
# predict probabilities for each class
prob = ada_xgb.predict_proba(test_data[variables])
print 'AUC', roc_auc_score(test_labels, prob[:, 1])









    



AUC 0.917898066693



In [24]:

    
# predict probabilities for each class
prob = ada_xgb.predict_proba(train_data[variables])
print 'AUC', roc_auc_score(train_labels, prob[:, 1])









    



AUC 0.950702225833

PyBrain

PyBrainClassifier wraps classifiers from PyBrain. PyBrain is short for Python-Based Reinforcement Learning, Artificial Intelligence and Neural Network Library.

Parameters:
-----------
:param features: features used in training.
:type features: list[str] or None
:param epochs: number of iterations of training; if < 0 then classifier trains until convergence.
:param scaler: scaler used to transform data; default is StandardScaler.
:type scaler: transformer from sklearn.preprocessing or None
:param boolean use_rprop: flag to indicate whether we should use rprop trainer.

net parameters:
:param list layers: indicate how many neurons in each hidden layer; default is 1 layer included 10 neurons.
:param hiddenclass: classes of the hidden layers; default is 'SigmoidLayer'.
:type hiddenclass: list of str
:param params: other net parameters:
    :param boolean bias and outputbias: flags to indicate whether the network should have the corresponding biases;
     both default to True.
    :param boolean peepholes.
    :param boolean recurrent: if the `recurrent` flag is set, a :class:`RecurrentNetwork` will be created,
     otherwise a :class:`FeedForwardNetwork`.
:type params: dict

trainer parameters:
:param float learningrate: gives the ratio of which parameters are changed into the direction of the gradient.
:param float lrdecay: the learning rate decreases by lrdecay, which is used to multiply the learning rate after each training step.
:param float momentum: the ratio by which the gradient of the last timestep is used.
:param boolean verbose.
:param boolean batchlearning: if batchlearning is set, the parameters are updated only at the end of each epoch. Default is False.
:param float weightdecay: corresponds to the weightdecay rate, where 0 is no weight decay at all.

rprop parameters:
:param float etaminus: factor by which step width is decreased when overstepping (0.5).
:param float etaplus: factor by which step width is increased when following gradient (1.2).
:param float delta: step width for each weight.
:param float deltamin: minimum step width (1e-6).
:param float deltamax: maximum step width (5.0).
:param float delta0: initial step width (0.1).

trainUntilConvergence parameters:
:param int max_epochs: if is given, at most that many epochs are trained.
:param int continue_epochs: each time validation error hits a minimum, try for continue_epochs epochs to find a better one.
:param float validation_proportion: the ratio of the dataset that is used for the validation dataset.

Loading data



In [2]:

    
import pandas
import numpy
from rep.utils import train_test_split
from sklearn.metrics import roc_auc_score

sig_data = pandas.read_csv('toy_datasets/toyMC_sig_mass.csv', sep='\t')
bck_data = pandas.read_csv('toy_datasets/toyMC_bck_mass.csv', sep='\t')

labels = numpy.array([1] * len(sig_data) + [0] * len(bck_data))
data = pandas.concat([sig_data, bck_data])
variables = ["FlightDistance", "FlightDistanceError", "IP", "VertexChi2", "pt", "p0_pt", "p1_pt", "p2_pt", 'LifeTime','dira']

Split on train and test



In [3]:

    
X_train, X_test, y_train, y_test = train_test_split(data[variables].values, labels, train_size=0.5, random_state = 384)
print X_train.shape, X_test.shape, len(y_train), len(y_test)









    



(72449, 10) (72450, 10) 72449 72450

Now we make a classifier with 2 hidden layers with 10 and 20 neurons in each layer



In [5]:

    
from rep.estimators.pybrain import PyBrainClassifier
from sklearn.preprocessing import StandardScaler, MinMaxScaler, Binarizer

pb = PyBrainClassifier(layers=[10, 20], epochs=10, verbose=True, scaler=StandardScaler())

Training



In [7]:

    
%%time
pb.fit(X_train, y_train)









    



Total error:  0.040186239795
Total error:  0.0368244026758
Total error:  0.0364056124783
Total error:  0.0361849120992
Total error:  0.0359050794801
Total error:  0.0358441554205
Total error:  0.0357131273455
Total error:  0.0357788689297
Total error:  0.0357235802459
Total error:  0.035619950252
CPU times: user 10min 42s, sys: 2.26 s, total: 10min 44s
Wall time: 10min 44s






    Out[7]:





PyBrainClassifier(batchlearning=False, delta0=0.1, deltamax=0.5,
         deltamin=1e-06, epochs=10, etaminus=0.5, etaplus=1.2,
         features=['Feature_0', 'Feature_1', 'Feature_2', 'Feature_3', 'Feature_4', 'Feature_5', 'Feature_6', 'Feature_7', 'Feature_8', 'Feature_9'],
         hiddenclass=['SigmoidLayer', 'SigmoidLayer'], layers=[10, 20],
         learningrate=0.01, lrdecay=1.0, momentum=0.0,
         scaler=StandardScaler(copy=True, with_mean=True, with_std=True),
         use_rprop=False, verbose=True, weightdecay=0.0)

Predicting probabilities



In [8]:

    
prob = pb.predict_proba(X_test)
prob









    Out[8]:





array([[ 0.01786076,  0.98213924],
       [ 0.09110269,  0.90889731],
       [ 0.22386682,  0.77613318],
       ..., 
       [ 0.39654057,  0.60345943],
       [ 0.02806454,  0.97193546],
       [ 0.02478111,  0.97521889]])

Predicting result



In [9]:

    
y_pred = pb.predict(X_test)



In [10]:

    
from sklearn.metrics import zero_one_loss
from sklearn.metrics import classification_report

print "Accuracy:", zero_one_loss(y_pred, y_test)
print "Classification report:"
print classification_report(y_pred, y_test)









    



Accuracy: 0.0975569358178
Classification report:
             precision    recall  f1-score   support

          0       0.48      0.74      0.58      6611
          1       0.97      0.92      0.94     65839

avg / total       0.93      0.90      0.91     72450

Other advantages of common interface

There are many things you can do with classifiers now:

cloning
getting / setting parameters as dictionaries
do automatic hyperparameter optimization
build pipelines (sklearn.pipeline)
use hierarchical training, training on subsets
passing over internet / train classifiers on other machines

And you can replace classifiers at any moment.

About

In this notebook we show the most simple way to

Loading data

First rows of our data

Splitting into train and test

Classifiers

Difference from default scikit-learn interface

Variables used in training

Sklearn

Predicting probabilities, measuring the quality

Predictions of classes

TMVA

Predict probabilities and estimate quality

XGBoost

Predict probabilities and estimate quality

Predict labels

Advantages of common interface

AdaBoost over TMVA classifier

AdaBoost over XGBoost

PyBrain

Loading data

Split on train and test

Now we make a classifier with 2 hidden layers with 10 and 20 neurons in each layer

Training

Predicting probabilities

Predicting result

Other advantages of common interface

Exercises

Exercise 1. Play with parameters in each type of classifiers

Exercise 2. Add weight column and train models with weights in training

	CDF1	CDF2	CDF3	DOCAone	DOCAthree	DOCAtwo	FlightDistance	FlightDistanceError	...	p1_IP	p1_IPSig	p1_Laura_IsoBDT	p1_pt	p2_IP	p2_IPSig	p2_Laura_IsoBDT	p2_pt	peakingbkg	pt
0	1.000000	1.000000	1.000000	0.111337	0.012695	0.123426	162.650955	0.870942	...	11.314665	83.196968	-0.223668	699.066467	9.799975	64.790207	-0.121159	521.628174	NaN	220.742111
1	0.759755	0.597375	0.389256	0.021781	0.094551	0.088421	4.193265	1.262280	...	0.720070	7.237868	-0.256142	587.628935	0.882111	8.834325	-0.203220	532.679950	NaN	661.208843
2	1.000000	0.796142	0.566286	0.011852	0.004400	0.009153	1.580610	0.261697	...	0.362181	4.173097	-0.252788	802.746495	0.427290	5.008959	-0.409469	674.122342	NaN	1290.963982
3	0.716397	0.524712	0.279033	0.015171	0.083900	0.069127	7.884569	1.310151	...	0.753449	6.615949	-0.253550	564.203857	0.917409	8.695459	-0.192284	537.791687	NaN	692.654175
4	1.000000	0.996479	0.888159	0.005547	0.070438	0.064689	-2.267649	0.139555	...	0.589455	21.869143	-0.254778	746.624928	0.388996	8.465344	-0.217319	988.539221	NaN	1328.837840

	effect
FlightDistance	0.018544
FlightDistanceError	0.075000
IP	0.163375
VertexChi2	0.101842
pt	0.098344
p0_pt	0.061443
p1_pt	0.109733
p2_pt	0.091812
LifeTime	0.127267
dira	0.152638

	effect
FlightDistance	1012
FlightDistanceError	940
IP	934
VertexChi2	918
pt	998
p0_pt	866
p1_pt	1070
p2_pt	1040
LifeTime	414
dira	338