About

This notebook demonstrates classifiers, which are provided by Reproducible experiment platform (REP) package.
REP contains following classifiers

scikit-learn
TMVA
XGBoost Also classifiers from hep_ml (as any other sklearn-compatible classifiers may be used)

In this notebook we show the most simple way to

train classifier
build predictions
measure quality
combine metaclassifiers

Loading data

download particle identification Data Set from UCI



In [1]:

    
!cd toy_datasets; wget -O MiniBooNE_PID.txt -nc MiniBooNE_PID.txt https://archive.ics.uci.edu/ml/machine-learning-databases/00199/MiniBooNE_PID.txt









    



--2015-10-12 18:15:34--  http://miniboone_pid.txt/
Resolving miniboone_pid.txt... failed: nodename nor servname provided, or not known.
wget: unable to resolve host address 'miniboone_pid.txt'
--2015-10-12 18:15:34--  https://archive.ics.uci.edu/ml/machine-learning-databases/00199/MiniBooNE_PID.txt
Resolving archive.ics.uci.edu... 128.195.1.95
Connecting to archive.ics.uci.edu|128.195.1.95|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 91174877 (87M) [text/plain]
Saving to: 'MiniBooNE_PID.txt'

100%[======================================>] 91,174,877  1.66MB/s   in 64s    

2015-10-12 18:16:41 (1.35 MB/s) - 'MiniBooNE_PID.txt' saved [91174877/91174877]

FINISHED --2015-10-12 18:16:41--
Total wall clock time: 1m 7s
Downloaded: 1 files, 87M in 1m 4s (1.35 MB/s)



In [2]:

    
import numpy, pandas
from rep.utils import train_test_split
from sklearn.metrics import roc_auc_score

data = pandas.read_csv('toy_datasets/MiniBooNE_PID.txt', sep='\s*', skiprows=[0], header=None, engine='python')
labels = pandas.read_csv('toy_datasets/MiniBooNE_PID.txt', sep=' ', nrows=1, header=None)
labels = [1] * labels[1].values[0] + [0] * labels[2].values[0]
data.columns = ['feature_{}'.format(key) for key in data.columns]

First rows of our data



In [3]:

    
data[:5]









    Out[3]:






  
    
      
      feature_0
      feature_1
      feature_2
      feature_3
      feature_4
      feature_5
      feature_6
      feature_7
      feature_8
      feature_9
      ...
      feature_40
      feature_41
      feature_42
      feature_43
      feature_44
      feature_45
      feature_46
      feature_47
      feature_48
      feature_49
    
  
  
    
      0
      2.59413
      0.468803
      20.6916
      0.322648
      0.009682
      0.374393
      0.803479
      0.896592
      3.59665
      0.249282
      ...
      101.174
      -31.3730
      0.442259
      5.86453
      0.000000
      0.090519
      0.176909
      0.457585
      0.071769
      0.245996
    
    
      1
      3.86388
      0.645781
      18.1375
      0.233529
      0.030733
      0.361239
      1.069740
      0.878714
      3.59243
      0.200793
      ...
      186.516
      45.9597
      -0.478507
      6.11126
      0.001182
      0.091800
      -0.465572
      0.935523
      0.333613
      0.230621
    
    
      2
      3.38584
      1.197140
      36.0807
      0.200866
      0.017341
      0.260841
      1.108950
      0.884405
      3.43159
      0.177167
      ...
      129.931
      -11.5608
      -0.297008
      8.27204
      0.003854
      0.141721
      -0.210559
      1.013450
      0.255512
      0.180901
    
    
      3
      4.28524
      0.510155
      674.2010
      0.281923
      0.009174
      0.000000
      0.998822
      0.823390
      3.16382
      0.171678
      ...
      163.978
      -18.4586
      0.453886
      2.48112
      0.000000
      0.180938
      0.407968
      4.341270
      0.473081
      0.258990
    
    
      4
      5.93662
      0.832993
      59.8796
      0.232853
      0.025066
      0.233556
      1.370040
      0.787424
      3.66546
      0.174862
      ...
      229.555
      42.9600
      -0.975752
      2.66109
      0.000000
      0.170836
      -0.814403
      4.679490
      1.924990
      0.253893
    
  

5 rows × 50 columns

Splitting into train and test



In [4]:

    
# Get train and test data
train_data, test_data, train_labels, test_labels = train_test_split(data, labels, train_size=0.5)

Classifiers

All classifiers inherit from sklearn.BaseEstimator and have the following methods:

classifier.fit(X, y, sample_weight=None) - train classifier
classifier.predict_proba(X) - return probabilities vector for all classes
classifier.predict(X) - return predicted labels
classifier.staged_predict_proba(X) - return probabilities after each iteration (not supported by TMVA)
classifier.get_feature_importances()

Here we use X to denote matrix with data of shape [n_samples, n_features], y is vector with labels (0 or 1) of shape [n_samples],
sample_weight is vector with weights.

Difference from default scikit-learn interface

X should be* pandas.DataFrame, not numpy.array.
Provided this, you'll be able to choose features used in training by setting e.g. features=['FlightTime', 'p'] in constructor.

* it works fine with numpy.array as well, but in this case all the features will be used.

Variables used in training



In [5]:

    
variables = list(data.columns[:26])

Sklearn

wrapper for scikit-learn classifiers. In this example we use GradientBoosting with default settings



In [6]:

    
from rep.estimators import SklearnClassifier
from sklearn.ensemble import GradientBoostingClassifier
# Using gradient boosting with default settings
sk = SklearnClassifier(GradientBoostingClassifier(), features=variables)
# Training classifier
sk.fit(train_data, train_labels)
print('training complete')









    



training complete

Predicting probabilities, measuring the quality



In [7]:

    
# predict probabilities for each class
prob = sk.predict_proba(test_data)
print prob









    



[[ 0.5726395   0.4273605 ]
 [ 0.97640407  0.02359593]
 [ 0.60355959  0.39644041]
 ..., 
 [ 0.81321482  0.18678518]
 [ 0.95970606  0.04029394]
 [ 0.99249754  0.00750246]]



In [8]:

    
print 'ROC AUC', roc_auc_score(test_labels, prob[:, 1])









    



ROC AUC 0.974373469718

Predictions of classes



In [9]:

    
sk.predict(test_data)









    Out[9]:





array([0, 0, 0, ..., 0, 0, 0])



In [10]:

    
sk.get_feature_importances()









    Out[10]:






  
    
      
      effect
    
  
  
    
      feature_0
      0.112941
    
    
      feature_1
      0.064934
    
    
      feature_2
      0.074011
    
    
      feature_3
      0.046601
    
    
      feature_4
      0.016968
    
    
      feature_5
      0.033421
    
    
      feature_6
      0.016995
    
    
      feature_7
      0.004705
    
    
      feature_8
      0.022764
    
    
      feature_9
      0.005404
    
    
      feature_10
      0.014556
    
    
      feature_11
      0.025533
    
    
      feature_12
      0.143363
    
    
      feature_13
      0.022579
    
    
      feature_14
      0.002558
    
    
      feature_15
      0.029933
    
    
      feature_16
      0.134153
    
    
      feature_17
      0.045996
    
    
      feature_18
      0.006470
    
    
      feature_19
      0.031310
    
    
      feature_20
      0.040756
    
    
      feature_21
      0.006868
    
    
      feature_22
      0.023590
    
    
      feature_23
      0.028791
    
    
      feature_24
      0.016908
    
    
      feature_25
      0.027892

TMVA



In [11]:

    
from rep.estimators import TMVAClassifier
print TMVAClassifier.__doc__









    



    TMVAClassifier wraps classifiers from TMVA (CERN library for machine learning)

    Parameters:
    -----------
    :param str method: algorithm method (default='kBDT')
    :param features: features used in training
    :type features: list[str] or None
    :param str factory_options: options, for example::

        "!V:!Silent:Color:Transformations=I;D;P;G,D"

    :param str sigmoid_function: function which is used to convert TMVA output to probabilities;

        * *identity* (use for svm, mlp) --- the same output, use this for methods returning class probabilities

        * *sigmoid* --- sigmoid transformation, use it if output varies in range [-infinity, +infinity]

        * *bdt* (for bdt algorithms output varies in range [-1, 1])

        * *sig_eff=0.4* --- for rectangular cut optimization methods,
        for instance, here 0.4 will be used as signal efficiency to evaluate MVA,
        (put any float number from [0, 1])

    :param dict method_parameters: estimator options, example: NTrees=100, BoostType='Grad'

    .. warning::
        TMVA doesn't support *staged_predict_proba()* and *feature_importances__*

    .. warning::
        TMVA doesn't support multiclassification, only two-class classification

    `TMVA guide <http://mirror.yandex.ru/gentoo-distfiles/distfiles/TMVAUsersGuide-v4.03.pdf>`_



In [12]:

    
tmva = TMVAClassifier(method='kBDT', NTrees=50, Shrinkage=0.05, features=variables)
tmva.fit(train_data, train_labels)
print('training complete')









    



training complete

Predict probabilities and estimate quality



In [13]:

    
# predict probabilities for each class
prob = tmva.predict_proba(test_data)
print prob









    



[[ 0.53313813  0.46686187]
 [ 0.71391331  0.28608669]
 [ 0.47546091  0.52453909]
 ..., 
 [ 0.4661644   0.5338356 ]
 [ 0.6196462   0.3803538 ]
 [ 0.87976222  0.12023778]]



In [14]:

    
print 'ROC AUC', roc_auc_score(test_labels, prob[:, 1])









    



ROC AUC 0.964830680782



In [15]:

    
# predict labels
tmva.predict(test_data)









    Out[15]:





array([0, 0, 1, ..., 1, 0, 0])

XGBoost



In [16]:

    
from rep.estimators import XGBoostClassifier
print XGBoostClassifier.__doc__









    



    Implements classification (and multiclassification) from XGBoost library.

    Parameters:
    -----------
    :param features: list of features to train model
    :type features: None or list(str)
    :param int n_estimators: the number of trees built.
    :param int nthreads: number of parallel threads used to run xgboost.
    :param num_feature: feature dimension used in boosting, set to maximum dimension of the feature
        (set automatically by xgboost, no need to be set by user).
    :type num_feature: None or int
    :param float gamma: minimum loss reduction required to make a further partition on a leaf node of the tree.
        The larger, the more conservative the algorithm will be.
    :type gamma: None or float
    :param float eta: step size shrinkage used in update to prevent overfitting.
        After each boosting step, we can directly get the weights of new features
        and eta actually shrinkage the feature weights to make the boosting process more conservative.
    :param int max_depth: maximum depth of a tree.
    :param float scale_pos_weight: ration of weights of the class 1 to the weights of the class 0.
    :param float min_child_weight: minimum sum of instance weight(hessian) needed in a child.
        If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight,
        then the building process will give up further partitioning.

        .. note:: weights are normalized so that mean=1 before fitting. Roughly min_child_weight is equal to the number of events.
    :param float subsample: subsample ratio of the training instance.
        Setting it to 0.5 means that XGBoost randomly collected half of the data instances to grow trees
        and this will prevent overfitting.
    :param float colsample: subsample ratio of columns when constructing each tree.
    :param float base_score: the initial prediction score of all instances, global bias.
    :param int random_state: random number seed.
    :param boot verbose: if 1, will print messages during training
    :param float missing: the number considered by xgboost as missing value.



In [17]:

    
# XGBoost with default parameters
xgb = XGBoostClassifier(features=variables)
xgb.fit(train_data, train_labels, sample_weight=numpy.ones(len(train_labels)))
print('training complete')









    



training complete






    



/Users/antares/code/xgboost/wrapper/xgboost.py:80: FutureWarning: comparison to `None` will result in an elementwise object comparison in the future.
  if label != None:
/Users/antares/code/xgboost/wrapper/xgboost.py:82: FutureWarning: comparison to `None` will result in an elementwise object comparison in the future.
  if weight !=None:

Predict probabilities and estimate quality



In [18]:

    
prob = xgb.predict_proba(test_data)
print 'ROC AUC:', roc_auc_score(test_labels, prob[:, 1])









    



ROC AUC: 0.980301325281

Predict labels



In [19]:

    
xgb.predict(test_data)









    Out[19]:





array([0, 0, 0, ..., 0, 0, 0])



In [20]:

    
xgb.get_feature_importances()









    Out[20]:






  
    
      
      effect
    
  
  
    
      feature_0
      548
    
    
      feature_1
      364
    
    
      feature_2
      466
    
    
      feature_3
      394
    
    
      feature_4
      438
    
    
      feature_5
      438
    
    
      feature_6
      336
    
    
      feature_7
      288
    
    
      feature_8
      364
    
    
      feature_9
      326
    
    
      feature_10
      356
    
    
      feature_11
      394
    
    
      feature_12
      628
    
    
      feature_13
      400
    
    
      feature_14
      292
    
    
      feature_15
      320
    
    
      feature_16
      358
    
    
      feature_17
      380
    
    
      feature_18
      296
    
    
      feature_19
      464
    
    
      feature_20
      310
    
    
      feature_21
      330
    
    
      feature_22
      284
    
    
      feature_23
      456
    
    
      feature_24
      282
    
    
      feature_25
      304

Advantages of common interface

As one can see above, all the classifiers implement the same interface, this simplifies work, simplifies comparison of different classifiers, but this is not the only profit.

Sklearn provides different tools to combine different classifiers and transformers. One of this tools is AdaBoost, which is abstract metaclassifier built on the top of some other classifier (usually, decision dree)

Let's show that now you can run AdaBoost over classifiers from other libraries!
(isn't boosting over neural network what you were dreaming of all your life?)

AdaBoost over TMVA classifier



In [21]:

    
from sklearn.ensemble import AdaBoostClassifier

# Construct AdaBoost with TMVA as base estimator
base_tmva = TMVAClassifier(method='kBDT', NTrees=15, Shrinkage=0.05)
ada_tmva  = SklearnClassifier(AdaBoostClassifier(base_estimator=base_tmva, n_estimators=5), features=variables)
ada_tmva.fit(train_data, train_labels)
print('training complete')









    



training complete



In [22]:

    
prob = ada_tmva.predict_proba(test_data)
print 'AUC', roc_auc_score(test_labels, prob[:, 1])









    



AUC 0.966183829791

AdaBoost over XGBoost



In [23]:

    
# Construct AdaBoost with xgboost base estimator
base_xgb = XGBoostClassifier(n_estimators=50)
# ada_xgb = SklearnClassifier(AdaBoostClassifier(base_estimator=base_xgb, n_estimators=1), features=variables)
ada_xgb = AdaBoostClassifier(base_estimator=base_xgb, n_estimators=1)
ada_xgb.fit(train_data[variables], train_labels)
print('training complete!')









    



training complete!



In [24]:

    
# predict probabilities for each class
prob = ada_xgb.predict_proba(test_data[variables])
print 'AUC', roc_auc_score(test_labels, prob[:, 1])









    



AUC 0.980029038596



In [25]:

    
# predict probabilities for each class
prob = ada_xgb.predict_proba(train_data[variables])
print 'AUC', roc_auc_score(train_labels, prob[:, 1])









    



AUC 0.992824797295

Other advantages of common interface

There are many things you can do with classifiers now:

cloning
getting / setting parameters as dictionaries
do automatic hyperparameter optimization
build pipelines (sklearn.pipeline)
use hierarchical training, training on subsets
passing over internet / train classifiers on other machines

And you can replace classifiers at any moment.

About

In this notebook we show the most simple way to

Loading data

download particle identification Data Set from UCI

First rows of our data

Splitting into train and test

Classifiers

Difference from default scikit-learn interface

Variables used in training

Sklearn

Predicting probabilities, measuring the quality

Predictions of classes

TMVA

Predict probabilities and estimate quality

XGBoost

Predict probabilities and estimate quality

Predict labels

Advantages of common interface

AdaBoost over TMVA classifier

AdaBoost over XGBoost

Other advantages of common interface

Exercises

Exercise 1. Play with parameters in each type of classifiers

Exercise 2. Add weight column and train models with weights in training

	feature_0	feature_1	feature_2	feature_3	feature_4	feature_5	feature_6	feature_7	feature_8	feature_9	...	feature_40	feature_41	feature_42	feature_43	feature_44	feature_45	feature_46	feature_47	feature_48	feature_49
0	2.59413	0.468803	20.6916	0.322648	0.009682	0.374393	0.803479	0.896592	3.59665	0.249282	...	101.174	-31.3730	0.442259	5.86453	0.000000	0.090519	0.176909	0.457585	0.071769	0.245996
1	3.86388	0.645781	18.1375	0.233529	0.030733	0.361239	1.069740	0.878714	3.59243	0.200793	...	186.516	45.9597	-0.478507	6.11126	0.001182	0.091800	-0.465572	0.935523	0.333613	0.230621
2	3.38584	1.197140	36.0807	0.200866	0.017341	0.260841	1.108950	0.884405	3.43159	0.177167	...	129.931	-11.5608	-0.297008	8.27204	0.003854	0.141721	-0.210559	1.013450	0.255512	0.180901
3	4.28524	0.510155	674.2010	0.281923	0.009174	0.000000	0.998822	0.823390	3.16382	0.171678	...	163.978	-18.4586	0.453886	2.48112	0.000000	0.180938	0.407968	4.341270	0.473081	0.258990
4	5.93662	0.832993	59.8796	0.232853	0.025066	0.233556	1.370040	0.787424	3.66546	0.174862	...	229.555	42.9600	-0.975752	2.66109	0.000000	0.170836	-0.814403	4.679490	1.924990	0.253893

	effect
feature_0	0.112941
feature_1	0.064934
feature_2	0.074011
feature_3	0.046601
feature_4	0.016968
feature_5	0.033421
feature_6	0.016995
feature_7	0.004705
feature_8	0.022764
feature_9	0.005404
feature_10	0.014556
feature_11	0.025533
feature_12	0.143363
feature_13	0.022579
feature_14	0.002558
feature_15	0.029933
feature_16	0.134153
feature_17	0.045996
feature_18	0.006470
feature_19	0.031310
feature_20	0.040756
feature_21	0.006868
feature_22	0.023590
feature_23	0.028791
feature_24	0.016908
feature_25	0.027892