About

This notebook demonstrates MatrixNet service wrapper which is provided by Reproducible experiment platform (REP) package. This service is available for CERN users.

To use MatrixNet, first acquire token::

Go to https://yandex-apps.cern.ch/ (ogin with your CERN-account)
Click Add token at the left panel
Choose service MatrixNet and click Create token
Create ~/.rep-matrixnet.config.json file with the following content

{
   "url": "https://ml.cern.yandex.net/v1",
   "token": "<your_token>"
}

In this notebook:

training of a classifier
computing predictions
measuring quality

Loading data

download particle identification Data Set from UCI



In [1]:

    
!cd toy_datasets; wget -O MiniBooNE_PID.txt -nc --no-check-certificate https://archive.ics.uci.edu/ml/machine-learning-databases/00199/MiniBooNE_PID.txt









    



File `MiniBooNE_PID.txt' already there; not retrieving.



In [2]:

    
import numpy, pandas
from rep.utils import train_test_split
from sklearn.metrics import roc_auc_score

data = pandas.read_csv('toy_datasets/MiniBooNE_PID.txt', sep='\s*', skiprows=[0], header=None, engine='python')
labels = pandas.read_csv('toy_datasets/MiniBooNE_PID.txt', sep=' ', nrows=1, header=None)
labels = [1] * labels[1].values[0] + [0] * labels[2].values[0]
data.columns = ['feature_{}'.format(key) for key in data.columns]

First rows of our data



In [3]:

    
data.head()









    Out[3]:






  
    
      
      feature_0
      feature_1
      feature_2
      feature_3
      feature_4
      feature_5
      feature_6
      feature_7
      feature_8
      feature_9
      ...
      feature_40
      feature_41
      feature_42
      feature_43
      feature_44
      feature_45
      feature_46
      feature_47
      feature_48
      feature_49
    
  
  
    
      0
      2.59413
      0.468803
      20.6916
      0.322648
      0.009682
      0.374393
      0.803479
      0.896592
      3.59665
      0.249282
      ...
      101.174
      -31.3730
      0.442259
      5.86453
      0.000000
      0.090519
      0.176909
      0.457585
      0.071769
      0.245996
    
    
      1
      3.86388
      0.645781
      18.1375
      0.233529
      0.030733
      0.361239
      1.069740
      0.878714
      3.59243
      0.200793
      ...
      186.516
      45.9597
      -0.478507
      6.11126
      0.001182
      0.091800
      -0.465572
      0.935523
      0.333613
      0.230621
    
    
      2
      3.38584
      1.197140
      36.0807
      0.200866
      0.017341
      0.260841
      1.108950
      0.884405
      3.43159
      0.177167
      ...
      129.931
      -11.5608
      -0.297008
      8.27204
      0.003854
      0.141721
      -0.210559
      1.013450
      0.255512
      0.180901
    
    
      3
      4.28524
      0.510155
      674.2010
      0.281923
      0.009174
      0.000000
      0.998822
      0.823390
      3.16382
      0.171678
      ...
      163.978
      -18.4586
      0.453886
      2.48112
      0.000000
      0.180938
      0.407968
      4.341270
      0.473081
      0.258990
    
    
      4
      5.93662
      0.832993
      59.8796
      0.232853
      0.025066
      0.233556
      1.370040
      0.787424
      3.66546
      0.174862
      ...
      229.555
      42.9600
      -0.975752
      2.66109
      0.000000
      0.170836
      -0.814403
      4.679490
      1.924990
      0.253893
    
  

5 rows × 50 columns

Splitting into train and test



In [4]:

    
# Get train and test data
train_data, test_data, train_labels, test_labels = train_test_split(data, labels, train_size=0.5)









    



/Users/antares/.virtualenvs/rep/lib/python2.7/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)

Variables used in training



In [5]:

    
variables = list(data.columns)[:10]

MatrixNet wrapper



In [6]:

    
from rep.estimators import MatrixNetClassifier









    



Couldn't import dot_parser, loading of dot files will not be possible.



In [7]:

    
import rep
rep.__file__









    Out[7]:





'/Users/antares/repositories/rep/rep/__init__.pyc'



In [8]:

    
print MatrixNetClassifier.__doc__









    



MatrixNet classification model. 

    This is a wrapper around **MatrixNet (specific BDT)** technology developed at **Yandex**,
    which is available for CERN people using authorization.
    Trained estimator is downloaded and stored at your computer, so you can use it at any time.

    :param train_features: features used in training
    :type train_features: list[str] or None
    :param api_config_file: path to the file with remote api configuration in the json format::

                {"url": "https://ml.cern.yandex.net/v1", "token": "<your_token>"}

    :type api_config_file: str

    :param int iterations: number of constructed trees (default=100)
    :param float regularization: regularization number (default=0.01)
    :param intervals: number of bins for features discretization or dict with borders
     list for each feature for its discretisation (default=8)
    :type intervals: int or dict(str, list)
    :param int max_features_per_iteration: depth (default=6, supports 1 <= .. <= 6)
    :param float features_sample_rate_per_iteration: training features sampling (default=1.0)
    :param float training_fraction: training rows bagging (default=0.5)
    :param auto_stop: error value for training prestopping
    :type auto_stop: None or float
    :param bool sync: synchronic or asynchronic training on the server
    :param random_state: state for a pseudo random generator
    :type random_state: None or int or RandomState



In [9]:

    
# configuring classifier (take configuration from $HOME/.rep-matrixnet.config.json)
mn = MatrixNetClassifier(features=variables, iterations=300, sync=False)
# training classifier
mn.fit(train_data, train_labels)
# pay attention: we set sync=False, so training is asynchronous 
# we passed the dataset to server and you can do other operations in python when classifier is trained on the server
print('asynchronous training started')









    



asynchronous training started



In [10]:

    
import time
# Check status of training
print 'Is training complete?', mn.training_status()
time.sleep(15)
# get number of iterations
print 'Number of iterations already done', mn.get_iterations()
# Synchronize (wait until the training is complete)
mn.synchronize()
print 'Is training complete?', mn.training_status()









    



Is training complete?<rep.estimators._mnkit.Estimator object at 0x115498b50>
 False
Number of iterations already done None
<rep.estimators._mnkit.Estimator object at 0x115498b50>
<rep.estimators._mnkit.Estimator object at 0x115498b50>
<rep.estimators._mnkit.Estimator object at 0x115498b50>
<rep.estimators._mnkit.Estimator object at 0x115498b50>
<rep.estimators._mnkit.Estimator object at 0x115498b50>
Is training complete?<rep.estimators._mnkit.Estimator object at 0x115498b50>
 True

Note: if training is failed, call mn.resubmit()

Predict probabilities and estimate quality



In [11]:

    
import rep



In [12]:

    
# predict probabilities for each class
prob = mn.predict_proba(test_data)
print prob









    



[[ 0.40882575  0.59117425]
 [ 0.28054092  0.71945908]
 [ 0.05378584  0.94621416]
 ..., 
 [ 0.70447783  0.29552217]
 [ 0.98245544  0.01754456]
 [ 0.19312067  0.80687933]]



In [13]:

    
# for prob in mn.staged_predict_proba(test_data):
#     print prob



In [14]:

    
print 'AUC', roc_auc_score(test_labels, prob[:, 1])









    



AUC 0.956957198089



In [15]:

    
%matplotlib inline
from rep.report.metrics import RocAuc
mn.test_on(test_data, test_labels).learning_curve(RocAuc())
mn.predict_proba??

Predictions of classes



In [16]:

    
mn.predict(test_data)









    Out[16]:





array([1, 1, 1, ..., 0, 0, 1])

Features importances: returns three different measures



In [17]:

    
mn.get_feature_importances()









    Out[17]:






  
    
      
      effect
      efficiency
      information
    
  
  
    
      feature_0
      0.821777
      0.863598
      0.951574
    
    
      feature_1
      0.651719
      0.684911
      0.951538
    
    
      feature_2
      1.000000
      1.000000
      1.000000
    
    
      feature_3
      0.410177
      0.431056
      0.951562
    
    
      feature_4
      0.061307
      0.075013
      0.817289
    
    
      feature_5
      0.177839
      0.200786
      0.885715
    
    
      feature_6
      0.100303
      0.105411
      0.951545
    
    
      feature_7
      0.095978
      0.100864
      0.951558
    
    
      feature_8
      0.163450
      0.171770
      0.951562
    
    
      feature_9
      0.071552
      0.075194
      0.951558

	feature_0	feature_1	feature_2	feature_3	feature_4	feature_5	feature_6	feature_7	feature_8	feature_9	...	feature_40	feature_41	feature_42	feature_43	feature_44	feature_45	feature_46	feature_47	feature_48	feature_49
0	2.59413	0.468803	20.6916	0.322648	0.009682	0.374393	0.803479	0.896592	3.59665	0.249282	...	101.174	-31.3730	0.442259	5.86453	0.000000	0.090519	0.176909	0.457585	0.071769	0.245996
1	3.86388	0.645781	18.1375	0.233529	0.030733	0.361239	1.069740	0.878714	3.59243	0.200793	...	186.516	45.9597	-0.478507	6.11126	0.001182	0.091800	-0.465572	0.935523	0.333613	0.230621
2	3.38584	1.197140	36.0807	0.200866	0.017341	0.260841	1.108950	0.884405	3.43159	0.177167	...	129.931	-11.5608	-0.297008	8.27204	0.003854	0.141721	-0.210559	1.013450	0.255512	0.180901
3	4.28524	0.510155	674.2010	0.281923	0.009174	0.000000	0.998822	0.823390	3.16382	0.171678	...	163.978	-18.4586	0.453886	2.48112	0.000000	0.180938	0.407968	4.341270	0.473081	0.258990
4	5.93662	0.832993	59.8796	0.232853	0.025066	0.233556	1.370040	0.787424	3.66546	0.174862	...	229.555	42.9600	-0.975752	2.66109	0.000000	0.170836	-0.814403	4.679490	1.924990	0.253893

	effect	efficiency	information
feature_0	0.821777	0.863598	0.951574
feature_1	0.651719	0.684911	0.951538
feature_2	1.000000	1.000000	1.000000
feature_3	0.410177	0.431056	0.951562
feature_4	0.061307	0.075013	0.817289
feature_5	0.177839	0.200786	0.885715
feature_6	0.100303	0.105411	0.951545
feature_7	0.095978	0.100864	0.951558
feature_8	0.163450	0.171770	0.951562
feature_9	0.071552	0.075194	0.951558