About

This notebook demonstrates MatrixNet service wrapper which is provided by Reproducible experiment platform (REP) package. This service is available for CERN users.

To use MatrixNet, first acquire token::

  • Go to https://yandex-apps.cern.ch/ (ogin with your CERN-account)
  • Click Add token at the left panel
  • Choose service MatrixNet and click Create token
  • Create ~/.rep-matrixnet.config.json file with the following content
{
   "url": "https://ml.cern.yandex.net/v1",
   "token": "<your_token>"
}

In this notebook:

  • training of a classifier
  • computing predictions
  • measuring quality

Loading data

download particle identification Data Set from UCI


In [1]:
!cd toy_datasets; wget -O MiniBooNE_PID.txt -nc --no-check-certificate https://archive.ics.uci.edu/ml/machine-learning-databases/00199/MiniBooNE_PID.txt


File `MiniBooNE_PID.txt' already there; not retrieving.

In [2]:
import numpy, pandas
from rep.utils import train_test_split
from sklearn.metrics import roc_auc_score

data = pandas.read_csv('toy_datasets/MiniBooNE_PID.txt', sep='\s*', skiprows=[0], header=None, engine='python')
labels = pandas.read_csv('toy_datasets/MiniBooNE_PID.txt', sep=' ', nrows=1, header=None)
labels = [1] * labels[1].values[0] + [0] * labels[2].values[0]
data.columns = ['feature_{}'.format(key) for key in data.columns]

First rows of our data


In [3]:
data.head()


Out[3]:
feature_0 feature_1 feature_2 feature_3 feature_4 feature_5 feature_6 feature_7 feature_8 feature_9 ... feature_40 feature_41 feature_42 feature_43 feature_44 feature_45 feature_46 feature_47 feature_48 feature_49
0 2.59413 0.468803 20.6916 0.322648 0.009682 0.374393 0.803479 0.896592 3.59665 0.249282 ... 101.174 -31.3730 0.442259 5.86453 0.000000 0.090519 0.176909 0.457585 0.071769 0.245996
1 3.86388 0.645781 18.1375 0.233529 0.030733 0.361239 1.069740 0.878714 3.59243 0.200793 ... 186.516 45.9597 -0.478507 6.11126 0.001182 0.091800 -0.465572 0.935523 0.333613 0.230621
2 3.38584 1.197140 36.0807 0.200866 0.017341 0.260841 1.108950 0.884405 3.43159 0.177167 ... 129.931 -11.5608 -0.297008 8.27204 0.003854 0.141721 -0.210559 1.013450 0.255512 0.180901
3 4.28524 0.510155 674.2010 0.281923 0.009174 0.000000 0.998822 0.823390 3.16382 0.171678 ... 163.978 -18.4586 0.453886 2.48112 0.000000 0.180938 0.407968 4.341270 0.473081 0.258990
4 5.93662 0.832993 59.8796 0.232853 0.025066 0.233556 1.370040 0.787424 3.66546 0.174862 ... 229.555 42.9600 -0.975752 2.66109 0.000000 0.170836 -0.814403 4.679490 1.924990 0.253893

5 rows × 50 columns

Splitting into train and test


In [4]:
# Get train and test data
train_data, test_data, train_labels, test_labels = train_test_split(data, labels, train_size=0.5)


/Users/antares/.virtualenvs/rep/lib/python2.7/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)

Variables used in training


In [5]:
variables = list(data.columns)[:10]

MatrixNet wrapper


In [6]:
from rep.estimators import MatrixNetClassifier


Couldn't import dot_parser, loading of dot files will not be possible.

In [7]:
import rep
rep.__file__


Out[7]:
'/Users/antares/repositories/rep/rep/__init__.pyc'

In [8]:
print MatrixNetClassifier.__doc__


MatrixNet classification model. 

    This is a wrapper around **MatrixNet (specific BDT)** technology developed at **Yandex**,
    which is available for CERN people using authorization.
    Trained estimator is downloaded and stored at your computer, so you can use it at any time.

    :param train_features: features used in training
    :type train_features: list[str] or None
    :param api_config_file: path to the file with remote api configuration in the json format::

                {"url": "https://ml.cern.yandex.net/v1", "token": "<your_token>"}

    :type api_config_file: str

    :param int iterations: number of constructed trees (default=100)
    :param float regularization: regularization number (default=0.01)
    :param intervals: number of bins for features discretization or dict with borders
     list for each feature for its discretisation (default=8)
    :type intervals: int or dict(str, list)
    :param int max_features_per_iteration: depth (default=6, supports 1 <= .. <= 6)
    :param float features_sample_rate_per_iteration: training features sampling (default=1.0)
    :param float training_fraction: training rows bagging (default=0.5)
    :param auto_stop: error value for training prestopping
    :type auto_stop: None or float
    :param bool sync: synchronic or asynchronic training on the server
    :param random_state: state for a pseudo random generator
    :type random_state: None or int or RandomState
    

In [9]:
# configuring classifier (take configuration from $HOME/.rep-matrixnet.config.json)
mn = MatrixNetClassifier(features=variables, iterations=300, sync=False)
# training classifier
mn.fit(train_data, train_labels)
# pay attention: we set sync=False, so training is asynchronous 
# we passed the dataset to server and you can do other operations in python when classifier is trained on the server
print('asynchronous training started')


asynchronous training started

In [10]:
import time
# Check status of training
print 'Is training complete?', mn.training_status()
time.sleep(15)
# get number of iterations
print 'Number of iterations already done', mn.get_iterations()
# Synchronize (wait until the training is complete)
mn.synchronize()
print 'Is training complete?', mn.training_status()


Is training complete?<rep.estimators._mnkit.Estimator object at 0x115498b50>
 False
Number of iterations already done None
<rep.estimators._mnkit.Estimator object at 0x115498b50>
<rep.estimators._mnkit.Estimator object at 0x115498b50>
<rep.estimators._mnkit.Estimator object at 0x115498b50>
<rep.estimators._mnkit.Estimator object at 0x115498b50>
<rep.estimators._mnkit.Estimator object at 0x115498b50>
Is training complete?<rep.estimators._mnkit.Estimator object at 0x115498b50>
 True

Note: if training is failed, call mn.resubmit()

Predict probabilities and estimate quality


In [11]:
import rep

In [12]:
# predict probabilities for each class
prob = mn.predict_proba(test_data)
print prob


[[ 0.40882575  0.59117425]
 [ 0.28054092  0.71945908]
 [ 0.05378584  0.94621416]
 ..., 
 [ 0.70447783  0.29552217]
 [ 0.98245544  0.01754456]
 [ 0.19312067  0.80687933]]

In [13]:
# for prob in mn.staged_predict_proba(test_data):
#     print prob

In [14]:
print 'AUC', roc_auc_score(test_labels, prob[:, 1])


AUC 0.956957198089

In [15]:
%matplotlib inline
from rep.report.metrics import RocAuc
mn.test_on(test_data, test_labels).learning_curve(RocAuc())
mn.predict_proba??

Predictions of classes


In [16]:
mn.predict(test_data)


Out[16]:
array([1, 1, 1, ..., 0, 0, 1])

Features importances: returns three different measures


In [17]:
mn.get_feature_importances()


Out[17]:
effect efficiency information
feature_0 0.821777 0.863598 0.951574
feature_1 0.651719 0.684911 0.951538
feature_2 1.000000 1.000000 1.000000
feature_3 0.410177 0.431056 0.951562
feature_4 0.061307 0.075013 0.817289
feature_5 0.177839 0.200786 0.885715
feature_6 0.100303 0.105411 0.951545
feature_7 0.095978 0.100864 0.951558
feature_8 0.163450 0.171770 0.951562
feature_9 0.071552 0.075194 0.951558