This notebook demonstrates classifiers, which are provided by Reproducible experiment platform (REP) package.
REP contains following classifiers
hep_ml
(as any other sklearn
-compatible classifiers may be used)
In [1]:
!cd toy_datasets; wget -O MiniBooNE_PID.txt -nc MiniBooNE_PID.txt https://archive.ics.uci.edu/ml/machine-learning-databases/00199/MiniBooNE_PID.txt
In [2]:
import numpy, pandas
from rep.utils import train_test_split
from sklearn.metrics import roc_auc_score
data = pandas.read_csv('toy_datasets/MiniBooNE_PID.txt', sep='\s*', skiprows=[0], header=None, engine='python')
labels = pandas.read_csv('toy_datasets/MiniBooNE_PID.txt', sep=' ', nrows=1, header=None)
labels = [1] * labels[1].values[0] + [0] * labels[2].values[0]
data.columns = ['feature_{}'.format(key) for key in data.columns]
In [3]:
data[:5]
Out[3]:
In [4]:
# Get train and test data
train_data, test_data, train_labels, test_labels = train_test_split(data, labels, train_size=0.5)
All classifiers inherit from sklearn.BaseEstimator and have the following methods:
classifier.fit(X, y, sample_weight=None)
- train classifier
classifier.predict_proba(X)
- return probabilities vector for all classes
classifier.predict(X)
- return predicted labels
classifier.staged_predict_proba(X)
- return probabilities after each iteration (not supported by TMVA)
classifier.get_feature_importances()
Here we use X
to denote matrix with data of shape [n_samples, n_features]
, y
is vector with labels (0 or 1) of shape [n_samples],
sample_weight
is vector with weights.
X should be* pandas.DataFrame
, not numpy.array
.
Provided this, you'll be able to choose features used in training by setting e.g. features=['FlightTime', 'p']
in constructor.
* it works fine with numpy.array
as well, but in this case all the features will be used.
In [5]:
variables = list(data.columns[:26])
In [6]:
from rep.estimators import SklearnClassifier
from sklearn.ensemble import GradientBoostingClassifier
# Using gradient boosting with default settings
sk = SklearnClassifier(GradientBoostingClassifier(), features=variables)
# Training classifier
sk.fit(train_data, train_labels)
print('training complete')
In [7]:
# predict probabilities for each class
prob = sk.predict_proba(test_data)
print prob
In [8]:
print 'ROC AUC', roc_auc_score(test_labels, prob[:, 1])
In [9]:
sk.predict(test_data)
Out[9]:
In [10]:
sk.get_feature_importances()
Out[10]:
In [11]:
from rep.estimators import TMVAClassifier
print TMVAClassifier.__doc__
In [12]:
tmva = TMVAClassifier(method='kBDT', NTrees=50, Shrinkage=0.05, features=variables)
tmva.fit(train_data, train_labels)
print('training complete')
In [13]:
# predict probabilities for each class
prob = tmva.predict_proba(test_data)
print prob
In [14]:
print 'ROC AUC', roc_auc_score(test_labels, prob[:, 1])
In [15]:
# predict labels
tmva.predict(test_data)
Out[15]:
In [16]:
from rep.estimators import XGBoostClassifier
print XGBoostClassifier.__doc__
In [17]:
# XGBoost with default parameters
xgb = XGBoostClassifier(features=variables)
xgb.fit(train_data, train_labels, sample_weight=numpy.ones(len(train_labels)))
print('training complete')
In [18]:
prob = xgb.predict_proba(test_data)
print 'ROC AUC:', roc_auc_score(test_labels, prob[:, 1])
In [19]:
xgb.predict(test_data)
Out[19]:
In [20]:
xgb.get_feature_importances()
Out[20]:
As one can see above, all the classifiers implement the same interface, this simplifies work, simplifies comparison of different classifiers, but this is not the only profit.
Sklearn
provides different tools to combine different classifiers and transformers.
One of this tools is AdaBoost
, which is abstract metaclassifier built on the top of some other classifier (usually, decision dree)
Let's show that now you can run AdaBoost over classifiers from other libraries!
(isn't boosting over neural network what you were dreaming of all your life?)
In [21]:
from sklearn.ensemble import AdaBoostClassifier
# Construct AdaBoost with TMVA as base estimator
base_tmva = TMVAClassifier(method='kBDT', NTrees=15, Shrinkage=0.05)
ada_tmva = SklearnClassifier(AdaBoostClassifier(base_estimator=base_tmva, n_estimators=5), features=variables)
ada_tmva.fit(train_data, train_labels)
print('training complete')
In [22]:
prob = ada_tmva.predict_proba(test_data)
print 'AUC', roc_auc_score(test_labels, prob[:, 1])
In [23]:
# Construct AdaBoost with xgboost base estimator
base_xgb = XGBoostClassifier(n_estimators=50)
# ada_xgb = SklearnClassifier(AdaBoostClassifier(base_estimator=base_xgb, n_estimators=1), features=variables)
ada_xgb = AdaBoostClassifier(base_estimator=base_xgb, n_estimators=1)
ada_xgb.fit(train_data[variables], train_labels)
print('training complete!')
In [24]:
# predict probabilities for each class
prob = ada_xgb.predict_proba(test_data[variables])
print 'AUC', roc_auc_score(test_labels, prob[:, 1])
In [25]:
# predict probabilities for each class
prob = ada_xgb.predict_proba(train_data[variables])
print 'AUC', roc_auc_score(train_labels, prob[:, 1])
There are many things you can do with classifiers now:
sklearn.pipeline
)And you can replace classifiers at any moment.