This notebook demonstrates classifiers, which are provided by Reproducible experiment platform (REP) package.
REP contains following classifiers
(and any sklearn
-compatible classifiers may be used).
Neural network libraries are introduced in different notebook.
In [1]:
!cd toy_datasets; wget -O MiniBooNE_PID.txt -nc --no-check-certificate https://archive.ics.uci.edu/ml/machine-learning-databases/00199/MiniBooNE_PID.txt
In [2]:
import numpy, pandas
from rep.utils import train_test_split
from sklearn.metrics import roc_auc_score
data = pandas.read_csv('toy_datasets/MiniBooNE_PID.txt', sep='\s*', skiprows=[0], header=None, engine='python')
labels = pandas.read_csv('toy_datasets/MiniBooNE_PID.txt', sep=' ', nrows=1, header=None)
labels = [1] * labels[1].values[0] + [0] * labels[2].values[0]
data.columns = ['feature_{}'.format(key) for key in data.columns]
In [3]:
data[:5]
Out[3]:
In [4]:
# Get train and test data
train_data, test_data, train_labels, test_labels = train_test_split(data, labels, train_size=0.25)
All classifiers inherit from sklearn.BaseEstimator and have the following methods:
classifier.fit(X, y, sample_weight=None)
- train classifier
classifier.predict_proba(X)
- return probabilities vector for all classes
classifier.predict(X)
- return predicted labels
classifier.staged_predict_proba(X)
- return probabilities after each iteration (not supported by TMVA)
classifier.get_feature_importances()
Here we use X
to denote matrix with data of shape [n_samples, n_features]
, y
is vector with labels (0 or 1) of shape [n_samples],
sample_weight
is vector with weights.
X should be* pandas.DataFrame
, not numpy.array
.
Provided this, you'll be able to choose features used in training by setting e.g. features=['FlightTime', 'p']
in constructor.
* it works fine with numpy.array
as well, but in this case all the features will be used.
In [5]:
variables = list(data.columns[:15])
In [6]:
from rep.estimators import SklearnClassifier
from sklearn.ensemble import GradientBoostingClassifier
# Using gradient boosting with default settings
sk = SklearnClassifier(GradientBoostingClassifier(), features=variables)
# Training classifier
sk.fit(train_data, train_labels)
print('training complete')
In [7]:
# predict probabilities for each class
prob = sk.predict_proba(test_data)
print prob
In [8]:
print 'ROC AUC', roc_auc_score(test_labels, prob[:, 1])
In [9]:
sk.predict(test_data)
Out[9]:
In [10]:
sk.get_feature_importances()
Out[10]:
In [11]:
from rep.estimators import TMVAClassifier
print TMVAClassifier.__doc__
In [12]:
tmva = TMVAClassifier(method='kBDT', NTrees=50, Shrinkage=0.05, features=variables)
tmva.fit(train_data, train_labels)
print('training complete')
In [13]:
# predict probabilities for each class
prob = tmva.predict_proba(test_data)
print prob
In [14]:
print 'ROC AUC', roc_auc_score(test_labels, prob[:, 1])
In [15]:
# predict labels
tmva.predict(test_data)
Out[15]:
In [16]:
from rep.estimators import XGBoostClassifier
print XGBoostClassifier.__doc__
In [17]:
# XGBoost with default parameters
xgb = XGBoostClassifier(features=variables)
xgb.fit(train_data, train_labels)
print('training complete')
In [18]:
prob = xgb.predict_proba(test_data)
print 'ROC AUC:', roc_auc_score(test_labels, prob[:, 1])
In [19]:
xgb.predict(test_data)
Out[19]:
In [20]:
xgb.get_feature_importances()
Out[20]:
As one can see above, all the classifiers implement the same interface, this simplifies work, simplifies comparison of different classifiers, but this is not the only profit.
Sklearn
provides different tools to combine different classifiers and transformers.
One of this tools is AdaBoost
, which is abstract metaclassifier built on the top of some other classifier (usually, decision dree). Also bagging is other frequently used ensembling meta-algorithm.
Let's show that now you can run AdaBoost over classifiers from other libraries!
(isn't boosting over neural network what you were dreaming of all your life?)
In [21]:
from sklearn.ensemble import AdaBoostClassifier
In [22]:
%%time
base_xgb = XGBoostClassifier(n_estimators=20)
ada_xgb = SklearnClassifier(AdaBoostClassifier(base_estimator=base_xgb, n_estimators=5))
ada_xgb.fit(train_data[variables], train_labels)
print('training complete!')
# predict probabilities for each class
prob = ada_xgb.predict_proba(test_data[variables])
print 'AUC', roc_auc_score(test_labels, prob[:, 1])
# predict probabilities for each class
prob = ada_xgb.predict_proba(train_data[variables])
print 'AUC', roc_auc_score(train_labels, prob[:, 1])
In [23]:
# base_tmva = TMVAClassifier(method='kBDT', NTrees=20)
# ada_tmva = SklearnClassifier(AdaBoostClassifier(base_estimator=base_tmva, n_estimators=5), features=variables)
# ada_tmva.fit(train_data, train_labels)
# print('training complete')
# prob = ada_tmva.predict_proba(test_data)
# print 'AUC', roc_auc_score(test_labels, prob[:, 1])
There are many things you can do with classifiers now:
sklearn.pipeline
)And you can replace classifiers at any moment.