In [1]:
import numpy, pandas
from rep.utils import train_test_split
from sklearn.metrics import roc_auc_score
sig_data = pandas.read_csv('toy_datasets/toyMC_sig_mass.csv', sep='\t')
bck_data = pandas.read_csv('toy_datasets/toyMC_bck_mass.csv', sep='\t')
labels = numpy.array([1] * len(sig_data) + [0] * len(bck_data))
data = pandas.concat([sig_data, bck_data])
In [2]:
data[:5]
Out[2]:
In [3]:
# Get train and test data
train_data, test_data, train_labels, test_labels = train_test_split(data, labels, train_size=0.5)
All classifiers inherit from sklearn.BaseEstimator and have the following methods:
classifier.fit(X, y, sample_weight=None)
- train classifier
classifier.predict_proba(X)
- return probabilities vector for all classes
classifier.predict(X)
- return predicted labels
classifier.staged_predict_proba(X)
- return probabilities after each iteration (not supported by TMVA)
classifier.get_feature_importances()
Here we use X
to denote matrix with data of shape [n_samples, n_features]
, y
is vector with labels (0 or 1) of shape [n_samples],
sample_weight
is vector with weights.
X should be* pandas.DataFrame
, not numpy.array
.
Provided this, you'll be able to choose features used in training by setting e.g. features=['FlightTime', 'p']
in constructor.
* it works fine with numpy.array
as well, but in this case all the features will be used.
In [4]:
variables = ["FlightDistance", "FlightDistanceError", "IP", "VertexChi2", "pt", "p0_pt", "p1_pt", "p2_pt", 'LifeTime','dira']
In [5]:
from rep.estimators import SklearnClassifier
from sklearn.ensemble import GradientBoostingClassifier
# Using gradient boosting with default settings
sk = SklearnClassifier(GradientBoostingClassifier(), features=variables)
# Training classifier
sk.fit(train_data, train_labels)
print('training complete')
In [6]:
# predict probabilities for each class
prob = sk.predict_proba(test_data)
print prob
In [7]:
print 'ROC AUC', roc_auc_score(test_labels, prob[:, 1])
In [8]:
sk.predict(test_data)
Out[8]:
In [9]:
sk.get_feature_importances()
Out[9]:
TMVAClassifier wraps classifiers from TMVA (CERN library for machine learning)
Parameters:
-----------
:param str method: algorithm method (default='kBDT')
:param list(str) | None features: features used in training
:param str factory_options: default="!V:!Silent:Color:Transformations=I;D;P;G,D:AnalysisType=Classification"
:param kwargs: other parameters passed as key=value.
TMVA doesn't support staged_predict_proba
In [10]:
from rep.estimators.tmva import TMVAClassifier, TMVARegressor
In [11]:
from rep.estimators import TMVAClassifier
In [12]:
tmva = TMVAClassifier(method='kBDT', NTrees=50, Shrinkage=0.05, features=variables)
tmva.fit(train_data, train_labels)
print('training complete')
In [13]:
# predict probabilities for each class
prob = tmva.predict_proba(test_data)
print prob
In [14]:
print 'AUC', roc_auc_score(test_labels, prob[:, 1])
In [15]:
# predict labels
tmva.predict(test_data)
Out[15]:
XGBoost is open-source library with fast gradient boosting over decision trees.
Parameters:
-----------
:param list(str) features: list of features to train model
:param n_estimators: the number of round for boosting.
:param nthreads: number of parallel threads used to run xgboost.
:param num_feature: feature dimension used in boosting, set to maximum dimension of the feature
(set automatically by xgboost, no need to be set by user).
:param gamma: minimum loss reduction required to make a further partition on a leaf node of the tree.
The larger, the more conservative the algorithm will be.
:param eta: step size shrinkage used in update to prevents overfitting.
After each boosting step, we can directly get the weights of new features
and eta actually shrinkage the feature weights to make the boosting process more conservative.
:param max_depth: maximum depth of a tree.
:params scale_pos_weight: ration of weights of the class 1 to the weights of the class 0.
:param min_child_weight: minimum sum of instance weight(hessian) needed in a child.
If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight,
then the building process will give up further partitioning.
In linear regression mode, this simply corresponds to minimum number of instances needed to be in each node.
The larger, the more conservative the algorithm will be.
:param subsample: subsample ratio of the training instance.
Setting it to 0.5 means that XGBoost randomly collected half of the data instances to grow trees
and this will prevent overfitting.
:param colsample: subsample ratio of columns when constructing each tree.
:param objective: specify the learning task and the corresponding learning objective, and the options are below:
"reg:linear" --linear regression
"reg:logistic" --logistic regression
"binary:logistic" --logistic regression for binary classification, output probability
"binary:logitraw" --logistic regression for binary classification, output score before logistic transformation
"multi:softmax" --set XGBoost to do multiclass classification using the softmax objective, you also need to
set num_class(number of classes)
"multi:softprob" --same as softmax, but output a vector of ndata * nclass, which can be further reshaped to ndata,
nclass matrix. The result contains predicted probability of each data point belonging to each class.
"rank:pairwise" --set XGBoost to do ranking task by minimizing the pairwise loss
:param base_score: the initial prediction score of all instances, global bias.
:param random_state: random number seed.
:param verbose: if 1, will print messages during training
:param missing: the number considered by xgboost as missing value.
In [16]:
from rep.estimators import XGBoostClassifier
# XGBoost with default parameters
xgb = XGBoostClassifier(features=variables)
xgb.fit(train_data, train_labels, sample_weight=numpy.ones(len(train_labels)))
print('training complete')
In [17]:
prob = xgb.predict_proba(test_data)
print 'ROC AUC:', roc_auc_score(test_labels, prob[:, 1])
In [18]:
xgb.predict(test_data)
Out[18]:
In [19]:
xgb.get_feature_importances()
Out[19]:
As one can see above, all the classifiers implement the same interface, this simplifies work, simplifies comparison of different classifiers, but this is not the only profit.
Sklearn
provides different tools to combine different classifiers and transformers.
One of this tools is AdaBoost
, which is abstract metaclassifier built on the top of some other classifier (usually, decision dree)
Let's show that now you can run AdaBoost over classifiers from other libraries!
(isn't boosting over neural network what you were dreaming of all your life?)
In [20]:
from sklearn.ensemble import AdaBoostClassifier
# Construct AdaBoost with TMVA as base estimator
base_tmva = TMVAClassifier(method='kBDT', NTrees=15, Shrinkage=0.05)
ada_tmva = SklearnClassifier(AdaBoostClassifier(base_estimator=base_tmva, n_estimators=5), features=variables)
ada_tmva.fit(train_data, train_labels)
print('training complete')
In [21]:
prob = ada_tmva.predict_proba(test_data)
print 'AUC', roc_auc_score(test_labels, prob[:, 1])
In [22]:
# Construct AdaBoost with xgboost base estimator
base_xgb = XGBoostClassifier(n_estimators=50)
# ada_xgb = SklearnClassifier(AdaBoostClassifier(base_estimator=base_xgb, n_estimators=1), features=variables)
ada_xgb = AdaBoostClassifier(base_estimator=base_xgb, n_estimators=1)
ada_xgb.fit(train_data[variables], train_labels)
print('training complete!')
In [23]:
# predict probabilities for each class
prob = ada_xgb.predict_proba(test_data[variables])
print 'AUC', roc_auc_score(test_labels, prob[:, 1])
In [24]:
# predict probabilities for each class
prob = ada_xgb.predict_proba(train_data[variables])
print 'AUC', roc_auc_score(train_labels, prob[:, 1])
PyBrainClassifier wraps classifiers from PyBrain. PyBrain is short for Python-Based Reinforcement Learning, Artificial Intelligence and Neural Network Library.
Parameters:
-----------
:param features: features used in training.
:type features: list[str] or None
:param epochs: number of iterations of training; if < 0 then classifier trains until convergence.
:param scaler: scaler used to transform data; default is StandardScaler.
:type scaler: transformer from sklearn.preprocessing or None
:param boolean use_rprop: flag to indicate whether we should use rprop trainer.
net parameters:
:param list layers: indicate how many neurons in each hidden layer; default is 1 layer included 10 neurons.
:param hiddenclass: classes of the hidden layers; default is 'SigmoidLayer'.
:type hiddenclass: list of str
:param params: other net parameters:
:param boolean bias and outputbias: flags to indicate whether the network should have the corresponding biases;
both default to True.
:param boolean peepholes.
:param boolean recurrent: if the `recurrent` flag is set, a :class:`RecurrentNetwork` will be created,
otherwise a :class:`FeedForwardNetwork`.
:type params: dict
trainer parameters:
:param float learningrate: gives the ratio of which parameters are changed into the direction of the gradient.
:param float lrdecay: the learning rate decreases by lrdecay, which is used to multiply the learning rate after each training step.
:param float momentum: the ratio by which the gradient of the last timestep is used.
:param boolean verbose.
:param boolean batchlearning: if batchlearning is set, the parameters are updated only at the end of each epoch. Default is False.
:param float weightdecay: corresponds to the weightdecay rate, where 0 is no weight decay at all.
rprop parameters:
:param float etaminus: factor by which step width is decreased when overstepping (0.5).
:param float etaplus: factor by which step width is increased when following gradient (1.2).
:param float delta: step width for each weight.
:param float deltamin: minimum step width (1e-6).
:param float deltamax: maximum step width (5.0).
:param float delta0: initial step width (0.1).
trainUntilConvergence parameters:
:param int max_epochs: if is given, at most that many epochs are trained.
:param int continue_epochs: each time validation error hits a minimum, try for continue_epochs epochs to find a better one.
:param float validation_proportion: the ratio of the dataset that is used for the validation dataset.
In [2]:
import pandas
import numpy
from rep.utils import train_test_split
from sklearn.metrics import roc_auc_score
sig_data = pandas.read_csv('toy_datasets/toyMC_sig_mass.csv', sep='\t')
bck_data = pandas.read_csv('toy_datasets/toyMC_bck_mass.csv', sep='\t')
labels = numpy.array([1] * len(sig_data) + [0] * len(bck_data))
data = pandas.concat([sig_data, bck_data])
variables = ["FlightDistance", "FlightDistanceError", "IP", "VertexChi2", "pt", "p0_pt", "p1_pt", "p2_pt", 'LifeTime','dira']
In [3]:
X_train, X_test, y_train, y_test = train_test_split(data[variables].values, labels, train_size=0.5, random_state = 384)
print X_train.shape, X_test.shape, len(y_train), len(y_test)
In [5]:
from rep.estimators.pybrain import PyBrainClassifier
from sklearn.preprocessing import StandardScaler, MinMaxScaler, Binarizer
pb = PyBrainClassifier(layers=[10, 20], epochs=10, verbose=True, scaler=StandardScaler())
In [7]:
%%time
pb.fit(X_train, y_train)
Out[7]:
In [8]:
prob = pb.predict_proba(X_test)
prob
Out[8]:
In [9]:
y_pred = pb.predict(X_test)
In [10]:
from sklearn.metrics import zero_one_loss
from sklearn.metrics import classification_report
print "Accuracy:", zero_one_loss(y_pred, y_test)
print "Classification report:"
print classification_report(y_pred, y_test)
There are many things you can do with classifiers now:
sklearn.pipeline
)And you can replace classifiers at any moment.