This notebook is devoted to uniformGradientBoosting, which is gradient boosting on trees, with custom loss function:
$\text{loss} = \sum_i w_i \exp \left[- \sum_j a_{ij} \textrm{score}_j y_j \right] $
Here $y_j \in \{+1, -1\}$, $\textrm{score}_j \in \mathbb{R}$ - are real class and score prediction of j-th event in the train set
$w_i$ are ones so far, the main problem is to choose appropriate $a_{ij}$ matrix, because there are plenty of variants.
If we take $a_{ij}$ to be identity matrix, it is simply AdaBoost loss.
SimpleKnnLoss(knn) is particular case, where in each line we set ones to closest knn events of the same class, and zeros to all others.
The matrix is square, if we take knn=1, this is the same as Ada loss.
PairwiseKnnLossFunction(knn) we take knn neighbours for each event, for each pair of neighboring events we create separate row in matrix, ones are placed in the corresponding to events columns (thus we have only two 1's in each row). This one gives poor uniformity and doen't semm to have any advantages. If knn=1, this one is equivalent to Ada loss too.
RandomKnnLossFunction(nrows, knn, knnfactor=3), the resulting A matrix wil have nrows rows, each of them is generated so:
we take random event from train dataset, from knn * knnfactor of closest neighbours we take knn in a random way, and place ones to the corresponding columns. Each row has knn 1's.
I don't rely on it much (though it seems to be adequate measure), so sometimes print plots to compare,
for some target efficiency MSE variation is computed so:
we have some target_efficiencies, we split all into bins, compute
$\text{mse}(eff) = \cfrac{1}{\text{n_bins} \times \text{particles}} \sum_{bin} (\text{mean_eff} - \text{bin_eff} )^2 \times \text{particles_in_bin} $
to obtain some measure of nonuniformity, we take the average of mse(eff) for several efficiencies (i.e. [0.6, 0.7, 0.8, 0.9])
This kind of function is taken because it is more or less independent of number of bins and number of events.
In [1]:
import pandas, numpy
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from IPython.display import display_html
from collections import OrderedDict
import uniformgradientboosting as ugb
import commonutils as utils
import reports
from reports import ClassifiersDict
from uboost import uBoostBDT, uBoostClassifier
from supplementaryclassifiers import HidingClassifier
from config import ipc_profile
In [2]:
used_columns = ["Y1", "Y2", "Y3", "M2AB", "M2AC"]
signalDF = pandas.read_csv('datasets/dalitzplot/signal.csv', sep='\t', usecols=used_columns)
signal5e5DF = pandas.read_csv('datasets/dalitzplot/signal5e5.csv', sep='\t', usecols=used_columns)
bgDF = pandas.read_csv('datasets/dalitzplot/bkgd.csv', sep='\t', usecols=used_columns)
answers5e5 = numpy.ones(len(signal5e5DF))
assert set(signalDF.columns) == set(signal5e5DF.columns) == set(bgDF.columns), "columns are diffferent"
In [3]:
def plotDistribution2D(var_name1, var_name2, data_frame, bins=40):
"""The function to plot 2D distribution histograms"""
H, x, y = pylab.histogram2d(data_frame[var_name1], data_frame[var_name2], bins = 40)
pylab.xlabel(var_name1)
pylab.ylabel(var_name2)
pylab.pcolor(x, y, H, cmap=cm.Blues)
pylab.colorbar()
pylab.figure(figsize=(18, 6))
subplot(1, 3, 1), pylab.title("signal"), plotDistribution2D("M2AB", "M2AC", signalDF)
subplot(1, 3, 2), pylab.title("background"), plotDistribution2D("M2AB", "M2AC", bgDF)
subplot(1, 3, 3), pylab.title("dense signal"), plotDistribution2D("M2AB", "M2AC", signal5e5DF)
pass
In [13]:
def smallReport(classifiers, roc_stages=[50, 100], mse_stages=[100], parallelize=True):
used_ipc = ipc_profile if parallelize else None
test_preds = classifiers.fit(trainX, trainY, ipc_profile=used_ipc).test_on(testX, testY, low_memory=True)
pylab.figure(figsize=(17, 7))
pylab.subplot(121), pylab.title('Learning curves'), test_preds.learning_curves()
pylab.subplot(122), pylab.title('Staged MSE'), test_preds.mse_curves(uniform_variables)
show()
test_preds.roc(stages=roc_stages).show()
classifiers.test_on(signal5e5DF, answers5e5, low_memory=True)\
.efficiency(uniform_variables, stages=mse_stages, target_efficiencies=[0.7])
In [5]:
trainX, trainY, testX, testY = utils.splitOnTestAndTrain(signalDF, bgDF)
train_variables = ["Y1", "Y2", "Y3"]
uniform_variables = ["M2AB", "M2AC"]
AdaBoost, uBoost, and uniformGradientBoosting
uBoost shows at least not worse quality with essentially better signal flatness, while being a little less uniform in background.
The latest is not surprise, because uBoost wasn't planned to be somehow flat in background.
In [38]:
base_estimator = DecisionTreeClassifier(max_depth=4)
n_estimators = 150 + 1
var_classifiers = ClassifiersDict()
var_classifiers['AdaBoost'] = HidingClassifier(train_variables=train_variables,
base_estimator=AdaBoostClassifier(base_estimator=base_estimator, n_estimators=n_estimators))
knnloss1 = ugb.SimpleKnnLossFunction(uniform_variables, knn=20)
var_classifiers['unifGB20'] = ugb.MyGradientBoostingClassifier(loss=knnloss1, max_depth=4, n_estimators=n_estimators,
learning_rate=.5, train_variables=train_variables)
var_classifiers['uBoost'] = uBoostClassifier(uniform_variables=uniform_variables, base_estimator=base_estimator,
n_estimators=n_estimators, train_variables=train_variables, efficiency_steps=12)
flatness_loss = ugb.FlatnessLossFunction(uniform_variables, ada_coefficient=0.05, bins=13)
var_classifiers['uGB+FL'] = ugb.MyGradientBoostingClassifier(loss=flatness_loss, max_depth=4, n_estimators=n_estimators,
learning_rate=.4, train_variables=train_variables)
var_classifiers.fit(trainX, trainY, ipc_profile=ipc_profile)
pass
In [8]:
var_classifiers.test_on(testX, testY).learning_curves().show().roc(stages=[75, 150]).show()
Out[8]:
In [41]:
var_pred5e5 = var_classifiers.test_on(signal5e5DF, answers5e5)
var_pred5e5.mse_curves(uniform_variables).show().efficiency(uniform_variables, stages=[75, 150], target_efficiencies=[0.7])
Out[41]:
In [43]:
effs = [0.6, 0.7, 0.85]
for eff in effs:
var_pred5e5.efficiency(uniform_variables, target_efficiencies=[eff]) \
.print_mse(uniform_variables, stages=[100], efficiencies=[eff])
display_html("<b>After summing over efficiencies {0} </b>".format(effs), raw=True)
var_pred5e5.print_mse(uniform_variables, stages=[100], efficiencies=effs)
# display_html(reports.computeStagedMseVariation(answers5e5, signal5e5DF, uniform_variables, var_sig5e5_probas_dict,
# stages=[100], target_efficiencies=effs) )
Out[43]:
In [15]:
pred = var_classifiers.test_on(testX, testY)
pred.mse_curves(uniform_variables, on_signal=False)
Out[15]:
In [16]:
sknn_classifiers = ClassifiersDict()
for knn in [1, 5, 10, 20, 30, 60]:
knnloss = ugb.SimpleKnnLossFunction(uniform_variables, knn=knn)
sknn_classifiers["sknn=%i" % knn] = ugb.MyGradientBoostingClassifier(loss=knnloss, max_depth=4, n_estimators=201,
learning_rate=.5, train_variables = train_variables)
smallReport(sknn_classifiers, roc_stages=[100, 200], mse_stages=[200])
In [17]:
sknn2_classifiers = ClassifiersDict()
for diagonal in [0, 1, 2]:
knnloss = ugb.SimpleKnnLossFunction(uniform_variables, knn=25, diagonal=diagonal)
sknn2_classifiers["diag=%i" % diagonal] = ugb.MyGradientBoostingClassifier(loss=knnloss, max_depth=4, n_estimators=101,
learning_rate=.5, train_variables = train_variables)
smallReport(sknn2_classifiers)
In [18]:
pw_classifiers = ClassifiersDict()
for knn in [5, 15]:
pw_loss = ugb.PairwiseKnnLossFunction(uniform_variables, knn=knn)
pw_classifiers["pw_knn=%i" % knn] = ugb.MyGradientBoostingClassifier(loss=pw_loss, max_depth=4, n_estimators=101,
learning_rate=.5, train_variables = train_variables)
smallReport(pw_classifiers)
In [19]:
rknn_classifiers = ClassifiersDict()
for knn in [1, 6, 10, 20, 30]:
rknn_loss = ugb.RandomKnnLossFunction(uniform_variables, knn=knn, n_rows=len(trainX) * 3, large_preds_penalty=0.)
rknn_classifiers["rknn=%i" % knn] = ugb.MyGradientBoostingClassifier(loss=rknn_loss, max_depth=4, n_estimators=101,
learning_rate=.5, train_variables = train_variables)
smallReport(rknn_classifiers)
In [20]:
rknn2_classifiers = ClassifiersDict()
for factor in [0.5, 1, 2, 4, 8]:
n_rows = int(factor * len(trainX))
rknn2_loss = ugb.RandomKnnLossFunction(uniform_variables, knn=20, n_rows=n_rows)
rknn2_classifiers["rknn2=%1.1f" % factor] = ugb.MyGradientBoostingClassifier(loss=rknn2_loss, max_depth=4, n_estimators=101,
learning_rate=.5, train_variables = train_variables)
smallReport(rknn2_classifiers)
In [21]:
ss_classifiers = ClassifiersDict()
for subsample in [0.5, .7, 0.8, 1.]:
knnloss = ugb.SimpleKnnLossFunction(uniform_variables, knn=20)
ss_classifiers["subsample=%1.2f" % subsample] = ugb.MyGradientBoostingClassifier(loss=knnloss, max_depth=4, n_estimators=151,
learning_rate=.5, train_variables=train_variables, subsample=subsample)
smallReport(ss_classifiers)
when we not distinguish classes, the classifier doesn't tend to uniformity, it tries to have at every region that signal has greater probas then bg (and that is all).
This may serve as a good feature for following usage in some other classifier, beacuse it tries to give equal difference in predictions over the Dalitz variables
In [22]:
sknn3_classifiers = ClassifiersDict()
for distinguish_classes in [True, False]:
knnloss = ugb.SimpleKnnLossFunction(uniform_variables, knn=25, distinguish_classes=distinguish_classes)
sknn3_classifiers["dist=%s" % str(distinguish_classes)] = \
ugb.MyGradientBoostingClassifier(loss=knnloss, max_depth=4, n_estimators=101,
learning_rate=.5, train_variables = train_variables)
smallReport(sknn3_classifiers)
In [23]:
full_used_columns = ["M2AB", "M2AC", "Y1", "Y2", "Y3", "Y4", "XA", "XB", "XC"]
full_signalDF = pandas.read_csv('datasets/dalitzplot/signal.csv', sep='\t', usecols=full_used_columns)
full_signal5e5DF = pandas.read_csv('datasets/dalitzplot/signal5e5.csv', sep='\t', usecols=full_used_columns)
full_bgDF = pandas.read_csv('datasets/dalitzplot/bkgd.csv', sep='\t', usecols=full_used_columns)
# preparation oftrain/test
full_trainX, full_trainY, full_testX, full_testY = utils.splitOnTestAndTrain(full_signalDF, full_bgDF)
In [24]:
base_estimator = DecisionTreeClassifier(max_depth=4)
uniform_variables = ["M2AB", "M2AC"]
n_estimators = 101
full_train_variables = ["Y1", "Y2", "Y3", "Y4", "XA", "XB", "XC"]
full_ada_classifiers = ClassifiersDict()
for n_features in [3, 4, 5, 6]:
full_ada_classifiers['Ada_Feat=%i' % n_features] = HidingClassifier(train_variables=full_train_variables[:n_features],
base_estimator=AdaBoostClassifier(base_estimator=base_estimator, n_estimators=n_estimators))
full_preds = full_ada_classifiers.fit(full_trainX, full_trainY, ipc_profile=ipc_profile).test_on(full_testX, full_testY)
figure(figsize=(17, 7))
subplot(121), full_preds.learning_curves()
subplot(122), full_preds.mse_curves(uniform_variables)
Out[24]:
In [25]:
n_estimators = 101
full_classifiers = ClassifiersDict()
full4_train_vars = full_train_variables[:4]
full_classifiers['AdaBoost'] = HidingClassifier(train_variables=full4_train_vars,
base_estimator=AdaBoostClassifier(base_estimator=base_estimator, n_estimators=n_estimators))
knnloss1 = ugb.SimpleKnnLossFunction(uniform_variables, knn=10)
full_classifiers['unifGB'] = ugb.MyGradientBoostingClassifier(loss=knnloss1, max_depth=4, n_estimators=n_estimators,
learning_rate=.5, train_variables=full4_train_vars)
full_classifiers['uBoost'] = uBoostClassifier(uniform_variables=uniform_variables, base_estimator=base_estimator,
n_estimators=n_estimators, train_variables=full4_train_vars,
efficiency_steps=12, ipc_profile=ipc_profile)
faltnessloss1 = ugb.FlatnessLossFunction(uniform_variables, ada_coefficient=0.05, bins=17)
full_classifiers['unifGB'] = ugb.MyGradientBoostingClassifier(loss=faltnessloss1, max_depth=4, n_estimators=n_estimators,
learning_rate=0.4, train_variables=full4_train_vars)
full_preds = full_ada_classifiers.fit(full_trainX, full_trainY, ipc_profile=ipc_profile).test_on(full_testX, full_testY)
figure(figsize=(17, 7))
subplot(121), full_preds.learning_curves()
subplot(122), full_preds.mse_curves(uniform_variables)
Out[25]:
Square A matrix is constructed by taking $$a_{ij} = \begin{cases} 0, & \text{if class}_i \neq \text{class}_j \\ 0, & \text{if j-th event is not in knn of i-th event }\\ f(r), & \text{otherwise, where $r$ is distance(i,j)} \end{cases}$$
$f(r)$ is the function to choose
The results of these experiments are too unstable and not reliable,
after rebuilding (or reshuffling datasets), the results may significally change.
Need more experiments here (or better - good theoretical idea why some function should be preferred), there are too many things to play with.
NB: clip(x,a,b) is function that places the result x between a and b (to avoid singularities from log, 1/r and so on)
In [26]:
functions = {'exp': lambda r: numpy.exp(-50*r),
'exp2': lambda r: numpy.exp(-1000*r*r),
'exp3': lambda r: numpy.exp(-2000*r*r),
'log': lambda r: numpy.clip(-numpy.log(r), 0, 7),
'1/r': lambda r: numpy.clip(1/r, 0, 100),
'1/sqrt(r)': lambda r: numpy.clip(r ** -0.5, 0, 10)
}
dist_classifiers = ClassifiersDict()
for name, func in functions.iteritems():
loss = ugb.DistanceBasedKnnFunction(uniform_variables, knn=150, distance_dependence=func, row_normalize=False)
dist_classifiers[name] = \
ugb.MyGradientBoostingClassifier(loss=loss, max_depth=4, n_estimators=101,
learning_rate=.5, train_variables = train_variables)
smallReport(dist_classifiers, parallelize=False)
In [27]:
dist2_classifiers = ClassifiersDict()
for name, func in functions.iteritems():
loss = ugb.DistanceBasedKnnFunction(uniform_variables, knn=150, distance_dependence=func, row_normalize=True)
dist2_classifiers[name + "+norm"] = \
ugb.MyGradientBoostingClassifier(loss=loss, max_depth=4, n_estimators=101,
learning_rate=.5, train_variables = train_variables)
smallReport(dist2_classifiers, parallelize=False)
In [28]:
functions = {'exp': lambda r: numpy.exp(-50*r),
'exp2': lambda r: numpy.exp(-1000*r*r),
'exp3': lambda r: numpy.exp(-2000*r*r),
'log': lambda r: numpy.clip(-numpy.log(r), 0, 7),
'1/r': lambda r: numpy.clip(1/r, 0, 100),
'1/sqrt(r)': lambda r: numpy.clip(r ** -0.5, 0, 10)
}
dist3_classifiers = ClassifiersDict()
for name, func in functions.iteritems():
loss = ugb.DistanceBasedKnnFunction(uniform_variables, knn=100, distance_dependence=func, row_normalize=True)
dist3_classifiers[name] = \
ugb.MyGradientBoostingClassifier(loss=loss, max_depth=4, n_estimators=101,
learning_rate=.5, train_variables = train_variables)
smallReport(dist3_classifiers, parallelize=False)
The average distance to n-th neighbour
just to feel the real scale
In [29]:
from sklearn.neighbors import NearestNeighbors
data = trainX[trainY > 0.5]
knn = 100
r, inds = NearestNeighbors(n_neighbors=knn).fit(data).kneighbors(data)
plot(numpy.arange(knn), r.mean(axis=0))
Out[29]:
In [30]:
dist4_classifiers = ClassifiersDict()
for knn in [1, 10, 20, 30]:
loss = ugb.DistanceBasedKnnFunction(uniform_variables, knn=knn, distance_dependence=lambda r: (r + 1e5)**0, row_normalize=True)
dist4_classifiers['knn=%i' %knn] = ugb.MyGradientBoostingClassifier(loss=loss, max_depth=4, n_estimators=101,
learning_rate=.5, train_variables=train_variables)
smallReport(dist4_classifiers, parallelize=False)
In [31]:
fl_classifiers = ClassifiersDict()
knn_loss = ugb.SimpleKnnLossFunction(uniform_variables, knn=25)
fl_classifiers["sknn25"] = ugb.MyGradientBoostingClassifier(loss=knn_loss, max_depth=4, n_estimators=101,
learning_rate=.5, train_variables = train_variables)
for learning_rate in [ 0.1, 0.2, 0.4, 0.6]:
flatness_loss = ugb.FlatnessLossFunction(uniform_variables, ada_coefficient=0.05, bins=13)
fl_classifiers["fl+lr=%1.3f" % learning_rate] = ugb.MyGradientBoostingClassifier(loss=flatness_loss, max_depth=4,
n_estimators=101, learning_rate=learning_rate, train_variables = train_variables)
smallReport(fl_classifiers, parallelize=False)
Different 'Ada_coefficient'
in general, loss is FlatnessLoss + AdaCoeff * AdaLoss
The greater Ada_coefficient, the more we tend to minimize AdaLoss (quality), not FlatnessLoss,
this coeficient surves as some kind of tradeoff parameter between flatness and quality
In [32]:
fl2_classifiers = ClassifiersDict()
knn_loss = ugb.SimpleKnnLossFunction(uniform_variables, knn=25)
fl2_classifiers["sknn25"] = ugb.MyGradientBoostingClassifier(loss=knn_loss, max_depth=4, n_estimators=151,
learning_rate=.5, train_variables = train_variables)
for ada_coeff in [0.01, 0.02, 0.05, 0.1, 0.2]:
flatness_loss = ugb.FlatnessLossFunction(uniform_variables, ada_coefficient=ada_coeff, bins=23)
fl2_classifiers["fl2=%1.2f" % ada_coeff] = ugb.MyGradientBoostingClassifier(loss=flatness_loss, max_depth=4,
n_estimators=151, learning_rate=0.2, train_variables = train_variables)
smallReport(fl2_classifiers)
In [33]:
fl3_classifiers = ClassifiersDict()
knn_loss = ugb.SimpleKnnLossFunction(uniform_variables, knn=25)
fl3_classifiers["sknn25"] = ugb.MyGradientBoostingClassifier(loss=knn_loss, max_depth=4, n_estimators=151,
learning_rate=.5, train_variables = train_variables)
for bins in [8, 15, 25]:
flatness_loss = ugb.FlatnessLossFunction(uniform_variables, ada_coefficient=0.1, bins=bins)
fl3_classifiers["fl3=%i" % bins] = ugb.MyGradientBoostingClassifier(loss=flatness_loss, max_depth=4,
n_estimators=151, learning_rate=0.2, train_variables = train_variables)
smallReport(fl3_classifiers)
In [34]:
fl4_classifiers = ClassifiersDict()
for power in [1., 1.5, 2., 3.]:
flatness_loss = ugb.FlatnessLossFunction(uniform_variables, ada_coefficient=0.1, power=power)
fl4_classifiers["fl4=%.1f" % power] = ugb.MyGradientBoostingClassifier(loss=flatness_loss, max_depth=4,
n_estimators=151, learning_rate=0.2, train_variables = train_variables)
smallReport(fl4_classifiers)
In [45]:
fl4_classifiers.test_on(testX, testY).mse_curves(uniform_variables, on_signal=False)
Out[45]:
In [ ]: