In [15]:
%matplotlib inline
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Fetch the data and load it in pandas


In [2]:
data = pd.read_csv("training.csv")

Prepare input to scikit and train and test cut


In [3]:
X = data.drop(['EventId', 'Weight', 'Label'], axis=1).values
y = data['Label'].values
w = data['Weight'].values
s_weights = w.sum()
s_s_weights = data[y == 's']['Weight'].sum()
s_b_weights = data[y == 'b']['Weight'].sum()

In [4]:
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test, w_train, w_test = train_test_split(
    X, y, w, test_size=0.2, random_state=0)

Train model


In [5]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

clf = AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=2), n_estimators=100)

In [6]:
# balanced classification
def balance_weights(w, y):
    w_balanced = w.copy() # need original weights for AMS
    w_balanced[y == 's'] /= s_s_weights
    w_balanced[y == 'b'] /= s_b_weights
    return w_balanced

In [7]:
%%time
_ = clf.fit(X_train, y_train, balance_weights(w_train, y_train))


CPU times: user 2min 18s, sys: 731 ms, total: 2min 19s
Wall time: 2min 20s

In [8]:
from sklearn.metrics import accuracy_score

y_pred = clf.predict(X_test)
accuracy_score(y_test, y_pred, sample_weight=balance_weights(w_test, y_test))


Out[8]:
0.84946835409303012

Optimizing the AMS on the held out validation set

The Approximate Median Significance \begin{equation*} \text{AMS} = \sqrt{ 2 \left( (s + b + 10) \ln \left( 1 + \frac{s}{b + 10} \right) - s \right) } \end{equation*} s and b are the sum of signal and background weights, respectively, in the selection region.


In [9]:
def AMS(s,b):
    assert s >= 0
    assert b >= 0
    bReg = 10.
    return math.sqrt(2. * ((s + b + bReg) * 
                          math.log(1. + s / (b + bReg)) - s))

In [10]:
y_pred_proba = clf.predict_proba(X_test)

Sorting the indices in increasing order of the scores (signal-like = higher).


In [11]:
tiis = y_pred_proba[:, 1].argsort()

Weights have to be normalized to the same sum as in the full set.


In [12]:
w_factor = float(len(data)) / len(X_test)

Initializing $s$ and $b$ to the full sum of weights, we start by having all points in the selection region.


In [13]:
s = w_test[y_test == 's'].sum()
b = w_test[y_test == 'b'].sum()

amss will contain AMSs after each point moved out of the selection region in the sorted validation set. ams_max will contain the best validation AMS, and threshold will be the smallest score among the selected points. We will do len(tiis) iterations, which means that amss[-1] is the AMS when only the point with the highest score is selected.


In [16]:
amss = np.empty([len(tiis)])
ams_max = 0
threshold = 0.0
for ti in range(len(tiis)):
    # don't forget to renormalize the weights to the same sum 
    # as in the complete training set
    amss[ti] = AMS(max(0, s * w_factor), max(0, b * w_factor))
    if amss[ti] > ams_max:
        ams_max = amss[ti]
        threshold = y_pred_proba[tiis[ti], 1]
        #print tI,threshold
    if y_test[tiis[ti]] == 's':
        s -= w_test[tiis[ti]]
    else:
        b -= w_test[tiis[ti]]

In [17]:
ams_max


Out[17]:
3.4132203331173847

In [18]:
threshold


Out[18]:
0.50664804836534294

In [19]:
plt.plot(amss)


Out[19]:
[<matplotlib.lines.Line2D at 0x114f52cd0>]

Produce Kaggle submission


In [20]:
test = pd.read_csv("test.csv")

In [21]:
scores = clf.predict_proba(test.drop('EventId', axis=1).values)

In [22]:
test['RankOrder'] = scores[:, 1].argsort().argsort() + 1 # trick to compute the rank order.

In [23]:
test['Class'] = ['b' if scores[i, 1] < threshold else 's' for i in range(len(scores))]

In [24]:
test.ix[:, ['EventId', 'RankOrder', 'Class']].to_csv("submission.csv", index=False)