Starting kit for the Higgs boson machine learning challenge

This notebook contains a starting kit for the Higgs boson machine learning challenge. Download the training set (called training.csv) and the test set (test.csv), then execute cells in order.



In [2]:

    
%matplotlib inline

import random
import string
import math
import csv

import numpy as np
import matplotlib.pyplot as plt

Reading an formatting training data



In [3]:

    
all = list(csv.reader(open("training.csv","rt"), delimiter=','))

Slicing off header row and id, weight, and label columns.



In [4]:

    
xs = np.array([list(map(float, row[1:-2])) for row in all[1:]])
(numPoints,numFeatures) = xs.shape

Perturbing features to avoid ties. It's far from optimal but makes life easier in this simple example.



In [5]:

    
xs = np.add(xs, np.random.normal(0.0, 0.0001, xs.shape))

Label selectors.



In [6]:

    
sSelector = np.array([row[-1] == 's' for row in all[1:]])
bSelector = np.array([row[-1] == 'b' for row in all[1:]])

Weights and weight sums.



In [7]:

    
weights = np.array([float(row[-2]) for row in all[1:]])
sumWeights = np.sum(weights)
sumSWeights = np.sum(weights[sSelector])
sumBWeights = np.sum(weights[bSelector])

Training and validation cuts

We will train a classifier on a random training set for minimizing the weighted error with balanced weights, then we will maximize the AMS on the held out validation set.



In [44]:

    
xsTrain









    Out[44]:





array([[  9.87068809e+01,   7.12049780e+01,   8.39770853e+01, ...,
         -9.99000165e+02,  -9.98999962e+02,   2.25654555e-06],
       [  9.57059287e+01,   5.01659845e+01,   5.91480046e+01, ...,
         -9.98999891e+02,  -9.98999959e+02,   1.15021858e-04],
       [  1.42209016e+02,   4.22218979e+01,   8.96650376e+01, ...,
          5.69936249e-02,   2.95093667e+00,   1.08156086e+02],
       ..., 
       [  1.63381937e+02,   7.56199972e+01,   1.03548896e+02, ...,
         -9.98999888e+02,  -9.99000168e+02,   7.51673556e-05],
       [  4.98398873e+01,   8.00451616e+01,   4.17932026e+01, ...,
         -9.99000041e+02,  -9.99000045e+02,   4.53350491e+01],
       [  1.29873037e+02,   2.91601910e+01,   9.56250119e+01, ...,
         -9.98999914e+02,  -9.99000005e+02,   2.23604574e-05]])



In [8]:

    
randomPermutation = random.sample(range(len(xs)), len(xs))
numPointsTrain = int(numPoints*0.9)
numPointsValidation = numPoints - numPointsTrain

xsTrain = xs[randomPermutation[:numPointsTrain]]
xsValidation = xs[randomPermutation[numPointsTrain:]]

sSelectorTrain = sSelector[randomPermutation[:numPointsTrain]]
bSelectorTrain = bSelector[randomPermutation[:numPointsTrain]]
sSelectorValidation = sSelector[randomPermutation[numPointsTrain:]]
bSelectorValidation = bSelector[randomPermutation[numPointsTrain:]]

weightsTrain = weights[randomPermutation[:numPointsTrain]]
weightsValidation = weights[randomPermutation[numPointsTrain:]]

sumWeightsTrain = np.sum(weightsTrain)
sumSWeightsTrain = np.sum(weightsTrain[sSelectorTrain])
sumBWeightsTrain = np.sum(weightsTrain[bSelectorTrain])



In [9]:

    
xsTrainTranspose = xsTrain.transpose()

Making signal and background weights sum to $1/2$ each to emulate uniform priors $p(s)=p(b)=1/2$.



In [10]:

    
weightsBalancedTrain = np.array([0.5 * weightsTrain[i]/sumSWeightsTrain
                                 if sSelectorTrain[i]
                                 else 0.5 * weightsTrain[i]/sumBWeightsTrain\
                                 for i in range(numPointsTrain)])
weightsBalancedTrain.sum()









    Out[10]:





1.0000000000001312

Training naive Bayes and defining the score function

Number of bins per dimension for binned naive Bayes.



In [45]:

    
numBins = 10

logPs[fI,bI] will be the log probability of a data point x with binMaxs[bI - 1] < x[fI] <= binMaxs[bI] (with binMaxs[-1] = -$\infty$ by convention) being a signal under uniform priors $p(\text{s}) = p(\text{b}) = 1/2$.



In [46]:

    
logPs = np.empty([numFeatures, numBins])
binMaxs = np.empty([numFeatures, numBins])
binIndexes = np.array(range(0, numPointsTrain+1, numPointsTrain//numBins))



In [43]:

    
fI = 0
indexes = xsTrainTranspose[fI].argsort()
xsTrainTranspose[fI, indexes[binIndexes[0+1]-1]]









    Out[43]:





-998.99995953116832



In [54]:

    
xsTrainTranspose[fI].argsort()









    Out[54]:





array([ 12856, 188934, 131709, ...,  45802, 213707, 214459])



In [13]:

    
for fI in range(numFeatures):
    # index permutation of sorted feature column
    indexes = xsTrainTranspose[fI].argsort()

    for bI in range(numBins):
        # upper bin limits
        binMaxs[fI, bI] = xsTrainTranspose[fI, indexes[binIndexes[bI+1]-1]]
        # training indices of points in a bin
        indexesInBin = indexes[binIndexes[bI]:binIndexes[bI+1]]
        # sum of signal weights in bin
        wS = np.sum(weightsBalancedTrain[indexesInBin]
                    [sSelectorTrain[indexesInBin]])
        # sum of background weights in bin
        wB = np.sum(weightsBalancedTrain[indexesInBin]
                    [bSelectorTrain[indexesInBin]])
        # log probability of being a signal in the bin
        logPs[fI, bI] = math.log(wS/(wS+wB))

The score function we will use to sort the test examples. For readability it is shifted so negative means likely background (under uniform prior) and positive means likely signal. x is an input vector.



In [14]:

    
def score(x):
    logP = 0
    for fI in range(numFeatures):
        bI = 0
        # linear search for the bin index of the fIth feature
        # of the signal
        while bI < len(binMaxs[fI]) - 1 and x[fI] > binMaxs[fI, bI]:
            bI += 1
        logP += logPs[fI, bI] - math.log(0.5)
    return logP

Optimizing the AMS on the held out validation set

The Approximate Median Significance \begin{equation*} \text{AMS} = \sqrt{ 2 \left( (s + b + 10) \ln \left( 1 + \frac{s}{b + 10} \right) - s \right) } \end{equation*} s and b are the sum of signal and background weights, respectively, in the selection region.



In [15]:

    
def AMS(s,b):
    assert s >= 0
    assert b >= 0
    bReg = 10.
    return math.sqrt(2 * ((s + b + bReg) * 
                          math.log(1 + s / (b + bReg)) - s))

Computing the scores on the validation set



In [16]:

    
validationScores = np.array([score(x) for x in xsValidation])

Sorting the indices in increasing order of the scores.



In [17]:

    
tIIs = validationScores.argsort()

Weights have to be normalized to the same sum as in the full set.



In [18]:

    
wFactor = 1.* numPoints / numPointsValidation

Initializing $s$ and $b$ to the full sum of weights, we start by having all points in the selectiom region.



In [19]:

    
s = np.sum(weightsValidation[sSelectorValidation])
b = np.sum(weightsValidation[bSelectorValidation])

amss will contain AMSs after each point moved out of the selection region in the sorted validation set.



In [20]:

    
amss = np.empty([len(tIIs)])

amsMax will contain the best validation AMS, and threshold will be the smallest score among the selected points.



In [21]:

    
amsMax = 0
threshold = 0.0

We will do len(tIIs) iterations, which means that amss[-1] is the AMS when only the point with the highest score is selected.



In [22]:

    
for tI in range(len(tIIs)):
    # don't forget to renormalize the weights to the same sum 
    # as in the complete training set
    amss[tI] = AMS(max(0,s * wFactor),max(0,b * wFactor))
    if amss[tI] > amsMax:
        amsMax = amss[tI]
        threshold = validationScores[tIIs[tI]]
        #print tI,threshold
    if sSelectorValidation[tIIs[tI]]:
        s -= weightsValidation[tIIs[tI]]
    else:
        b -= weightsValidation[tIIs[tI]]



In [23]:

    
amsMax









    Out[23]:





2.0425679448304743



In [24]:

    
threshold









    Out[24]:





0.014027569078873592



In [25]:

    
plt.plot(amss)









    Out[25]:





[<matplotlib.lines.Line2D at 0x12c447890>]

Computing the permutation on the test set

Reading the test file, slicing off the header row and the id column, and converting the data into float.



In [26]:

    
test = list(csv.reader(open("test.csv", "rt"),delimiter=','))
xsTest = np.array([list(map(float, row[1:])) for row in test[1:]])



In [ ]:

    
testIds = np.array([int(row[0]) for row in test[1:]])

Computing the scores.



In [26]:

    
testScores = np.array([score(x) for x in xsTest])









    



---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-27-33cf47335c2c> in <module>()
----> 1 testScores = np.array([score(x) for x in xsTest])

<ipython-input-27-33cf47335c2c> in <listcomp>(.0)
----> 1 testScores = np.array([score(x) for x in xsTest])

<ipython-input-14-fd14e54a5e86> in score(x)
      5         # linear search for the bin index of the fIth feature
      6         # of the signal
----> 7         while bI < len(binMaxs[fI]) - 1 and x[fI] > binMaxs[fI, bI]:
      8             bI += 1
      9         logP += logPs[fI, bI] - math.log(0.5)

KeyboardInterrupt: 





    



Traceback (most recent call last):

  File "/Users/tom/sandbox/ipython/IPython/kernel/zmq/ipkernel.py", line 416, in execute_request
    shell.run_cell(code, store_history=store_history, silent=silent)

  File "/Users/tom/sandbox/ipython/IPython/core/interactiveshell.py", line 2754, in run_cell
    self.events.trigger('post_execute')

  File "/Users/tom/sandbox/ipython/IPython/core/events.py", line 82, in trigger
    func(*args, **kwargs)

  File "/Users/tom/sandbox/ipython/IPython/kernel/zmq/pylab/backend_inline.py", line 118, in flush_figures
    return show(True)

  File "/Users/tom/sandbox/ipython/IPython/kernel/zmq/pylab/backend_inline.py", line 47, in show
    matplotlib.pyplot.close('all')

  File "/Users/tom/Envs/py3/lib/python3.3/site-packages/matplotlib/pyplot.py", line 517, in close
    _pylab_helpers.Gcf.destroy_all()

  File "/Users/tom/Envs/py3/lib/python3.3/site-packages/matplotlib/_pylab_helpers.py", line 93, in destroy_all
    gc.collect()

KeyboardInterrupt

Computing the rank order.



In [28]:

    
testInversePermutation = testScores.argsort()



In [29]:

    
testPermutation = list(testInversePermutation)
for tI,tII in zip(range(len(testInversePermutation)),
                  testInversePermutation):
    testPermutation[tII] = tI

Computing the submission file with columns EventId, RankOrder, and Class.



In [30]:

    
submission = np.array([[str(testIds[tI]),str(testPermutation[tI]+1),
                       's' if testScores[tI] >= threshold else 'b'] 
            for tI in range(len(testIds))])



In [31]:

    
submission = np.append([['EventId','RankOrder','Class']],
                        submission, axis=0)

Saving the file that can be submitted to Kaggle.



In [32]:

    
np.savetxt("submission.csv",submission,fmt='%s',delimiter=',')



In [ ]: