This document describes the optimal submission to Kaggle's Second Annual Data Science Bowl. The goal of this document is to minimize the evaluation score of the competition, which is Continuous Ranked Probability Score (CRPS) assuming you some how have access to the real distribution of values.

Finding the optimal submission without variational calculus

Suppose the real distribution is $p_v$ of the $M$ (600) possible volume options. Such that (starting from 1 and not from 0)

$\sum^M_{v=1}p_v = 1$

you want to make a submission $Q_n$ which can be written as sum of $q_j$

$Q_n = \sum^n_{j=1} q_j$

in order for the submission to be valid you are constrained to have $q_j \ge 0$ and

$\sum^M_{j=1}q_j = 1$

the goal is to minimize the cost for a real value $v$

$C_v = {1\over M} \sum_n \left( Q_n - H(n \ge v) \right)^2$

were $H(\mbox{true}) = 1$ and $H(\mbox{false}) = 0$.

The $v$ values are distributed according to $p_v$ so without knowing the $v$ value we want to minimize

$C = E_{p_v} (C_v)$

we minimize $C$ by changing the $q_j$ values as long as the constraint is meet. For example we can change a specific location, $w$, in $q_j$ and see when the derivative vanish. You can argue that changing just one value will violate the constraint but this can be relaxed by adding a Lagrange multiplier, $\lambda$, to the goal:

$C = E_{p_v} (C_v) + \lambda \sum_j q_j$

However, as we will soon see this addition cancels out so you can safely ignore it in this problem. So lets compute the derivative and set it to zero

${ \partial C \over \partial q_w } = E_{p_v} \left( \sum_n 2 (Q_n -H(n \ge v )) H(n \ge w)\right) + \lambda = 0$

we can cancel out the $2$ if we re-write $\lambda$ to include it and if we use $E_{p_v}(1) = 1$ we get

$\sum_{n=w} Q_n - E_{p_v} \left( \sum_n H(n \ge v ) H(n \ge w)\right) + \lambda = 0$

using $\sum_{n=1}^M H(n \ge v ) H(n \ge w) = M - \max(v,w)$ we get

$\sum_{n=w} Q_n - M + \sum_v \left( p_v \max(v,w) \right) + \lambda = 0$

we can repeat this formila for $w+1$

$\sum_{n=w+1} Q_n - M + \sum_v \left( p_v \max(v,w+1) \right) + \lambda = 0$

and subtract the two equations to get

$Q_w - \sum_{v=1}^w p_v = 0$

as promised $\lambda$ disappeared and we are left with the result that the submission should be the cumulative distribution function (cdf) of whatever you think the true distribution is ($P_w = \sum_{v=1}^w p_v$)

$Q_w = P_w$

code for generating optimal submission

Without additional information about each example, the distribution could be derived directly from the values in the training set


In [ ]:
import pandas as pd
import numpy as np
train = pd.read_csv('train.csv')
def doHist(data):
    h = np.zeros(600)
    for j in np.ceil(data.values).astype(int):
        h[j:] += 1
    h /= len(data)
    return h
hSystole = doHist(train.Systole)
hDiastole = doHist(train.Diastole)
sub = pd.read_csv('sample_submission_validate.csv', index_col='Id')
N = len(sub)//2
X = np.zeros((2*N,600))
for i in range(N):
    X[2*i,:] = hDiastole
    X[2*i+1,:] = hSystole
sub[sub.columns] = X
sub.to_csv('submission.csv')

which gave a score of $0.042738$ on the validation set, which just happens to be a very popular result on the leader board (as of December 16, 2015)


In [ ]: