In [1]:
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
plt.rcParams['figure.figsize'] = 8, 8
plt.rcParams['axes.grid'] = True
Likely that we can calibrate our predictions to improve our performance as described in this blog post. First, though, we should plot the reliability for a typical classifier.
In [2]:
cd ..
In [3]:
from python import utils
import pickle
In [7]:
import sklearn.linear_model
import sklearn.cross_validation
import sklearn.metrics
In [4]:
with open("fsdetailed.pickle","rb") as f:
predictions,labels,weights,segments = pickle.load(f)
In [5]:
x,y = utils.reliability_plot(predictions,labels)
In [8]:
sklearn.metrics.roc_auc_score(labels,predictions)
Out[8]:
In [9]:
sklearn.metrics.log_loss(labels,predictions)
Out[9]:
In [10]:
plt.plot(x,y)
plt.xlabel("Binned mean predicted")
plt.ylabel("Binned fraction positive")
plt.plot([0, 1], [0, 1], 'k--')
Out[10]:
The above is not good. But, that might be good news for us, because if we can improve it then we'll have a good chance of improving our score.
Worth a try, anyway.
As it says on the blog:
You essentially create a new data set that has the same labels, but with one dimension (the output of the SVM). You then train on this new data set, and feed the output of the SVM as the input to this calibration method, which returns a probability. In Platt’s case, we are essentially just performing logistic regression on the output of the SVM with respect to the true class labels.
Doing this is pretty easy for us.
In [11]:
lr = sklearn.linear_model.LogisticRegression()
In [12]:
# cross validation again
prds_cal = []
labels_cal = []
cv = sklearn.cross_validation.StratifiedShuffleSplit(labels)
for train,test in cv:
lr.fit(predictions[np.newaxis].T[train],labels[train])
prds_cal.append(lr.predict_proba(predictions[np.newaxis].T[test])[:,1])
labels_cal.append(labels[test])
# stack results
prds_cal = np.hstack(prds_cal[:])
labels_cal = np.hstack(labels_cal[:])
AUC has not changed significantly:
In [13]:
sklearn.metrics.roc_auc_score(labels_cal,prds_cal)
Out[13]:
In [14]:
sklearn.metrics.log_loss(labels_cal,prds_cal)
Out[14]:
In [15]:
x,y = utils.reliability_plot(prds_cal,labels_cal)
plt.plot(x,y)
plt.xlabel("Binned mean predicted")
plt.ylabel("Binned fraction positive")
plt.plot([0, 1], [0, 1], 'k--')
Out[15]:
This is very similar to Platt scaling:
The second popular method of calibrating is isotonic regression. The idea is to fit a piecewise-constant non-decreasing function instead of logistic regression. Piecewise-constant non-decreasing means a stair-like shape...
Luckily this is also implemented in scikit-learn. Fairly easy to switch this in:
In [16]:
import sklearn.isotonic
In [17]:
iso = sklearn.isotonic.IsotonicRegression(out_of_bounds='clip')
In [18]:
prds_cal = []
labels_cal = []
cv = sklearn.cross_validation.StratifiedShuffleSplit(labels)
for train,test in cv:
iso.fit(predictions[train],labels[train])
prds_cal.append(iso.transform(predictions[test]))
labels_cal.append(labels[test])
# stack results
prds_cal = np.hstack(prds_cal[:])
labels_cal = np.hstack(labels_cal[:])
In [19]:
x,y = utils.reliability_plot(prds_cal,labels_cal)
plt.plot(x,y)
plt.xlabel("Binned mean predicted")
plt.ylabel("Binned fraction positive")
plt.plot([0, 1], [0, 1], 'k--')
Out[19]:
There appears to be a problem with isotonic regresssion or the way I'm using it. Data we're getting out has some problems:
In [20]:
delinds = []
for i, (label, prediction) in enumerate(zip(labels_cal,prds_cal)):
if not np.isfinite(prediction):
print(i,prediction)
delinds.append(i)
In [21]:
mask = np.ones(len(prds_cal), dtype=bool)
mask[delinds] = False
prds_cal = prds_cal[mask]
labels_cal = labels_cal[mask]
In [22]:
sklearn.metrics.roc_auc_score(labels_cal,prds_cal)
Out[22]:
In [23]:
sklearn.metrics.log_loss(labels_cal,prds_cal)
Out[23]:
So this methods reducing the AUC score, but appears to improve the smoothness of the predictions. In the blog post they saw approximately the same thing, but they improved the log loss, which we don't necessarily care about, because we're not being scored on it.
We might expect that the Platt scaling will work better if it can tell which subject is which and bias appropriately. So, we can add a 1-of-k feature to the prediction (single-element) vector.
Unfortunately, we don't actually know which subject each of the predictions we're looking at here came from.
In [24]:
import json
In [25]:
with open("forestselection_gavin.json") as f:
settings = json.load(f)
In [26]:
subject_1ofk = []
for segment in segments:
fvector = np.zeros(len(settings['SUBJECTS']))
for i,subject in enumerate(settings['SUBJECTS']):
if subject in segment:
# set appropriate element to 1
fvector[i]=1
subject_1ofk.append(np.array(fvector)[np.newaxis])
subject_1ofk = np.vstack(subject_1ofk)
In [27]:
subject_1ofk.shape
Out[27]:
In [28]:
predictions.shape
Out[28]:
In [29]:
predictions = predictions[np.newaxis].T
predictions.shape
Out[29]:
In [30]:
X = np.hstack([predictions,subject_1ofk])
In [31]:
# cross validation again
prds_cal = []
labels_cal = []
cv = sklearn.cross_validation.StratifiedShuffleSplit(labels)
for train,test in cv:
lr.fit(X[train],labels[train])
prds_cal.append(lr.predict_proba(X[test])[:,1])
labels_cal.append(labels[test])
# stack results
prds_cal = np.hstack(prds_cal[:])
labels_cal = np.hstack(labels_cal[:])
In [32]:
x,y = utils.reliability_plot(prds_cal,labels_cal)
plt.plot(x,y)
plt.xlabel("Binned mean predicted")
plt.ylabel("Binned fraction positive")
plt.plot([0, 1], [0, 1], 'k--')
Out[32]:
AUC is improved, but log loss suffers:
In [33]:
sklearn.metrics.roc_auc_score(labels,predictions)
Out[33]:
In [34]:
sklearn.metrics.log_loss(labels,predictions)
Out[34]:
Let's talk turkey.
Above results are from one of the highest scoring models (forgot to switch back in v2 features). We can simply load the predictions it made in it's submission csv and transform them with the fitted logistic regression model we've created.
First, fit the logistic regression model to all the training data:
In [35]:
lr.fit(X,labels)
Out[35]:
Then load the submission csv:
In [36]:
import csv
In [37]:
testsegments = []
testpredictions = []
with open("output/forestselection_gavin_submission_using__v3_feats.csv") as f:
c = csv.reader(f)
# skip first line
header = next(c)
for line in c:
testsegments.append(line[0])
testpredictions.append(line[1])
Build the required 1-of-k subject feature:
In [38]:
subject_1ofk = []
for segment in testsegments:
fvector = np.zeros(len(settings['SUBJECTS']))
for i,subject in enumerate(settings['SUBJECTS']):
if subject in segment:
# set appropriate element to 1
fvector[i]=1
subject_1ofk.append(np.array(fvector)[np.newaxis])
subject_1ofk = np.vstack(subject_1ofk)
In [39]:
testpredictions = np.array(testpredictions)
In [40]:
testpredictions = testpredictions[np.newaxis].T
testpredictions.shape
Out[40]:
In [41]:
testpredictions = testpredictions.astype(float)
In [42]:
subject_1ofk.shape
Out[42]:
In [43]:
X = np.hstack([testpredictions,subject_1ofk])
In [44]:
ptest = lr.predict_proba(X)[:,1]
In [45]:
ptest
Out[45]:
And save these new predictions to a file:
In [46]:
with open("output/platt_scaled_1ofk_forestselection_v3.csv","w") as f:
c = csv.writer(f)
c.writerow(header)
for segment, prediction in zip(testsegments,ptest):
c.writerow([segment,prediction])
In [53]:
!wc -l output/platt_scaled_1ofk_forestselection_v3.csv
In [54]:
!head output/platt_scaled_1ofk_forestselection_v3.csv
Submitted and it scored: 0.69822. Not really an improvement.
Trying with vanilla Platt scaling as above:
In [49]:
lr.fit(predictions,labels)
Out[49]:
In [50]:
ptest = lr.predict_proba(testpredictions)[:,1]
In [51]:
with open("output/platt_scaled_forestselection_v3.csv","w") as f:
c = csv.writer(f)
c.writerow(header)
for segment, prediction in zip(testsegments,ptest):
c.writerow([segment,prediction])
In [52]:
!head output/platt_scaled_forestselection_v3.csv
Submitted and scored exactly the same as without Platt scaling: 0.78169
Worth checking that it actually changed the probabilities:
In [55]:
!head output/forestselection_gavin_submission_using__v3_feats.csv
Yeah, there are differences, I suppose it has simply been scaled.