In this week's homework assignment we want you to start playing around with different classifiers and try to submit a prediction to the Kaggle competition. Nothing fancy, just so you won't try to do that the first time in the last day before the due date.
This notebook is designed to assist you in playing around with those classifiers, though most of the code is already in the homework assignment writeup.
In [1]:
# Import all required libraries
from __future__ import division # For python 2.*
import numpy as np
import matplotlib.pyplot as plt
import mltools as ml
np.random.seed(0)
%matplotlib inline
In [2]:
# Data Loading
X = np.genfromtxt('data/X_train.txt', delimiter=None)
Y = np.genfromtxt('data/Y_train.txt', delimiter=None)
# The test data
Xte = np.genfromtxt('data/X_test.txt', delimiter=None)
All your work should be done on the training data set. To be able to make educated decisions on which classifier you're going to use, you should split it into train and validation data sets.
In [8]:
Xtr, Xva, Ytr, Yva = ml.splitData(X, Y)
Xtr, Ytr = ml.shuffleData(Xtr, Ytr)
# Taking a subsample of the data so that trains faster. You should train on whole data for homework and Kaggle.
Xt, Yt = Xtr[:4000], Ytr[:4000]
Time to start doing some classifications. We'll use all those you are required to from the assignment on the data. We'll skip the KNN one, if you want a reminder on how to use them see previous discussions.
For the Kaggle dataset you need to submit probabilities and not just class predictions. Don't worry, you don't need to code that, just use the predictSoft() function.
In [9]:
# The decision tree classifier has minLeaf and maxDepth parameters. You should know what it means by now.
learner = ml.dtree.treeClassify(Xt, Yt, minLeaf=25, maxDepth=15)
# Prediction
probs = learner.predictSoft(Xte)
The predictSoft method returns an $M \times C$ table in which for each point you have the proability of each class.
In [11]:
probs
Out[11]:
We can also compute the AUC for both the training and validation data sets.
In [12]:
print("{0:>15}: {1:.4f}".format('Train AUC', learner.auc(Xt, Yt)))
print("{0:>15}: {1:.4f}".format('Validation AUC', learner.auc(Xva, Yva)))
Play with different parameters to see how AUC changes.
In [13]:
learner = ml.dtree.treeClassify()
learner.train(Xt, Yt, maxDepth=2)
print (learner)
In [14]:
# Scaling the data
XtrP, params = ml.rescale(Xt)
XteP,_ = ml.rescale(Xte, params)
print(XtrP.shape, XteP.shape)
Note that we do not need to scale the data for decision tree.
In [15]:
## Linear models:
learner = ml.linearC.linearClassify()
learner.train(XtrP, Yt, initStep=0.5, stopTol=1e-6, stopIter=100)
probs = learner.predictSoft(XteP)
And the AUC IS:
In [16]:
print("{0:>15}: {1:.4f}".format('Train AUC',learner.auc(XtrP, Yt)))
print("{0:>15}: {1:.4f}".format('Validation AUC', learner.auc(Xva, Yva)))
This is why we're using a validation data set. We can see already that for THIS specific configuration the decision tree is much better. It is very likely that it'll be better on the test data.
In [17]:
nn = ml.nnet.nnetClassify()
After we construct the classifier, we can define the sizes of its layers and initialize their values with "init_weights".
Definition of nn.init_weights:
nn.init_weights(self, sizes, init, X, Y)
From the method description: sizes = [Ninput, N1, N2, ... , Noutput], where Ninput = # of input features, and Nouput = # classes
Training the model using gradient descent, we can track the surrogate loss (here, MSE loss on the output vector, compared to a 1-of-K representation of the class), as well as the 0/1 classification loss (error rate):
In [18]:
nn.init_weights([14, 5, 3], 'random', Xt, Yt)
nn.train(Xt, Yt, stopTol=1e-8, stepsize=.25, stopIter=50)
In [19]:
# Need to specify the right number of input and output layers.
nn.init_weights([Xt.shape[1], 5, len(np.unique(Yt))], 'random', Xt, Yt)
nn.train(Xt, Yt, stopTol=1e-8, stepsize=.25, stopIter=50) # Really small stopIter so it will stop fast :)
In [20]:
print("{0:>15}: {1:.4f}".format('Train AUC',nn.auc(Xt, Yt)))
print("{0:>15}: {1:.4f}".format('Validation AUC', nn.auc(Xva, Yva)))
The AUC results are bad because we just used a lame configuration of the NN. NN can be engineered until your last day, but some things should make sense to you.
One example is the option to change the activation function. This is the function that is in the inner layers. By default the code comes with the tanh, but the logistic (sigmoid) is also coded in and you can just specify it.
In [21]:
nn.setActivation('logistic')
nn.train(Xt, Yt, stopTol=1e-8, stepsize=.25, stopIter=100)
print("{0:>15}: {1:.4f}".format('Train AUC',nn.auc(Xt, Yt)))
print("{0:>15}: {1:.4f}".format('Validation AUC', nn.auc(Xva, Yva)))
Not suprisingly, you can also provide a custom activation function. Note that for the last layer you will probably always want the sigmoid function, so only change the inner layers ones.
The function definition is this:
setActivation(self, method, sig=None, d_sig=None, sig_0=None, d_sig_0=None)
You can call it with method='custom' and then specify both sig and d_sig. (the '0' ones are for the last layer)
In [23]:
# Here's a dummy activation method (f(x) = x)
sig = lambda z: np.atleast_2d(z)
dsig = lambda z: np.atleast_2d(1)
In [24]:
nn = ml.nnet.nnetClassify()
nn.init_weights([Xt.shape[1], 5, len(np.unique(Yt))], 'random', Xt, Yt)
nn.setActivation('custom', sig, dsig)
nn.train(Xt, Yt, stopTol=1e-8, stepsize=.25, stopIter=100)
print("{0:>15}: {1:.4f}".format('Train AUC',nn.auc(Xt, Yt)))
print("{0:>15}: {1:.4f}".format('Validation AUC', nn.auc(Xva, Yva)))
We've learn that one way of guessing how well we're doing with different model parameters is to plot the train and validation errors as a function of that paramter (e.g, k in the KNN of degree in the linear classifier and regression).
Now it seems like there could be more parameters involved? One example is the degree and the regularizer value (see. HW assignment for more examples).
When it's two features you can simple use heatmaps. The X-axis and Y-axis represent the parameters and the "heat" is the validation/train error as a "third" dimension".
We're going to use a dummy function to show that. Let's assume we have two parameters p1 and p2 and the prediction accuracy is p1 + p2 (yup, that stupid). In the HW assignment it's actually the auc.
In [25]:
p1 = np.arange(5)
p2 = np.arange(5)
In [26]:
auc = np.zeros([p1.shape[0], p2.shape[0]])
for i in range(p1.shape[0]):
for j in range(p2.shape[0]):
auc[i][j] = p1[i] + p2[j]
In [27]:
auc
Out[27]:
In [28]:
f, ax = plt.subplots(1, 1, figsize=(8, 5))
cax = ax.matshow(auc)
f.colorbar(cax)
ax.set_xticks(p1)
ax.set_xticklabels(['%d' % p for p in p1])
ax.set_yticks(p2)
ax.set_yticklabels(['%d' % p for p in p2])
plt.show()
Let's assume that the last classifier we ran was the best one (after we used all that we know to verify it is the best one including that plot from the previoud block). Now let's run it on the test and create a file that can be submitted.
Each line in the file is a point id and the probability of P(Y=1). There's also a header file. Here's how you can create it simply from the probs matrix.
In [29]:
probs
Out[29]:
In [30]:
# Create the data for submission by taking the P(Y=1) column from probs and just add a running index as the first column.
Y_sub = np.vstack([np.arange(Xte.shape[0]), probs[:, 1]]).T
# We specify the header (ID, Prob1) and also specify the comments as '' so the header won't be commented out with
# the # sign.
np.savetxt('data/Y_sub.txt', Y_sub, '%d, %.5f', header='ID,Prob1', comments='', delimiter=',')