CS 224D Assignment #2

Part [1]: Deep Networks: NER Window Model

For this first part of the assignment, you'll build your first "deep" networks. On problem set 1, you computed the backpropagation gradient $\frac{\partial J}{\partial w}$ for a two-layer network; in this problem set you'll implement a slightly more complex network to perform named entity recognition (NER).

Before beginning the programming section, you should complete parts (a) and (b) of the corresponding section of the handout.



In [1]:

    
import sys, os
from numpy import *
from matplotlib.pyplot import *
%matplotlib inline
matplotlib.rcParams['savefig.dpi'] = 100

%load_ext autoreload
%autoreload 2









    



Vendor:  Continuum Analytics, Inc.
Package: mkl
Message: trial mode expires in 24 days

(c): Random Initialization Test

Use the cell below to test your code.



In [2]:

    
from misc import random_weight_matrix
random.seed(10)
print random_weight_matrix(3,5)









    



[[ 0.46994114 -0.83008197  0.23148553  0.43094097 -0.00258593]
 [-0.47666619 -0.52297046  0.45125243 -0.57311684 -0.71301636]
 [ 0.32105262  0.78530031 -0.85918681  0.02111762  0.54147539]]

(d): Implementation

We've provided starter code to load in the dataset and convert it to a list of "windows", consisting of indices into the matrix of word vectors.

We pad each sentence with begin and end tokens <s> and </s>, which have their own word vector representations; additionally, we convert all words to lowercase, canonicalize digits (e.g. 1.12 becomes DG.DGDG), and replace unknown words with a special token UUUNKKK.

You don't need to worry about the details of this, but you can inspect the docs variables or look at the raw data (in plaintext) in the ./data/ directory.



In [3]:

    
import data_utils.utils as du
import data_utils.ner as ner



In [4]:

    
# Load the starter word vectors
wv, word_to_num, num_to_word = ner.load_wv('data/ner/vocab.txt', 'data/ner/wordVectors.txt')
tagnames = ["O", "LOC", "MISC", "ORG", "PER"]
num_to_tag = dict(enumerate(tagnames))
tag_to_num = du.invert_dict(num_to_tag)

# Load the training set
docs = du.load_dataset('data/ner/train')
X_train, y_train = du.docs_to_windows(docs, word_to_num, tag_to_num)

# Load the dev set (for tuning hyperparameters)
docs = du.load_dataset('data/ner/dev')
X_dev, y_dev = du.docs_to_windows(docs, word_to_num, tag_to_num)

# Load the test set (dummy labels only)
docs = du.load_dataset('data/ner/test.masked')
X_test, y_test = du.docs_to_windows(docs, word_to_num, tag_to_num)

To avoid re-inventing the wheel, we provide a base class that handles a lot of the drudgery of managing parameters and running gradient descent. It's based on the classifier API used by scikit-learn, so if you're familiar with that library it should be easy to use.

We'll be using this class for the rest of this assignment, so it helps to get acquainted with a simple example that should be familiar from Assignment 1. To keep this notebook uncluttered, we've put the code in the softmax_example.py; take a look at it there, then run the cell below.



In [5]:

    
from softmax_example import SoftmaxRegression
sr = SoftmaxRegression(wv=zeros((10,100)), dims=(100,5))

##
# Automatic gradient checker!
# this checks anything you add to self.grads or self.sgrads
# using the method of Assignment 1
sr.grad_check(x=5, y=4)









    



grad_check: dJ/db error norm = 3.555e-10 [ok]
    b dims: [5] = 5 elem
grad_check: dJ/dW error norm = 2.212e-11 [ok]
    W dims: [5, 100] = 500 elem
grad_check: dJ/dL[5] error norm = 2.693e-11 [ok]
    L[5] dims: [100] = 100 elem

In order to implement a model, you need to subclass NNBase, then implement the following methods:

__init__() (initialize parameters and hyperparameters)
_acc_grads() (compute and accumulate gradients)
compute_loss() (compute loss for a training example)
predict(), predict_proba(), or other prediction method (for evaluation)

NNBase provides you with a few others that will be helpful:

grad_check() (run a gradient check - calls _acc_grads and compute_loss)
train_sgd() (run SGD training; more on this later)

Your task is to implement the window model in nerwindow.py; a scaffold has been provided for you with instructions on what to fill in.

When ready, you can test below:



In [6]:

    
from nerwindow import WindowMLP
clf = WindowMLP(wv, windowsize=3, dims=[None, 100, 5], reg=0.001, alpha=0.01)
clf.grad_check(X_train[0], y_train[0]) # gradient check on single point









    



grad_check: dJ/db2 error norm = 3.234e-10 [ok]
    b2 dims: [5] = 5 elem
grad_check: dJ/dU error norm = 2.844e-10 [ok]
    U dims: [5, 100] = 500 elem
grad_check: dJ/db1 error norm = 2.85e-09 [ok]
    b1 dims: [100] = 100 elem
grad_check: dJ/dW error norm = 1.337e-08 [ok]
    W dims: [100, 150] = 15000 elem
grad_check: dJ/dL[30] error norm = 3.555e-11 [ok]
    L[30] dims: [50] = 50 elem
grad_check: dJ/dL[6659] error norm = 4.283e-11 [ok]
    L[6659] dims: [50] = 50 elem
grad_check: dJ/dL[12637] error norm = 4.409e-11 [ok]
    L[12637] dims: [50] = 50 elem

Now we'll train your model on some data! You can implement your own SGD method, but we recommend that you just call clf.train_sgd. This takes the following arguments:

X, y : training data
idxiter: iterable (list or generator) that gives index (row of X) of training examples in the order they should be visited by SGD
printevery: int, prints progress after this many examples
costevery: int, computes mean loss after this many examples. This is a costly operation, so don't make this too frequent!

The implementation we give you supports minibatch learning; if idxiter is a list-of-lists (or yields lists), then gradients will be computed for all indices in a minibatch before modifying the parameters (this is why we have you write _acc_grad instead of applying them directly!).

Before training, you should generate a training schedule to pass as idxiter. If you know how to use Python generators, we recommend those; otherwise, just make a static list. Make the following in the cell below:

An "epoch" schedule that just iterates through the training set, in order, nepoch times.
A random schedule of N examples sampled with replacement from the training set.
A random schedule of N/k minibatches of size k, sampled with replacement from the training set.



In [7]:

    
nepoch = 5
n_train = len(y_train)
N = nepoch * n_train
k = 5 # minibatch size

random.seed(10) # do not change this!
#### YOUR CODE HERE ####
def epoch_schedule():
    for n in xrange(N):
        yield n % n_train

import numpy as np
def random_schedule():
    for idx in np.random.randint(n_train, size=N):
        yield idx

def minibatch_schedule():
    M = N / k
    for m in xrange(M):
        yield np.random.randint(n_train, size=k)

#clf = WindowMLP(wv, windowsize=3, dims=[None, 100, 5], reg=0.001, alpha=0.01)        
clf.train_sgd(X_train, y_train, idxiter=minibatch_schedule())
#### END YOUR CODE ###









    



Begin SGD...
  Seen 10000 in 22.71 s
  [10000]: mean loss 0.319364
SGD Interrupted: saw 12641 examples in 46.28 seconds.






    Out[7]:





[(10000, 0.31936381069165587)]



In [8]:

    
from nerwindow import compute_f1
yp = clf.predict(X_dev)
print compute_f1(y_dev, yp, tagnames)









    



54.1335962713



In [19]:

    
def minibatch_schedule(N=100, k=5):
    M = N / k
    for m in xrange(M):
        yield np.random.randint(n_train, size=k)
for x in minibatch_schedule(): print x









    



[ 16393 126297  65301 126390 188520]
[  3946  24807 183506  98520   7715]
[122159 111172 109525  78587  51161]
[ 21467  45274  44189 139159    667]
[109097 175437  15723 117046 152589]
[103125  51581 121909  54701 147431]
[ 95628 115075  71134 182335 174295]
[202362 123422  22096  62765 201477]
[130991  76301  86265  85048 103762]
[160140   6513 107232 190367  52320]
[ 94018 157479 116344  87450  72361]
[ 20033 144717  81015  43899 190018]
[ 68288  91842  40131 167156  96725]
[152738  89210 107555   3381 153704]
[ 13838 112150  22442  66444  73232]
[ 40428  79302 179127  22022  30089]
[ 63096 181089   1040 112540  11495]
[139384 190119  74910 104999  91668]
[100500  76078  24811 158579  23633]
[136796 183608  92482 176512  62389]



In [30]:

    
class RunSettings(object):
    def __init__(self, windowsize, n_hidden_units, reg, alpha, n_output_units=5):
        self.windowsize = windowsize
        self.n_hidden_units = n_hidden_units
        self.reg = reg
        self.alpha = alpha
        self.dims=[None, n_hidden_units, n_output_units]

    def __repr__(self):
      return "# ws = %d, # n-hu = %d, reg = %0.5f, alpha = %0.5f" % (self.windowsize, self.n_hidden_units, self.reg, self.alpha)

n_runs = 20
runs = []
for i in xrange(n_runs):    
    windowsize = np.random.choice([3,5,7])
    n_hidden_units = np.random.choice(range(100,400,20)) 
    reg = np.random.uniform(0.001, 0.003)
    alpha = np.random.uniform(0.03, 0.06)
    reg = 0.002
    alpha = 0.056
    windowsize = 3
    runs.append(RunSettings(windowsize, n_hidden_units, reg, alpha))
#for run in runs: print run



In [31]:

    
from nerwindow import compute_f1
n_train = len(y_train)
def minibatch_schedule(N=100, k=5):
    M = N / k
    for m in xrange(M):
        yield np.random.randint(n_train, size=k)
best_f1 = 0.0
best_parameters = None
best_clf = None
for i, run in enumerate(runs): 
    dims=[None, run.n_hidden_units, 5]
    print "%d. %s" % (i, run),
    schedule = minibatch_schedule(N=800000, k=5)
    clf = WindowMLP(wv, windowsize=run.windowsize, dims=run.dims, reg=run.reg, alpha=run.alpha) 
    clf.train_sgd(X_train, y_train, idxiter=schedule, printevery=1000000, costevery=1000000, verbose=False)
    yp = clf.predict(X_dev)
    f1 = compute_f1(y_dev, yp, tagnames)
    print '-->> %0.4f' % f1
    if f1 > best_f1:
        best_f1 = f1
        best_clf = clf
        best_parameters = (windowsize, n_hidden_units, reg, alpha)
print best_f1, best_parameters
from nerwindow import full_report, eval_performance
yp = best_clf.predict(X_dev)
full_report(y_dev, yp, tagnames)
eval_performance(y_dev, yp, tagnames)









    



0. # ws = 3, # n-hu = 140, reg = 0.00200, alpha = 0.05600 -->> 75.7065
1. # ws = 3, # n-hu = 340, reg = 0.00200, alpha = 0.05600 -->> 77.5730
2. # ws = 3, # n-hu = 220, reg = 0.00200, alpha = 0.05600 -->> 75.8447
3. # ws = 3, # n-hu = 220, reg = 0.00200, alpha = 0.05600 -->> 75.7957
4. # ws = 3, # n-hu = 200, reg = 0.00200, alpha = 0.05600 -->> 76.8867
5. # ws = 3, # n-hu = 200, reg = 0.00200, alpha = 0.05600 -->> 76.8465
6. # ws = 3, # n-hu = 340, reg = 0.00200, alpha = 0.05600 -->> 77.5730
7. # ws = 3, # n-hu = 140, reg = 0.00200, alpha = 0.05600 -->> 75.6417
8. # ws = 3, # n-hu = 340, reg = 0.00200, alpha = 0.05600 -->> 77.5730
9. # ws = 3, # n-hu = 360, reg = 0.00200, alpha = 0.05600 -->> 77.8156
10. # ws = 3, # n-hu = 280, reg = 0.00200, alpha = 0.05600 -->> 73.2663
11. # ws = 3, # n-hu = 160, reg = 0.00200, alpha = 0.05600 -->> 76.2252
12. # ws = 3, # n-hu = 300, reg = 0.00200, alpha = 0.05600 -->> 75.0372
13. # ws = 3, # n-hu = 300, reg = 0.00200, alpha = 0.05600 -->> 74.9882
14. # ws = 3, # n-hu = 220, reg = 0.00200, alpha = 0.05600 -->> 75.8378
15. # ws = 3, # n-hu = 260, reg = 0.00200, alpha = 0.05600 -->> 74.6496
16. # ws = 3, # n-hu = 260, reg = 0.00200, alpha = 0.05600 -->> 74.5068
17. # ws = 3, # n-hu = 240, reg = 0.00200, alpha = 0.05600 -->> 76.2371
18. # ws = 3, # n-hu = 100, reg = 0.00200, alpha = 0.05600 -->> 76.3012
19. # ws = 3, # n-hu = 320, reg = 0.00200, alpha = 0.05600 -->> 76.6486
77.8155927168 (3, 320, 0.002, 0.056)
             precision    recall  f1-score   support

          O       0.96      0.98      0.97     42759
        LOC       0.84      0.79      0.81      2094
       MISC       0.88      0.70      0.78      1268
        ORG       0.69      0.64      0.66      2092
        PER       0.88      0.79      0.83      3149

avg / total       0.94      0.94      0.94     51362

=== Performance (omitting 'O' class) ===
Mean precision:  82.17%
Mean recall:     73.99%
Mean F1:         77.82%

Now call train_sgd to train on X_train, y_train. To verify that things work, train on 100,000 examples or so to start (with any of the above schedules). This shouldn't take more than a couple minutes, and you should get a mean cross-entropy loss around 0.4.

Now, if this works well, it's time for production! You have three tasks here:

Train a good model
Plot a learning curve (cost vs. # of iterations)
Use your best model to predict the test set

You should train on the train data and evaluate performance on the dev set. The test data we provided has only dummy labels (everything is O); we'll compare your predictions to the true labels at grading time.

Scroll down to section (f) for the evaluation code.

We don't expect you to spend too much time doing an exhaustive search here; the default parameters should work well, although you can certainly do better. Try to achieve an F1 score of at least 76% on the dev set, as reported by eval_performance.

Feel free to create new cells and write new code here, including new functions (helpers and otherwise) in nerwindow.py. When you have a good model, follow the instructions below to make predictions on the test set.

A strong model may require 10-20 passes (or equivalent number of random samples) through the training set and could take 20 minutes or more to train - but it's also possible to be much, much faster!

Things you may want to tune:

alpha (including using an "annealing" schedule to decrease the learning rate over time)
training schedule and minibatch size
regularization strength
hidden layer dimension
width of context window



In [18]:

    
from nn.base import NNBase

#### YOUR CODE HERE ####
# Sandbox: build a good model by tuning hyperparameters
n_train = len(y_train)
def minibatch_schedule(N=100, k=5):
    M = N / k
    for m in xrange(M):
        yield np.random.randint(n_train, size=k)
def anneal_schedule(a0, epoch=50000):
    ctr = 0
    while True:
        yield a0 * 1.0/((ctr+epoch)/epoch)
        ctr += 1
schedule = minibatch_schedule(N=100000, k=5)
alpha_schedule = anneal_schedule(0.54)
dims=[None, 140, 5]
clf = WindowMLP(wv, windowsize=3, dims=dims, reg=0.02, alpha=0.054) 
#clf.train_sgd(X_train, y_train, idxiter=schedule, printevery=100000, costevery=100000, verbose=True);
clf.train_sgd(X_train, y_train, idxiter=schedule, alphaiter=alpha_schedule, printevery=100000, costevery=100000, verbose=True);

#### END YOUR CODE ####









    



Begin SGD...
  [20000]: mean loss 15.6736
SGD complete: 20000 examples in 74.11 seconds.






    Out[18]:





[(20000, 15.673563976651218)]



In [17]:

    
#### YOUR CODE HERE ####
# Sandbox: build a good model by tuning hyperparameters

yp = clf.predict(X_dev)
full_report(y_dev, yp, tagnames)
eval_performance(y_dev, yp, tagnames)


#### END YOUR CODE ####









    



             precision    recall  f1-score   support

          O       0.94      0.99      0.96     42759
        LOC       0.72      0.75      0.73      2094
       MISC       0.84      0.43      0.57      1268
        ORG       0.65      0.21      0.32      2092
        PER       0.75      0.76      0.75      3149

avg / total       0.91      0.92      0.91     51362

=== Performance (omitting 'O' class) ===
Mean precision:  72.95%
Mean recall:     57.41%
Mean F1:         61.46%



In [37]:

    
#### YOUR CODE HERE ####
# Sandbox: build a good model by tuning hyperparameters
traincurvebest = [(100000, 1.6392305257753381),
 (200000, 0.9713501488847085),
 (300000, 0.92719162923623333),
 (400000, 0.65314617730524283),
 (500000, 0.69434574257551684),
 (600000, 0.67674408587791146),
 (700000, 0.63355849175379098),
 (800000, 0.60947476566166814),
 (900000, 0.60553916271791497),
 (1000000, 0.60838990305892149),
 (1100000, 0.58387614068123839),
 (1200000, 0.56528442563491521),
 (1300000, 0.56772383628962297),
 (1400000, 0.55818029785509871),
 (1500000, 0.56255721129937275),
 (1600000, 0.55059669338120609),
 (1700000, 0.5582590912960208),
 (1800000, 0.56455349952333056),
 (1900000, 0.54554620555117928),
 (2000000, 0.53827092990040815)]
#### END YOUR CODE ####

(e): Plot Learning Curves

The train_sgd function returns a list of points (counter, cost) giving the mean loss after that number of SGD iterations.

If the model is taking too long you can cut it off by going to Kernel->Interrupt in the IPython menu; train_sgd will return the training curve so-far, and you can restart without losing your training progress.

Make two plots:

Learning curve using reg = 0.001, and comparing the effect of changing the learning rate: run with alpha = 0.01 and alpha = 0.1. Use minibatches of size 5, and train for 10,000 minibatches with costevery=200. Be sure to scale up your counts (x-axis) to reflect the batch size. What happens if the model tries to learn too fast? Explain why this occurs, based on the relation of SGD to the true objective.
Learning curve for your best model (print the hyperparameters in the title), as trained using your best schedule. Set costevery so that you get at least 100 points to plot.



In [40]:

    
counts = [x for x, y in traincurvebest]
costs = [y for x, y in traincurvebest]



In [45]:

    
##
# Plot your best learning curve here
#counts, costs = zip(*traincurvebest)
figure(figsize=(6,4))
plot(5*array(counts), costs, color='b', marker='o', linestyle='-')
title(r"Learning Curve ($\alpha$=%g, $\lambda$=%g)" % (clf.alpha, clf.lreg))
xlabel("SGD Iterations"); ylabel(r"Average $J(\theta)$"); 
ylim(ymin=0, ymax=max(1.1*max(costs),3*min(costs)));
#ylim(0,2)

# Don't change this filename!
savefig("ner.learningcurve.best.png")



In [19]:

    
##
# Plot comparison of learning rates here
# feel free to change the code below

figure(figsize=(6,4))
counts, costs = zip(*trainingcurve1)
plot(5*array(counts), costs, color='b', marker='o', linestyle='-', label=r"$\alpha=0.01$")
counts, costs = zip(*trainingcurve2)
plot(5*array(counts), costs, color='g', marker='o', linestyle='-', label=r"$\alpha=0.1$")
title(r"Learning Curve ($\lambda=0.01$, minibatch k=5)")
xlabel("SGD Iterations"); ylabel(r"Average $J(\theta)$"); 
ylim(ymin=0, ymax=max(1.1*max(costs),3*min(costs)));
legend()

# Don't change this filename
savefig("ner.learningcurve.comparison.png")

(f): Evaluating your model

Evaluate the model on the dev set using your predict function, and compute performance metrics below!



In [15]:

    
# Predict labels on the dev set
yp = clf.predict(X_dev)
# Save predictions to a file, one per line
ner.save_predictions(yp, "dev.predicted")



In [11]:

    
from nerwindow import full_report, eval_performance
full_report(y_dev, yp, tagnames) # full report, helpful diagnostics
eval_performance(y_dev, yp, tagnames) # performance: optimize this F1









    



             precision    recall  f1-score   support

          O       0.84      1.00      0.91     42759
        LOC       0.64      0.18      0.28      2094
       MISC       0.00      0.00      0.00      1268
        ORG       0.25      0.01      0.01      2092
        PER       0.00      0.00      0.00      3149

avg / total       0.74      0.84      0.77     51362

=== Performance (omitting 'O' class) ===
Mean precision:  21.64%
Mean recall:     4.49%
Mean F1:         7.09%






    



/home/birksworks/anaconda/lib/python2.7/site-packages/sklearn/metrics/classification.py:958: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)



In [22]:

    
# Save your predictions on the test set for us to evaluate
# IMPORTANT: make sure X_test is exactly as loaded 
# from du.docs_to_windows, so that your predictions 
# line up with ours.
yptest = clf.predict(X_test)
ner.save_predictions(yptest, "test.predicted")

Part [1.1]: Probing neuron responses

You might have seen some results from computer vision where the individual neurons learn to detect edges, shapes, or even cat faces. We're going to do the same for language.

Recall that each "neuron" is essentially a logistic regression unit, with weights corresponding to rows of the corresponding matrix. So, if we have a hidden layer of dimension 100, then we can think of our matrix $W \in \mathbb{R}^{100 x 150}$ as representing 100 hidden neurons each with weights W[i,:] and bias b1[i].

(a): Hidden Layer, Center Word

For now, let's just look at the center word, and ignore the rest of the window. This corresponds to columns W[:,50:100], although this could change if you altered the window size for your model. For each neuron, find the top 10 words that it responds to, as measured by the dot product between W[i,50:100] and L[j]. Use the provided code to print these words and their scores for 5 neurons of your choice. In your writeup, briefly describe what you notice here.

The num_to_word dictionary, loaded earlier, may be helpful.



In [23]:

    
# Recommended function to print scores
# scores = list of float
# words = list of str
def print_scores(scores, words):
    for i in range(len(scores)):
        print "[%d]: (%.03f) %s" % (i, scores[i], words[i])

#### YOUR CODE HERE ####

neurons = [1,3,4,6,8] # change this to your chosen neurons
for i in neurons:
    print "Neuron %d" % i
    print_scores(topscores[i], topwords[i])
    
#### END YOUR CODE ####

(b): Model Output, Center Word

Now, let's do the same for the output layer. Here we only have 5 neurons, one for each class. O isn't very interesting, but let's look at the other four.

Here things get a little more complicated: since we take a softmax, we can't just look at the neurons separately. An input could cause several of these neurons to all have a strong response, so we really need to compute the softmax output and find the strongest inputs for each class.

As before, let's consider only the center word (W[:,50:100]). For each class ORG, PER, LOC, and MISC, find the input words that give the highest probability $P(\text{class}\ |\ \text{word})$.

You'll need to do the full feed-forward computation here - for efficiency, try to express this as a matrix operation on $L$. This is the same feed-forward computation as used to predict probabilities, just with $W$ replaced by W[:,50:100].

As with the hidden-layer neurons, print the top 10 words and their corresponding class probabilities for each class.



In [24]:

    
#### YOUR CODE HERE ####


for i in range(1,5):
    print "Output neuron %d: %s" % (i, num_to_tag[i])
    print_scores(topscores[i], topwords[i])
    print ""

#### END YOUR CODE ####

(c): Model Output, Preceding Word

Now for one final task: let's look at the preceding word. Repeat the above analysis for the output layer, but use the first part of $W$, i.e. W[:,:50].

Describe what you see, and include these results in your writeup.



In [25]:

    
#### YOUR CODE HERE ####


for i in range(1,5):
    print "Output neuron %d: %s" % (i, num_to_tag[i])
    print_scores(topscores[i], topwords[i])
    print ""

#### END YOUR CODE ####



In [ ]: