A basic part of speech tagger

This tutorial walks you through writing learning to search code using the VW python interface. Once you've completed this, you can graduate to the C++ version, which will be faster for the computer but more painful for you.

The "learning to search" paradigm solves problems that look like the following. You have a sequence of decisions to make. At the end of making these decisions, the world tells you how bad your decisions were. You want to condition later decisions on earlier decisions. But thankfully, at "training time," you have access to an oracle that will tell you the right answer.

Let's start with a simple example: sequence labeling for Part of Speech tagging. The goal is to take a sequence of words ("the monster ate a big sandwich") and label them with their parts of speech (in this case: Det Noun Verb Det Adj Noun).

We will choose to solve this problem with left-to-right search. I.e., we'll label the first word, then the second then the third and so on.

For any vw project in python, we have to start by importing the pyvw library:


In [ ]:
from __future__ import print_function
from vowpalwabbit import pyvw

Now, let's define our data first. We'll do this first by defining the labels (one annoying thing is that labels in vw have to be integers):


In [ ]:
DET  = 1
NOUN = 2
VERB = 3
ADJ  = 4
my_dataset = [ [(DET , 'the'),
                (NOUN, 'monster'),
                (VERB, 'ate'),
                (DET , 'a'),
                (ADJ , 'big'),
                (NOUN, 'sandwich')],
               [(DET , 'the'),
                (NOUN, 'sandwich'),
                (VERB, 'was'),
                (ADJ , 'tasty')],
               [(NOUN, 'it'),
                (VERB, 'ate'),
                (NOUN, 'it'),
                (ADJ , 'all')] ]
print(my_dataset[2])

Here we have an example of a (correctly) tagged sentence.

Now, we need to write the structured prediction code. To do this, we have to write a new class that derives from the pyvw.SearchTask class.

This class must have two functions: __init__ and _run.

The initialization function takes three arguments (plus self): a vw object (vw), a search object (sch), and the number of actions (num_actions) that this object has been initialized with. Within the initialization function, we must first initialize the parent class, and then we can set whatever options we want via sch.set_options(...). Of course we can also do whatever additional initialization we like.

The _run function executes the sequence of decisions on a given input. The input will be of whatever type our data is (so, in the above example, it will be a list of (label,word) pairs).

Here is a basic implementation of sequence labeling:


In [ ]:
class SequenceLabeler(pyvw.SearchTask):
    def __init__(self, vw, sch, num_actions):
        # you must must must initialize the parent class
        # this will automatically store self.sch <- sch, self.vw <- vw
        pyvw.SearchTask.__init__(self, vw, sch, num_actions)
        
        # set whatever options you want
        sch.set_options( sch.AUTO_HAMMING_LOSS | sch.AUTO_CONDITION_FEATURES )

    def _run(self, sentence):   # it's called _run to remind you that you shouldn't call it directly!
        output = []
        for n in range(len(sentence)):
            pos,word = sentence[n]
            # use "with...as..." to guarantee that the example is finished properly
            with self.vw.example({'w': [word]}) as ex:
                pred = self.sch.predict(examples=ex, my_tag=n+1, oracle=pos, condition=[(n,'p'), (n-1, 'q')])
                output.append(pred)
        return output

Let's unpack this a bit.

The __init__ function is simple. It first calls the parent initializer and then sets some options. The options it sets are two things designed to make the programmer's life easier. The first is AUTO_HAMMING_LOSS. Remember earlier we said that when the sequence of decision is made, you have to say how bad it was? This says that we want this to be computed automatically by comparing the individual decisions to the oracle decisions, and defining the loss to be the sum of incorrect decisions.

The second is AUTO_CONDITION_FEATURES. This is a bit subtler. Later in the _run function, we will say that the label of the nth word depends on the label of the n-1th word. In order to get the underlying classifier to pay attention to that conditioning, we need to add features. We could do that manually (we'll do this later) or we can ask vw to do it automatically for us. For simplicity, we choose the latter.

The _run function takes a sentence (list of pos/word pairs) as input. We loop over each word position n in the sentence and extract the pos,word pair. We then construct a VW example that consists of a single feature (the word) in the 'w' namespace. Given that example ex, we make a search-based prediction by calling self.sch.predict(...). This is a fairly complicated function that takes a number of arguments. Here, we are calling it with the following:

  • examples=ex: This tells the predictor what features to use.
  • my_tag=n+1: In general, we want to condition the prediction of the nth label on the n-1th label. In order to do this, we have to give each prediction a "name" so that we can refer back to it in the future. This name needs to be an integer >= 1. So we'll call the first word 1, the second word 2, and so on. It has to be n+1 and not n because of the 1-based requirement.
  • oracle=pos: As mentioned before, on training data, we need to tell the system what the "true" (or "best") decision is at this point in time. Here, it is the given part of speech label.
  • condition=(n,'p'): This says that this prediction depends on the output of whichever-prediction-was-called-n, and that the "nature" of that condition is called 'p' (for "predecessor" in this case, though this is entirely up to you)

Now, we're ready to train the model. We do this in three steps. First, we initialize a vw object, telling it that we have a --search task with 4 labels, second that the specific type of --search_task is hook (you will always use the hook task) and finally that we want it to be quiet and use a larger ring_size (you can ignore the ring_size for now).


In [ ]:
vw = pyvw.vw("--search 4 --audit --quiet --search_task hook --ring_size 1024")

Next, we need to initialize the search task. We use the vw.init_search_task function to do this:


In [ ]:
sequenceLabeler = vw.init_search_task(SequenceLabeler)

Finally, we can train on the dataset we defined earlier, using sequenceLabeler.learn (the .learn function is inherited from the pyvw.SearchTask class). The .learn function takes any iterator over data. Whatever type of data it iterates over is what it will feed to your _run function.


In [ ]:
for i in range(10):
    sequenceLabeler.learn(my_dataset)

Of course, we want to see if it's learned anything. So let's create a single test example:


In [ ]:
test_example = [ (0,w) for w in "the sandwich ate a monster".split() ]
print(test_example)

We've used 0 as the labels so you can be sure that vw isn't cheating and it's actually making predictions:


In [ ]:
out = sequenceLabeler.predict(test_example)
print(out)

If we look back at our POS tag definitions, this is DET NOUN VERB DET NOUN, which is indeed correct!

Removing the AUTO features

In the above example we used both AUTO_HAMMING_LOSS and AUTO_CONDITION_FEATURES. To make more explicit what these are doing, let's rewrite our SequenceLabeler class without them! Here's a version that gets rid of both simultaneously. It is only modestly more complex:


In [ ]:
class SequenceLabeler2(pyvw.SearchTask):
    def __init__(self, vw, sch, num_actions):
        pyvw.SearchTask.__init__(self, vw, sch, num_actions)

    def _run(self, sentence):
        output = []
        loss = 0.
        for n in range(len(sentence)):
            pos,word = sentence[n]
            prevPred = output[n-1] if n > 0 else '<s>'
            with self.vw.example({'w': [word], 'p': [prevPred]}) as ex:
                pred = self.sch.predict(examples=ex, my_tag=n+1, oracle=pos, condition=(n,'p'))
                output.append(pred)
                if pred != pos:
                    loss += 1.
        self.sch.loss(loss)
        return output
    
sequenceLabeler2 = vw.init_search_task(SequenceLabeler2)
sequenceLabeler2.learn(my_dataset)
print(sequenceLabeler2.predict( [(0,w) for w in "the sandwich ate a monster".split()] ))

If executed correctly, this should have printed [1, 2, 3, 1, 2].

There are essentially two things that changed here. In order to get rid of AUTO_HAMMING_LOSS, we had to keep track of how many errors the predictor had made. This is done by checking whether pred != pos inside the inner loop, and then at the end calling self.sch.loss(loss) to tell the search procedure how well we did.

In order to get rid of AUTO_CONDITION_FEATURES, we need to explicitly add the previous prediction as features to the example we are predicting with. Here, we've done this by extracting the previous prediction (prevPred) and explicitly adding it as a feature (in the 'p' namespace) during the example construction.

Important Note: even though we're not using AUTO_CONDITION_FEATURES, we still must tell the search process that this prediction depends on the previous prediction. We need to do this because the learning algorithm automatically memoizes certain computations, and so it needs to know that, when memoizing, to remember that this prediction might have been different if a previous decision were different.

Very silly Covington-esqu dependency parsing

Let's also try a variant of dependency parsing to see that this doesn't work just for sequence-labeling list tasks. First we need to define some data:


In [ ]:
# the label for each word is its parent, or -1 for root
my_dataset = [ [("the",      1),   # 0
                ("monster",  2),   # 1
                ("ate",     -1),   # 2
                ("a",        5),   # 3
                ("big",      5),   # 4
                ("sandwich", 2) ]  # 5
                ,
               [("the",      1),   # 0
                ("sandwich", 2),   # 1
                ("is",      -1),   # 2
                ("tasty",    2)]   # 3
                ,
               [("a",        1),   # 0
                ("sandwich", 2),   # 1
                ("ate",     -1),   # 2
                ("itself",   2),   # 3
                ]
                ]

For instance, in the first sentence, the parent of "the" is "monster"; the parent of "monster" is "ate"; and "ate" is the root.

The basic idea of a Covington-style dependency parser is to loop over all O(N^2) word pairs and ask if one is the parent of the other. In a real parser you would want to make sure that you don't have cycles, that you have a unique root and (perhaps) that the resulting graph is projective. I'm not doing that here. Hopefully I'll add a shift-reduce parser example later that does do this. Here's an implementation of this idea:


In [ ]:
class CovingtonDepParser(pyvw.SearchTask):
    def __init__(self, vw, sch, num_actions):
        pyvw.SearchTask.__init__(self, vw, sch, num_actions)
        sch.set_options( sch.AUTO_HAMMING_LOSS | sch.AUTO_CONDITION_FEATURES )

    def _run(self, sentence):
        N = len(sentence)
        # initialize our output so everything is a root
        output = [-1 for i in range(N)]
        for n in range(N):
            wordN,parN = sentence[n]
            for m in range(-1,N):
                if m == n: continue
                wordM = sentence[m][0] if m > 0 else "*root*"
                # ask the question: is m the parent of n?
                isParent = 2 if m == parN else 1

                # construct an example
                dir = 'l' if m < n else 'r'
                with self.vw.example({'a': [wordN, dir + '_' + wordN], 'b': [wordM, dir + '_' + wordN], 'p': [wordN + '_' + wordM, dir + '_' + wordN + '_' + wordM],
                                      'd': [ str(m-n <= d) + '<=' + str(d) for d in [-8, -4, -2, -1, 1, 2, 4, 8] ] +
                                           [ str(m-n >= d) + '>=' + str(d) for d in [-8, -4, -2, -1, 1, 2, 4, 8] ] }) as ex:
                    pred = self.sch.predict(examples  = ex,
                                            my_tag    = (m+1)*N + n + 1,
                                            oracle    = isParent,
                                            condition = [ (max(0, (m  )*N + n + 1), 'p'),
                                                          (max(0, (m+1)*N + n    ), 'q') ])
                    if pred == 2:
                        output[n] = m
                        break
        return output

In this, output stores the predicted tree and is initialized with every word being a root. We then loop over every word (n) and every possible parent (m, which can be -1, though that's really kind of unnecessary).

The features are basically the words under consideration, the words paired with the direction of the edge, the pair of words, and then a bunch of (binned) distance features.

We can train and run this parser with:


In [ ]:
vw = pyvw.vw("--search 2 --quiet --search_task hook --ring_size 1024")
task = vw.init_search_task(CovingtonDepParser)
for p in range(10): # do ten passes over the training data
    task.learn(my_dataset)
print('testing')
print(task.predict( [(w,-1) for w in "the monster ate a sandwich".split()] ))
print('should have printed [ 1 2 -1 4 2 ]')

One could argue that a more natural way to do this would be with LDF rather than the inner loop over m. We'll do that next.

LDF-based Covington-style dependency parser

One of the weirdnesses in the previous parser implementation is that it makes N-many binary decisions per word ("is word n my parent?") rather than a single N-way decision. The latter makes more sense.

The challenge is that you cannot set this up as a vanilla multiclass classification problem because (a) the number of "classes" is a function of the input (a length N sentence will have N classes) and (b) class "1" and "2" don't mean anything consistently across examples.

The way around this is label-dependent features (LDF). In LDF mode, the class ids are (essentially -- see caveat below) irrelevant. Instead, you simply provide features that depend on the label (hence "LDF"). In particular, for each possible label, you provide a different feature vector, and the goal of learning is to pick one of those as the "correct" one.

Here's a re-implementation of Covington using LDF:


In [ ]:
class CovingtonDepParserLDF(pyvw.SearchTask):
    def __init__(self, vw, sch, num_actions):
        pyvw.SearchTask.__init__(self, vw, sch, num_actions)
        sch.set_options( sch.AUTO_HAMMING_LOSS | sch.IS_LDF | sch.AUTO_CONDITION_FEATURES )

    def makeExample(self, sentence, n, m):
        wordN = sentence[n][0]
        wordM = sentence[m][0] if m >= 0 else '*ROOT*'
        dir   = 'l' if m < n else 'r'
        ex = self.vw.example( { 'a': [wordN, dir + '_' + wordN],
                                'b': [wordM, dir + '_' + wordN],
                                'p': [wordN + '_' + wordM, dir + '_' + wordN + '_' + wordM],
                                'd': [ str(m-n <= d) + '<=' + str(d) for d in [-8, -4, -2, -1, 1, 2, 4, 8] ] +
                                     [ str(m-n >= d) + '>=' + str(d) for d in [-8, -4, -2, -1, 1, 2, 4, 8] ] },
                              labelType=self.vw.lCostSensitive)
        ex.set_label_string(str(m+2) + ":0")
        return ex
            
    def _run(self, sentence):
        N = len(sentence)
        # initialize our output so everything is a root
        output = [-1 for i in range(N)]
        for n in range(N):
            # make LDF examples
            examples = [ self.makeExample(sentence,n,m) for m in range(-1,N) if n != m ]

            # truth
            parN = sentence[n][1]
            oracle = parN+1 if parN < n else parN   # have to -1 because we excluded n==m from list

            # make a prediction
            pred = self.sch.predict(examples  = examples,
                                    my_tag    = n+1,
                                    oracle    = oracle,
                                    condition = [ (n, 'p'), (n-1, 'q') ] )

            output[n] = pred-1 if pred < n else pred # have to +1 because n==m excluded

            for ex in examples: ex.finish()  # clean up
            
        return output

There are a few things going on here. Let's focus first on the __init__ function. The only difference here is that when we call sch.set_options we provide sch.IS_LDF to declare that this is an LDF task.

Let's skip the makeExample function for a minute and look at the _run function. You should recognize most of this from the non-LDF version. We initialize the output (parent) of every word to -1 (meaning that every word is connected to the root).

For each word n, we construct N-many examples: one for every -1..(N-1), except for the current word n because you cannot have self-loops. If we were being more clever, we would only do the ones that won't result in the creation of a cycle, but we're not being clever.

Now, because the "labels" are just examples, it's a bit more complicated to specify the oracle. The oracle is an index into the examples list. So if oracle is the oracle action, then examples[oracle] is the corresponding example. We compute the oracle as follows. parN is the actual parent, which is going to be something in the range -1 .. (N-1). If parN < n (this is a left arrow), then the oracle index is parN+1 because the root (-1) is index 0 and so on. If parN > n (note: it cannot be equal to n) then, beacuse n == m is left out of the examples list, the correct index is just parN. Phew.

We then ask for a prediction. Now, instead of giving a single example, with give the list of examples. The tag works the same way, as does the conditioning.

Once we get a prediction out (called pred) we need to figure out what parent it actually corresponds to. This is simply un-doing the computaiton two paragraphs ago!

Finally -- and this is skippable if you trust the Python garbage collector -- we tell VW that we're done with all the examples we created. We do this just to be pedantic; it's optional.

Okay, now let's go back to the makeExample function. This takes two word ids (n and m) and makes an example that roughly says "what would it look like if I had an edge from n to m?" We construct basically the same feautres as before. There are two major changes, though:

  1. When we run self.vw.example(...) we provide labelType=self.vw.lCostSensitive as an argument. This is because under the hood, vw treats LDF examples as cost-sensitive classification examples. This means they need to have cost-sensitive labels, so that's how we need to create them.

  2. We explicitly set the label of the this example to str(m+2)+":0". What is this? Well, this is optional but recommended. Here's the issue. In LDF mode, recall that labels have no intrinsic meaning. This means that when vw does auto-conditioning, it's not really clear what to use as the "previous prediction." By giving explicit label names (in this case, m+2) we're recording what the position of the last parent, which may be useful for predicting the next parent. We could avoid this necessity if we did our own feature engineering on the history, but for now, this seems to capture the right intuition.

Given all this, we can now train and test our parser:


In [ ]:
vw = pyvw.vw("--search 0 --csoaa_ldf m --search_task hook --ring_size 1024 --quiet")
task = vw.init_search_task(CovingtonDepParserLDF)
for p in range(2): # do two passes over the training data
    task.learn(my_dataset)
print(task.predict( [(w,-1) for w in "the monster ate a sandwich".split()] ))

The correct parse of this sentence is [1, 2, -1, 4, 2] which is what this should have printed.

There are two major things to notice in the initialization of VW. The first is that we say --search 0. The zero labels argument to --search declares that this is going to be an LDF task. We also have to tell VW that we want an LDF-enabled cost-sensitive learner, which is what --csoaa_ldf m does (if you're wondering, m means "multiline" -- just treat it as something you have to do). The rest should be familiar.

A simple word-alignment model

Okay, as a last example we'll do a simple word alignment model in the spirit of the IBM models. Note that this will be a supervised model; doing unsupervised stuff is a bit trickier.

Here's some word alignment data. The dataset is triples of E, A, F where A[i] = list of words E[i] aligned to, or [] for null-aligned:


In [ ]:
my_dataset = [
    ( "the blue house".split(),
      ([0], [2], [1]),
      "la maison bleue".split() ),
    ( "the house".split(),
      ([0], [1]),
      "la maison".split() ),
    ( "the flower".split(),
      ([0], [1]),
      "la fleur".split() )
    ]

It's going to be useful to compute alignment mismatches at the word level between true alignments (like [1,2]) and predicted alignments (like [2,3,4]). We use intersection-over-union error:


In [ ]:
def alignmentError(true, sys):
    t = set(true)
    s = set(sys)
    if len(t | s) == 0: return 0.
    return 1. - float(len(t & s)) / float(len(t | s))

And now we can define our structured prediction task. This is also an LDF problem. Basically for each word on the English side, we'll loop over all possible phrases on the Foreign side to which it could align (maximum phrase length of three). For each of these options we'll create an example to be fed into the LDF classifier. We also ensure that the same foreign word cannot be covered by multiple English words, though this might not be a good idea in general.


In [ ]:
class WordAligner(pyvw.SearchTask):
    def __init__(self, vw, sch, num_actions):
        pyvw.SearchTask.__init__(self, vw, sch, num_actions)
        sch.set_options( sch.AUTO_HAMMING_LOSS | sch.IS_LDF | sch.AUTO_CONDITION_FEATURES )

    def makeExample(self, E, F, i, j0, l):
        f  = 'Null' if j0 is None else [ F[j0+k] for k in range(l+1) ]
        ex = self.vw.example( { 'e': E[i],
                                'f': f,
                                'p': '_'.join(f),
                                'l': str(l),
                                'o': [str(i-j0), str(i-j0-l)] if j0 is not None else [] },
                              labelType = self.vw.lCostSensitive )
        lab = 'Null' if j0 is None else str(j0+l)
        ex.set_label_string(lab + ':0')
        return ex
        
    def _run(self, alignedSentence):
        E,A,F = alignedSentence

        # for each E word, we pick a F span
        covered = {}  # which F words have been covered so far?
        output  = []
        
        for i in range(len(E)):
            examples = []  # contains vw examples
            spans    = []  # contains triples (alignment error, index in examples, [range])
            
            # empty span:
            examples.append( self.makeExample(E, F, i, None, None) )
            spans.append( (alignmentError(A[i], []), 0, []) )

            # non-empty spans
            for j0 in range(len(F)):
                for l in range(3):   # max phrase length of 3
                    if j0+l >= len(F): break
                    if covered.has_key(j0+l): break

                    id = len(examples)
                    examples.append( self.makeExample(E, F, i, j0, l) )
                    spans.append( (alignmentError(A[i], range(j0,j0+l+1)), id, range(j0,j0+l+1)) )

            sortedSpans = []
            for s in spans: sortedSpans.append(s)
            sortedSpans.sort()
            oracle = []
            for id in range(len(sortedSpans)):
                if sortedSpans[id][0] > sortedSpans[0][0]: break
                oracle.append( sortedSpans[id][1] )
                
            pred = self.sch.predict(examples  = examples,
                                    my_tag    = i+1,
                                    oracle    = oracle,
                                    condition = [ (i, 'p'), (i-1, 'q') ] )

            for ex in examples: ex.finish()

            output.append( spans[pred][2] )
            for j in spans[pred][2]:
                covered[j] = True

        return output

The only really complicated thing here is computing the oracle. What we do is, for each possible alignment, compute an intersection-over-union error rate. The oracle is then that alignment that achieves the smallest (local) error rate. This is not perfect, but is good enough. One interesting thing here is that now the oracle could be a list; this is completely supported by the underlying algorithms.

We can train and test this model to make sure it does the right thing:


In [ ]:
vw = pyvw.vw("--search 0 --csoaa_ldf m --search_task hook --ring_size 1024 --quiet -q ef -q ep")
task = vw.init_search_task(WordAligner)
for p in range(10):
    task.learn(my_dataset)
print(task.predict( ("the blue flower".split(), ([],[],[]), "la fleur bleue".split()) ))

If this worked correctly, it should have printed [[0], [2], [1]].


In [ ]: