The Portuguese Dataset

In this exercise you are going to experiment with arc-factored non-projective dependency parsers. The CoNLL-X and CoNLL 2008 shared task datasets (Buchholz and Marsi, 2006; Surdeanu et al., 2008) contain dependency treebanks for 14 languages. In this lab, we are going to experiment with the Portuguese and English datasets. We preprocessed those datasets to exclude all sentences with more than 15 words; this yielded the files:

  • data/deppars/portuguese train.conll,
  • data/deppars/portuguese test.conll,
  • data/deppars/english train.conll,
  • data/deppars/english test.conll.

After importing all the necessary libraries, load the Portuguese dataset:


In [ ]:
%load_ext autoreload
%autoreload 2

In [ ]:
import sys
sys.path.append("../../../")
import lxmls.parsing.dependency_parser as depp
dp = depp.DependencyParser()
dp.read_data("portuguese")

Observe the statistics which are shown. How many features are there in total?

Examining Features

We will now have a close look on the features that can be used in the parser. Examine the file:

lxmls/parsing/dependency features.py.

The following method takes a sentence and computes a vector of features for each possible arc $\langle h, m \rangle$:

def create_arc_features(self, instance, h, m, add=False):
    '''Creates features for arc h-->m.'''

We grouped the features in several subsets, so that we can conduct some ablation experiments:

  • Basic features that look only at the parts-of-speech of the words that can be connected by an arc;
  • Lexical features that also look at these words themselves;
  • Distance features that look at the length and direction of the dependency link (i.e., distance between the two words);
  • Contextual features that look at the context (part-of-speech tags) of the words surrounding h and m.

In the default configuration, only the basic features are enabled. The total number of features is the quantity observed in the previous question. With this configuration, train the parser by running 10 epochs of the structured perceptron algorithm:


In [ ]:
import lxmls.parsing.dependency_parser as depp
dp = depp.DependencyParser()
dp.read_data("portuguese")
dp.train_perceptron(10)

In [ ]:
dp.test()

What is the accuracy obtained in the test set? (Note: the shown accuracy is the fraction of words whose parent was correctly predicted.)

Adding Features

Repeat the previous exercise by subsequently enabling the lexical, distance and contextual features:


In [ ]:
dp.features.use_lexical = True 
dp.read_data("portuguese") 
dp.train_perceptron(10) 
dp.test()

In [ ]:
dp.features.use_distance = True
dp.read_data("portuguese") 
dp.train_perceptron(10) 
dp.test()

In [ ]:
dp.features.use_contextual = True 
dp.read_data("portuguese") 
dp.train_perceptron(10)
dp.test()

For each configuration, write down the number of features and test set accuracies. Observe the improvements obtained when more features were added. Feel free to engineer new features!

Training with MaxEnt

Which of the three important inference tasks discussed above (computing the most likely tree, computing the partition function, and computing the marginals) need to be performed in the structured perceptron algorithm? What about a maximum entropy classifier, with stochastic gradient descent? Check your answers by looking at the following two methods in code/dependency parser.py:

def train_perceptron(self, n_epochs):
    ...
def train_crf_sgd(self, n_epochs, sigma, eta0 = 0.001):
    ...

Repeat the last exercise by training a maximum entropy classifier, with stochastic gradient descent, using $l$ = 0.01 and a initial stepsize of $\eta_0$ = 0.1:


In [ ]:
dp.train_crf_sgd(10, 0.01, 0.1)
dp.test()

Compare the results with those obtained by the perceptron algorithm.

Other Languages

Train a parser for English using your favourite learning algorithm:


In [ ]:
dp.read_data("english")
dp.train_perceptron(10)
dp.test()

The predicted trees are placed in the file data/deppars/english_test.conll.pred. To get a sense of which errors are being made, you can check the sentences that differ from the gold standard (see the data in data/deppars/english_test.conll) and visualize those sentences, e.g., in https://brenocon.com/parseviz/.


In [ ]:

Eisner's Algorithm (optional)

Implement Eisner’s algorithm for projective dependency parsing. The pseudo-code is shown as Algorithm 13. Implement this algorithm as the function:

def parse_proj(self, scores):

in file dependency_decoder.py. The input is a matrix of arc scores, whose dimension is $(N + 1)$-by-$(N + 1)$, and whose $(h, m)$ entry contains the score $s_\sigma(h, m)$.

In particular, the first row contains the scores for the arcs that depart from the root, and the first column's values, along with the main diagonal, are to be ignored (since no arcs point to the root, and there are no self-pointing arcs). To make your job easier, we provide an implementation of the backtracking part:

def backtrack_eisner(self, incomplete_backtrack, complete_backtrack, s, t, direction, complete, heads):

so you just need to build complete/incomplete spans and their backtrack pointers and then call.

heads = -np.ones(N+1, dtype=int) 
    self.backtrack_eisner(incomplete_backtrack, complete_backtrack, 0, N, 1, 1,heads)
    return heads

to obtain the final parse. To test the algorithm, retrain the parser on the English data (where the trees are actually all projective) by setting the flag dp.projective to True:


In [ ]:
dp = depp.DependencyParser() 
dp.features.use_lexical = True 
dp.features.use_distance = True 
dp.features.use_contextual = True 
dp.read_data("english") 
dp.projective = True
dp.train_perceptron(10)
dp.test()

You should get the following results:

4.2.5
Number of sentences: 8044
Number of tokens: 80504
Number of words: 12202
Number of pos: 48
Number of features: 338014
Epoch 1
Training accuracy: 0.835637168541
Epoch 2
Training accuracy: 0.922426254687
Epoch 3
Training accuracy: 0.947621628947
Epoch 4
Training accuracy: 0.960326602521
Epoch 5
Training accuracy: 0.967689840538
Epoch 6
Training accuracy: 0.97263631025
Epoch 7
Training accuracy: 0.97619370285
Epoch 8
Training accuracy: 0.979209016579
Epoch 9
Training accuracy: 0.98127569228
Epoch 10
Training accuracy: 0.981320865519
Test accuracy (509 test instances): 0.886732599366