LXMLS 2017 - Day 4

Syntax and Parsing


In [2]:
%matplotlib inline

In [3]:
%load_ext autoreload

In [4]:
%autoreload 2
import sys
import scipy
sys.path.append('../../../')
import lxmls

CKY Algorithm: Step by Step

The CKY algorithm is a dynamic programming algorithm, much like Viterbi. The general idea is to build up on smaller, simpler things you have computed in previous iterations.

Viterby is focused on building sequences (of labels) by computing the highest scoring sequence of length $n$ from sequences of length $(n-1)$. CKY expands this idea to graphs, which can be thought as higher dimension sequences. In particular, trees spanning over $n$ words are built from trees spanning up to $(n-1)$ words.

In the next section, we're going to do some CKY iterations on a real sequence, to get a more concrete sense of how the algorithm works.

Exercise 4.1 In this simple exercise, you will see the CKY algorithm in action. There is a Javascript applet that illustrates how CKY works (in its non-probabilistic form). Go to http://lxmls.it.pt/2015/cky.html, and observe carefully the several steps taken by the algorithm


Exercise 4.2 This exercise will show you that real-world sentences can have complicated syntactic structures. There is a parse tree visualizer in http://brenocon.com/parseviz/. Go to your local data/treebanks folder and open the file PTB excerpt.txt. Copy a few trees from the file, one at a time, and examine their parse trees in the visualizer.


Exercise 4.3 In this exercise you are going to experiment with arc-factored non-projective dependency parsers. The CoNLL-X and CoNLL 2008 shared task datasets (Buchholz and Marsi, 2006; Surdeanu et al., 2008) contain dependency treebanks for 14 languages. In this lab, we are going to experiment with the Portuguese and English datasets.


In [7]:
# Exercises for lab day 4 Parsing.
import sys

sys.path.append('.')

import lxmls.parsing.dependency_parser as depp
import pdb

print "Exercise 4.3.1"

dp = depp.DependencyParser()

dp.read_data("portuguese")

print "Exercise 4.3.2"

dp.train_perceptron(10)
dp.test()

print "Exercise 4.3.3"

dp.features.use_lexical = True
dp.read_data("portuguese")
dp.train_perceptron(10)
dp.test()

dp.features.use_distance = True
dp.read_data("portuguese")
dp.train_perceptron(10)
dp.test()

dp.features.use_contextual = True
dp.read_data("portuguese")
dp.train_perceptron(10)
dp.test()

print "Exercise 4.3.4"

dp.train_crf_sgd(10, 0.01, 0.1)
dp.test()

print "Exercise 4.3.5"

dp.read_data("english")
dp.train_perceptron(10)
dp.test()


print "Exercise 4.3.6"

dp = depp.DependencyParser()
dp.features.use_lexical = True
dp.features.use_distance = True
dp.features.use_contextual = True
dp.read_data("english")
dp.projective = True
dp.train_perceptron(10)
dp.test()


Exercise 4.3.1
Number of sentences: 3029
Number of tokens: 25015
Number of words: 7621
Number of pos: 16
Number of features: 142
Exercise 4.3.2
Epoch 1
Training accuracy: 0.497432605905
Epoch 2
Training accuracy: 0.499144201968
Epoch 3
Training accuracy: 0.498217087434
Epoch 4
Training accuracy: 0.50053487377
Epoch 5
Training accuracy: 0.501818570817
Epoch 6
Training accuracy: 0.498538011696
Epoch 7
Training accuracy: 0.500962772786
Epoch 8
Training accuracy: 0.500285266011
Epoch 9
Training accuracy: 0.499286834974
Epoch 10
Training accuracy: 0.500035658251
Test accuracy (109 test instances): 0.495210727969
Exercise 4.3.3
Number of sentences: 3029
Number of tokens: 25015
Number of words: 7621
Number of pos: 16
Number of features: 46216
Epoch 1
Training accuracy: 0.531914134931
Epoch 2
Training accuracy: 0.641135358722
Epoch 3
Training accuracy: 0.722864070746
Epoch 4
Training accuracy: 0.784695478534
Epoch 5
Training accuracy: 0.820425046356
Epoch 6
Training accuracy: 0.851911282271
Epoch 7
Training accuracy: 0.873876765083
Epoch 8
Training accuracy: 0.890850092711
Epoch 9
Training accuracy: 0.897054628441
Epoch 10
Training accuracy: 0.907466837826
Test accuracy (109 test instances): 0.57662835249
Number of sentences: 3029
Number of tokens: 25015
Number of words: 7621
Number of pos: 16
Number of features: 46224
Epoch 1
Training accuracy: 0.700292397661
Epoch 2
Training accuracy: 0.790757381258
Epoch 3
Training accuracy: 0.844815290258
Epoch 4
Training accuracy: 0.878155755242
Epoch 5
Training accuracy: 0.897232919698
Epoch 6
Training accuracy: 0.915703893881
Epoch 7
Training accuracy: 0.929432320639
Epoch 8
Training accuracy: 0.940379403794
Epoch 9
Training accuracy: 0.948723434603
Epoch 10
Training accuracy: 0.954999286835
Test accuracy (109 test instances): 0.714559386973
Number of sentences: 3029
Number of tokens: 25015
Number of words: 7621
Number of pos: 16
Number of features: 92918
Epoch 1
Training accuracy: 0.825452859792
Epoch 2
Training accuracy: 0.903758379689
Epoch 3
Training accuracy: 0.933497361289
Epoch 4
Training accuracy: 0.950613321923
Epoch 5
Training accuracy: 0.960419341036
Epoch 6
Training accuracy: 0.966445585508
Epoch 7
Training accuracy: 0.973933818286
Epoch 8
Training accuracy: 0.976180288119
Epoch 9
Training accuracy: 0.982634431607
Epoch 10
Training accuracy: 0.985094850949
Test accuracy (109 test instances): 0.874521072797
Exercise 4.3.4
Epoch 1
Training objective: 4.4070880113
Epoch 2
Training objective: 3.7910577601
Epoch 3
Training objective: 3.69859794909
Epoch 4
Training objective: 3.65759775959
Epoch 5
Training objective: 3.63423975151
Epoch 6
Training objective: 3.61910334678
Epoch 7
Training objective: 3.60848000103
Epoch 8
Training objective: 3.6006052875
Epoch 9
Training objective: 3.5945303809
Epoch 10
Training objective: 3.58969896567
Test accuracy (109 test instances): 0.88122605364
Exercise 4.3.5
Number of sentences: 8044
Number of tokens: 80504
Number of words: 12202
Number of pos: 48
Number of features: 338014
Epoch 1
Training accuracy: 0.825484482992
Epoch 2
Training accuracy: 0.918123503636
Epoch 3
Training accuracy: 0.944832181416
Epoch 4
Training accuracy: 0.959840990197
Epoch 5
Training accuracy: 0.96828838596
Epoch 6
Training accuracy: 0.973788227854
Epoch 7
Training accuracy: 0.97763924651
Epoch 8
Training accuracy: 0.981038532773
Epoch 9
Training accuracy: 0.983297194742
Epoch 10
Training accuracy: 0.985025071148
Test accuracy (509 test instances): 0.885612987498
Enter to go on to next exercise:
Exercise 4.3.6
Number of sentences: 8044
Number of tokens: 80504
Number of words: 12202
Number of pos: 48
Number of features: 338014
Epoch 1
Training accuracy: 0.835637168541
Epoch 2
Training accuracy: 0.922426254687
Epoch 3
Training accuracy: 0.947621628947
Epoch 4
Training accuracy: 0.960326602521
Epoch 5
Training accuracy: 0.967689840538
Epoch 6
Training accuracy: 0.97263631025
Epoch 7
Training accuracy: 0.97619370285
Epoch 8
Training accuracy: 0.979209016579
Epoch 9
Training accuracy: 0.98127569228
Epoch 10
Training accuracy: 0.981320865519
Test accuracy (509 test instances): 0.886732599366

Implement Eisner’s algorithm for projective dependency parsing. The pseudo-code is shown as Algorithm 13 Implement this algorithm as the function


In [5]:
def parse_proj(self, scores):
        """
        Parse using Eisner's algorithm.
        """

        ############################
        # Solution to Exercise 4.3.6
        ############################

        nr, nc = np.shape(scores)
        if nr != nc:
            raise ValueError("scores must be a squared matrix with nw+1 rows")
            return []

        N = nr - 1  # Number of words (excluding root).

        # Initialize CKY table.
        complete = np.zeros([N+1, N+1, 2])  # s, t, direction (right=1).
        incomplete = np.zeros([N+1, N+1, 2])  # s, t, direction (right=1).
        complete_backtrack = -np.ones([N+1, N+1, 2], dtype=int)  # s, t, direction (right=1).
        incomplete_backtrack = -np.ones([N+1, N+1, 2], dtype=int)  # s, t, direction (right=1).

        incomplete[0, :, 0] -= np.inf

        # Loop from smaller items to larger items.
        for k in xrange(1, N+1):
            for s in xrange(N-k+1):
                t = s + k

                # First, create incomplete items.
                # left tree
                incomplete_vals0 = complete[s, s:t, 1] + complete[(s+1):(t+1), t, 0] + scores[t, s]
                incomplete[s, t, 0] = np.max(incomplete_vals0)
                incomplete_backtrack[s, t, 0] = s + np.argmax(incomplete_vals0)
                # right tree
                incomplete_vals1 = complete[s, s:t, 1] + complete[(s+1):(t+1), t, 0] + scores[s, t]
                incomplete[s, t, 1] = np.max(incomplete_vals1)
                incomplete_backtrack[s, t, 1] = s + np.argmax(incomplete_vals1)

                # Second, create complete items.
                # left tree
                complete_vals0 = complete[s, s:t, 0] + incomplete[s:t, t, 0]
                complete[s, t, 0] = np.max(complete_vals0)
                complete_backtrack[s, t, 0] = s + np.argmax(complete_vals0)
                # right tree
                complete_vals1 = incomplete[s, (s+1):(t+1), 1] + complete[(s+1):(t+1), t, 1]
                complete[s, t, 1] = np.max(complete_vals1)
                complete_backtrack[s, t, 1] = s + 1 + np.argmax(complete_vals1)

        value = complete[0][N][1]
        heads = -np.ones(N + 1, dtype=int)
        self.backtrack_eisner(incomplete_backtrack, complete_backtrack, 0, N, 1, 1, heads)

        value_proj = 0.0
        for m in xrange(1, N+1):
            h = heads[m]
            value_proj += scores[h, m]

        return heads

        ###################################
        # End of solution to Exercise 4.3.6
        ###################################