Übungsblatt 11

Präsenzaufgaben

Aufgabe 1 Grammatikinduktion

In dieser Aufgabe soll vollautomatisch aus Daten (Syntaxbäumen) eine probabilistische, kontextfreie Grammatik erzeugt werden.

Füllen Sie die Lücken und versuchen Sie mithilfe Ihrer automatisch erstellten Grammatik die folgenden Sätze zu parsen:



In [1]:

    
test_sentences = [
    "the men saw a car .",
    "the woman gave the man a book .",
    "she gave a book to the man .",
    "yesterday , all my trouble seemed so far away ."
]



In [2]:

    
import nltk
from nltk.corpus import treebank
from nltk.grammar import ProbabilisticProduction, PCFG



In [3]:

    
# Production count: the number of times a given production occurs
pcount = {}

# LHS-count: counts the number of times a given lhs occurs
lcount = {}

for tree in treebank.parsed_sents():
    for prod in tree.productions():
        pcount[prod] = pcount.get(prod, 0) + 1
        lcount[prod.lhs()] = lcount.get(prod.lhs(), 0) + 1
        
productions = [
    ProbabilisticProduction(
        p.lhs(), p.rhs(),
        prob=pcount[p] / lcount[p.lhs()]    
    )
    for p in pcount
]

start = nltk.Nonterminal('S')
grammar = PCFG(start, productions)
parser = nltk.ViterbiParser(grammar)



In [4]:

    
from IPython.display import display
for sent in test_sentences:
    for res in parser.parse(sent.split()):
        display(res)

Aufgabe 2 Informationsextraktion per Syntaxanalyse

Gegenstand dieser Aufgabe ist eine anwendungsnahe Möglichkeit, Ergebnisse einer Syntaxanalyse weiterzuverarbeiten. Aus den syntaktischen Abhängigkeiten eines Textes soll (unter Zuhilfenahme einiger Normalisierungsschritte) eine semantische Repräsentation der im Text enthaltenen Informationen gewonnen werden.

Für die syntaktische Analyse soll der DependencyParser der Stanford CoreNLP Suite verwendet werden. Die semantische Repräsentation eines Satzes sei ein zweistelliges, logisches Prädikat, dessen Argumente durch Subjekt und Objekt gefüllt sind. (Bei Fehlen eines der beiden Elemente soll None geschrieben werden.)

Folgendes Beispiel illustriert das gewünschte Ergebnis:

Eingabe:

I shot an elephant in my pajamas.
The elephant was seen by a giraffe in the desert.
The bird I need is a raven.
The man who saw the raven laughed out loud.

Ausgabe:

shot(I, elephant)
seen(giraffe, elephant)
need(I, bird)
raven(bird, None)
saw(man, raven)
laughed(man, None)



In [5]:

    
from nltk.parse.stanford import StanfordDependencyParser

PATH_TO_CORE = r"C:\Users\Martin\CoreNLP\stanford-corenlp-full-2017-06-09"
jar = PATH_TO_CORE + r"\stanford-corenlp-3.8.0.jar"
model = PATH_TO_CORE + r"\stanford-corenlp-3.8.0-models.jar"

dep_parser = StanfordDependencyParser(
    jar, model,
    model_path="edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz"
)



In [6]:

    
from collections import defaultdict

def generate_predicates_for_sentence(sentence):
    verbs = set()
    sbj = {}
    obj = {}
    sbj_candidates = defaultdict(list)
    case = {}
    for result in dep_parser.raw_parse(sentence):
        relcl_trips = []
        for triple in result.triples():

            if triple[1] == "nsubj":
                verbs.add(triple[0][0])
                sbj[triple[0][0]] = triple[2]
            if triple[1] == "dobj":
                verbs.add(triple[0][0])
                obj[triple[0][0]] = triple[2]
            if triple[1] == "nsubjpass":
                verbs.add(triple[0][0])
                obj[triple[0][0]] = triple[2]
            if triple[0][1].startswith("V"):
                verbs.add(triple[0][0])
                if triple[1] == "nmod":
                    sbj_candidates[triple[0][0]].append(triple[2])
            if triple[1] == "acl:relcl":
                verbs.add(triple[2][0])
                relcl_trips.append(triple)
            if triple[1] == "case":
                case[triple[0]] = triple[2]

        for triple in relcl_trips:
            if triple[2][0] in sbj:
                if sbj[triple[2][0]][1] in ["WP", "WDT"]:
                    sbj[triple[2][0]] = triple[0]
                else:
                    obj[triple[2][0]] = triple[0]
            else:
                sbj[triple[2][0]] = triple[0]
                
        for v in sbj_candidates:
            if v not in sbj:
                for cand in sbj_candidates[v]:
                    if case[cand][0] == "by":
                        sbj[v] = cand

    preds = []
    for v in verbs:
        preds.append(
            v + "(" + sbj.get(v, ("None",))[0] +
            ", " + obj.get(v, ("None",))[0] + ")"
        )
    return preds



In [7]:

    
for pred in generate_predicates_for_sentence(
    "The man that saw the raven laughed out loud."
):
    print(pred)









    



saw(man, raven)
laughed(man, None)



In [8]:

    
def generate_predicates_for_text(text):
    predicates = []
    for sent in nltk.tokenize.sent_tokenize(text):
        predicates.extend(generate_predicates_for_sentence(sent))
    return predicates



In [9]:

    
text = """
I shot an elephant in my pajamas.
The elephant was seen by a giraffe.
The bird I need is a raven.
The man who saw the raven laughed out loud.
"""

for pred in generate_predicates_for_text(text):
    print(pred)









    



shot(I, elephant)
seen(giraffe, elephant)
raven(bird, None)
need(I, bird)
saw(man, raven)
laughed(man, None)

Hausaufgaben

Aufgabe 3 Parent Annotation

Parent Annotation kann die Performanz einer CFG wesentlich verbessern. Schreiben Sie eine Funktion, die einen gegebenen Syntaxbaum dieser Optimierung unterzieht. Auf diese Art und Weise transformierte Bäume können dann wiederum zur Grammatikinduktion verwendet werden.

parentHistory soll dabei die Anzahl der Vorgänger sein, die zusätzlich zum direkten Elternknoten berücksichtigt werden. (Kann bei der Lösung der Aufgabe auch ignoriert werden.)

parentChar soll ein Trennzeichen sein, das bei den neuen Knotenlabels zwischen dem ursprünglichen Knotenlabel und der Liste von Vorgängern eingefügt wird.



In [10]:

    
def parent_annotation(tree, parentHistory=0, parentChar="^"):
    def pa_rec(node, parents):
        originalNode = node.label()

        parentString = parentChar + '<' + "-".join(parents) + '>'
        node.set_label(node.label() + parentString)

        for child in node:
            pa_rec(child, [originalNode] + parents[:parentHistory])
            
        return node
    
    return pa_rec(tree, [])



In [11]:

    
test_tree = nltk.Tree(
    "S",
    [
        nltk.Tree("NP", [
            nltk.Tree("DET", []),
            nltk.Tree("N", [])
        ]),
        nltk.Tree("VP", [
            nltk.Tree("V", []),
            nltk.Tree("NP", [
                nltk.Tree("DET", []),
                nltk.Tree("N", [])
            ])
        ])
    ]
)

parent_annotation(
   test_tree
)









    Out[11]:

Aufgabe 4 Mehr Semantik für IE

Zusätzlich zu den in Aufgabe 2 behandelten Konstruktionen sollen jetzt auch negierte und komplexe Sätze mit Konjunktionen sinnvoll verarbeitet werden.

Eingabe:

I see an elephant.
You didn't see the elephant.
Peter saw the elephant and drank wine.

Gewünschte Ausgabe:

see(I, elephant)
not_see(You, elephant)
saw(Peter, elephant)
drank(Peter, wine)

Kopieren Sie am besten Ihren aktuellen Stand von oben herunter und fügen Sie Ihre Erweiterungen dann hier ein.



In [12]:

    
def generate_predicates_for_sentence(sentence):
    # verbs contains everything that should be treated as verb (i.e. become a predicate)
    verbs = set()
    # sbj (obj) maps from verbs/predicates to its first (second) argument
    sbj = {}
    obj = {}
    # sbj_candidates maps from a verb to potential subjects
    # it is used when the verb-noun relation is not clear (e.g. nmod)
    sbj_candidates = defaultdict(list)
    # in case, we store information about the kind of PP some noun is found in
    case = {}
    # the negated-set should contain everything we find under negation
    negated = set()
    # the conj-dict should connect verbs that are coordinated
    # the mapping goes from later verbs to former ones
    conj = {}
    for result in dep_parser.raw_parse(sentence):
        relcl_trips = []
        for triple in result.triples():

            if triple[1] == "nsubj":
                verbs.add(triple[0][0])
                sbj[triple[0][0]] = triple[2]
            if triple[1] == "dobj":
                verbs.add(triple[0][0])
                obj[triple[0][0]] = triple[2]
            if triple[1] == "nsubjpass":
                verbs.add(triple[0][0])
                obj[triple[0][0]] = triple[2]
            if triple[0][1].startswith("V"):
                verbs.add(triple[0][0])
                if triple[1] == "nmod":
                    sbj_candidates[triple[0][0]].append(triple[2])
            if triple[1] == "acl:relcl":
                verbs.add(triple[2][0])
                relcl_trips.append(triple)
            if triple[1] == "case":
                case[triple[0]] = triple[2]
            if triple[1] == "neg":
                negated.add(triple[0][0])
            if triple[1] == "conj":
                verbs.add(triple[0][0])
                verbs.add(triple[2][0])
                conj[triple[2][0]] = triple[0][0]

        for triple in relcl_trips:
            if triple[2][0] in sbj:
                if sbj[triple[2][0]][1] in ["WP", "WDT"]:
                    sbj[triple[2][0]] = triple[0]
                else:
                    obj[triple[2][0]] = triple[0]
            else:
                sbj[triple[2][0]] = triple[0]
                
        for v in sbj_candidates:
            if v not in sbj:
                for cand in sbj_candidates[v]:
                    if case[cand][0] == "by":
                        sbj[v] = cand
                        
        for v in verbs:
            # if we do not have the subject of a verb which is coordinated with some other verb,
            # they probably share their subject
            if v not in sbj and v in conj and conj[v] in sbj:
                sbj[v] = sbj[conj[v]]

    preds = []
    negator = lambda v: "not_" + v if v in negated else v
    for v in verbs:
        preds.append(
            negator(v) + "(" + sbj.get(v, ("None",))[0] +
            ", " + obj.get(v, ("None",))[0] + ")"
        )
    return preds



In [13]:

    
for pred in generate_predicates_for_sentence(
    "Peter saw the elephant, drank wine and laughed."
):
    print(pred)









    



saw(Peter, elephant)
drank(Peter, wine)
laughed(Peter, None)



In [14]:

    
text = """
I see an elephant.
You didn't see the elephant.
Peter saw the elephant and drank wine.
"""

for pred in generate_predicates_for_text(text):
    print(pred)









    



see(I, elephant)
not_see(You, elephant)
saw(Peter, elephant)
drank(Peter, wine)