In dieser Aufgabe soll vollautomatisch aus Daten (Syntaxbäumen) eine probabilistische, kontextfreie Grammatik erzeugt werden.
Füllen Sie die Lücken und versuchen Sie mithilfe Ihrer automatisch erstellten Grammatik die folgenden Sätze zu parsen:
In [1]:
test_sentences = [
"the men saw a car .",
"the woman gave the man a book .",
"she gave a book to the man .",
"yesterday , all my trouble seemed so far away ."
]
In [2]:
import nltk
from nltk.corpus import treebank
from nltk.grammar import ProbabilisticProduction, PCFG
In [3]:
# Production count: the number of times a given production occurs
pcount = {}
# LHS-count: counts the number of times a given lhs occurs
lcount = {}
for tree in treebank.parsed_sents():
for prod in tree.productions():
pcount[prod] = pcount.get(prod, 0) + 1
lcount[prod.lhs()] = lcount.get(prod.lhs(), 0) + 1
productions = [
ProbabilisticProduction(
p.lhs(), p.rhs(),
prob=pcount[p] / lcount[p.lhs()]
)
for p in pcount
]
start = nltk.Nonterminal('S')
grammar = PCFG(start, productions)
parser = nltk.ViterbiParser(grammar)
In [4]:
from IPython.display import display
for sent in test_sentences:
for res in parser.parse(sent.split()):
display(res)
Gegenstand dieser Aufgabe ist eine anwendungsnahe Möglichkeit, Ergebnisse einer Syntaxanalyse weiterzuverarbeiten. Aus den syntaktischen Abhängigkeiten eines Textes soll (unter Zuhilfenahme einiger Normalisierungsschritte) eine semantische Repräsentation der im Text enthaltenen Informationen gewonnen werden.
Für die syntaktische Analyse soll der DependencyParser
der Stanford CoreNLP Suite verwendet werden. Die semantische Repräsentation eines Satzes sei ein zweistelliges, logisches Prädikat, dessen Argumente durch Subjekt und Objekt gefüllt sind. (Bei Fehlen eines der beiden Elemente soll None
geschrieben werden.)
Folgendes Beispiel illustriert das gewünschte Ergebnis:
Eingabe:
I shot an elephant in my pajamas.
The elephant was seen by a giraffe in the desert.
The bird I need is a raven.
The man who saw the raven laughed out loud.
Ausgabe:
shot(I, elephant)
seen(giraffe, elephant)
need(I, bird)
raven(bird, None)
saw(man, raven)
laughed(man, None)
In [5]:
from nltk.parse.stanford import StanfordDependencyParser
PATH_TO_CORE = r"C:\Users\Martin\CoreNLP\stanford-corenlp-full-2017-06-09"
jar = PATH_TO_CORE + r"\stanford-corenlp-3.8.0.jar"
model = PATH_TO_CORE + r"\stanford-corenlp-3.8.0-models.jar"
dep_parser = StanfordDependencyParser(
jar, model,
model_path="edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz"
)
In [6]:
from collections import defaultdict
def generate_predicates_for_sentence(sentence):
verbs = set()
sbj = {}
obj = {}
sbj_candidates = defaultdict(list)
case = {}
for result in dep_parser.raw_parse(sentence):
relcl_trips = []
for triple in result.triples():
if triple[1] == "nsubj":
verbs.add(triple[0][0])
sbj[triple[0][0]] = triple[2]
if triple[1] == "dobj":
verbs.add(triple[0][0])
obj[triple[0][0]] = triple[2]
if triple[1] == "nsubjpass":
verbs.add(triple[0][0])
obj[triple[0][0]] = triple[2]
if triple[0][1].startswith("V"):
verbs.add(triple[0][0])
if triple[1] == "nmod":
sbj_candidates[triple[0][0]].append(triple[2])
if triple[1] == "acl:relcl":
verbs.add(triple[2][0])
relcl_trips.append(triple)
if triple[1] == "case":
case[triple[0]] = triple[2]
for triple in relcl_trips:
if triple[2][0] in sbj:
if sbj[triple[2][0]][1] in ["WP", "WDT"]:
sbj[triple[2][0]] = triple[0]
else:
obj[triple[2][0]] = triple[0]
else:
sbj[triple[2][0]] = triple[0]
for v in sbj_candidates:
if v not in sbj:
for cand in sbj_candidates[v]:
if case[cand][0] == "by":
sbj[v] = cand
preds = []
for v in verbs:
preds.append(
v + "(" + sbj.get(v, ("None",))[0] +
", " + obj.get(v, ("None",))[0] + ")"
)
return preds
In [7]:
for pred in generate_predicates_for_sentence(
"The man that saw the raven laughed out loud."
):
print(pred)
In [8]:
def generate_predicates_for_text(text):
predicates = []
for sent in nltk.tokenize.sent_tokenize(text):
predicates.extend(generate_predicates_for_sentence(sent))
return predicates
In [9]:
text = """
I shot an elephant in my pajamas.
The elephant was seen by a giraffe.
The bird I need is a raven.
The man who saw the raven laughed out loud.
"""
for pred in generate_predicates_for_text(text):
print(pred)
Parent Annotation kann die Performanz einer CFG wesentlich verbessern. Schreiben Sie eine Funktion, die einen gegebenen Syntaxbaum dieser Optimierung unterzieht. Auf diese Art und Weise transformierte Bäume können dann wiederum zur Grammatikinduktion verwendet werden.
parentHistory
soll dabei die Anzahl der Vorgänger sein, die zusätzlich zum direkten Elternknoten berücksichtigt werden. (Kann bei der Lösung der Aufgabe auch ignoriert werden.)
parentChar
soll ein Trennzeichen sein, das bei den neuen Knotenlabels zwischen dem ursprünglichen Knotenlabel und der Liste von Vorgängern eingefügt wird.
In [10]:
def parent_annotation(tree, parentHistory=0, parentChar="^"):
def pa_rec(node, parents):
originalNode = node.label()
parentString = parentChar + '<' + "-".join(parents) + '>'
node.set_label(node.label() + parentString)
for child in node:
pa_rec(child, [originalNode] + parents[:parentHistory])
return node
return pa_rec(tree, [])
In [11]:
test_tree = nltk.Tree(
"S",
[
nltk.Tree("NP", [
nltk.Tree("DET", []),
nltk.Tree("N", [])
]),
nltk.Tree("VP", [
nltk.Tree("V", []),
nltk.Tree("NP", [
nltk.Tree("DET", []),
nltk.Tree("N", [])
])
])
]
)
parent_annotation(
test_tree
)
Out[11]:
Zusätzlich zu den in Aufgabe 2 behandelten Konstruktionen sollen jetzt auch negierte und komplexe Sätze mit Konjunktionen sinnvoll verarbeitet werden.
Eingabe:
I see an elephant.
You didn't see the elephant.
Peter saw the elephant and drank wine.
Gewünschte Ausgabe:
see(I, elephant)
not_see(You, elephant)
saw(Peter, elephant)
drank(Peter, wine)
Kopieren Sie am besten Ihren aktuellen Stand von oben herunter und fügen Sie Ihre Erweiterungen dann hier ein.
In [12]:
def generate_predicates_for_sentence(sentence):
# verbs contains everything that should be treated as verb (i.e. become a predicate)
verbs = set()
# sbj (obj) maps from verbs/predicates to its first (second) argument
sbj = {}
obj = {}
# sbj_candidates maps from a verb to potential subjects
# it is used when the verb-noun relation is not clear (e.g. nmod)
sbj_candidates = defaultdict(list)
# in case, we store information about the kind of PP some noun is found in
case = {}
# the negated-set should contain everything we find under negation
negated = set()
# the conj-dict should connect verbs that are coordinated
# the mapping goes from later verbs to former ones
conj = {}
for result in dep_parser.raw_parse(sentence):
relcl_trips = []
for triple in result.triples():
if triple[1] == "nsubj":
verbs.add(triple[0][0])
sbj[triple[0][0]] = triple[2]
if triple[1] == "dobj":
verbs.add(triple[0][0])
obj[triple[0][0]] = triple[2]
if triple[1] == "nsubjpass":
verbs.add(triple[0][0])
obj[triple[0][0]] = triple[2]
if triple[0][1].startswith("V"):
verbs.add(triple[0][0])
if triple[1] == "nmod":
sbj_candidates[triple[0][0]].append(triple[2])
if triple[1] == "acl:relcl":
verbs.add(triple[2][0])
relcl_trips.append(triple)
if triple[1] == "case":
case[triple[0]] = triple[2]
if triple[1] == "neg":
negated.add(triple[0][0])
if triple[1] == "conj":
verbs.add(triple[0][0])
verbs.add(triple[2][0])
conj[triple[2][0]] = triple[0][0]
for triple in relcl_trips:
if triple[2][0] in sbj:
if sbj[triple[2][0]][1] in ["WP", "WDT"]:
sbj[triple[2][0]] = triple[0]
else:
obj[triple[2][0]] = triple[0]
else:
sbj[triple[2][0]] = triple[0]
for v in sbj_candidates:
if v not in sbj:
for cand in sbj_candidates[v]:
if case[cand][0] == "by":
sbj[v] = cand
for v in verbs:
# if we do not have the subject of a verb which is coordinated with some other verb,
# they probably share their subject
if v not in sbj and v in conj and conj[v] in sbj:
sbj[v] = sbj[conj[v]]
preds = []
negator = lambda v: "not_" + v if v in negated else v
for v in verbs:
preds.append(
negator(v) + "(" + sbj.get(v, ("None",))[0] +
", " + obj.get(v, ("None",))[0] + ")"
)
return preds
In [13]:
for pred in generate_predicates_for_sentence(
"Peter saw the elephant, drank wine and laughed."
):
print(pred)
In [14]:
text = """
I see an elephant.
You didn't see the elephant.
Peter saw the elephant and drank wine.
"""
for pred in generate_predicates_for_text(text):
print(pred)