An attelo parser converts “documents” (here: EDUs with some metadata) into graphs (with EDUs as nodes and relation labels between them). In API terms, a parser is something that enriches datapacks, progressively adding or stripping away information until we get a full graph.
Parsers follow the scikit-learn estimator and transformer conventions, ie. with a fit
function to learn some model from training data and a transform
function to convert (in our case) datapacks to enriched datapacks.
In [34]:
from __future__ import print_function
from os import path as fp
from attelo.io import (load_multipack)
CORPUS_DIR = 'example-corpus'
PREFIX = fp.join(CORPUS_DIR, 'tiny')
# load the data into a multipack
mpack = load_multipack(PREFIX + '.edus',
PREFIX + '.pairings',
PREFIX + '.features.sparse',
PREFIX + '.features.sparse.vocab',
verbose=True)
We'll set aside one of the datapacks to test with, leaving the other two for training. We do this by hand for this simple example, but you may prefer to use the helper functions in attelo.fold when working with real data
In [35]:
test_dpack = mpack.values()[0]
train_mpack = {k: mpack[k] for k in mpack.keys()[1:]}
print('multipack entries:', len(mpack))
print('train entries:', len(train_mpack))
In [36]:
def print_results(dpack):
'summarise parser results'
for i, (edu1, edu2) in enumerate(dpack.pairings):
wanted = dpack.get_label(dpack.target[i])
got = dpack.get_label(dpack.graph.prediction[i])
print(i, edu1.id, edu2.id, '\t|', got, '\twanted:', wanted)
In [40]:
from attelo.decoding.baseline import (LastBaseline)
from attelo.learning import (SklearnAttachClassifier)
from attelo.parser.attach import (AttachPipeline)
from sklearn.linear_model import (LogisticRegression)
learner = SklearnAttachClassifier(LogisticRegression())
decoder = LastBaseline()
parser1 = AttachPipeline(learner=learner,
decoder=decoder)
# train the parser
train_dpacks = train_mpack.values()
train_targets = [x.target for x in train_dpacks]
parser1.fit(train_dpacks, train_targets)
# now run on a test pack
dpack = parser1.transform(test_dpack)
print_results(dpack)
In [41]:
from attelo.learning import (SklearnLabelClassifier)
from attelo.parser.label import (SimpleLabeller)
from sklearn.linear_model import (LogisticRegression)
learner = SklearnLabelClassifier(LogisticRegression())
parser2 = SimpleLabeller(learner=learner)
# train the parser
parser2.fit(train_dpacks, train_targets)
# now run on a test pack
dpack = parser2.transform(test_dpack)
print_results(dpack)
That doesn't quite look right. Now we have labels, but none of our edges are UNRELATED
. But this is because the simple labeller will apply labels on all unknown edges. What we need is to be able to combine the attach and label parsers in a parsing pipeline
A parsing pipeline is a parser that combines other parsers in sequence. For purposes of learning/fitting, the individual steps can be thought of as being run in parallel (in practice, they are fitted in sequnce). For transforming though, they are run in order. A pipeline thus refines a datapack over the course of multiple parsers.
In [42]:
from attelo.parser.pipeline import (Pipeline)
# this is actually attelo.parser.full.PostlabelPipeline
parser3 = Pipeline(steps=[('attach', parser1),
('label', parser2)])
parser3.fit(train_dpacks, train_targets)
dpack = parser3.transform(test_dpack)
print_results(dpack)
We have now seen some basic attelo parsers, how they use the scikit-learn fit/transform idiom, and we can combine them with pipelines. In future tutorials we'll break some of the parsers down into their constituent parts (notice the attach pipeline is itself a pipeline) and explore the process of writing parsers of our own.