Attelo reads its input files into “datapacks”. Generally speaking, we have one datapack per document, so when reading a corpus in, we would be reading multiple datapacks (we read a multipack, ie. a dictionary of datapacks, or perhaps a fancier structure in future attelo versions)
In [29]:
from __future__ import print_function
from os import path as fp
from attelo.io import (load_multipack)
CORPUS_DIR = 'example-corpus'
PREFIX = fp.join(CORPUS_DIR, 'tiny')
# load the data into a multipack
mpack = load_multipack(PREFIX + '.edus',
PREFIX + '.pairings',
PREFIX + '.features.sparse',
PREFIX + '.features.sparse.vocab',
verbose=True)
As we can see below, multipacks are dictionaries from document names to dpacks.
In [30]:
for dname, dpack in mpack.items():
about = ("Doc: {name} |"
" edus: {edus}, pairs: {pairs},"
" features: {feats}")
print(about.format(name=dname,
edus=len(dpack.edus),
pairs=len(dpack),
feats=dpack.data.shape))
Datapacks store everything we know about a document:
In [52]:
dpack = mpack.values()[0] # pick an arbitrary pack
print("LABELS ({num}): {lbls}".format(num=len(dpack.labels),
lbls=", ".join(dpack.labels)))
print()
# note that attelo will by convention insert __UNK__ into the list of
# labels, at position 0. It also requires that UNRELATED and ROOT be
# in the list of available labels
for edu in dpack.edus[:3]:
print(edu)
print("...\n")
for i, (edu1, edu2) in enumerate(dpack.pairings[:3]):
lnum = dpack.target[i]
lbl = dpack.get_label(lnum)
feats = dpack.data[i,:].toarray()[0]
print('PAIR', i, edu1.id, edu2.id, '\t|', lbl, '\t|', feats)
print("...\n")
for j, vocab in enumerate(dpack.vocab[:3]):
print('FEATURE', j, vocab)
print("...\n")
There are a couple of datapack variants to be aware of:
graph
entry. We will explore weighted datapacks in the parser tutorial.This concludes our tour of attelo datapacks. In other tutorials we will explore some of the uses of datapacks, namely as the input/output of our parsers.