This notebook illustrates the working of LAF-Fabric, a tool to analyze the data inside LAF resources (Linguistic Annotation Framework). We use it for a particular LAF resource: the Hebrew Bible with linguistic annotations. The software to get the Hebrew Bible in LAF is part of LAF-Fabric (see emdros2laf).
NB 1. This is a static copy of the Gender notebook. You can download it, and if you have iPython installed and the LAF-Fabric, then you can run this notebook. And you can create many more notebooks like this, looking for patterns in the Hebrew Bible.
NB 2. All software involved is open source, and the data is Open Access (not for commercial use). Click the logo:
Words in Hebrew are either masculine, or feminine, or unknown.
We want to plot the percentage of masculine and feminine words per chapter.
In the Hebrew LAF data, some nodes are annotated as word, and some nodes as chapter
(there are many more kinds of node, of course).
The names of chapters and the genders of words are coded as features inside annotations to these nodes.
The features we need are present in an annotation space named etcbc4 (after the name and version of this LAF resource).
The chapter features are labeled with sft and the other features with ft.
When LAF-Fabric compiles features into binary data, it forgets the annotations in which the features come, but the annotation space and label are retained in a double prefix to the feature name.
LAF-Fabric remembers those features by their fully qualified names: etcbc4:ft.gender, etcbc4:sft.chapter etc.
There may also be annotations without feature contents.
The next cell loads the required libraries and creates a task processor.
In [1]:
import sys
import collections
from laf.fabric import LafFabric
fabric = LafFabric(verbose='DETAIL')
The processor needs data. Here is where we say what data to load. We do not need the XML identifiers as they show up in the original LAF resource. But we do need a few features of nodes, namely the ones that give us the gender of the words, and the numbers of the chapters and the books in which the chapters are contained.
The init function actually draws that data in, and it will take a few seconds.
It needs to know the name of the source. This name corresponds with a subdirectory in your work_dir.
The '--' means that we do not draw in an annox (extra annotation package). If you want to do that, this is the place to give the name of such a package, which must be a subdirectory name inside the annotations directory in your work_dir.
Then gender is just a name we choose to give to this task. This name determines where on the filesystem the log file and output (if any) will be put: a subdirectory gender inside the source directory inside your output_dir.
The last argument to load() is a dictionary of data items to load.
The primary key indicates whether the primary data itself must be loaded. Tasks can then use methods to find the primary data that is attached to a node. For the Hebrew data this is hardly necessary, because the words have textual information as features on them.
The xmlids are tables mapping nodes and edges to the original xml identifiers they have in the original LAF source. Most tasks do not need this. Only when a task needs to link new annotations to nodes and edges and write the result as an additional LAF file, it needs to know the original identifiers.
The features to be loaded are specified by two strings, one for node features and one for edge features. For all these features, data will be loaded, and all other features' data will be unloaded, if still loaded.
Caution: Missing feature data
If you forget to mention a feature in the load declaration and you do use it in your task, LAF-Fabric will stop your task and shout error messages at you. If you declare features that do not exist in the LAF data, you just get a warning. But if you try to use such features, you get also a loud error.
In [2]:
fabric.load('etcbc4b', '--', 'gender',
{
"primary": False,
"xmlids": {"node": False, "edge": False},
"features": ("otype gn chapter book", ""),
})
exec(fabric.localnames.format(var='fabric'))
In order to write an efficient task,
it is convenient to import the names of the API methods as local variables.
The lookup of names in Python is fastest for local names.
And it makes the code much cleaner. The method load() does this.
See the API reference for full documentation.
All that you want to know about features and are not afraid to ask. It is an object, and for each feature that you have declared, it has a member with a handy name. For example,
F.etcbc4_db_otype
is a feature object
that corresponds with the LAF feature given in an annotation
in the annotation space etcbc4, with label db and name otype.
It is a node feature, because otherwise we had to use FE instead of F.
You do not have to mention the annotation space and label, Laf-Fabric will find out what they should be given the available features. If there is confusion, Laf-Fabric will tell you, and you can supply more full names.
You can look up a feature value of this feature, say for node n, by saying
F.otype.v(n)
If you want to walk through all the nodes, possibly skipping some, then this is your method.
It is an iterator that yields a new node everytime it is called.
The order is so-called primary data order, which will be explained below.
The test and value and values arguments are optional.
If given, test should be a callable with one argument, returning a string;
value should be a string, values a list of strings.
test will be called for each passing node,
and if the value returned is not equal to the given value and not a member of values,
the node will be skipped.
Issues a timed message to the standard error and to the log file.
Creates a open file handle for reading a file in your task output directory
Creates a open file handle for writing a file in your task output directory
Gives the full path to a file in your task output directory
In [6]:
print(fF_all)
print(fFE_all)
We need to get an output file to write to. A simple method provides a handle to a file open for writing. The file will be created in the output_dir, under the subdir etcbc4, under the subdir gender.
In [3]:
table = outfile('table.tsv')
All open files (reading and writing) will be closed with
close()
below.
Here we loop over a bunch of nodes (in fact over all nodes), in a convenient document order.
There is an implicit partial order on nodes. The short story is: the nodes that are linked to primary data, inherit the order that is present in the primary data. The long story is a bit more complicated, since nodes may be attached to multiple ranges of primary data.
See node order for details. If you don't, it might be enough to know that embedding nodes always come before embedded nodes, meaning that if a node happens to be attached to a big piece of primary data, and a second node to a part of that data, then the node with the bigger attachment comes first.
When there is no inclusion either way, and the start and end points are the same, the order is left undefined.
We initialize the counters in which we store the word counts. We keep track of the chapter we are in and accumulate counts of the words, masculine and feminine. For each chapter we create entries in the ch, m and f lists.
Note also the progress messages after each chapter.
In [4]:
stats = [0, 0, 0]
cur_chapter = None
cur_book = None
ch = []
m = []
f = []
In [5]:
for node in NN():
otype = F.otype.v(node)
if otype == "word":
stats[0] += 1
if F.gn.v(node) == "m":
stats[1] += 1
elif F.gn.v(node) == "f":
stats[2] += 1
elif otype == "chapter":
if cur_chapter != None:
masc = 0 if not stats[0] else 100 * float(stats[1]) / stats[0]
fem = 0 if not stats[0] else 100 * float(stats[2]) / stats[0]
ch.append(cur_chapter)
m.append(masc)
f.append(fem)
table.write("{},{},{}\n".format(cur_chapter, masc, fem))
else:
table.write("{},{},{}\n".format('book chapter', 'masculine', 'feminine'))
this_book = F.book.v(node)
this_chapnum = F.chapter.v(node)
this_chapter = "{} {}".format(this_book, this_chapnum)
if this_book != cur_book:
sys.stderr.write("\n{}".format(this_book))
cur_book = this_book
sys.stderr.write(" {}".format(this_chapnum))
stats = [0, 0, 0]
cur_chapter = this_chapter
We need to close open files. This is exactly what the next statement does.
In [7]:
close()
Everything is still in memory. Now it is the time to generate a graphical representation of the data.
The matplotlib package is full of instruments to do that.
But let us first have a look at a few rows of the data itself.
In [8]:
import pandas
import matplotlib.pyplot as plt
from IPython.display import display
pandas.set_option('display.notebook_repr_html', True)
%matplotlib inline
The files that have been generated reside in a subdirectory of your work directory. You can easily refer to them as follows:
In [9]:
table_file = my_file('table.tsv')
df = pandas.read_csv(table_file)
In [10]:
df.head(100)
Out[10]:
Now let's get matplotlib to work. Here we just show a line graph of 20 chapters. If you want to see another series of chapters, just modify the start and end variables below and execute again by pressing Shift Enter. You can repeat this as often as you like without re-running earlier steps.
In [11]:
x = range(len(ch))
start = 100
end = 120
fig = plt.figure()
plt.plot(x[start:end], m[start:end], 'b-', x[start:end], f[start:end], 'r-')
plt.axis([start, end, 0, 50])
plt.xticks(x[start:end], ch[start:end], rotation='vertical')
plt.margins(0.2)
plt.subplots_adjust(bottom=0.15);
plt.title('gender');
Finally, save the chart.
In [12]:
fig.savefig('gender.png')