Task: add missing sentence-level annotations to Exmaralda files

AP told me that 20 of the Exmaralda-annotated MAZ176 files don't have a tiger:sentence tier.

$ for i in *.exb ; do grep -cH "category=\"sentence\"" $i ; done | grep "0$"
maz-00001.exb:0
maz-00002.exb:0
maz-1423.exb:0
maz-1453.exb:0
maz-1679.exb:0
maz-1757.exb:0
maz-1818.exb:0
maz-2316.exb:0
maz-2609.exb:0
maz-2611.exb:0
maz-2669.exb:0
maz-3073.exb:0
maz-3080.exb:0
maz-3110.exb:0
maz-3277.exb:0
maz-3367.exb:0
maz-3377.exb:0
maz-3415.exb:0
maz-3547.exb:0
maz-4031.exb:0

The annotations mentioned are present in the Exmaralda files I gave to the annotator:

$ /tmp/pcc176_tiger2exmaralda $ ack-grep -c "category=\"sentence\"" | cut -d : -f 2 | uniq 
1

In [1]:
import os
import glob
from operator import itemgetter, attrgetter

from lxml import etree
from discoursegraphs import DiscourseDocumentGraph, EdgeTypes

In [5]:
from discoursegraphs.readwrite.exmaralda import ExmaraldaDocumentGraph

In [2]:
BROKEN_EXMARALDA_DIR = os.path.expanduser('~/repos/pcc-annis-merged/maz176/information-structure/')
exmaralda_files = glob.glob(os.path.join(BROKEN_EXMARALDA_DIR, '*.exb'))

In [9]:
from discoursegraphs.readwrite import MMAXDocumentGraph
from discoursegraphs.readwrite.exmaralda import write_exb

MMAX_DIR = os.path.expanduser('/home/arne/repos/pcc-annis-merged/maz176/coreference/')

In [20]:
for exmaralda_file in exmaralda_files:
    doc_id = os.path.basename(exmaralda_file).split('.')[0]
    edg = ExmaraldaDocumentGraph(
        exmaralda_file,
        ignored_tier_categories=['syntax', 'pos', 'sentence', 'chain', 'markable', 'secedge'])
    mdg = MMAXDocumentGraph(os.path.join(MMAX_DIR, doc_id+'.mmax'),
                            ignore_sentence_annotations=False)
    mdg.merge_graphs(edg)
    write_exb(mdg, '/tmp/{}_with_sentences.exb'.format(doc_id))

get all tier categories from the corpus

TODO: There are two odd tier categories in here:

    * SEGMENNT-1
    * v

In [21]:
from collections import Counter

tier_categories = Counter()

for exmaralda_file in exmaralda_files:
    edg = ExmaraldaDocumentGraph(exmaralda_file)
    for tier_id in select_nodes_by_layer(edg, 'exmaralda:tier'):
        tier_categories[edg.node[tier_id]['exmaralda:category']] += 1

tier_categories.most_common()


Out[21]:
[('syntax', 1077),
 ('sentence', 312),
 ('SEGMENT', 183),
 ('SEGMENT-1', 166),
 ('SEGMENT-2', 111),
 ('markable', 55),
 ('chain', 40),
 ('THETICITY', 20),
 ('TOPIC', 20),
 ('THETICITY-1', 19),
 ('SEGMENT-3', 19),
 ('secedge', 11),
 ('THETICITY-2', 9),
 ('THETICITY-3', 2),
 ('SEGMENT-4', 1),
 ('SENTENCE-1', 1),
 ('SEGMENTT-1', 1),
 ('THETIC-1', 1),
 ('THETIC-2', 1),
 ('SEGMENT4', 1),
 ('SEGMENNT-1', 1),
 ('v', 1)]

In [ ]: