AP told me that 20 of the Exmaralda-annotated MAZ176 files don't have a tiger:sentence
tier.
$ for i in *.exb ; do grep -cH "category=\"sentence\"" $i ; done | grep "0$"
maz-00001.exb:0
maz-00002.exb:0
maz-1423.exb:0
maz-1453.exb:0
maz-1679.exb:0
maz-1757.exb:0
maz-1818.exb:0
maz-2316.exb:0
maz-2609.exb:0
maz-2611.exb:0
maz-2669.exb:0
maz-3073.exb:0
maz-3080.exb:0
maz-3110.exb:0
maz-3277.exb:0
maz-3367.exb:0
maz-3377.exb:0
maz-3415.exb:0
maz-3547.exb:0
maz-4031.exb:0
The annotations mentioned are present in the Exmaralda files I gave to the annotator:
$ /tmp/pcc176_tiger2exmaralda $ ack-grep -c "category=\"sentence\"" | cut -d : -f 2 | uniq
1
In [1]:
import os
import glob
from operator import itemgetter, attrgetter
from lxml import etree
from discoursegraphs import DiscourseDocumentGraph, EdgeTypes
In [5]:
from discoursegraphs.readwrite.exmaralda import ExmaraldaDocumentGraph
In [2]:
BROKEN_EXMARALDA_DIR = os.path.expanduser('~/repos/pcc-annis-merged/maz176/information-structure/')
exmaralda_files = glob.glob(os.path.join(BROKEN_EXMARALDA_DIR, '*.exb'))
In [9]:
from discoursegraphs.readwrite import MMAXDocumentGraph
from discoursegraphs.readwrite.exmaralda import write_exb
MMAX_DIR = os.path.expanduser('/home/arne/repos/pcc-annis-merged/maz176/coreference/')
In [20]:
for exmaralda_file in exmaralda_files:
doc_id = os.path.basename(exmaralda_file).split('.')[0]
edg = ExmaraldaDocumentGraph(
exmaralda_file,
ignored_tier_categories=['syntax', 'pos', 'sentence', 'chain', 'markable', 'secedge'])
mdg = MMAXDocumentGraph(os.path.join(MMAX_DIR, doc_id+'.mmax'),
ignore_sentence_annotations=False)
mdg.merge_graphs(edg)
write_exb(mdg, '/tmp/{}_with_sentences.exb'.format(doc_id))
TODO: There are two odd tier categories in here:
* SEGMENNT-1
* v
In [21]:
from collections import Counter
tier_categories = Counter()
for exmaralda_file in exmaralda_files:
edg = ExmaraldaDocumentGraph(exmaralda_file)
for tier_id in select_nodes_by_layer(edg, 'exmaralda:tier'):
tier_categories[edg.node[tier_id]['exmaralda:category']] += 1
tier_categories.most_common()
Out[21]:
In [ ]: