Taking it from where we left...

Saving Python objects

import pickle

with open('data/DBG_tagged_baseline.pickle','wb') as f:
    pickle.dump(tagged_text_baseline, f)

with open('data/DBG_tagged_clkt.pickle','wb') as f:
    pickle.dump(tagged_text_cltk, f)

with open('data/DBG_tagged_nltk.pickle','wb') as f:
    pickle.dump(tagged_text_nltk, f)

Loading Python objects back into memory

In [2]:
import pickle

with open('data/DBG_tagged_baseline.pickle','rb') as f:
    tagged_text_baseline = pickle.load(f)
with open('data/DBG_tagged_clkt.pickle','rb') as f:
    tagged_text_cltk = pickle.load(f)
with open('data/DBG_tagged_nltk.pickle','rb') as f:
    tagged_text_nltk = pickle.load(f)

In [491]:

 ('PRIMUS', 'O'),
 ('Gallia', 'Entity'),
 ('est', 'O'),
 ('omnis', 'O'),
 ('divisa', 'O'),
 ('in', 'O'),
 ('partes', 'O'),
 ('tres,', 'O'),
 ('quarum', 'O')]

In [492]:
for baseline_out, cltk_out, nltk_out in zip(tagged_text_baseline[:20]
                                            , tagged_text_cltk[:20]
                                            , tagged_text_nltk[:20]):
    print("Baseline: %s\nCLTK: %s\nNLTK: %s\n"%(baseline_out
                                                , cltk_out
                                                , nltk_out))

Baseline: ('COMMENTARIUS', 'O')

Baseline: ('PRIMUS', 'O')

Baseline: ('Gallia', 'Entity')
CLTK: ('Gallia', 'Entity')
NLTK: ('Gallia', 'LOC')

Baseline: ('est', 'O')
CLTK: ('est', 'O')
NLTK: ('est', 'O')

Baseline: ('omnis', 'O')
CLTK: ('omnis', 'O')
NLTK: ('omnis', 'O')

Baseline: ('divisa', 'O')
CLTK: ('divisa', 'O')
NLTK: ('divisa', 'O')

Baseline: ('in', 'O')
CLTK: ('in', 'O')
NLTK: ('in', 'O')

Baseline: ('partes', 'O')
CLTK: ('partes', 'O')
NLTK: ('partes', 'O')

Baseline: ('tres,', 'O')
CLTK: ('tres,', 'O')
NLTK: ('tres,', 'O')

Baseline: ('quarum', 'O')
CLTK: ('quarum', 'O')
NLTK: ('quarum', 'O')

Baseline: ('unam', 'O')
CLTK: ('unam', 'O')
NLTK: ('unam', 'O')

Baseline: ('incolunt', 'O')
CLTK: ('incolunt', 'O')
NLTK: ('incolunt', 'O')

Baseline: ('Belgae,', 'Entity')
CLTK: ('Belgae,', 'O')
NLTK: ('Belgae,', 'O')

Baseline: ('aliam', 'O')
CLTK: ('aliam', 'O')
NLTK: ('aliam', 'O')

Baseline: ('Aquitani,', 'Entity')
CLTK: ('Aquitani,', 'O')
NLTK: ('Aquitani,', 'O')

Baseline: ('tertiam', 'O')
CLTK: ('tertiam', 'O')
NLTK: ('tertiam', 'O')

Baseline: ('qui', 'O')
CLTK: ('qui', 'O')
NLTK: ('qui', 'O')

Baseline: ('ipsorum', 'O')
CLTK: ('ipsorum', 'O')
NLTK: ('ipsorum', 'O')

Baseline: ('lingua', 'O')
CLTK: ('lingua', 'O')
NLTK: ('lingua', 'O')

Baseline: ('Celtae,', 'Entity')
CLTK: ('Celtae,', 'O')
NLTK: ('Celtae,', 'O')

Gathering basic statistics about the extracted NEs

Let's say we want to:

  • know how entities by type were extracted
  • look at the most frequent NEs

Counting entities by type

Of the three methods we used in the previous session to extract NEs from Caesar's De Bello Gallico, let's take the output of NLTK. In fact, this is the only one with more granulary entity types, where as the other two have just a generic Entity type.

In [480]:

 ('PRIMUS', 'LOC'),
 ('Gallia', 'LOC'),
 ('est', 'O'),
 ('omnis', 'O'),
 ('divisa', 'O'),
 ('in', 'O'),
 ('partes', 'O'),
 ('tres,', 'O'),
 ('quarum', 'O'),
 ('unam', 'O'),
 ('incolunt', 'O'),
 ('Belgae,', 'O'),
 ('aliam', 'O'),
 ('Aquitani,', 'O'),
 ('tertiam', 'O'),
 ('qui', 'O'),
 ('ipsorum', 'O'),
 ('lingua', 'O'),
 ('Celtae,', 'O'),
 ('nostra', 'O'),
 ('Galli', 'PER'),
 ('appellantur.', 'PER'),
 ('Hi', 'PER'),
 ('omnes', 'O'),
 ('lingua,', 'O'),
 ('institutis,', 'O'),
 ('legibus', 'O'),
 ('inter', 'O'),
 ('se', 'O'),
 ('differunt.', 'O'),
 ('Gallos', 'PER'),
 ('ab', 'O'),
 ('Aquitanis', 'O'),
 ('Garumna', 'O'),
 ('flumen,', 'O'),
 ('a', 'O'),
 ('Belgis', 'PER'),
 ('Matrona', 'PER'),
 ('et', 'O'),
 ('Sequana', 'O'),
 ('dividit.', 'O'),
 ('Horum', 'O'),
 ('omnium', 'O'),
 ('fortissimi', 'O'),
 ('sunt', 'O'),
 ('Belgae,', 'O'),
 ('propterea', 'O'),
 ('quod', 'O'),
 ('a', 'O'),
 ('cultu', 'O'),
 ('atque', 'O'),
 ('humanitate', 'O'),
 ('provinciae', 'O'),
 ('longissime', 'O'),
 ('absunt,', 'O'),
 ('minimeque', 'O'),
 ('ad', 'O'),
 ('eos', 'O'),
 ('mercatores', 'O'),
 ('saepe', 'O'),
 ('commeant', 'O'),
 ('atque', 'O'),
 ('ea', 'O'),
 ('quae', 'O'),
 ('ad', 'O'),
 ('effeminandos', 'O'),
 ('animos', 'O'),
 ('pertinent', 'O'),
 ('important,', 'O'),
 ('proximique', 'O'),
 ('sunt', 'O'),
 ('Germanis,', 'O'),
 ('qui', 'O'),
 ('trans', 'O'),
 ('Rhenum', 'O'),
 ('incolunt,', 'O'),
 ('quibuscum', 'O'),
 ('continenter', 'O'),
 ('bellum', 'O'),
 ('gerunt.', 'O'),
 ('Qua', 'O'),
 ('de', 'O'),
 ('causa', 'O'),
 ('Helvetii', 'O'),
 ('quoque', 'O'),
 ('reliquos', 'O'),
 ('Gallos', 'PER'),
 ('virtute', 'O'),
 ('praecedunt,', 'O'),
 ('quod', 'O'),
 ('fere', 'O'),
 ('cotidianis', 'O'),
 ('proeliis', 'O'),
 ('cum', 'O'),
 ('Germanis', 'O'),
 ('contendunt,', 'O'),
 ('cum', 'O'),
 ('aut', 'O'),
 ('suis', 'O'),
 ('finibus', 'O'),
 ('eos', 'O'),
 ('prohibent', 'O'),
 ('aut', 'O'),
 ('ipsi', 'O'),
 ('in', 'O'),
 ('eorum', 'O'),
 ('finibus', 'O'),
 ('bellum', 'O'),
 ('gerunt.', 'O'),
 ('[Eorum', 'O'),
 ('una,', 'O'),
 ('pars,', 'O'),
 ('quam', 'O'),
 ('Gallos', 'PER'),
 ('obtinere', 'O'),
 ('dictum', 'O'),
 ('est,', 'O'),
 ('initium', 'O'),
 ('capit', 'O'),
 ('a', 'O'),
 ('flumine', 'O'),
 ('Rhodano,', 'O'),
 ('continetur', 'O'),
 ('Garumna', 'O'),
 ('flumine,', 'O'),
 ('Oceano,', 'O'),
 ('finibus', 'O'),
 ('Belgarum,', 'O'),
 ('attingit', 'O'),
 ('etiam', 'O'),
 ('ab', 'O'),
 ('Sequanis', 'O'),
 ('et', 'O'),
 ('Helvetiis', 'O'),
 ('flumen', 'O'),
 ('Rhenum,', 'O'),
 ('vergit', 'O'),
 ('ad', 'O'),
 ('septentriones.', 'O'),
 ('Belgae', 'O'),
 ('ab', 'O'),
 ('extremis', 'O'),
 ('Galliae', 'O'),
 ('finibus', 'O'),
 ('oriuntur,', 'O'),
 ('pertinent', 'O'),
 ('ad', 'O'),
 ('inferiorem', 'O'),
 ('partem', 'O'),
 ('fluminis', 'O'),
 ('Rheni,', 'O'),
 ('spectant', 'O'),
 ('in', 'O'),
 ('septentrionem', 'O'),
 ('et', 'O'),
 ('orientem', 'O'),
 ('solem.', 'O'),
 ('Aquitania', 'LOC'),
 ('a', 'O'),
 ('Garumna', 'O'),
 ('flumine', 'O'),
 ('ad', 'O'),
 ('Pyrenaeos', 'O'),
 ('montes', 'O'),
 ('et', 'O'),
 ('eam', 'O'),
 ('partem', 'O'),
 ('Oceani', 'O'),
 ('quae', 'O'),
 ('est', 'O'),
 ('ad', 'O'),
 ('Hispaniam', 'O'),
 ('pertinet;', 'O'),
 ('spectat', 'O'),
 ('inter', 'O'),
 ('occasum', 'O'),
 ('solis', 'O'),
 ('et', 'O'),
 ('septentriones.]', 'O'),
 ('Apud', 'O'),
 ('Helvetios', 'O'),
 ('longe', 'O'),
 ('nobilissimus', 'O'),
 ('fuit', 'O'),
 ('et', 'O'),
 ('ditissimus', 'O'),
 ('Orgetorix.', 'O'),
 ('Is', 'O'),
 ('M.', 'O'),
 ('Messala,', 'O'),
 ('[et', 'O'),
 ('P.]', 'O'),
 ('M.', 'PER'),
 ('Pisone', 'PER'),
 ('consulibus', 'O'),
 ('regni', 'O'),
 ('cupiditate', 'O'),
 ('inductus', 'O'),
 ('coniurationem', 'O'),
 ('nobilitatis', 'O'),
 ('fecit', 'O'),
 ('et', 'O'),
 ('civitati', 'O'),
 ('persuasit', 'O'),
 ('ut', 'O'),
 ('de', 'O'),
 ('finibus', 'O'),
 ('suis', 'O'),
 ('cum', 'O'),
 ('omnibus', 'O'),
 ('copiis', 'O'),
 ('exirent:', 'O'),
 ('perfacile', 'O'),
 ('esse,', 'O'),
 ('cum', 'O'),
 ('virtute', 'O'),
 ('omnibus', 'O'),
 ('praestarent,', 'O'),
 ('totius', 'O'),
 ('Galliae', 'O'),
 ('imperio', 'O'),
 ('potiri.', 'O'),
 ('Id', 'O'),
 ('hoc', 'O'),
 ('facilius', 'O'),
 ('iis', 'O'),
 ('persuasit,', 'O'),
 ('quod', 'O'),
 ('undique', 'O'),
 ('loci', 'O'),
 ('natura', 'O'),
 ('Helvetii', 'O'),
 ('continentur:', 'O'),
 ('una', 'O'),
 ('ex', 'O'),
 ('parte', 'O'),
 ('flumine', 'O'),
 ('Rheno', 'O'),
 ('latissimo', 'O'),
 ('atque', 'O'),
 ('altissimo,', 'O'),
 ('qui', 'O'),
 ('agrum', 'O'),
 ('Helvetium', 'O'),
 ('a', 'O'),
 ('Germanis', 'O'),
 ('dividit;', 'O'),
 ('altera', 'O'),
 ('ex', 'O'),
 ('parte', 'O'),
 ('monte', 'O'),
 ('Iura', 'O'),
 ('altissimo,', 'O'),
 ('qui', 'O'),
 ('est', 'O'),
 ('inter', 'O'),
 ('Sequanos', 'O'),
 ('et', 'O'),
 ('Helvetios;', 'O'),
 ('tertia', 'O'),
 ('lacu', 'O'),
 ('Lemanno', 'O'),
 ('et', 'O'),
 ('flumine', 'O'),
 ('Rhodano,', 'O'),
 ('qui', 'O'),
 ('provinciam', 'O'),
 ('nostram', 'O'),
 ('ab', 'O'),
 ('Helvetiis', 'O'),
 ('dividit.', 'O'),
 ('His', 'O'),
 ('rebus', 'O'),
 ('fiebat', 'O'),
 ('ut', 'O'),
 ('et', 'O'),
 ('minus', 'O'),
 ('late', 'O'),
 ('vagarentur', 'O'),
 ('et', 'O'),
 ('minus', 'O'),
 ('facile', 'O'),
 ('finitimis', 'O'),
 ('bellum', 'O'),
 ('inferre', 'O'),
 ('possent;', 'O'),
 ('qua', 'O'),
 ('ex', 'O'),
 ('parte', 'O'),
 ('homines', 'O'),
 ('bellandi', 'O'),
 ('cupidi', 'O'),
 ('magno', 'O'),
 ('dolore', 'O'),
 ('adficiebantur.', 'O'),
 ('Pro', 'O'),
 ('multitudine', 'O'),
 ('autem', 'O'),
 ('hominum', 'O'),
 ('et', 'O'),
 ('pro', 'O'),
 ('gloria', 'O'),
 ('belli', 'O'),
 ('atque', 'O'),
 ('fortitudinis', 'O'),
 ('angustos', 'O'),
 ('se', 'O'),
 ('fines', 'O'),
 ('habere', 'O'),
 ('arbitrabantur,', 'O'),
 ('qui', 'O'),
 ('in', 'O'),
 ('longitudinem', 'O'),
 ('milia', 'O'),
 ('passuum', 'O'),
 ('CCXL,', 'O'),
 ('in', 'O'),
 ('latitudinem', 'O'),
 ('CLXXX', 'O'),
 ('patebant.', 'O'),
 ('His', 'O'),
 ('rebus', 'O'),
 ('adducti', 'O'),
 ('et', 'O'),
 ('auctoritate', 'O'),
 ('Orgetorigis', 'O'),
 ('permoti', 'O'),
 ('constituerunt', 'O'),
 ('ea', 'O'),
 ('quae', 'O'),
 ('ad', 'O'),
 ('proficiscendum', 'O'),
 ('pertinerent', 'O'),
 ('comparare,', 'O'),
 ('iumentorum', 'O'),
 ('et', 'O'),
 ('carrorum', 'O'),
 ('quam', 'O'),
 ('maximum', 'O'),
 ('numerum', 'O'),
 ('coemere,', 'O'),
 ('sementes', 'O'),
 ('quam', 'O'),
 ('maximas', 'O'),
 ('facere,', 'O'),
 ('ut', 'O'),
 ('in', 'O'),
 ('itinere', 'O'),
 ('copia', 'O'),
 ('frumenti', 'O'),
 ('suppeteret,', 'O'),
 ('cum', 'O'),
 ('proximis', 'O'),
 ('civitatibus', 'O'),
 ('pacem', 'O'),
 ('et', 'O'),
 ('amicitiam', 'O'),
 ('confirmare.', 'O'),
 ('Ad', 'O'),
 ('eas', 'O'),
 ('res', 'O'),
 ('conficiendas', 'O'),
 ('biennium', 'O'),
 ('sibi', 'O'),
 ('satis', 'O'),
 ('esse', 'O'),
 ('duxerunt;', 'O'),
 ('in', 'O'),
 ('tertium', 'O'),
 ('annum', 'O'),
 ('profectionem', 'O'),
 ('lege', 'O'),
 ('confirmant.', 'O'),
 ('Ad', 'O'),
 ('eas', 'O'),
 ('res', 'O'),
 ('conficiendas', 'O'),
 ('Orgetorix', 'O'),
 ('deligitur.', 'O'),
 ('Is', 'O'),
 ('sibi', 'O'),
 ('legationem', 'O'),
 ('ad', 'O'),
 ('civitates', 'O'),
 ('suscipit.', 'O'),
 ('In', 'O'),
 ('eo', 'O'),
 ('itinere', 'O'),
 ('persuadet', 'O'),
 ('Castico,', 'O'),
 ('Catamantaloedis', 'O'),
 ('filio,', 'O'),
 ('Sequano,', 'O'),
 ('cuius', 'O'),
 ('pater', 'O'),
 ('regnum', 'O'),
 ('in', 'O'),
 ('Sequanis', 'O'),
 ('multos', 'O'),
 ('annos', 'O'),
 ('obtinuerat', 'O'),
 ('et', 'O'),
 ('a', 'O'),
 ('senatu', 'O'),
 ('populi', 'O'),
 ('Romani', 'O'),
 ('amicus', 'O'),
 ('appellatus', 'O'),
 ('erat,', 'O'),
 ('ut', 'O'),
 ('regnum', 'O'),
 ('in', 'O'),
 ('civitate', 'O'),
 ('sua', 'O'),
 ('occuparet,', 'O'),
 ('quod', 'O'),
 ('pater', 'O'),
 ('ante', 'O'),
 ('habuerit;', 'O'),
 ('itemque', 'O'),
 ('Dumnorigi', 'O'),
 ('Haeduo,', 'O'),
 ('fratri', 'O'),
 ('Diviciaci,', 'O'),
 ('qui', 'O'),
 ('eo', 'O'),
 ('tempore', 'O'),
 ('principatum', 'O'),
 ('in', 'O'),
 ('civitate', 'O'),
 ('obtinebat', 'O'),
 ('ac', 'O'),
 ('maxime', 'O'),
 ('plebi', 'O'),
 ('acceptus', 'O'),
 ('erat,', 'O'),
 ('ut', 'O'),
 ('idem', 'O'),
 ('conaretur', 'O'),
 ('persuadet', 'O'),
 ('eique', 'O'),
 ('filiam', 'O'),
 ('suam', 'O'),
 ('in', 'O'),
 ('matrimonium', 'O'),
 ('dat.', 'O'),
 ('Perfacile', 'O'),
 ('factu', 'O'),
 ('esse', 'O'),
 ('illis', 'O'),
 ('probat', 'O'),
 ('conata', 'O'),
 ('perficere,', 'O'),
 ('propterea', 'O'),
 ('quod', 'O'),
 ('ipse', 'O'),
 ('suae', 'O'),
 ('civitatis', 'O'),
 ('imperium', 'O'),
 ('obtenturus', 'O'),
 ('esset:', 'O'),
 ('non', 'O'),
 ('esse', 'O'),
 ('dubium', 'O'),
 ('quin', 'O'),
 ('totius', 'O'),
 ('Galliae', 'O'),
 ('plurimum', 'O'),
 ('Helvetii', 'O'),
 ('possent;', 'O'),
 ('se', 'O'),
 ('suis', 'O'),
 ('copiis', 'O'),
 ('suoque', 'O'),
 ('exercitu', 'O'),
 ('illis', 'O'),
 ('regna', 'O'),
 ('conciliaturum', 'O'),
 ('confirmat.', 'O'),
 ('Hac', 'O'),
 ('oratione', 'O'),
 ('adducti', 'O'),
 ('inter', 'O'),
 ('se', 'O'),
 ('fidem', 'O'),
 ('et', 'O'),
 ('ius', 'O'),
 ('iurandum', 'O'),
 ('dant', 'O'),
 ('et', 'O'),
 ('regno', 'O'),
 ('occupato', 'O'),
 ('per', 'O'),
 ('tres', 'O'),
 ('potentissimos', 'O'),
 ('ac', 'O'),
 ('firmissimos', 'O'),
 ('populos', 'O'),
 ('totius', 'O'),
 ('Galliae', 'O'),
 ('sese', 'O'),
 ('potiri', 'O'),
 ('posse', 'O'),
 ('sperant.', 'O'),
 ('Ea', 'O'),
 ('res', 'O'),
 ('est', 'O'),
 ('Helvetiis', 'O'),
 ('per', 'O'),
 ('indicium', 'O'),
 ('enuntiata.', 'O'),
 ('Moribus', 'O'),
 ('suis', 'O'),
 ('Orgetoricem', 'O'),
 ('ex', 'O'),
 ('vinculis', 'O'),
 ('causam', 'O'),
 ('dicere', 'O'),
 ('coegerunt;', 'O'),
 ('damnatum', 'O'),
 ('poenam', 'O'),
 ('sequi', 'O'),
 ('oportebat,', 'O'),
 ('ut', 'O'),
 ('igni', 'O'),
 ('cremaretur.', 'O'),
 ('Die', 'O'),
 ('constituta', 'O'),
 ('causae', 'O'),
 ('dictionis', 'O'),
 ('Orgetorix', 'O'),
 ('ad', 'O'),
 ('iudicium', 'O'),
 ('omnem', 'O'),
 ('suam', 'O'),
 ('familiam,', 'O'),
 ('ad', 'O'),
 ('hominum', 'O'),
 ('milia', 'O'),
 ('decem,', 'O'),
 ('undique', 'O'),
 ('coegit,', 'O'),
 ('et', 'O'),
 ('omnes', 'O'),
 ('clientes', 'O'),
 ('obaeratosque', 'O'),
 ('suos,', 'O'),
 ('quorum', 'O'),
 ('magnum', 'O'),
 ('numerum', 'O'),
 ('habebat,', 'O'),
 ('eodem', 'O'),
 ('conduxit;', 'O'),
 ('per', 'O'),
 ('eos', 'O'),
 ('ne', 'O'),
 ('causam', 'O'),
 ('diceret', 'O'),
 ('se', 'O'),
 ('eripuit.', 'O'),
 ('Cum', 'O'),
 ('civitas', 'O'),
 ('ob', 'O'),
 ('eam', 'O'),
 ('rem', 'O'),
 ('incitata', 'O'),
 ('armis', 'O'),
 ('ius', 'O'),
 ('suum', 'O'),
 ('exequi', 'O'),
 ('conaretur', 'O'),
 ('multitudinemque', 'O'),
 ('hominum', 'O'),
 ('ex', 'O'),
 ('agris', 'O'),
 ('magistratus', 'O'),
 ('cogerent,', 'O'),
 ('Orgetorix', 'O'),
 ('mortuus', 'O'),
 ('est;', 'O'),
 ('neque', 'O'),
 ('abest', 'O'),
 ('suspicio,', 'O'),
 ('ut', 'O'),
 ('Helvetii', 'O'),
 ('arbitrantur,', 'O'),
 ('quin', 'O'),
 ('ipse', 'O'),
 ('sibi', 'O'),
 ('mortem', 'O'),
 ('consciverit.', 'O'),
 ('Post', 'O'),
 ('eius', 'O'),
 ('mortem', 'O'),
 ('nihilo', 'O'),
 ('minus', 'O'),
 ('Helvetii', 'O'),
 ('id', 'O'),
 ('quod', 'O'),
 ('constituerant', 'O'),
 ('facere', 'O'),
 ('conantur,', 'O'),
 ('ut', 'O'),
 ('e', 'O'),
 ('finibus', 'O'),
 ('suis', 'O'),
 ('exeant.', 'O'),
 ('Ubi', 'ORG'),
 ('iam', 'O'),
 ('se', 'O'),
 ('ad', 'O'),
 ('eam', 'O'),
 ('rem', 'O'),
 ('paratos', 'O'),
 ('esse', 'O'),
 ('arbitrati', 'O'),
 ('sunt,', 'O'),
 ('oppida', 'O'),
 ('sua', 'O'),
 ('omnia,', 'O'),
 ('numero', 'O'),
 ('ad', 'O'),
 ('duodecim,', 'O'),
 ('vicos', 'O'),
 ('ad', 'O'),
 ('quadringentos,', 'O'),
 ('reliqua', 'O'),
 ('privata', 'O'),
 ('aedificia', 'O'),
 ('incendunt;', 'O'),
 ('frumentum', 'O'),
 ('omne,', 'O'),
 ('praeter', 'O'),
 ('quod', 'O'),
 ('secum', 'O'),
 ('portaturi', 'O'),
 ('erant,', 'O'),
 ('comburunt,', 'O'),
 ('ut', 'O'),
 ('domum', 'O'),
 ('reditionis', 'O'),
 ('spe', 'O'),
 ('sublata', 'O'),
 ('paratiores', 'O'),
 ('ad', 'O'),
 ('omnia', 'O'),
 ('pericula', 'O'),
 ('subeunda', 'O'),
 ('essent;', 'O'),
 ('trium', 'O'),
 ('mensum', 'O'),
 ('molita', 'O'),
 ('cibaria', 'O'),
 ('sibi', 'O'),
 ('quemque', 'O'),
 ('domo', 'O'),
 ('efferre', 'O'),
 ('iubent.', 'O'),
 ('Persuadent', 'O'),
 ('Rauracis', 'O'),
 ('et', 'O'),
 ('Tulingis', 'O'),
 ('et', 'O'),
 ('Latobrigis', 'O'),
 ('finitimis,', 'O'),
 ('uti', 'O'),
 ('eodem', 'O'),
 ('usi', 'O'),
 ('consilio', 'O'),
 ('oppidis', 'O'),
 ('suis', 'O'),
 ('vicisque', 'O'),
 ('exustis', 'O'),
 ('una', 'O'),
 ('cum', 'O'),
 ('iis', 'O'),
 ('proficiscantur,', 'O'),
 ('Boiosque,', 'O'),
 ('qui', 'O'),
 ('trans', 'O'),
 ('Rhenum', 'O'),
 ('incoluerant', 'O'),
 ('et', 'O'),
 ('in', 'O'),
 ('agrum', 'O'),
 ('Noricum', 'O'),
 ('transierant', 'O'),
 ('Noreiamque', 'O'),
 ('oppugnabant,', 'O'),
 ('receptos', 'O'),
 ('ad', 'O'),
 ('se', 'O'),
 ('socios', 'O'),
 ('sibi', 'O'),
 ('adsciscunt.', 'O'),
 ('Erant', 'O'),
 ('omnino', 'O'),
 ('itinera', 'O'),
 ('duo,', 'O'),
 ('quibus', 'O'),
 ('itineribus', 'O'),
 ('domo', 'O'),
 ('exire', 'O'),
 ('possent:', 'O'),
 ('unum', 'O'),
 ('per', 'O'),
 ('Sequanos,', 'O'),
 ('angustum', 'O'),
 ('et', 'O'),
 ('difficile,', 'O'),
 ('inter', 'O'),
 ('montem', 'O'),
 ('Iuram', 'O'),
 ('et', 'O'),
 ('flumen', 'O'),
 ('Rhodanum,', 'O'),
 ('vix', 'O'),
 ('qua', 'O'),
 ('singuli', 'O'),
 ('carri', 'O'),
 ('ducerentur,', 'O'),
 ('mons', 'O'),
 ('autem', 'O'),
 ('altissimus', 'O'),
 ('impendebat,', 'O'),
 ('ut', 'O'),
 ('facile', 'O'),
 ('perpauci', 'O'),
 ('prohibere', 'O'),
 ('possent;', 'O'),
 ('alterum', 'O'),
 ('per', 'O'),
 ('provinciam', 'O'),
 ('nostram,', 'O'),
 ('multo', 'O'),
 ('facilius', 'O'),
 ('atque', 'O'),
 ('expeditius,', 'O'),
 ('propterea', 'O'),
 ('quod', 'O'),
 ('inter', 'O'),
 ('fines', 'O'),
 ('Helvetiorum', 'O'),
 ('et', 'O'),
 ('Allobrogum,', 'O'),
 ('qui', 'O'),
 ('nuper', 'O'),
 ('pacati', 'O'),
 ('erant,', 'O'),
 ('Rhodanus', 'O'),
 ('fluit', 'O'),
 ('isque', 'O'),
 ('non', 'O'),
 ('nullis', 'O'),
 ('locis', 'O'),
 ('vado', 'O'),
 ('transitur.', 'O'),
 ('Extremum', 'O'),
 ('oppidum', 'O'),
 ('Allobrogum', 'O'),
 ('est', 'O'),
 ('proximumque', 'O'),
 ('Helvetiorum', 'O'),
 ('finibus', 'O'),
 ('Genava.', 'O'),
 ('Ex', 'O'),
 ('eo', 'O'),
 ('oppido', 'O'),
 ('pons', 'O'),
 ('ad', 'O'),
 ('Helvetios', 'O'),
 ('pertinet.', 'O'),
 ('Allobrogibus', 'O'),
 ('sese', 'O'),
 ('vel', 'O'),
 ('persuasuros,', 'O'),
 ('quod', 'O'),
 ('nondum', 'O'),
 ('bono', 'O'),
 ('animo', 'O'),
 ('in', 'O'),
 ('populum', 'O'),
 ('Romanum', 'O'),
 ('viderentur,', 'O'),
 ('existimabant', 'O'),
 ('vel', 'O'),
 ('vi', 'O'),
 ('coacturos', 'O'),
 ('ut', 'O'),
 ('per', 'O'),
 ('suos', 'O'),
 ('fines', 'O'),
 ('eos', 'O'),
 ('ire', 'O'),
 ('paterentur.', 'O'),
 ('Omnibus', 'O'),
 ('rebus', 'O'),
 ('ad', 'O'),
 ('profectionem', 'O'),
 ('comparatis', 'O'),
 ('diem', 'O'),
 ('dicunt,', 'O'),
 ('qua', 'O'),
 ('die', 'O'),
 ('ad', 'O'),
 ('ripam', 'O'),
 ('Rhodani', 'O'),
 ('omnes', 'O'),
 ('conveniant.', 'O'),
 ('is', 'O'),
 ('dies', 'O'),
 ('erat', 'O'),
 ('a.', 'O'),
 ('d.', 'PER'),
 ('V.', 'PER'),
 ('Kal.', 'PER'),
 ('Apr.', 'PER'),
 ('L.', 'PER'),
 ('Pisone,', 'PER'),
 ('A.', 'PER'),
 ('Gabinio', 'PER'),
 ('consulibus.', 'PER'),
 ('Caesari', 'PER'),
 ('cum', 'O'),
 ('id', 'O'),
 ('nuntiatum', 'O'),
 ('esset,', 'O'),
 ('eos', 'O'),
 ('per', 'O'),
 ('provinciam', 'O'),
 ('nostram', 'O'),
 ('iter', 'O'),
 ('facere', 'O'),
 ('conari,', 'O'),
 ('maturat', 'O'),
 ('ab', 'O'),
 ('urbe', 'O'),
 ('proficisci', 'O'),
 ('et', 'O'),
 ('quam', 'O'),
 ('maximis', 'O'),
 ('potest', 'O'),
 ('itineribus', 'O'),
 ('in', 'O'),
 ('Galliam', 'O'),
 ('ulteriorem', 'O'),
 ('contendit', 'O'),
 ('et', 'O'),
 ('ad', 'O'),
 ('Genavam', 'O'),
 ('pervenit.', 'O'),
 ('Provinciae', 'O'),
 ('toti', 'O'),
 ('quam', 'O'),
 ('maximum', 'O'),
 ('potest', 'O'),
 ('militum', 'O'),
 ('numerum', 'O'),
 ('imperat', 'O'),
 ('(erat', 'O'),
 ('omnino', 'O'),
 ('in', 'O'),
 ('Gallia', 'LOC'),
 ('ulteriore', 'O'),
 ('legio', 'O'),
 ('una),', 'O'),
 ('pontem,', 'O'),
 ('qui', 'O'),
 ('erat', 'O'),
 ('ad', 'O'),
 ('Genavam,', 'O'),
 ('iubet', 'O'),
 ('rescindi.', 'O'),
 ('Ubi', 'ORG'),
 ('de', 'O'),
 ('eius', 'O'),
 ('aventu', 'O'),
 ('Helvetii', 'O'),
 ('certiores', 'O'),
 ('facti', 'O'),
 ('sunt,', 'O'),
 ('legatos', 'O'),
 ('ad', 'O'),
 ('eum', 'O'),
 ('mittunt', 'O'),
 ('nobilissimos', 'O'),
 ('civitatis,', 'O'),
 ('cuius', 'O'),
 ('legationis', 'O'),
 ('Nammeius', 'O'),
 ('et', 'O'),
 ('Verucloetius', 'O'),
 ('principem', 'O'),
 ('locum', 'O'),
 ('obtinebant,', 'O'),
 ('qui', 'O'),
 ('dicerent', 'O'),
 ('sibi', 'O'),
 ('esse', 'O'),
 ('in', 'O'),
 ('animo', 'O'),
 ('sine', 'O'),
 ('ullo', 'O'),
 ('maleficio', 'O'),
 ('iter', 'O'),
 ('per', 'O'),
 ('provinciam', 'O'),
 ('facere,', 'O'),
 ('propterea', 'O'),
 ('quod', 'O'),
 ('aliud', 'O'),
 ('iter', 'O'),
 ('haberent', 'O'),
 ('nullum:', 'O'),
 ('rogare', 'O'),
 ('ut', 'O'),
 ('eius', 'O'),
 ('voluntate', 'O'),
 ('id', 'O'),
 ('sibi', 'O'),
 ('facere', 'O'),
 ('liceat.', 'O'),
 ('Caesar,', 'O'),
 ('quod', 'O'),
 ('memoria', 'O'),
 ('tenebat', 'O'),
 ('L.', 'O'),
 ('Cassium', 'O'),
 ('consulem', 'O'),
 ('occisum', 'O'),
 ('exercitumque', 'O'),
 ('eius', 'O'),
 ('ab', 'O'),
 ('Helvetiis', 'O'),
 ('pulsum', 'O'),
 ('et', 'O'),
 ('sub', 'O'),
 ('iugum', 'O'),
 ('missum,', 'O'),
 ('concedendum', 'O'),
 ('non', 'O'),
 ('putabat;', 'O'),
 ('neque', 'O'),
 ('homines', 'O'),
 ('inimico', 'O'),
 ('animo,', 'O'),
 ('data', 'O'),
 ('facultate', 'O'),
 ('per', 'O'),
 ('provinciam', 'O'),
 ('itineris', 'O'),
 ('faciundi,', 'O'),
 ('temperaturos', 'O'),
 ('ab', 'O'),
 ('iniuria', 'O'),
 ('et', 'O'),
 ('maleficio', 'O'),
 ('existimabat.', 'O'),
 ('Tamen,', 'O'),
 ('ut', 'O'),
 ('spatium', 'O'),
 ('intercedere', 'O'),
 ('posset', 'O'),
 ('dum', 'O'),
 ('milites', 'O'),
 ('quos', 'O'),
 ('imperaverat', 'O'),
 ('convenirent,', 'O'),
 ('legatis', 'O'),
 ('respondit', 'O'),
 ('diem', 'O'),
 ('se', 'O'),
 ('ad', 'O'),
 ('deliberandum', 'O'),
 ('sumpturum:', 'O'),
 ('si', 'O'),
 ('quid', 'O'),
 ('vellent,', 'O'),
 ('ad', 'O'),
 ('Id.', 'O'),
 ('April.', 'O'),
 ('reverterentur.', 'O'),
 ('Interea', 'O'),
 ('ea', 'O'),
 ('legione', 'O'),
 ('quam', 'O'),
 ('secum', 'O'),
 ('habebat', 'O'),
 ('militibusque,', 'O'),
 ('qui', 'O'),
 ('ex', 'O'),
 ('provincia', 'O'),
 ('convenerant,', 'O'),
 ('a', 'O'),
 ('lacu', 'O'),
 ('Lemanno,', 'O'),
 ('qui', 'O'),
 ('in', 'O'),
 ('flumen', 'O'),

The first thing to do is to create a list of all named entity tags that were extracted by NLTK:

In [493]:
nltk_tags = []
for token, tag in tagged_text_nltk:

In [494]:

['LOC', 'LOC', 'LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'O']

A more elegant – that is, Pythonic – way of doing this is to use list comprehension:

In [32]:
nltk_tags = [tag for token, tag in tagged_text_nltk]

In [495]:

['LOC', 'LOC', 'LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'O']

Now, we want to count how many times each token appears.

A typical way of doing this is to use a dictionary to store the counts.

Since in a dictionary, the keys are unique, we leverage this property to keep track of whether a given entity type was already encountered as we go through all extracted entities.

The values in the dictionary are simply numbers (integers), that are increased of 1 any time a given type is found in the data.

In [496]:

['LOC', 'LOC', 'LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'O']

In [497]:
# we initialize an empty dictionary
counts = {}

# we iterate through all NE tags
for tag in nltk_tags:
    # we check if our dictionary already contains an item
    # for that specific entity type
    if tag in counts:
        # if it does, we just increase the counter of 1
        # otherwise we add it and set it to 1
        counts[tag] = 1

In [501]:

{'LOC': 22, 'O': 8001, 'ORG': 40, 'PER': 113}

Let's look at the result:

In [24]:

{'LOC': 22, 'O': 8001, 'ORG': 40, 'PER': 113}

Now that we have learned how to do this ourselves, it's important to know that the Python library collections already contains an objects that does exactly this: Counter.

Let's look at its documentation:

In [29]:

In [502]:
from collections import Counter

In [503]:
nltk_tag_counts = Counter(nltk_tags)

In [504]:

Counter({'LOC': 22, 'O': 8001, 'ORG': 40, 'PER': 113})

As you can see, the output is identical to the one we had previously obtained!

Computing entity frequency

Let's first filter NLTK's output and keep just the entities (i.e. the token identified as being a NE):

In [505]:
nltk_entities = []
for token, tag in tagged_text_nltk:
    if tag != "O":

In [506]:


We can now use the Counter object that we just introduced to count the frequencies:

In [507]:
nltk_entity_counts = Counter(nltk_entities)

In [508]:


In [509]:

Counter({'A.': 1,
         'Apr.': 1,
         'Aquitania': 1,
         'Arari': 1,
         'Ariovistus': 2,
         'Belgis': 1,
         'Bello': 1,
         'C': 1,
         'C.': 4,
         'COMMENTARIUS': 1,
         'Caburi': 1,
         'Caesar': 21,
         'Caesare': 2,
         'Caesarem': 4,
         'Caesari': 4,
         'Caesaris': 7,
         'Cassiano': 1,
         'Cimbris': 1,
         'Cognito': 1,
         'Conloquendi': 1,
         'De': 1,
         'Diem': 1,
         'Diu': 1,
         'Diviciaco': 3,
         'Diviciacum': 1,
         'Diviciacus': 1,
         'Divico': 2,
         'Dubis': 1,
         'Dumnorigem': 1,
         'Fabio': 1,
         'Flacco': 1,
         'Gabinio': 1,
         'Galli': 1,
         'Gallia': 12,
         'Galliam': 1,
         'Gallos': 4,
         'Genus': 1,
         'Germanos': 4,
         'Haeduorum': 1,
         'Haeduos': 1,
         'Helvetiis': 1,
         'Hi': 1,
         'His': 1,
         'Huc': 1,
         'Huic': 1,
         'Ipse': 4,
         'Italia': 1,
         'Item': 1,
         'Kal.': 1,
         'L.': 2,
         'Labieno': 2,
         'Locutus': 1,
         'M.': 4,
         'Magnam': 1,
         'Mario': 1,
         'Matrona': 1,
         'Messala,': 1,
         'Metius': 1,
         'Munitis': 1,
         'Nam': 1,
         'PRIMUS': 1,
         'Petit': 1,
         'Pisone': 2,
         'Pisone,': 1,
         'Postero': 1,
         'Praeterea': 1,
         'Q.': 1,
         'Rhodanus': 1,
         'Romanis': 2,
         'Romanos': 1,
         'Romanus': 2,
         'Sequanis': 1,
         'Sequanorum': 1,
         'Sullae': 1,
         'Summa': 1,
         'Teutonis': 1,
         'Treveris': 1,
         'Ubi': 8,
         'V': 1,
         'V.': 1,
         'Valerii': 1,
         'Valerio': 1,
         'Valerius': 1,
         'appellantur.': 1,
         'autem': 1,
         'certiorem': 1,
         'citeriorem': 1,
         'cognovit': 1,
         'consulibus.': 1,
         'd.': 1,
         'esse.': 1,
         'et': 2,
         'exercitu': 1,
         'exploratores': 2,
         'legioni': 1,
         'per': 1,
         'provincia': 1,
         'renuntiatur': 1})

In [510]:

[('Caesar', 21),
 ('Gallia', 12),
 ('Ubi', 8),
 ('Caesaris', 7),
 ('Gallos', 4),
 ('M.', 4),
 ('Caesarem', 4),
 ('Ipse', 4),
 ('Caesari', 4),
 ('Germanos', 4),
 ('C.', 4),
 ('Diviciaco', 3),
 ('Romanis', 2),
 ('Pisone', 2),
 ('Romanus', 2),
 ('L.', 2),
 ('et', 2),
 ('Labieno', 2),
 ('Caesare', 2),
 ('exploratores', 2),
 ('Ariovistus', 2),
 ('Divico', 2),
 ('Rhodanus', 1),
 ('provincia', 1),
 ('Diviciacus', 1),
 ('V', 1),
 ('PRIMUS', 1),
 ('consulibus.', 1),
 ('Diem', 1),
 ('Dumnorigem', 1),
 ('esse.', 1),
 ('Bello', 1),
 ('C', 1),
 ('Metius', 1),
 ('Item', 1),
 ('per', 1),
 ('Matrona', 1),
 ('Valerio', 1),
 ('certiorem', 1),
 ('Fabio', 1),
 ('Galliam', 1),
 ('Sullae', 1),
 ('Haeduos', 1),
 ('Gabinio', 1),
 ('Messala,', 1),
 ('Genus', 1),
 ('V.', 1),
 ('Romanos', 1),
 ('Munitis', 1),
 ('Postero', 1),
 ('cognovit', 1),
 ('Hi', 1),
 ('Diviciacum', 1),
 ('Summa', 1),
 ('Pisone,', 1),
 ('Dubis', 1),
 ('Galli', 1),
 ('Valerii', 1),
 ('autem', 1),
 ('Kal.', 1),
 ('Teutonis', 1),
 ('appellantur.', 1),
 ('Diu', 1),
 ('Huic', 1),
 ('Haeduorum', 1),
 ('citeriorem', 1),
 ('Magnam', 1),
 ('Cassiano', 1),
 ('Treveris', 1),
 ('Caburi', 1),
 ('renuntiatur', 1),
 ('Locutus', 1),
 ('Sequanis', 1),
 ('Belgis', 1),
 ('Sequanorum', 1),
 ('Mario', 1),
 ('His', 1),
 ('d.', 1),
 ('Petit', 1),
 ('exercitu', 1),
 ('Conloquendi', 1),
 ('A.', 1),
 ('legioni', 1),
 ('Aquitania', 1),
 ('Q.', 1),
 ('Flacco', 1),
 ('Valerius', 1),
 ('Italia', 1),
 ('De', 1),
 ('Helvetiis', 1),
 ('Cognito', 1),
 ('Huc', 1),
 ('Cimbris', 1),
 ('Apr.', 1),
 ('Nam', 1),
 ('Arari', 1),
 ('Praeterea', 1)]

In [486]:
nltk_entity_counts = dict(nltk_entity_counts)

In [487]:


In [398]:
sorted(entity_counts.items(), key=lambda x: x[1], reverse=True)

[('Caesar', 21),
 ('Gallia', 12),
 ('Ubi', 8),
 ('Caesaris', 7),
 ('Caesari', 4),
 ('Gallos', 4),
 ('M.', 4),
 ('Caesarem', 4),
 ('Ipse', 4),
 ('C.', 4),
 ('Germanos', 4),
 ('Diviciaco', 3),
 ('Romanis', 2),
 ('Pisone', 2),
 ('Romanus', 2),
 ('Divico', 2),
 ('Ariovistus', 2),
 ('et', 2),
 ('Labieno', 2),
 ('Caesare', 2),
 ('L.', 2),
 ('exploratores', 2),
 ('Rhodanus', 1),
 ('provincia', 1),
 ('Diviciacus', 1),
 ('V', 1),
 ('PRIMUS', 1),
 ('consulibus.', 1),
 ('Diem', 1),
 ('Diu', 1),
 ('Dumnorigem', 1),
 ('esse.', 1),
 ('Bello', 1),
 ('C', 1),
 ('certiorem', 1),
 ('per', 1),
 ('Matrona', 1),
 ('Cognito', 1),
 ('Valerio', 1),
 ('Item', 1),
 ('Metius', 1),
 ('Sullae', 1),
 ('Haeduos', 1),
 ('Gabinio', 1),
 ('Genus', 1),
 ('V.', 1),
 ('Romanos', 1),
 ('Fabio', 1),
 ('Munitis', 1),
 ('cognovit', 1),
 ('Hi', 1),
 ('Diviciacum', 1),
 ('Valerii', 1),
 ('Postero', 1),
 ('Pisone,', 1),
 ('Dubis', 1),
 ('Galli', 1),
 ('autem', 1),
 ('Kal.', 1),
 ('Teutonis', 1),
 ('appellantur.', 1),
 ('Summa', 1),
 ('Huic', 1),
 ('Haeduorum', 1),
 ('citeriorem', 1),
 ('Magnam', 1),
 ('Cassiano', 1),
 ('Treveris', 1),
 ('Caburi', 1),
 ('renuntiatur', 1),
 ('Locutus', 1),
 ('Sequanis', 1),
 ('Belgis', 1),
 ('Sequanorum', 1),
 ('Mario', 1),
 ('De', 1),
 ('His', 1),
 ('d.', 1),
 ('exercitu', 1),
 ('Arari', 1),
 ('Conloquendi', 1),
 ('A.', 1),
 ('legioni', 1),
 ('Aquitania', 1),
 ('Galliam', 1),
 ('Q.', 1),
 ('Flacco', 1),
 ('Valerius', 1),
 ('Italia', 1),
 ('Petit', 1),
 ('Helvetiis', 1),
 ('Huc', 1),
 ('Cimbris', 1),
 ('Praeterea', 1),
 ('Apr.', 1),
 ('Nam', 1),
 ('Messala,', 1)]

As you will have noticed, one big limitation of counting entities this way, is that entities consisting of more than one token are treated as separate entities.

What we'd need, instead, is a data format that allows for stating that two or more consecutive tokens are part of the same entity...

Another limitation, is that we are actually counting surface forms of a given entity (cfr. "Caesar", "Caesaris", "Caesari", etc.). For this we'd need to disambiguate each entity by means of a unique identifier.

Counting, sorting, plotting

Two libraries that are very useful when dealing with data are pandas and seaborn.

Pandas is a data analysis library, while seaborn is used to visualise statistical data.

These libraries play nicely together and are often used in combination.

In [511]:
import pandas as pd
import seaborn as sns

In [512]:
%matplotlib inline


A key data structure in pandas is the DataFrame, a tabular data structure that allows for arithmetic operations on its contents.

The functionalities provided by pandas' dataframes are very similar to those provided by a spreadsheet software.

First off, we initialise a dataframe representing the named entity counts that we have created previously. Remember?

In [409]:

{'A.': 1,
 'Apr.': 1,
 'Aquitania': 1,
 'Arari': 1,
 'Ariovistus': 2,
 'Belgis': 1,
 'Bello': 1,
 'C': 1,
 'C.': 4,
 'Caburi': 1,
 'Caesar': 21,
 'Caesare': 2,
 'Caesarem': 4,
 'Caesari': 4,
 'Caesaris': 7,
 'Cassiano': 1,
 'Cimbris': 1,
 'Cognito': 1,
 'Conloquendi': 1,
 'De': 1,
 'Diem': 1,
 'Diu': 1,
 'Diviciaco': 3,
 'Diviciacum': 1,
 'Diviciacus': 1,
 'Divico': 2,
 'Dubis': 1,
 'Dumnorigem': 1,
 'Fabio': 1,
 'Flacco': 1,
 'Gabinio': 1,
 'Galli': 1,
 'Gallia': 12,
 'Galliam': 1,
 'Gallos': 4,
 'Genus': 1,
 'Germanos': 4,
 'Haeduorum': 1,
 'Haeduos': 1,
 'Helvetiis': 1,
 'Hi': 1,
 'His': 1,
 'Huc': 1,
 'Huic': 1,
 'Ipse': 4,
 'Italia': 1,
 'Item': 1,
 'Kal.': 1,
 'L.': 2,
 'Labieno': 2,
 'Locutus': 1,
 'M.': 4,
 'Magnam': 1,
 'Mario': 1,
 'Matrona': 1,
 'Messala,': 1,
 'Metius': 1,
 'Munitis': 1,
 'Nam': 1,
 'PRIMUS': 1,
 'Petit': 1,
 'Pisone': 2,
 'Pisone,': 1,
 'Postero': 1,
 'Praeterea': 1,
 'Q.': 1,
 'Rhodanus': 1,
 'Romanis': 2,
 'Romanos': 1,
 'Romanus': 2,
 'Sequanis': 1,
 'Sequanorum': 1,
 'Sullae': 1,
 'Summa': 1,
 'Teutonis': 1,
 'Treveris': 1,
 'Ubi': 8,
 'V': 1,
 'V.': 1,
 'Valerii': 1,
 'Valerio': 1,
 'Valerius': 1,
 'appellantur.': 1,
 'autem': 1,
 'certiorem': 1,
 'citeriorem': 1,
 'cognovit': 1,
 'consulibus.': 1,
 'd.': 1,
 'esse.': 1,
 'et': 2,
 'exercitu': 1,
 'exploratores': 2,
 'legioni': 1,
 'per': 1,
 'provincia': 1,
 'renuntiatur': 1}

We use a utility function provided by the library to create a DataFrame starting from a dictionary.

In [513]:
df = pd.DataFrame.from_dict(dict(nltk_entity_counts), orient="index")

In [514]:

Rhodanus 1
Caesari 4
provincia 1
Diviciacus 1
V 1
Romanis 2
consulibus. 1
Pisone 2
Diem 1
Ubi 8
Diu 1
Dumnorigem 1
esse. 1
Bello 1
C 1
Romanus 2
certiorem 1
per 1
Matrona 1
Cognito 1
Valerio 1
Item 1
Divico 2
Gallia 12
Metius 1
Ariovistus 2
Sullae 1
Haeduos 1
Gabinio 1
... ...
Mario 1
Caesare 2
De 1
His 1
Caesar 21
d. 1
exercitu 1
Arari 1
Conloquendi 1
A. 1
legioni 1
Aquitania 1
Galliam 1
Q. 1
Flacco 1
Valerius 1
Italia 1
Petit 1
Helvetiis 1
L. 2
Huc 1
Cimbris 1
Praeterea 1
Apr. 1
Nam 1
exploratores 2
C. 4
Messala, 1
Germanos 4

98 rows × 1 columns

Let's rename the column that contains the entity counts:

In [515]:
df.columns = ["count"]

In [516]:

<class 'pandas.core.frame.DataFrame'>
Index: 98 entries, Rhodanus to Germanos
Data columns (total 1 columns):
count    98 non-null int64
dtypes: int64(1)
memory usage: 1.5+ KB

In [518]:

Rhodanus 1
Caesari 4
provincia 1
Diviciacus 1
V 1
Romanis 2
consulibus. 1
Pisone 2
Diem 1

In [520]:

L. 2
Huc 1
Cimbris 1
Praeterea 1
Apr. 1
Nam 1
exploratores 2
C. 4
Messala, 1
Germanos 4

We can now sort the entities based on their frequency by using the sort_values method of the dataframe:

In [521]:
df.sort_values(by="count", ascending=False)

Caesar 21
Gallia 12
Ubi 8
Caesaris 7
Germanos 4
C. 4
Caesarem 4
Caesari 4
Ipse 4
M. 4
Gallos 4
Diviciaco 3
Divico 2
Labieno 2
Caesare 2
Ariovistus 2
et 2
Romanis 2
Pisone 2
exploratores 2
Romanus 2
L. 2
Locutus 1
Mario 1
Sequanorum 1
Apr. 1
Belgis 1
Sequanis 1
Italia 1
Praeterea 1
... ...
Dumnorigem 1
Diu 1
Diem 1
consulibus. 1
V 1
Diviciacus 1
provincia 1
Sullae 1
Haeduos 1
Gabinio 1
Genus 1
Huic 1
Summa 1
appellantur. 1
Kal. 1
autem 1
Galli 1
Dubis 1
Pisone, 1
Postero 1
Valerii 1
Diviciacum 1
Hi 1
cognovit 1
Munitis 1
Fabio 1
Romanos 1
V. 1
Teutonis 1

98 rows × 1 columns

In [522]:

Rhodanus 1
Caesari 4
provincia 1
Diviciacus 1
V 1
Romanis 2
consulibus. 1
Pisone 2
Diem 1
Ubi 8
Diu 1
Dumnorigem 1
esse. 1
Bello 1
C 1
Romanus 2
certiorem 1
per 1
Matrona 1
Cognito 1
Valerio 1
Item 1
Divico 2
Gallia 12
Metius 1
Ariovistus 2
Sullae 1
Haeduos 1
Gabinio 1
... ...
Mario 1
Caesare 2
De 1
His 1
Caesar 21
d. 1
exercitu 1
Arari 1
Conloquendi 1
A. 1
legioni 1
Aquitania 1
Galliam 1
Q. 1
Flacco 1
Valerius 1
Italia 1
Petit 1
Helvetiis 1
L. 2
Huc 1
Cimbris 1
Praeterea 1
Apr. 1
Nam 1
exploratores 2
C. 4
Messala, 1
Germanos 4

98 rows × 1 columns

NB: sort_values produces a sorted copy of the input dataframe, it does not change it directly.

So to have a sorted dataframe we need to re-assign our variable:

In [523]:
df = df.sort_values(by="count", ascending=False)

In [524]:

Caesar 21
Gallia 12
Ubi 8
Caesaris 7
Germanos 4
C. 4
Caesarem 4
Caesari 4
Ipse 4
M. 4
Gallos 4
Diviciaco 3
Divico 2
Labieno 2
Caesare 2
Ariovistus 2
et 2
Romanis 2
Pisone 2
exploratores 2
Romanus 2
L. 2
Locutus 1
Mario 1
Sequanorum 1
Apr. 1
Belgis 1
Sequanis 1
Italia 1
Praeterea 1
... ...
Dumnorigem 1
Diu 1
Diem 1
consulibus. 1
V 1
Diviciacus 1
provincia 1
Sullae 1
Haeduos 1
Gabinio 1
Genus 1
Huic 1
Summa 1
appellantur. 1
Kal. 1
autem 1
Galli 1
Dubis 1
Pisone, 1
Postero 1
Valerii 1
Diviciacum 1
Hi 1
cognovit 1
Munitis 1
Fabio 1
Romanos 1
V. 1
Teutonis 1

98 rows × 1 columns


Seaborn is built on top of matplotlib a very powerful python library for plotting.

Seaborn provides a high-level layer on top of it, and makes some guesses on the nature of the data it receives in input.

In [440]:
from IPython.display import IFrame
IFrame('http://seaborn.pydata.org/examples/', width=900, height=4000)


A quite handy characteristic is that you can pass to Seaborn a pandas.DataFrame.

Here we plot the top 10 entities extracted by NLTK:

In [525]:

Caesar 21
Gallia 12
Ubi 8
Caesaris 7
Germanos 4
C. 4
Caesarem 4
Caesari 4
Ipse 4
M. 4

In [526]:
sns.barplot(x="count", y=df[:10].index, data=df[:10])

<matplotlib.axes._subplots.AxesSubplot at 0x7fd1d19d4eb8>
/home/mromanello/.pyenv/versions/3.5.0/envs/sunoikisis/lib/python3.5/site-packages/matplotlib/font_manager.py:1297: UserWarning: findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans
  (prop.get_family(), self.defaultFamily[fontext]))

Some things to note:

  • x is the name of the dataframe's column containing the values for the x axis
  • y are the labels for the y axis (in this case they come from the index)
  • data is the pandas' DataFrame we pass to the function

In [527]:
sns.barplot(x="count", y=df[:20].index, data=df[:20])

<matplotlib.axes._subplots.AxesSubplot at 0x7fd1d1ca32e8>
/home/mromanello/.pyenv/versions/3.5.0/envs/sunoikisis/lib/python3.5/site-packages/matplotlib/font_manager.py:1297: UserWarning: findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans
  (prop.get_family(), self.defaultFamily[fontext]))

Let's plot now, instead, the distribution of named entity types.

In [528]:
# we initialize an empty dictionary
counts = {}

# we iterate through all NE tags
for tag in nltk_tags:
    # we check if our dictionary already contains an item
    # for that specific entity type
    if tag in counts:
        # if it does, we just increase the counter of 1
        # otherwise we add it and set it to 1
        counts[tag] = 1

In [529]:
df_types = pd.DataFrame.from_dict(counts, orient="index")
df_types.columns = ["count"]

LOC 22
ORG 40
PER 113
O 8001

We can generate a basic pie chart by using the plot method of a dataframe:

In [530]:
df_types.plot(y="count", kind="pie")

<matplotlib.axes._subplots.AxesSubplot at 0x7fd1e9209fd0>
/home/mromanello/.pyenv/versions/3.5.0/envs/sunoikisis/lib/python3.5/site-packages/matplotlib/font_manager.py:1297: UserWarning: findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans
  (prop.get_family(), self.defaultFamily[fontext]))

Which is equivalent to:

In [447]:

<matplotlib.axes._subplots.AxesSubplot at 0x7fd1d1b828d0>
/home/mromanello/.pyenv/versions/3.5.0/envs/sunoikisis/lib/python3.5/site-packages/matplotlib/font_manager.py:1297: UserWarning: findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans
  (prop.get_family(), self.defaultFamily[fontext]))

Now, this is not very useful/readable. Let's try to remove the O labels.

We just modify the code used above and add an if statement:

In [531]:
# we initialize an empty dictionary
counts = {}

# we iterate through all NE tags
for tag in nltk_tags:
    # do something only if `tag` is not "O"
    if tag != "O":
        # we check if our dictionary already contains an item
        # for that specific entity type
        if tag in counts:
            # if it does, we just increase the counter of 1
            # otherwise we add it and set it to 1
            counts[tag] = 1

In [532]:
df_types = pd.DataFrame.from_dict(counts, orient="index")
df_types.columns = ["count"]

LOC 22
ORG 40
PER 113

In [533]:

<matplotlib.axes._subplots.AxesSubplot at 0x7fd1d409eb38>
/home/mromanello/.pyenv/versions/3.5.0/envs/sunoikisis/lib/python3.5/site-packages/matplotlib/font_manager.py:1297: UserWarning: findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans
  (prop.get_family(), self.defaultFamily[fontext]))

Evaluating NER

To give a practical example of how to measure the accuracy of a NER system, we will use the dates extracted by Francesco in the first part of the session by using regular expressions.

data preparation

Let's read in the dates that were extracted previously:

In [534]:
import codecs
with codecs.open("data/iob/article_446_date_aut.iob","r","utf-8") as f:
    data = f.read()

In [535]:


We want to convert this into a list of tuples

In [536]:
iob_data_auto = [line.split("\t") for line in data.split("\n") if line!=""]

In [537]:

[['The', 'O', 'O'],
 ['Praetorian', 'O', 'O'],
 ['Proconsuls', 'O', 'O'],
 ['of', 'O', 'O'],
 ['the', 'O', 'O'],
 ['Roman', 'O', 'O'],
 ['Republic', 'O', 'O'],
 ['45', 'CD', 'B-DATE'],
 ['FREDERIK', 'O', 'O'],
 ['JULIAAN', 'O', 'O'],
 ['VERVAET', 'O', 'O'],
 ['The', 'O', 'O'],
 ['Praetorian', 'O', 'O'],
 ['Proconsuls', 'O', 'O'],
 ['of', 'O', 'O'],
 ['the', 'O', 'O'],
 ['Roman', 'O', 'O'],
 ['Republic', 'O', 'O'],
 ['(', 'O', 'O'],
 ['211–52', 'Date', 'B-DATE'],
 ['BCE', 'Date', 'I-DATE'],
 [')', 'O', 'O'],
 ['.', 'O', 'O'],
 ['A', 'O', 'O'],
 ['Constitutional', 'O', 'O'],
 ['Survey', 'O', 'O'],
 ['1', 'CD', 'B-DATE'],
 ['.', 'O', 'O'],
 ['Introduction', 'O', 'O'],
 ['The', 'O', 'O'],
 ['republican', 'O', 'O'],
 ['administrative', 'O', 'O'],
 ['procedure', 'O', 'O'],
 ['of', 'O', 'O'],
 ['sending', 'O', 'O'],
 ['out', 'O', 'O'],
 ['praetors', 'O', 'O'],
 ['with', 'O', 'O'],
 ['consular', 'O', 'O'],
 ['imperium', 'O', 'O'],
 ['is', 'O', 'O'],
 ['reasonably', 'O', 'O'],
 ['well-known', 'O', 'O'],
 ['but', 'O', 'O'],
 ['little', 'O', 'O'],
 ['understood', 'O', 'O'],
 ['.', 'O', 'O'],
 ['To', 'O', 'O'],
 ['the', 'O', 'O'],
 ['best', 'O', 'O'],
 ['of', 'O', 'O'],
 ['my', 'O', 'O'],
 ['knowledge', 'O', 'O'],
 [',', 'O', 'O'],
 ['not', 'O', 'O'],
 ['a', 'O', 'O'],
 ['single', 'O', 'O'],
 ['study', 'O', 'O'],
 ['or', 'O', 'O'],
 ['book', 'O', 'O'],
 ['chapter', 'O', 'O'],
 ['has', 'O', 'O'],
 ['been', 'O', 'O'],
 ['devoted', 'O', 'O'],
 ['exclusively', 'O', 'O'],
 ['to', 'O', 'O'],
 ['a', 'O', 'O'],
 ['gubernatorial', 'O', 'O'],
 ['practice', 'O', 'O'],
 ['that', 'O', 'O'],
 ['rapidly', 'O', 'O'],
 ['gained', 'O', 'O'],
 ['importance', 'O', 'O'],
 ['from', 'O', 'O'],
 ['the', 'O', 'O'],
 ['Second', 'O', 'O'],
 ['Punic', 'O', 'O'],
 ['War', 'O', 'O'],
 ['.', 'O', 'O'],
 ['This', 'O', 'O'],
 ['bipartite', 'O', 'O'],
 ['study', 'O', 'O'],
 ['aims', 'O', 'O'],
 ['at', 'O', 'O'],
 ['bridging', 'O', 'O'],
 ['this', 'O', 'O'],
 ['remarkable', 'O', 'O'],
 ['gap', 'O', 'O'],
 ['.', 'O', 'O'],
 ['The', 'O', 'O'],
 ['first', 'O', 'O'],
 ['component', 'O', 'O'],
 ['of', 'O', 'O'],
 ['this', 'O', 'O'],
 ['inquiry', 'O', 'O'],
 ['endeavours', 'O', 'O'],
 ['to', 'O', 'O'],
 ['offer', 'O', 'O'],
 ['an', 'O', 'O'],
 ['overall', 'O', 'O'],
 ['constitutional', 'O', 'O'],
 ['survey', 'O', 'O'],
 ['of', 'O', 'O'],
 ['the', 'O', 'O'],
 ['institutional', 'O', 'O'],
 ['phenomenon', 'O', 'O'],
 ['of', 'O', 'O'],
 ['the', 'O', 'O'],
 ['praetura', 'O', 'O'],
 ['pro', 'O', 'O'],
 ['consule', 'O', 'O'],
 ['by', 'O', 'O'],
 ['discussing', 'O', 'O'],
 ['its', 'O', 'O'],
 ['origins', 'O', 'O'],
 [',', 'O', 'O'],
 ['nature', 'O', 'O'],
 ['and', 'O', 'O'],
 ['historical', 'O', 'O'],
 ['development', 'O', 'O'],
 ['.', 'O', 'O'],
 ['The', 'O', 'O'],
 ['second', 'O', 'O'],
 ['part', 'O', 'O'],
 ['is', 'O', 'O'],
 ['conducted', 'O', 'O'],
 ['by', 'O', 'O'],
 ['F.', 'O', 'O'],
 ['Hurlet', 'O', 'O'],
 ['and', 'O', 'O'],
 ['scrutinizes', 'O', 'O'],
 ['this', 'O', 'O'],
 ['practice', 'O', 'O'],
 ['as', 'O', 'O'],
 ['recorded', 'O', 'O'],
 ['in', 'O', 'O'],
 ['the', 'O', 'O'],
 ['fasti', 'O', 'O'],
 ['of', 'O', 'O'],
 ['Africa', 'O', 'O'],
 [',', 'O', 'O'],
 ['Sicily', 'O', 'O'],
 ['and', 'O', 'O'],
 ['Corsica-Sardinia', 'O', 'O'],
 ['.', 'O', 'O'],
 ['After', 'O', 'O'],
 ['highlighting', 'O', 'O'],
 ['the', 'O', 'O'],
 ['significance', 'O', 'O'],
 ['of', 'O', 'O'],
 ['the', 'O', 'O'],
 ['Metilian', 'O', 'O'],
 ['Law', 'O', 'O'],
 ['from', 'O', 'O'],
 ['217', 'CD', 'B-DATE'],
 ['BCE', 'Date', 'I-DATE'],
 ['as', 'O', 'O'],
 ['a', 'O', 'O'],
 ['precedent', 'O', 'O'],
 [',', 'O', 'O'],
 ['the', 'O', 'O'],
 ['next', 'O', 'O'],
 ['section', 'O', 'O'],
 ['of', 'O', 'O'],
 ['this', 'O', 'O'],
 ['study', 'O', 'O'],
 ['provides', 'O', 'O'],
 ['a', 'O', 'O'],
 ['detailed', 'O', 'O'],
 ['discussion', 'O', 'O'],
 ['of', 'O', 'O'],
 ['the', 'O', 'O'],
 ['(', 'O', 'O'],
 ['historical', 'O', 'O'],
 ['circumstances', 'O', 'O'],
 ['of', 'O', 'O'],
 ['the', 'O', 'O'],
 [')', 'O', 'O'],
 ['creation', 'O', 'O'],
 ['of', 'O', 'O'],
 ['the', 'O', 'O'],
 ['praetorian', 'O', 'O'],
 ['proconsulship', 'O', 'O'],
 ['.', 'O', 'O'],
 ['This', 'O', 'O'],
 ['is', 'O', 'O'],
 ['followed', 'O', 'O'],
 ['by', 'O', 'O'],
 ['a', 'O', 'O'],
 ['brief', 'O', 'O'],
 ['assessment', 'O', 'O'],
 ['of', 'O', 'O'],
 ['the', 'O', 'O'],
 ['proliferation', 'O', 'O'],
 ['of', 'O', 'O'],
 ['this', 'O', 'O'],
 ['administrative', 'O', 'O'],
 ['practice', 'O', 'O'],
 ['throughout', 'O', 'O'],
 ['the', 'O', 'O'],
 ['Republic', 'O', 'O'],
 ['’s', 'O', 'O'],
 ['growing', 'O', 'O'],
 ['number', 'O', 'O'],
 ['of', 'O', 'O'],
 ['provinces', 'O', 'O'],
 ['.', 'O', 'O'],
 ['The', 'O', 'O'],
 ['subsequent', 'O', 'O'],
 ['two', 'O', 'O'],
 ['sections', 'O', 'O'],
 ['concern', 'O', 'O'],
 ['the', 'O', 'O'],
 ['official', 'O', 'O'],
 ['authority', 'O', 'O'],
 ['and', 'O', 'O'],
 ['nomenclature', 'O', 'O'],
 ['of', 'O', 'O'],
 ['the', 'O', 'O'],
 ['praetorian', 'O', 'O'],
 ['proconsuls', 'O', 'O'],
 [',', 'O', 'O'],
 ['followed', 'O', 'O'],
 ['by', 'O', 'O'],
 ['a', 'O', 'O'],
 ['discussion', 'O', 'O'],
 ['of', 'O', 'O'],
 ['the', 'O', 'O'],
 ['frequency', 'O', 'O'],
 ['and', 'O', 'O'],
 ['rationale', 'O', 'O'],
 ['of', 'O', 'O'],
 ['the', 'O', 'O'],
 ['practice', 'O', 'O'],
 ['as', 'O', 'O'],
 ['well', 'O', 'O'],
 ['as', 'O', 'O'],
 ['the', 'O', 'O'],
 ['discretion', 'O', 'O'],
 ['exercised', 'O', 'O'],
 ['by', 'O', 'O'],
 ['the', 'O', 'O'],
 ['Senate', 'O', 'O'],
 ['in', 'O', 'O'],
 ['the', 'O', 'O'],
 ['relevant', 'O', 'O'],
 ['decision-making', 'O', 'O'],
 ['process', 'O', 'O'],
 ['.', 'O', 'O'],
 ['The', 'O', 'O'],
 ['final', 'O', 'O'],
 ['section', 'O', 'O'],
 [',', 'O', 'O'],
 ['then', 'O', 'O'],
 [',', 'O', 'O'],
 ['addresses', 'O', 'O'],
 ['the', 'O', 'O'],
 ['temporary', 'O', 'O'],
 ['abolition', 'O', 'O'],
 ['as', 'O', 'O'],
 ['well', 'O', 'O'],
 ['as', 'O', 'O'],
 ['the', 'O', 'O'],
 ['eventual', 'O', 'O'],
 ['generalization', 'O', 'O'],
 ['of', 'O', 'O'],
 ['the', 'O', 'O'],
 ['praetorian', 'O', 'O'],
 ['proconsulship', 'O', 'O'],
 ['in', 'O', 'O'],
 ['the', 'O', 'O'],
 ['transitional', 'O', 'O'],
 ['period', 'O', 'O'],
 ['53/52–27', 'O', 'O'],
 ['BCE', 'Date', 'B-DATE'],
 ['.', 'O', 'O'],
 ['The', 'O', 'O'],
 ['epilogue', 'O', 'O'],
 ['ponders', 'O', 'O'],
 ['the', 'O', 'O'],
 ['question', 'O', 'O'],
 ['whether', 'O', 'O'],
 ['or', 'O', 'O'],
 ['not', 'O', 'O'],
 ['the', 'O', 'O'],
 ['command', 'O', 'O'],
 ['held', 'O', 'O'],
 ['in', 'O', 'O'],
 ['215', 'CD', 'B-DATE'],
 ['BCE', 'Date', 'I-DATE'],
 ['by', 'O', 'O'],
 ['M.', 'O', 'O'],
 ['Claudius', 'O', 'O'],
 ['Marcellus', 'O', 'O'],
 ['(', 'O', 'O'],
 ['cos', 'O', 'O'],
 ['.', 'O', 'O'],
 ['222', 'CD', 'B-DATE'],
 [',', 'O', 'O'],
 ['pr', 'O', 'O'],
 ['.', 'O', 'O'],
 ['II', 'O', 'O'],
 ['216', 'CD', 'B-DATE'],
 [')', 'O', 'O'],
 ['can', 'O', 'O'],
 ['be', 'O', 'O'],
 ['considered', 'O', 'O'],
 ['as', 'O', 'O'],
 ['the', 'O', 'O'],
 ['first', 'O', 'O'],
 ['historically', 'O', 'O'],
 ['attested', 'O', 'O'],
 ['case', 'O', 'O'],
 ['of', 'O', 'O'],
 ['a', 'O', 'O'],
 ['(', 'O', 'O'],
 ['pro', 'O', 'O'],
 [')', 'O', 'O'],
 ['praetor(e)', 'O', 'O'],
 ['pro', 'O', 'O'],
 ['consule', 'O', 'O'],
 ['.', 'O', 'O'],
 ['I', 'O', 'O'],
 ['would', 'O', 'O'],
 ['like', 'O', 'O'],
 ['to', 'O', 'O'],
 ['express', 'O', 'O'],
 ['my', 'O', 'O'],
 ['profound', 'O', 'O'],
 ['gratitude', 'O', 'O'],
 ['to', 'O', 'O'],
 ['both', 'O', 'O'],
 ['anonymous', 'O', 'O'],
 ['referees', 'O', 'O'],
 ['for', 'O', 'O'],
 ['their', 'O', 'O'],
 ['extensive', 'O', 'O'],
 ['and', 'O', 'O'],
 ['detailed', 'O', 'O'],
 ['comments', 'O', 'O'],
 ['and', 'O', 'O'],
 ['suggestions', 'O', 'O'],
 ['which', 'O', 'O'],
 ['much', 'O', 'O'],
 ['helped', 'O', 'O'],
 ['to', 'O', 'O'],
 ['improve', 'O', 'O'],
 ['this', 'O', 'O'],
 ['study', 'O', 'O'],
 ['.', 'O', 'O'],
 ['Any', 'O', 'O'],
 ['remaining', 'O', 'O'],
 ['flaws', 'O', 'O'],
 ['and', 'O', 'O'],
 ['errors', 'O', 'O'],
 ['are', 'O', 'O'],
 ['my', 'O', 'O'],
 ['own', 'O', 'O'],
 ['.', 'O', 'O'],
 ['I', 'O', 'O'],
 ['also', 'O', 'O'],
 ['wish', 'O', 'O'],
 ['to', 'O', 'O'],
 ['thank', 'O', 'O'],
 ['my', 'O', 'O'],
 ['colleague', 'O', 'O'],
 ['and', 'O', 'O'],
 ['long-standing', 'O', 'O'],
 ['friend', 'O', 'O'],
 [',', 'O', 'O'],
 ['Frédéric', 'O', 'O'],
 ['Hurlet', 'O', 'O'],
 [',', 'O', 'O'],
 ['for', 'O', 'O'],
 ['the', 'O', 'O'],
 ['pleasant', 'O', 'O'],
 ['and', 'O', 'O'],
 ['rewarding', 'O', 'O'],
 ['collaboration', 'O', 'O'],
 ['.', 'O', 'O'],
 ['1', 'CD', 'B-DATE'],
 ['For', 'O', 'O'],
 ['a', 'O', 'O'],
 ['rare', 'O', 'O'],
 ['study', 'O', 'O'],
 ['into', 'O', 'O'],
 ['the', 'O', 'O'],
 ['praetorian', 'O', 'O'],
 ['proconsuls', 'O', 'O'],
 ['of', 'O', 'O'],
 ['the', 'O', 'O'],
 ['early', 'O', 'O'],
 ['Empire', 'O', 'O'],
 [',', 'O', 'O'],
 ['see', 'O', 'O'],
 ['Eck', 'O', 'O'],
 ['1972/1973', 'O', 'O'],
 [',', 'O', 'O'],
 ['233–260', 'Date', 'B-DATE'],
 ['.', 'O', 'O'],
 ['46', 'CD', 'B-DATE'],
 ['2', 'CD', 'I-DATE'],
 ['.', 'O', 'O'],
 ['The', 'O', 'O'],
 ['Metilian', 'O', 'O'],
 ['Law', 'O', 'O'],
 ['(', 'O', 'O'],
 ['217', 'CD', 'B-DATE'],
 ['BCE', 'Date', 'I-DATE'],
 [')', 'O', 'O'],
 ['as', 'O', 'O'],
 ['a', 'O', 'O'],
 ['precedent', 'O', 'O'],
 ['In', 'O', 'O'],
 ['the', 'O', 'O'],
 ['aftermath', 'O', 'O'],
 ['of', 'O', 'O'],
 ['Hannibal', 'O', 'O'],
 ['’s', 'O', 'O'],
 ['crushing', 'O', 'O'],
 ['victory', 'O', 'O'],
 ['at', 'O', 'O'],
 ['Lake', 'O', 'O'],
 ['Trasimene', 'O', 'O'],
 ['in', 'O', 'O'],
 ['217', 'CD', 'B-DATE'],
 [',', 'O', 'O'],
 ['Senate', 'O', 'O'],
 ['and', 'O', 'O'],
 ['People', 'O', 'O'],
 ['decided', 'O', 'O'],
 ['on', 'O', 'O'],
 ['a', 'O', 'O'],
 ['series', 'O', 'O'],
 ['of', 'O', 'O'],
 ['unprecedented', 'O', 'O'],
 ['measures', 'O', 'O'],
 ['.', 'O', 'O'],
 ['First', 'O', 'O'],
 [',', 'O', 'O'],
 ['circumstances', 'O', 'O'],
 ['led', 'O', 'O'],
 ['the', 'O', 'O'],
 ['Senate', 'O', 'O'],
 ['to', 'O', 'O'],
 ['arrange', 'O', 'O'],
 ['for', 'O', 'O'],
 ['a', 'O', 'O'],
 ['direct', 'O', 'O'],
 ['popular', 'O', 'O'],
 ['election', 'O', 'O'],
 ['of', 'O', 'O'],
 ['a', 'O', 'O'],
 ['dictator', 'O', 'O'],
 ['as', 'O', 'O'],
 ['well', 'O', 'O'],
 ['as', 'O', 'O'],
 ['his', 'O', 'O'],
 ['magister', 'O', 'O'],
 ['equitum', 'O', 'O'],
 [',', 'O', 'O'],
 ['both', 'O', 'O'],
 ['key', 'O', 'O'],
 ['positions', 'O', 'O'],
 ['being', 'O', 'O'],
 ['won', 'O', 'O'],
 ['by', 'O', 'O'],
 ['Q.', 'O', 'O'],
 ['Fabius', 'O', 'O'],
 ['Maximus', 'O', 'O'],
 ['Verrucosus', 'O', 'O'],
 ['(', 'O', 'O'],
 ['cos', 'O', 'O'],
 ['.', 'O', 'O'],
 ['233', 'CD', 'B-DATE'],
 [',', 'O', 'O'],
 ['II', 'O', 'O'],
 ['228', 'CD', 'B-DATE'],
 [',', 'O', 'O'],
 ['III', 'O', 'O'],
 ['215', 'CD', 'B-DATE'],
 [',', 'O', 'O'],
 ['IV', 'O', 'O'],
 ['214', 'CD', 'B-DATE'],
 [',', 'O', 'O'],
 ['V', 'O', 'O'],
 ['209', 'CD', 'B-DATE'],
 [')', 'O', 'O'],
 ['and', 'O', 'O'],
 ['M.', 'O', 'O'],
 ['Minucius', 'O', 'O'],
 ['Rufus', 'O', 'O'],
 ['(', 'O', 'O'],
 ['cos', 'O', 'O'],
 ['.', 'O', 'O'],
 ['221', 'CD', 'B-DATE'],
 [')', 'O', 'O'],
 ['respectively', 'O', 'O'],
 ['.', 'O', 'O'],
 ['Several', 'O', 'O'],
 ['factors', 'O', 'O'],
 ['combined', 'O', 'O'],
 ['to', 'O', 'O'],
 ['set', 'O', 'O'],
 ['the', 'O', 'O'],
 ['stage', 'O', 'O'],
 ['for', 'O', 'O'],
 ['yet', 'O', 'O'],
 ['another', 'O', 'O'],
 ['dramatic', 'O', 'O'],
 ['precedent', 'O', 'O'],
 [':', 'O', 'O'],
 ['widespread', 'O', 'O'],
 ['popular', 'O', 'O'],
 ['discontent', 'O', 'O'],
 ['at', 'O', 'O'],
 ['Hannibal', 'O', 'O'],
 ['’s', 'O', 'O'],
 ['targeted', 'O', 'O'],
 ['destructions', 'O', 'O'],
 ['and', 'O', 'O'],
 ['Fabius', 'O', 'O'],
 ['’', 'O', 'O'],
 ['evasive', 'O', 'O'],
 ['strategy', 'O', 'O'],
 [';', 'O', 'O'],
 ['relentless', 'O', 'O'],
 ['calls', 'O', 'O'],
 ['for', 'O', 'O'],
 ['a', 'O', 'O'],
 ['more', 'O', 'O'],
 ['aggressive', 'O', 'O'],
 ['approach', 'O', 'O'],
 ['on', 'O', 'O'],
 ['the', 'O', 'O'],
 ['part', 'O', 'O'],
 ['of', 'O', 'O'],
 ['the', 'O', 'O'],
 ['hawkish', 'O', 'O'],
 ['magister', 'O', 'O'],
 ['equitum', 'O', 'O'],
 [';', 'O', 'O'],
 ['and', 'O', 'O'],
 ['profound', 'O', 'O'],
 ['senatorial', 'O', 'O'],
 ['dissatisfaction', 'O', 'O'],
 ['at', 'O', 'O'],
 ['the', 'O', 'O'],
 ['dictator', 'O', 'O'],
 ['’s', 'O', 'O'],
 ['decision', 'O', 'O'],
 ['to', 'O', 'O'],
 ['exchange', 'O', 'O'],
 ['prisoners', 'O', 'O'],
 ['without', 'O', 'O'],
 ['prior', 'O', 'O'],
 ['senatorial', 'O', 'O'],
 ['consent', 'O', 'O'],
 ['.', 'O', 'O'],
 ['Finally', 'O', 'O'],
 [',', 'O', 'O'],
 ['news', 'O', 'O'],
 ['of', 'O', 'O'],
 ['a', 'O', 'O'],
 ['successful', 'O', 'O'],
 ['engagement', 'O', 'O'],
 ['between', 'O', 'O'],
 ['the', 'O', 'O'],
 ['magister', 'O', 'O'],
 ['equitum', 'O', 'O'],
 ['and', 'O', 'O'],
 ['Hannibal', 'O', 'O'],
 ['’s', 'O', 'O'],
 ['forces', 'O', 'O'],
 ['prompted', 'O', 'O'],
 ['the', 'O', 'O'],
 ['Senate', 'O', 'O'],
 ['to', 'O', 'O'],
 ['instruct', 'O', 'O'],
 ['the', 'O', 'O'],
 ['tribune', 'O', 'O'],
 ['of', 'O', 'O'],
 ['the', 'O', 'O'],
 ['plebs', 'O', 'O'],
 ['M.', 'O', 'O'],
 ['Metilius', 'O', 'O'],
 ['to', 'O', 'O'],
 ['put', 'O', 'O'],
 ['forward', 'O', 'O'],
 ['a', 'O', 'O'],
 ['rogatio', 'O', 'O'],
 ['de', 'O', 'O'],
 ['aequando', 'O', 'O'],
 ['M.', 'O', 'O'],
 ['Minuci', 'O', 'O'],
 ['magistri', 'O', 'O'],
 ['equitum', 'O', 'O'],
 ['et', 'O', 'O'],
 ['Q.', 'O', 'O'],
 ['Fabi', 'O', 'O'],
 ['dictatoris', 'O', 'O'],
 ['iure', 'O', 'O'],
 ['.', 'O', 'O'],
 ['In', 'O', 'O'],
 ['terms', 'O', 'O'],
 ['of', 'O', 'O'],
 ['public', 'O', 'O'],
 ['law', 'O', 'O'],
 [',', 'O', 'O'],
 ['the', 'O', 'O'],
 ['results', 'O', 'O'],
 ['of', 'O', 'O'],
 ['this', 'O', 'O'],
 ['Metilian', 'O', 'O'],
 ['Law', 'O', 'O'],
 ['were', 'O', 'O'],
 ['twofold', 'O', 'O'],
 ['.', 'O', 'O'],
 ['First', 'O', 'O'],
 [',', 'O', 'O'],
 ['it', 'O', 'O'],
 ['upgraded', 'O', 'O'],
 ['the', 'O', 'O'],
 ['consular', 'O', 'O'],
 ['imperium', 'O', 'O'],
 ['of', 'O', 'O'],
 ['M.', 'O', 'O'],
 ['Minucius', 'O', 'O'],
 ['by', 'O', 'O'],
 ['redefining', 'O', 'O'],
 ['it', 'O', 'O'],
 ['as', 'O', 'O'],
 ['dictatorium', 'O', 'O'],
 ['.', 'O', 'O'],
 ['Second', 'O', 'O'],
 [',', 'O', 'O'],
 ['and', 'O', 'O'],
 ['just', 'O', 'O'],
 ['as', 'O', 'O'],
 ['remarkably', 'O', 'O'],
 [',', 'O', 'O'],
 ['it', 'O', 'O'],
 ['provided', 'O', 'O'],
 ['that', 'O', 'O'],
 ['the', 'O', 'O'],
 ['magister', 'O', 'O'],
 ['equitum', 'O', 'O'],
 ['was', 'O', 'O'],
 ['to', 'O', 'O'],
 ['command', 'O', 'O'],
 ['on', 'O', 'O'],
 ['a', 'O', 'O'],
 ['footing', 'O', 'O'],
 ['of', 'O', 'O'],
 ['equality', 'O', 'O'],
 ['with', 'O', 'O'],
 ['the', 'O', 'O'],
 ['dictator', 'O', 'O'],
 [',', 'O', 'O'],
 ['very', 'O', 'O'],
 ['much', 'O', 'O'],
 ['on', 'O', 'O'],
 ['the', 'O', 'O'],
 ['model', 'O', 'O'],
 ['of', 'O', 'O'],
 ['a', 'O', 'O'],
 ['joint', 'O', 'O'],
 ['consular', 'O', 'O'],
 ['command', 'O', 'O'],
 ['.', 'O', 'O'],
 ['By', 'O', 'O'],
 ['virtue', 'O', 'O'],
 ['of', 'O', 'O'],
 ['the', 'O', 'O'],
 ['Metilian', 'O', 'O'],
 ['Law', 'O', 'O'],
 [',', 'O', 'O'],
 ['each', 'O', 'O'],
 ['man', 'O', 'O'],
 ['now', 'O', 'O'],
 ['commanded', 'O', 'O'],
 ['not', 'O', 'O'],
 ['only', 'O', 'O'],
 ['eodem', 'O', 'O'],
 ['imperio', 'O', 'O'],
 [',', 'O', 'O'],
 ['but', 'O', 'O'],
 ['also', 'O', 'O'],
 ['pari', 'O', 'O'],
 ['imperio', 'O', 'O'],
 ['.', 'O', 'O'],
 ['Unsurprisingly', 'O', 'O'],
 [',', 'O', 'O'],
 ['Minucius', 'O', 'O'],
 ['immediately', 'O', 'O'],
 ['insisted', 'O', 'O'],
 ['on', 'O', 'O'],
 ['dividing', 'O', 'O'],
 ['up', 'O', 'O'],
 ['the', 'O', 'O'],
 ['legions', 'O', 'O'],
 [',', 'O', 'O'],
 ['which', 'O', 'O'],
 ['were', 'O', 'O'],
 ['saved', 'O', 'O'],
 ['from', 'O', 'O'],
 ['utter', 'O', 'O'],
 ['destruction', 'O', 'O'],
 ['only', 'O', 'O'],
 ['by', 'O', 'O'],
 ['the', 'O', 'O'],
 ['dictator', 'O', 'O'],
 ['’s', 'O', 'O'],
 ['timely', 'O', 'O'],
 ['intervention', 'O', 'O'],
 ['.', 'O', 'O'],
 ['After', 'O', 'O'],
 ['Minucius', 'O', 'O'],
 ['himself', 'O', 'O'],
 ['had', 'O', 'O'],
 ['dramatically', 'O', 'O'],
 ['renounced', 'O', 'O'],
 ['the', 'O', 'O'],
 ['Metilian', 'O', 'O'],
 ['Law', 'O', 'O'],
 ['by', 'O', 'O'],
 ['submitting', 'O', 'O'],
 ['himself', 'O', 'O'],
 ['again', 'O', 'O'],
 ['to', 'O', 'O'],
 ['Fabius', 'O', 'O'],
 ['’', 'O', 'O'],
 ['supreme', 'O', 'O'],
 ['command', 'O', 'O'],
 [',', 'O', 'O'],
 ['SPQR', 'O', 'O'],
 ['officially', 'O', 'O'],
 ['repealed', 'O', 'O'],
 ['the', 'O', 'O'],
 ['plebiscite', 'O', 'O'],
 ['.', 'O', 'O'],
 ['Although', 'O', 'O'],
 ['the', 'O', 'O'],
 ['legal', 'O', 'O'],
 ['effect', 'O', 'O'],
 ['of', 'O', 'O'],
 ['this', 'O', 'O'],
 ['ad', 'O', 'O'],
 ['hoc', 'O', 'O'],
 ['statute', 'O', 'O'],
 ['thus', 'O', 'O'],
 ['was', 'O', 'O'],
 ['shortlived', 'O', 'O'],
 [',', 'O', 'O'],
 ['there', 'O', 'O'],
 ['is', 'O', 'O'],
 [',', 'O', 'O'],
 ['nonetheless', 'O', 'O'],
 [',', 'O', 'O'],
 ['every', 'O', 'O'],
 ['indication', 'O', 'O'],
 ['that', 'O', 'O'],
 ['its', 'O', 'O'],
 ['historic', 'O', 'O'],
 ['significance', 'O', 'O'],
 ['was', 'O', 'O'],
 ['tremendous', 'O', 'O'],
 ['.', 'O', 'O'],
 ['Amongst', 'O', 'O'],
 ['other', 'O', 'O'],
 ['things', 'O', 'O'],
 [',', 'O', 'O'],
 ['it', 'O', 'O'],
 ['demonstrated', 'O', 'O'],
 ['that', 'O', 'O'],
 ['it', 'O', 'O'],
 ['was', 'O', 'O'],
 ['perfectly', 'O', 'O'],
 ['possible', 'O', 'O'],
 ['to', 'O', 'O'],
 ['upgrade', 'O', 'O'],
 ['the', 'O', 'O'],
 ['genus', 'O', 'O'],
 ['imperii', 'O', 'O'],
 ['of', 'O', 'O'],
 ['a', 'O', 'O'],
 ['certain', 'O', 'O'],
 ['official', 'O', 'O'],
 ['(', 'O', 'O'],
 ['with', 'O', 'O'],
 ['full', 'O', 'O'],
 ['imperium', 'O', 'O'],
 ['auspiciumque', 'O', 'O'],
 [')', 'O', 'O'],
 ['to', 'O', 'O'],
 ['the', 'O', 'O'],
 ['level', 'O', 'O'],
 ['of', 'O', 'O'],
 ['a', 'O', 'O'],
 ['genus', 'O', 'O'],
 ['imperii', 'O', 'O'],
 ['that', 'O', 'O'],
 ['was', 'O', 'O'],
 [',', 'O', 'O'],
 ['in', 'O', 'O'],
 ['terms', 'O', 'O'],
 ['of', 'O', 'O'],
 ['relative', 'O', 'O'],
 ['strength', 'O', 'O'],
 [',', 'O', 'O'],
 ['maius', 'O', 'O'],
 ['.', 'O', 'O'],
 ['The', 'O', 'O'],
 ['lex', 'O', 'O'],
 ['Metilia', 'O', 'O'],
 ['therefore', 'O', 'O'],
 ['set', 'O', 'O'],
 ['a', 'O', 'O'],
 ['precedent', 'O', 'O'],
 ['for', 'O', 'O'],
 ['the', 'O', 'O'],
 ['procedure', 'O', 'O'],
 ['of', 'O', 'O'],
 ['redefining', 'O', 'O'],
 ['and', 'O', 'O'],
 ['upgrading', 'O', 'O'],
 ['the', 'O', 'O'],
 ['praetorium', 'O', 'O'],
 ['imperium', 'O', 'O'],
 ['of', 'O', 'O'],
 ['certain', 'O', 'O'],
 ['praetors', 'O', 'O'],
 ['to', 'O', 'O'],
 ['consulare', 'O', 'O'],
 ['imperium', 'O', 'O'],
 ['.', 'O', 'O'],
 ['The', 'O', 'O'],
 ['next', 'O', 'O'],
 ['section', 'O', 'O'],
 ['of', 'O', 'O'],
 ['this', 'O', 'O'],
 ['study', 'O', 'O'],
 ['will', 'O', 'O'],
 ['2', 'CD', 'B-DATE'],
 ['The', 'O', 'O'],
 ['traditional', 'O', 'O'],
 ['genera', 'O', 'O'],
 ['imperii', 'O', 'O'],
 ['consisted', 'O', 'O'],
 ['of', 'O', 'O'],
 ['the', 'O', 'O'],
 ['praetorium', 'O', 'O'],
 ['imperium', 'O', 'O'],
 ['(', 'O', 'O'],
 ['cf.', 'O', 'O'],
 ['Cic', 'O', 'O'],
 ['.', 'O', 'O'],
 ['Pis', 'O', 'O'],
 ['.', 'O', 'O'],
 ['38', 'CD', 'B-DATE'],
 [';', 'O', 'O'],
 ['Verr', 'O', 'O'],
 ['.', 'O', 'O'],
 ['2.', 'O', 'O'],
 ['.', 'O', 'O'],
 [';', 'O', 'O'],
 ['Diu', 'O', 'O'],
 ['.', 'O', 'O'],
 ['1', 'CD', 'B-DATE'],
 ['.', 'O', 'O'],
 [')', 'O', 'O'],
 [',', 'O', 'O'],
 ['the', 'O', 'O'],
 ['consulare', 'O', 'O'],
 ['imperium', 'O', 'O'],
 [',', 'O', 'O'],
 ['and', 'O', 'O'],
 ['the', 'O', 'O'],
 ['dictatorium', 'O', 'O'],
 ['imperium', 'O', 'O'],
 ['(', 'O', 'O'],
 ['cf.', 'O', 'O'],
 ['Livy', 'O', 'O'],
 ['22.', 'O', 'O'],
 ['.', 'O', 'O'],
 [':', 'O', 'O'],
 ['dictatorio', 'O', 'O'],
 ['imperio', 'O', 'O'],
 ['–', 'O', 'O'],
 ['in', 'O', 'O'],
 ['Rep.', 'O', 'O'],
 ['2', 'CD', 'B-DATE'],
 ['.', 'O', 'O'],
 [',', 'O', 'O'],
 ['Cicero', 'O', 'O'],
 ['defines', 'O', 'O'],
 ['the', 'O', 'O'],
 ['imperium', 'O', 'O'],
 ['of', 'O', 'O'],
 ['the', 'O', 'O'],
 ['dictatorship', 'O', 'O'],
 [',', 'O', 'O'],
 ['which', 'O', 'O'],
 ['was', 'O', 'O'],
 ['supposedly', 'O', 'O'],
 ['created', 'O', 'O'],
 ['ten', 'O', 'O'],
 ['years', 'O', 'O'],
 ['after', 'O', 'O'],
 ['the', 'O', 'O'],
 ['establishment', 'O', 'O'],
 ['of', 'O', 'O'],
 ['the', 'O', 'O'],
 ['consulate', 'O', 'O'],
 [',', 'O', 'O'],
 ['as', 'O', 'O'],
 ['a', 'O', 'O'],
 ['nouum', 'O', 'O'],
 ['genus', 'O', 'O'],
 ['imperii', 'O', 'O'],
 ['[', 'O', 'O'],
 ['…', 'O', 'O'],
 [']', 'O', 'O'],
 ['proximum', 'O', 'O'],
 ['similitudine', 'O', 'O'],
 ['regiae', 'O', 'O'],
 [')', 'O', 'O'],
 ['.', 'O', 'O'],
 ['3', 'CD', 'B-DATE'],
 ['For', 'O', 'O'],
 ['a', 'O', 'O'],
 ['comprehensive', 'O', 'O'],
 ['analysis', 'O', 'O'],
 ['of', 'O', 'O'],
 ['the', 'O', 'O'],
 ['Metilian', 'O', 'O'],
 ['Law', 'O', 'O'],
 ['and', 'O', 'O'],
 ['its', 'O', 'O'],
 ['historic', 'O', 'O'],
 ['significance', 'O', 'O'],
 [',', 'O', 'O'],
 ['see', 'O', 'O'],
 ['Vervaet', 'O', 'O'],
 ['2007', 'CD', 'B-DATE'],
 [',', 'O', 'O'],
 ['197–232', 'Date', 'B-DATE'],
 ['.', 'O', 'O'],
 ['See', 'O', 'O'],
 ['esp', 'O', 'O'],
 ['.', 'O', 'O'],
 ['201–215', 'Date', 'B-DATE'],
 ['for', 'O', 'O'],
 ['the', 'O', 'O'],
 ['argument', 'O', 'O'],
 ['that', 'O', 'O'],
 ['Minucius', 'O', 'O'],
 ['’', 'O', 'O'],
 ['imperium', 'O', 'O'],
 ['was', 'O', 'O'],
 ['fully', 'O', 'O'],
 ['equated', 'O', 'O'],
 ['with', 'O', 'O'],
 ['that', 'O', 'O'],
 ['of', 'O', 'O'],
 ['the', 'O', 'O'],
 ['dictator', 'O', 'O'],
 ['by', 'O', 'O'],
 ['virtue', 'O', 'O'],
 ['of', 'O', 'O'],
 ['a', 'O', 'O'],
 ['plebiscitum', 'O', 'O'],
 ['Metilium', 'O', 'O'],
 ['passed', 'O', 'O'],
 ['ex', 'O', 'O'],
 ['s.c.', 'O', 'O'],
 ['Although', 'O', 'O'],
 ['Livy', 'O', 'O'],
 ['clearly', 'O', 'O'],
 ['alludes', 'O', 'O'],
 ['to', 'O', 'O'],
 ['the', 'O', 'O'],
 ['hostility', 'O', 'O'],
 ['of', 'O', 'O'],
 ['the', 'O', 'O'],
 ['Senate', 'O', 'O'],
 ['against', 'O', 'O'],
 ['the', 'O', 'O'],
 ['dictator', 'O', 'O'],
 [',', 'O', 'O'],
 ['his', 'O', 'O'],
 ['narrative', 'O', 'O'],
 ['suppressed', 'O', 'O'],
 ['any', 'O', 'O'],
 ['direct', 'O', 'O'],
 ['reference', 'O', 'O'],
 ['to', 'O', 'O'],
 ['this', 'O', 'O'],
 ['unprecedented', 'O', 'O'],
 ['s.c.', 'O', 'O'],

Let's now do the same for our groundtruth (i.e. the manually corrected data):

In [538]:
with codecs.open("data/iob/article_446_date_GOLD.iob","r","utf-8") as f:
    data = f.read()

In [539]:
iob_data_gold = [line.split("\t") for line in data.split("\n") if line!=""]

In [540]:
list(zip(iob_data_gold[:50], iob_data_auto[:50]))

[(['The', 'O', 'O'], ['The', 'O', 'O']),
 (['Praetorian', 'O', 'O'], ['Praetorian', 'O', 'O']),
 (['Proconsuls', 'O', 'O'], ['Proconsuls', 'O', 'O']),
 (['of', 'O', 'O'], ['of', 'O', 'O']),
 (['the', 'O', 'O'], ['the', 'O', 'O']),
 (['Roman', 'O', 'O'], ['Roman', 'O', 'O']),
 (['Republic', 'O', 'O'], ['Republic', 'O', 'O']),
 (['45', 'CD', 'B-DATE'], ['45', 'CD', 'B-DATE']),
 (['FREDERIK', 'O', 'O'], ['FREDERIK', 'O', 'O']),
 (['JULIAAN', 'O', 'O'], ['JULIAAN', 'O', 'O']),
 (['VERVAET', 'O', 'O'], ['VERVAET', 'O', 'O']),
 (['The', 'O', 'O'], ['The', 'O', 'O']),
 (['Praetorian', 'O', 'O'], ['Praetorian', 'O', 'O']),
 (['Proconsuls', 'O', 'O'], ['Proconsuls', 'O', 'O']),
 (['of', 'O', 'O'], ['of', 'O', 'O']),
 (['the', 'O', 'O'], ['the', 'O', 'O']),
 (['Roman', 'O', 'O'], ['Roman', 'O', 'O']),
 (['Republic', 'O', 'O'], ['Republic', 'O', 'O']),
 (['(', 'O', 'O'], ['(', 'O', 'O']),
 (['211–52', 'Date', 'B-DATE'], ['211–52', 'Date', 'B-DATE']),
 (['BCE', 'Date', 'I-DATE'], ['BCE', 'Date', 'I-DATE']),
 ([')', 'O', 'O'], [')', 'O', 'O']),
 (['.', 'O', 'O'], ['.', 'O', 'O']),
 (['A', 'O', 'O'], ['A', 'O', 'O']),
 (['Constitutional', 'O', 'O'], ['Constitutional', 'O', 'O']),
 (['Survey', 'O', 'O'], ['Survey', 'O', 'O']),
 (['1', 'O', 'O'], ['1', 'CD', 'B-DATE']),
 (['.', 'O', 'O'], ['.', 'O', 'O']),
 (['Introduction', 'O', 'O'], ['Introduction', 'O', 'O']),
 (['The', 'O', 'O'], ['The', 'O', 'O']),
 (['republican', 'O', 'O'], ['republican', 'O', 'O']),
 (['administrative', 'O', 'O'], ['administrative', 'O', 'O']),
 (['procedure', 'O', 'O'], ['procedure', 'O', 'O']),
 (['of', 'O', 'O'], ['of', 'O', 'O']),
 (['sending', 'O', 'O'], ['sending', 'O', 'O']),
 (['out', 'O', 'O'], ['out', 'O', 'O']),
 (['praetors', 'O', 'O'], ['praetors', 'O', 'O']),
 (['with', 'O', 'O'], ['with', 'O', 'O']),
 (['consular', 'O', 'O'], ['consular', 'O', 'O']),
 (['imperium', 'O', 'O'], ['imperium', 'O', 'O']),
 (['is', 'O', 'O'], ['is', 'O', 'O']),
 (['reasonably', 'O', 'O'], ['reasonably', 'O', 'O']),
 (['well-known', 'O', 'O'], ['well-known', 'O', 'O']),
 (['but', 'O', 'O'], ['but', 'O', 'O']),
 (['little', 'O', 'O'], ['little', 'O', 'O']),
 (['understood', 'O', 'O'], ['understood', 'O', 'O']),
 (['.', 'O', 'O'], ['.', 'O', 'O']),
 (['To', 'O', 'O'], ['To', 'O', 'O']),
 (['the', 'O', 'O'], ['the', 'O', 'O']),
 (['best', 'O', 'O'], ['best', 'O', 'O'])]

In [541]:
auto_labels = [line[2] for line in iob_data_auto]

In [542]:


In [543]:
gold_labels = [line[2] for line in iob_data_gold]

In [544]:


In [545]:


computing error types

The very first step is to classify the automatically assigned labels into entity types.

So we create a dictionary to store the counts for True Positives (TP), False Postives (FP), True Negatives (TN) and False Negatives (FN).

In [546]:
errors = {
    "tp" : 0
    , "fp": 0
    , "tn" : 0
    , "fn" : 0

Remember the zip function to read two (or more) lists at a time? It's exactly what we need, so let's use it!

This is the kind of output it produces:

In [454]:
list(zip(gold_labels, auto_labels))[:10]

[('O', 'O'),
 ('O', 'O'),
 ('O', 'O'),
 ('O', 'O'),
 ('O', 'O'),
 ('O', 'O'),
 ('O', 'O'),
 ('B-DATE', 'B-DATE'),
 ('O', 'O'),
 ('O', 'O')]

In [456]:
assert len(gold_labels) == len(auto_labels)

In [547]:
# we iterate through the two lists of labels
for gold_label, auto_label in list(zip(gold_labels, auto_labels)):
    # label is a negative entity => error type is TN or FP
    if gold_label == "O":
        if gold_label == auto_label:
    # label is a positive entity => error type is TP or FN
        if gold_label == auto_label:

In [548]:

{'fn': 3, 'fp': 25, 'tn': 1725, 'tp': 36}

computing Precision, Recall and F-score

Let's create one function to compute each of these measures.


Precision is the fraction of retrieved entities that are correct.

This measure takes into account the correctly identified entities (TPs) as well as those that were mistakenly tagged as entities (FPs), but does not consider the entities that were missed (FNs).

It's defined by the following formula:

$$p = \frac{tp}{tp+fp}$$

In [549]:
def calc_precision(d_errors):
    Calculates the precision given the input error dictionary.
    if(d_errors["tp"] + d_errors["fp"] == 0):
        return 0
        return d_errors["tp"] / (d_errors["tp"] + d_errors["fp"])

In [557]:
precision = calc_precision(errors)


Recall is the fraction of correct entities that are retrieved by the system.

Recall does not consider the FPs but, instead, takes into account the TNs, i.e. the entities that were missed.

It's defined by the following formula:

$$r = \frac{tp}{tp+fn}$$

In [551]:
def calc_recall(d_errors):
    Calculates the recall given the input error dictionary.
    if(d_errors["tp"] + d_errors["fn"] == 0):
        return 0
        return d_errors["tp"] / (d_errors["tp"] + d_errors["fn"])

In [558]:
recall = calc_recall(errors)

F-1 Score

Finally, the F1 score is a global metric that combines both precision and recall giving them equal importance.

It's defined by the following formula:

$$f1 = \frac{2*p*r}{p+r}$$

In [553]:
def calc_fscore(d_errors):
    Calculates the f-score given the input error dictionary.
    prec = calc_precision(d_errors)
    rec = calc_recall(d_errors)
    if(prec == 0 and rec == 0):
        return 0
        return 2*((prec * rec) / (prec + rec))

In [559]:
fscore = calc_fscore(errors)

In [560]:
print("Precision: {0:.2f}".format(precision))
print("Recall: {0:.2f}".format(recall))
print("F1-score: {0:.2f}".format(fscore))

Precision: 0.59
Recall: 0.92
F1-score: 0.72


$$acc = \frac{tp+tn}{fp+fn}$$

In [477]:
def calc_accuracy(d_errors):
    Calculates the accuracy given the input error dictionary.
    true_predictions = d_errors["tp"] + d_errors["tn"]
    false_predictions = d_errors["fp"] + d_errors["fn"]
    return true_predictions / (true_predictions + false_predictions)

In [478]:


In [479]:
print("Precision: {0:.2f}".format(precision))
print("Recall: {0:.2f}".format(recall))
print("F1-score: {0:.2f}".format(fscore))

Precision: 0.56
Recall: 0.82
F1-score: 0.67

Using sklearn

The good news is that you don't really need to implement all these maeasures, as there are libraries that can compute precision, recall and f-score for you.

Still, it's important to know what those scores mean and how they are obtained!

In [561]:
from sklearn.metrics import precision_recall_fscore_support

In [562]:
precision, recall, fscore, support = precision_recall_fscore_support(gold_labels
                                                                     , auto_labels
                                , average="micro"
                                , labels=["B-DATE","I-DATE"])

In [563]:
print("Precision: {0:.2f}".format(precision))
print("Recall: {0:.2f}".format(recall))
print("F1-score: {0:.2f}".format(fscore))

Precision: 0.58
Recall: 0.92
F1-score: 0.71