Welcome

This notebook accompanies the Sunokisis Digital Classics common session on Named Entity Extraction, see https://github.com/SunoikisisDC/SunoikisisDC-2016-2017/wiki/Named-Entity-Extraction-I.

In this notebook we are going to experiment with three different methods for extracting named entities from a Latin text.

Library imports

External modules and libraries can be imported using import statements.

Let's the Natural Language ToolKit (NLTK), the Classical Language ToolKit (CLTK), MyCapytain and some local libraries that are used in this notebook.



In [20]:

    
########
# NLTK #
########
import nltk
from nltk.tag import StanfordNERTagger
########
# CLTK #
########
import cltk
from cltk.tag.ner import tag_ner
##############
# MyCapytain #
##############
import MyCapytain 
from MyCapytain.resolvers.cts.api import HttpCTSResolver
from MyCapytain.retrievers.cts5 import CTS
from MyCapytain.common.constants import Mimetypes
#################
# other imports #
#################
import sys
sys.path.append("/opt/nlp/pymodules/")
from idai_journals.nlp import sub_leaves

And more precisely, we are using the following versions:



In [4]:

    
print(nltk.__version__)



In [5]:

    
print(cltk.__version__)



In [6]:

    
print(MyCapytain.__version__)

Let's grab some text

To start with, we need some text from which we'll try to extract named entities using various methods and libraries.

There are several ways of doing this e.g.:

copy and paste the text from Perseus or the Latin Library into a text document, and read it into a variable
load a text from one of the Latin corpora available via cltk (cfr. this blog post)
or load it from Perseus by leveraging its Canonical Text Services API

Let's gor for #3 :)

What's CTS?

CTS URNs stand for Canonical Text Service Uniform Resource Names.

You can think of a CTS URN like a social security number for texts (or parts of texts).

Here are some examples of CTS URNs with different levels of granularity:

urn:cts:latinLit:phi0448 (Caesar)
urn:cts:latinLit:phi0448.phi001 (Caesar's De Bello Gallico)
urn:cts:latinLit:phi0448.phi001.perseus-lat2 DBG Latin edtion
urn:cts:latinLit:phi0448.phi001.perseus-lat2:1 DBG Latin edition, book 1
urn:cts:latinLit:phi0448.phi001.perseus-lat2:1.1.1 DBG Latin edition, book 1, chapter 1, section 1

How do I find out the CTS URN of a given author or text? The Perseus Catalog is your friend! (crf. e.g. http://catalog.perseus.org/catalog/urn:cts:latinLit:phi0448)

Querying a CTS API

The URN of the Latin edition of Caesar's De Bello Gallico is urn:cts:latinLit:phi0448.phi001.perseus-lat2.



In [7]:

    
my_passage = "urn:cts:latinLit:phi0448.phi001.perseus-lat2"

With this information, we can query a CTS API and get some information about this text.

For example, we can "discover" its canonical text structure, an essential information to be able to cite this text.



In [8]:

    
# We set up a resolver which communicates with an API available in Leipzig
resolver = HttpCTSResolver(CTS("http://cts.dh.uni-leipzig.de/api/cts/"))



In [9]:

    
# We require some metadata information
textMetadata = resolver.getMetadata("urn:cts:latinLit:phi0448.phi001.perseus-lat2")
# Texts in CTS Metadata have one interesting property : its citation scheme.
# Citation are embedded objects that carries information about how a text can be quoted, what depth it has
print([citation.name for citation in textMetadata.citation])









    



['Book', 'Chapter', 'Section']

But we can also query the same API and get back the text of a specific text section, for example the entire book 1.

To do so, we need to append the indication of the reference scope (i.e. book 1) to the URN.



In [10]:

    
my_passage = "urn:cts:latinLit:phi0448.phi001.perseus-lat2:1"

So we retrieve the first book of the De Bello Gallico by passing its CTS URN (that we just stored in the variable my_passage) to the CTS API, via the resolver provided by MyCapytains:



In [11]:

    
passage = resolver.getTextualNode(my_passage)

At this point the passage is available in various formats: text, but also TEI XML, etc.

Thus, we need to specify that we are interested in getting the text only:



In [12]:

    
de_bello_gallico_book1 = passage.export(Mimetypes.PLAINTEXT)

Let's check that the text is there by printing the content of the variable de_bello_gallico_book1 where we stored it:



In [13]:

    
#print(de_bello_gallico_book1)

The text that we have just fetched by using a programming interface (API) can also be viewed in the browser.

Or even imported as an iframe into this notebook!



In [14]:

    
from IPython.display import IFrame
IFrame('http://cts.dh.uni-leipzig.de/read/latinLit/phi0448/phi001/perseus-lat2/1', width=1000, height=350)









    Out[14]:

Let's see how many words (tokens, more properly) there are in Caesar's De Bello Gallico I:



In [15]:

    
len(de_bello_gallico_book1.split(" "))









    Out[15]:





8176

Very simple baseline

Now let's write what in NLP jargon is called a baseline, that is a method for extracting named entities that can serve as a term of comparison to evaluate the accuracy of other methods.

Baseline method:

cycle through each token of the text
if the token starts with a capital letter it's a named entity (only one type, i.e. Entity)



In [16]:

    
"T".istitle()









    Out[16]:





True



In [17]:

    
"t".istitle()









    Out[17]:





False



In [18]:

    
# we need a list to store the tagged tokens
tagged_tokens = []

# tokenisation is done by using the string method `split(" ")` 
# that splits a string upon white spaces
for n, token in enumerate(de_bello_gallico_book1.split(" ")):
    if(token.istitle()):
        tagged_tokens.append((token, "Entity"))
    else:
        tagged_tokens.append((token, "O"))

Let's a havea look at the first 50 tokens that we just tagged:



In [19]:

    
tagged_tokens[:50]









    Out[19]:





[('COMMENTARIUS', 'O'),
 ('PRIMUS', 'O'),
 ('Gallia', 'Entity'),
 ('est', 'O'),
 ('omnis', 'O'),
 ('divisa', 'O'),
 ('in', 'O'),
 ('partes', 'O'),
 ('tres,', 'O'),
 ('quarum', 'O'),
 ('unam', 'O'),
 ('incolunt', 'O'),
 ('Belgae,', 'Entity'),
 ('aliam', 'O'),
 ('Aquitani,', 'Entity'),
 ('tertiam', 'O'),
 ('qui', 'O'),
 ('ipsorum', 'O'),
 ('lingua', 'O'),
 ('Celtae,', 'Entity'),
 ('nostra', 'O'),
 ('Galli', 'Entity'),
 ('appellantur.', 'O'),
 ('Hi', 'Entity'),
 ('omnes', 'O'),
 ('lingua,', 'O'),
 ('institutis,', 'O'),
 ('legibus', 'O'),
 ('inter', 'O'),
 ('se', 'O'),
 ('differunt.', 'O'),
 ('Gallos', 'Entity'),
 ('ab', 'O'),
 ('Aquitanis', 'Entity'),
 ('Garumna', 'Entity'),
 ('flumen,', 'O'),
 ('a', 'O'),
 ('Belgis', 'Entity'),
 ('Matrona', 'Entity'),
 ('et', 'O'),
 ('Sequana', 'Entity'),
 ('dividit.', 'O'),
 ('Horum', 'Entity'),
 ('omnium', 'O'),
 ('fortissimi', 'O'),
 ('sunt', 'O'),
 ('Belgae,', 'Entity'),
 ('propterea', 'O'),
 ('quod', 'O'),
 ('a', 'O')]

For convenience we can also wrap our baseline code into a function that we call extract_baseline. Let's define it:



In [ ]:

    
def extract_baseline(input_text):
    """
    :param input_text: the text to tag (string)
    :return: a list of tuples, where tuple[0] is the token and tuple[1] is the named entity tag
    """
    # we need a list to store the tagged tokens
    tagged_tokens = []

    # tokenisation is done by using the string method `split(" ")` 
    # that splits a string upon white spaces
    for n, token in enumerate(input_text.split(" ")):
        if(token.istitle()):
            tagged_tokens.append((token, "Entity"))
        else:
            tagged_tokens.append((token, "O")) 
    return tagged_tokens

And now we can call it like this:



In [ ]:

    
tagged_tokens_baseline = extract_baseline(de_bello_gallico_book1)



In [ ]:

    
tagged_tokens_baseline[-50:]

We can modify slightly our function so that it prints the snippet of text where an entity is found:



In [ ]:

    
def extract_baseline(input_text):
    """
    :param input_text: the text to tag (string)
    :return: a list of tuples, where tuple[0] is the token and tuple[1] is the named entity tag
    """
    # we need a list to store the tagged tokens
    tagged_tokens = []

    # tokenisation is done by using the string method `split(" ")` 
    # that splits a string upon white spaces
    for n, token in enumerate(input_text.split(" ")):
        if(token.istitle()):
            tagged_tokens.append((token, "Entity"))
            context = input_text.split(" ")[n-5:n+5]
            print("Found entity \"%s\" in context \"%s\""%(token, " ".join(context)))
        else:
            tagged_tokens.append((token, "O"))  
    return tagged_tokens



In [ ]:

    
tagged_text_baseline = extract_baseline(de_bello_gallico_book1)



In [ ]:

    
tagged_text_baseline[:50]

NER with CLTK

The CLTK library has some basic support for the extraction of named entities from Latin and Greek texts (see CLTK's documentation).

The current implementation (as of version 0.1.47) uses a lookup-based method.

For each token in a text, the tagger checks whether that token is contained within a predefined list of possible named entities:

list of Latin proper nouns: https://github.com/cltk/latin_proper_names_cltk
list of Greek proper nouns: https://github.com/cltk/greek_proper_names_cltk

Let's run CLTK's tagger (it takes a moment):



In [ ]:

    
%%time
tagged_text_cltk = tag_ner('latin', input_text=de_bello_gallico_book1)

Let's have a look at the ouput, only the first 10 tokens (by using the list slicing notation):



In [ ]:

    
tagged_text_cltk[:10]

The output looks slightly different from the one of our baseline function (the size of the tuples in the list varies).

But we can write a function to fix this, we call it reshape_cltk_output:



In [ ]:

    
def reshape_cltk_output(tagged_tokens):
    reshaped_output = []
    for tagged_token in tagged_tokens:
        if(len(tagged_token)==1):
            reshaped_output.append((tagged_token[0], "O"))
        else:
            reshaped_output.append((tagged_token[0], tagged_token[1]))
    return reshaped_output

We apply this function to CLTK's output:



In [ ]:

    
tagged_text_cltk = reshape_cltk_output(tagged_text_cltk)

And the resulting output looks now ok:



In [ ]:

    
tagged_text_cltk[:20]

Now let's compare the two list of tagged tokens by using a python function called zip, which allows us to read multiple lists simultaneously:



In [ ]:

    
list(zip(tagged_text_baseline[:20], tagged_text_cltk_reshaped[:20]))

But, as you can see, the two lists are not aligned.

This is due to how the CLTK function tokenises the text. The comma after "tres" becomes a token on its own, whereas when we tokenise by white space the comma is attached to "tres" (i.e. "tres,").

A solution to this is to pass to the tag_ner function the text already tokenised by text.



In [ ]:

    
tagged_text_cltk = reshape_cltk_output(tag_ner('latin', input_text=de_bello_gallico_book1.split(" ")))



In [ ]:

    
list(zip(tagged_text_baseline[:20], tagged_text_cltk[:20]))

NER with NLTK



In [ ]:

    
stanford_model_italian = "/opt/nlp/stanford-tools/stanford-ner-2015-12-09/classifiers/ner-ita-nogpe-noiob_gaz_wikipedia_sloppy.ser.gz"



In [ ]:

    
ner_tagger = StanfordNERTagger(stanford_model_italian)



In [ ]:

    
tagged_text_nltk = ner_tagger.tag(de_bello_gallico_book1.split(" "))

Let's have a look at the output



In [ ]:

    
tagged_text_nltk[:20]

Wrap up

At this point we can "compare" the output of the three different methods we used, again by using the zip function.



In [ ]:

    
list(zip(tagged_text_baseline[:20], tagged_text_cltk[:20], tagged_text_nltk[:20]))



In [ ]:

    
for baseline_out, cltk_out, nltk_out in zip(tagged_text_baseline[:20], tagged_text_cltk[:20], tagged_text_nltk[:20]):
    print("Baseline: %s\nCLTK: %s\nNLTK: %s\n"%(baseline_out, cltk_out, nltk_out))

Excercise

Extract the named entities from the English translation of the De Bello Gallico book 1.

The CTS URN for this translation is urn:cts:latinLit:phi0448.phi001.perseus-eng2:1.

Modify the code above to use the English model of the Stanford tagger instead of the italian one.

Hint:



In [ ]:

    
stanford_model_english = "/opt/nlp/stanford-tools/stanford-ner-2015-12-09/classifiers/english.muc.7class.distsim.crf.ser.gz"