In our previous common session, we have introduced some fundamental notions of the Python language. Let's review some of them!
In [39]:
from idai_journals import nlp as dainlp
import re
from treetagger import TreeTagger
In [40]:
from nltk.tag import StanfordNERTagger
from nltk.chunk.util import tree2conlltags
from nltk.chunk import RegexpParser
from nltk.tree import Tree
from nltk.tag import StanfordNERTaggelr
In [ ]:
#interges and floats
3 + 0.5
#strings
"hello"
#Booleans
True
In [41]:
#lists (can also contain multiple different data types)
li = ["Leipzig", "London", "Berlin", "Boston", 4, False]
#tuples (like lists, but immutable)
tu = ("tuple", "list", "dictionary")
#dictionaries (key : value pairs)
di = {"key" : "value", "other-key" : "second value"}
In [42]:
#home assignment: try to figure out what the if statement (line 2) does
for l in li:
if isinstance(l, str):
print(l)
In [43]:
def printMe(message):
print(message)
printMe("Hello, world!")
printMe("goodbye...")
In [44]:
l = ["zero", "one", "two", "three"]
l[10]
In [45]:
try:
l[10]
except IndexError:
print("hey, your index is way too high!")
Objects might be a bit complicated, but they're very important to understand the code written by other people, while most of the programs that you'll find around is written using classes and objects. Oh, and the good news is... you've already met them!
What are "objects" in a programming language like Python? Well, I like to think about them as... magical, animated tools!
Say that you want to fetch water from a well (and maybe clean some of the mess...). Well, the object-oriented approach to this tak consists in creating one or more magic brooms that go and fetch the water for you! In order to create them, you have to conceptualize the broom in terms of:
That's it! In programming parlance, the features are called properties of the object; the actions are called methods.
When you want to build your own magic brooms you first create a sort of prototype for each of them (which is called the class of magic brooms); then you can go on and create as many brooms as you want...
Here's how to do it! (very simplified)
In [46]:
class MagicBroom():
#this is called "constructor"; it's a special method
def __init__(self, name, speed=20):
self.name = name
self.buckets = 2
self.speed = speed
def greet(self):
print("Hello, my name is %s! What can I do for you?" % self.name)
def fetchWater(self):
if self.speed >= 20:
print("Yes, sir! I'll be back in a sec!")
else:
print("Allright, but I am taking my time!")
In [48]:
mickey = MagicBroom("Mickey")
mickey.greet()
In [49]:
peter = MagicBroom("Peter", speed=5)
In [50]:
mickey.speed
Out[50]:
In [51]:
mickey.fetchWater()
In [52]:
peter.fetchWater()
How would you find all the numbers in this sentence?
The set of integers consists of zero (0), the positive natural numbers (1, 2, 3, …), also called whole numbers or counting numbers,[1][2] and their additive inverses (the negative integers, i.e., −1, −2, −3, …). This is often denoted by a boldface Z ("Z") or blackboard bold Z {\displaystyle \mathbb {Z} } \mathbb {Z} (Unicode U+2124 ℤ) standing for the German word Zahlen ([ˈtsaːlən], "numbers").[3][4] ℤ is a subset of the sets of rational and real numbers and, like the natural numbers, is countably infinite.
In [53]:
wiki = 'The set of integers consists of zero (0), the positive natural numbers (1, 2, 3, …), also called whole numbers or counting numbers,[1][2] and their additive inverses (the negative integers, i.e., −1, −2, −3, …). This is often denoted by a boldface Z ("Z") or blackboard bold Z {\displaystyle \mathbb {Z} } \mathbb {Z} (Unicode U+2124 ℤ) standing for the German word Zahlen ([ˈtsaːlən], "numbers").[3][4] ℤ is a subset of the sets of rational and real numbers and, like the natural numbers, is countably infinite.'
We'd need a way to tell our machine not to look for specific strings, bur rather for classes of strings, i.e. using some sort of meta-character to catch a whole group of signs (e.g. the numbers); then we'd need to tell to optionally include/exclude some other signs, or to catch the numbers only if they're not preceeded/followed by other signs...
That's precisely what Regular Expressions do! They allow you to express a query as a string of metacharacters (or groups of metacharacters).
How do we use them in Python? First, we need to import a module from the Standard Library (i.e. you already have them with Python: no need to install external libraries)
In [55]:
import re
A cool feature of RegExp in Python is that you can create your complicated patterns as objects (and assign them to variables)! That's right, RegExp patterns are your magic brooms...
In [56]:
#here is one to catch all numbers
reg = re.compile(r'[0-9]+') #or: r'\d+'
type(reg)
Out[56]:
The Pattern object has a number of interesting methods to search and replace the pattern. Generally, you use them with the text that must be searched as an argument. For instance, findall returns all matches as a list
In [57]:
reg.findall(wiki)
Out[57]:
Kind of a sloppy job we did! The negative numbers are not captured as negative; the footnote reference (e.g. [1], [4]) are also captured and we don't want them... We can do better. Let's improve our pattern so that we include the '-' signs (if present) and we get rid of the footnotes
In [58]:
reg = re.compile(r'(?<!\[)−?\d+(?!\])') # the 'r' is there to make sure that we don't have to "escape the escape" sing (\)
reg.findall(wiki)
Out[58]:
Now it's time to go back to our task of (Named) Entity recognition and extraction task. But we're going to use RegExp patterns and syntax quite a few times now...
As Matteo said last time, the concept of "named entity" is domain- and task- specific. While a person's or a place's name will more or less always fall under the definition, in some contexts of information extraction people might be interested in other kinds of real-life "entities", such as time references (months, days, dates) or museum objects, which are not relevant to others.
In this exercise, we are going to expand on what Matteo did last time with proper names in Latin and look at two specific classes of "entities" mentioned in a modern scientific text about ancient history: dates and persons.
First, let's grab a text.
We will be working with an English article on Roman history. The article is: Frederik Juliaan Vervaet, The Praetorian Proconsuls of the Roman Republic (211–52 BCE). A Constitutional Survey, Chiron 42(2012): 45-96.
Let's start by loading the text and inspect the first 10.000 characters (we'll be working with just the first 10k words)
In [59]:
with open("data/txt/article446_10k.txt") as f:
txt = f.read()
In [60]:
txt[:1000]
Out[60]:
Most of the time POS tagging is the precondition before you can perform any other advanced operation on a text
As we did with Matteo last time, by "tagging" we mean the coupling of each word with a tag that describe some property of the word itself. Part-of-speech tags define what word class (e.g. "verb", or "proper noun") a text token belongs to.
There are several tagset used for each language, and several software (pos taggers) who can tag your text automatically. One of the most used is TreeTagger, which has pretrained classifiers for many languages.
Let's run it from Python, using one of the few "wrappers" available
In [61]:
#first we load the library
from treetagger import TreeTagger
In [62]:
#That's right! we start by creating a Tagger "magic broom" (a Tagger object)
tt = TreeTagger(language="english")
#then we tag our text
tagged = tt.tag(txt)
In [63]:
tagged[:20]
Out[63]:
Named Entity Recognition (using a tool like the Stanford NER that we saw in our last lecture) is also a way of tagging the text, this time using information not on the word class but on a different level of classification (place, person, organization or none of the above).
Let's do this too
In [64]:
#first, we define the path to the English classifier for Stanford NER
english_classifier = 'english.all.3class.distsim.crf.ser.gz'
twords = [w[0] for w in tagged]
In [65]:
#then... guess what? Yes, we create a NER-tagger Magic Broom ;-)
from nltk.tag import StanfordNERTagger
ner_tagger = StanfordNERTagger(english_classifier)
ners = ner_tagger.tag(twords)
In [66]:
#not very pretty...
ners[:20]
Out[66]:
As we saw, when we analyze a text we proceed word by word (more exactly: token by token). However, Named Entities (now including dates) often span over more than one token. The task of sub-dividing a section of text into phrases and/or meaningful constituents (which may include 1 or more text tokens) is called chunking
In the image above, the tokens are [We, saw, the, yellow, dog]. Two Noun Phrases (NP) can be chunked:
The IOB notation that Matteo introduced last time is a popular way to store the information about chunks in a word-by-word format. In the case of "the yellow dog", we will have:
The easiest method for chunking a sentence in Python is to use the information in the Tag and a regexp syntax.
For example, if we have:
in O
New LOCATION
York LOCATION
City LOCATION
We easily see that the 3 tokens tagged as LOCATION go together. We may thus write a grammar rule that chunks the LOC together:
LOC:
{<LOCATION><LOCATION>*}
Which means group in a chunk named LOC
every token tagged as LOCATION
, including any token tagged as LOCATION
that might optionally come after.
And the same goes also for PERSONS
and ORGANIZATIONS
. We may even use RegExp syntax to be more tollerant and make room for annotation errors, in case e.g. the two tokens Geore Washington are wrongly tagged as PERSON
and LOCATION
.
Here's how I'd do it (it's not perfect at all but it should work in most cases)...
In [68]:
from nltk.chunk import RegexpParser
english_chunker = RegexpParser(r'''
LOC:
{<LOCATION><(PERSON|LOCATION|MISC|ORGANIZATION)>*}
''')
Let's see it in action with the first few words
In [69]:
tree = english_chunker.parse(ners[:20])
print(tree)
Well... OK, "Roman Republic" is not a location, but at least the chunking is exactly what we wanted to have, right?
OK, but now how do we convert this to the IOB notation?
Luckily, there's a ready-made function in a module from the NLTK library! Let's load and use it
(just in case, there is also a function that does the reverse: from IOB to tree)
In [70]:
from nltk.chunk.util import tree2conlltags
In [72]:
iobs = tree2conlltags(tree)
In [73]:
iobs
Out[73]:
Now, to go back to our original task, how do we use all this to annotate the dates and export them to IOB?
Dates are often just numbers (e.g. "2017"); sometimes they come in more complex formats like: "14 September 2017" or "14-09-2017".
One very simple solutions to find them and annotate them with a chunking notation might be to tag the tokens of our text with a very simple custom tagset that we design for dates. We assign "O" to all tokens, save the numbers (that we tag "CD") and some selected time formats or expressions, like the months of the year or the sequence number-number. We use the tag "Date" for them.
In order to do this, we need:
A module of NLTK provides with exactly that tagger that can work with RegExp syntax
In [74]:
from nltk.tag import RegexpTagger
In [87]:
#here is our list of patterns
patterns = [
(r'\d+$', 'CD'),
(r'\d+[-–]\d+$', "Date"),
(r'\d{1,2}[-\.\/]\d{1,2}[-\.\/]\d{2,4}', "Date"),
(r'January|February|March|April|May|June|July|August|September|October|November|December', "Date"),
(r'\d{4}$', "Date"),
(r'BCE|BC|AD', "Date"),
(r'.*', "O")
]
In [88]:
#Our RegexpTagger magic broom! We initialize it with our pattern list
tagger = RegexpTagger(patterns)
In [77]:
#let's test it with a trivial example
tagger.tag("I was born on September 14 , or 14-09".split(" "))
Out[77]:
Now let's see it in action on the real stuff
In [89]:
reg_tag = tagger.tag(twords)
In [90]:
reg_tag[:50]
Out[90]:
Now we just need to chunk it and export it to IOB. Then we are ready to evaluate the manual annotation...
First, we have to define a chunker
In [91]:
date_chunker = RegexpParser(r'''
DATE:
{<CD>*<Date><Date|CD>*}
DATE:
{<CD>+}
''')
In [92]:
t = date_chunker.parse(reg_tag)
#we use that function to make sure that the tree is not too complex to be converted
flat = dainlp.flatten_tree(t)
In [93]:
iob_list = tree2conlltags(flat)
In [94]:
iob_list[:50]
Out[94]:
In [95]:
#then we can write it on an output file
with open("data/iob/article_446_date_aut.iob", "w") as out:
for i in iob_list:
out.write("\t".join(i)+"\n")
In the practical exercise, you are requested to extract the person names from the same article that we used for dates. You will annotate them using the Stanford NER with the pre-trained classifier for English that come with the software; extract the Person chunks; evaluate the results against a golden standard.
Here is a summary of the steps that you will have to execute in order to solve the exercise:
data/txt/article446_10k.txt
and read its contentdata/iob/article_446_person_GOLD.iob
as gold standard
In [ ]:
#just remember that the path to the English pre-trained classifier for Stanfor NER is
english_classifier = 'english.all.3class.distsim.crf.ser.gz'
For the evaluation of the accuracy of your classifier, you can adapt the following lines of code:
from sklearn.metrics import precision_recall_fscore_support
precision, recall, fscore, support = precision_recall_fscore_support(gold_labels
, auto_labels
, average="micro"
, labels=["B-DATE","I-DATE"])
print("Precision: {0:.2f}".format(precision))
print("Recall: {0:.2f}".format(recall))
print("F1-score: {0:.2f}".format(fscore))
Things you'll need to change/provide:
labels
)gold_labels
: a list with the correct labelsauto_labels
: a similar list with the labels output by your classifier.NB: make sure that gold_labels
and auto_labels
are of the same lenght, i.e. that both labels at position n
in both lists refer to the same token.