import pickle
with open('data/DBG_tagged_baseline.pickle','wb') as f:
pickle.dump(tagged_text_baseline, f)
with open('data/DBG_tagged_clkt.pickle','wb') as f:
pickle.dump(tagged_text_cltk, f)
with open('data/DBG_tagged_nltk.pickle','wb') as f:
pickle.dump(tagged_text_nltk, f)
In [2]:
import pickle
with open('data/DBG_tagged_baseline.pickle','rb') as f:
tagged_text_baseline = pickle.load(f)
with open('data/DBG_tagged_clkt.pickle','rb') as f:
tagged_text_cltk = pickle.load(f)
with open('data/DBG_tagged_nltk.pickle','rb') as f:
tagged_text_nltk = pickle.load(f)
In [491]:
tagged_text_baseline[:10]
Out[491]:
In [492]:
for baseline_out, cltk_out, nltk_out in zip(tagged_text_baseline[:20]
, tagged_text_cltk[:20]
, tagged_text_nltk[:20]):
print("Baseline: %s\nCLTK: %s\nNLTK: %s\n"%(baseline_out
, cltk_out
, nltk_out))
Of the three methods we used in the previous session to extract NEs from Caesar's De Bello Gallico, let's take the output of NLTK. In fact, this is the only one with more granulary entity types, where as the other two have just a generic Entity
type.
In [480]:
tagged_text_nltk
Out[480]:
The first thing to do is to create a list of all named entity tags that were extracted by NLTK:
In [493]:
nltk_tags = []
for token, tag in tagged_text_nltk:
nltk_tags.append(tag)
In [494]:
nltk_tags[:10]
Out[494]:
A more elegant – that is, Pythonic – way of doing this is to use list comprehension:
In [32]:
nltk_tags = [tag for token, tag in tagged_text_nltk]
In [495]:
nltk_tags[:10]
Out[495]:
Now, we want to count how many times each token appears.
A typical way of doing this is to use a dictionary to store the counts.
Since in a dictionary, the keys are unique, we leverage this property to keep track of whether a given entity type was already encountered as we go through all extracted entities.
The values in the dictionary are simply numbers (integers), that are increased of 1 any time a given type is found in the data.
In [496]:
nltk_tags[:10]
Out[496]:
In [497]:
# we initialize an empty dictionary
counts = {}
# we iterate through all NE tags
for tag in nltk_tags:
# we check if our dictionary already contains an item
# for that specific entity type
if tag in counts:
# if it does, we just increase the counter of 1
counts[tag]+=1
else:
# otherwise we add it and set it to 1
counts[tag] = 1
In [501]:
counts
Out[501]:
Let's look at the result:
In [24]:
counts
Out[24]:
Now that we have learned how to do this ourselves, it's important to know that the Python library collections
already contains an objects that does exactly this: Counter
.
Let's look at its documentation:
In [29]:
Counter?
In [502]:
from collections import Counter
In [503]:
nltk_tag_counts = Counter(nltk_tags)
In [504]:
nltk_tag_counts
Out[504]:
As you can see, the output is identical to the one we had previously obtained!
Let's first filter NLTK's output and keep just the entities (i.e. the token identified as being a NE):
In [505]:
nltk_entities = []
for token, tag in tagged_text_nltk:
if tag != "O":
nltk_entities.append(token)
In [506]:
nltk_entities[:20]
Out[506]:
We can now use the Counter
object that we just introduced to count the frequencies:
In [507]:
nltk_entity_counts = Counter(nltk_entities)
In [508]:
type(nltk_entity_counts)
Out[508]:
In [509]:
nltk_entity_counts
Out[509]:
In [510]:
nltk_entity_counts.most_common()
Out[510]:
In [486]:
nltk_entity_counts = dict(nltk_entity_counts)
In [487]:
type(nltk_entity_counts)
Out[487]:
In [398]:
sorted(entity_counts.items(), key=lambda x: x[1], reverse=True)
Out[398]:
As you will have noticed, one big limitation of counting entities this way, is that entities consisting of more than one token are treated as separate entities.
What we'd need, instead, is a data format that allows for stating that two or more consecutive tokens are part of the same entity...
Another limitation, is that we are actually counting surface forms of a given entity (cfr. "Caesar", "Caesaris", "Caesari", etc.). For this we'd need to disambiguate each entity by means of a unique identifier.
Two libraries that are very useful when dealing with data are pandas
and seaborn
.
Pandas
is a data analysis library, while seaborn
is used to visualise statistical data.
These libraries play nicely together and are often used in combination.
In [511]:
import pandas as pd
import seaborn as sns
In [512]:
%matplotlib inline
A key data structure in pandas
is the DataFrame
, a tabular data structure that allows for arithmetic operations on its contents.
The functionalities provided by pandas
' dataframes are very similar to those provided by a spreadsheet software.
First off, we initialise a dataframe representing the named entity counts that we have created previously. Remember?
In [409]:
nltk_entity_counts
Out[409]:
We use a utility function provided by the library to create a DataFrame
starting from a dictionary.
In [513]:
df = pd.DataFrame.from_dict(dict(nltk_entity_counts), orient="index")
In [514]:
df
Out[514]:
Let's rename the column that contains the entity counts:
In [515]:
df.columns = ["count"]
In [516]:
df.info()
In [518]:
df.head(10)
Out[518]:
In [520]:
df.tail(10)
Out[520]:
We can now sort the entities based on their frequency by using the sort_values
method of the dataframe:
In [521]:
df.sort_values(by="count", ascending=False)
Out[521]:
In [522]:
df
Out[522]:
NB: sort_values
produces a sorted copy of the input dataframe, it does not change it directly.
So to have a sorted dataframe we need to re-assign our variable:
In [523]:
df = df.sort_values(by="count", ascending=False)
In [524]:
df
Out[524]:
Seaborn
is built on top of matplotlib
a very powerful python library for plotting.
Seaborn
provides a high-level layer on top of it, and makes some guesses on the nature of the data it receives in input.
In [440]:
from IPython.display import IFrame
IFrame('http://seaborn.pydata.org/examples/', width=900, height=4000)
Out[440]:
A quite handy characteristic is that you can pass to Seaborn
a pandas.DataFrame
.
Here we plot the top 10 entities extracted by NLTK:
In [525]:
df[:10]
Out[525]:
In [526]:
sns.barplot(x="count", y=df[:10].index, data=df[:10])
Out[526]:
Some things to note:
x
is the name of the dataframe's column containing the values for the x axisy
are the labels for the y axis (in this case they come from the index)data
is the pandas' DataFrame we pass to the function
In [527]:
sns.barplot(x="count", y=df[:20].index, data=df[:20])
Out[527]:
Let's plot now, instead, the distribution of named entity types.
In [528]:
# we initialize an empty dictionary
counts = {}
# we iterate through all NE tags
for tag in nltk_tags:
# we check if our dictionary already contains an item
# for that specific entity type
if tag in counts:
# if it does, we just increase the counter of 1
counts[tag]+=1
else:
# otherwise we add it and set it to 1
counts[tag] = 1
In [529]:
df_types = pd.DataFrame.from_dict(counts, orient="index")
df_types.columns = ["count"]
df_types
Out[529]:
We can generate a basic pie chart by using the plot
method of a dataframe:
In [530]:
df_types.plot(y="count", kind="pie")
Out[530]:
Which is equivalent to:
In [447]:
df_types.plot.pie(y="count")
Out[447]:
Now, this is not very useful/readable. Let's try to remove the O
labels.
We just modify the code used above and add an if
statement:
In [531]:
# we initialize an empty dictionary
counts = {}
# we iterate through all NE tags
for tag in nltk_tags:
# do something only if `tag` is not "O"
if tag != "O":
# we check if our dictionary already contains an item
# for that specific entity type
if tag in counts:
# if it does, we just increase the counter of 1
counts[tag]+=1
else:
# otherwise we add it and set it to 1
counts[tag] = 1
else:
pass
In [532]:
df_types = pd.DataFrame.from_dict(counts, orient="index")
df_types.columns = ["count"]
df_types
Out[532]:
In [533]:
df_types.plot.pie(y="count")
Out[533]:
To give a practical example of how to measure the accuracy of a NER system, we will use the dates extracted by Francesco in the first part of the session by using regular expressions.
Let's read in the dates that were extracted previously:
In [534]:
import codecs
with codecs.open("data/iob/article_446_date_aut.iob","r","utf-8") as f:
data = f.read()
In [535]:
data
Out[535]:
We want to convert this into a list of tuples
In [536]:
iob_data_auto = [line.split("\t") for line in data.split("\n") if line!=""]
In [537]:
iob_data_auto
Out[537]:
Let's now do the same for our groundtruth (i.e. the manually corrected data):
In [538]:
with codecs.open("data/iob/article_446_date_GOLD.iob","r","utf-8") as f:
data = f.read()
In [539]:
iob_data_gold = [line.split("\t") for line in data.split("\n") if line!=""]
In [540]:
list(zip(iob_data_gold[:50], iob_data_auto[:50]))
Out[540]:
In [541]:
auto_labels = [line[2] for line in iob_data_auto]
In [542]:
len(auto_labels)
Out[542]:
In [543]:
gold_labels = [line[2] for line in iob_data_gold]
In [544]:
len(auto_labels)
Out[544]:
In [545]:
auto_labels[:20]
Out[545]:
The very first step is to classify the automatically assigned labels into entity types.
So we create a dictionary to store the counts for True Positives (TP), False Postives (FP), True Negatives (TN) and False Negatives (FN).
In [546]:
errors = {
"tp" : 0
, "fp": 0
, "tn" : 0
, "fn" : 0
}
Remember the zip
function to read two (or more) lists at a time? It's exactly what we need, so let's use it!
This is the kind of output it produces:
In [454]:
list(zip(gold_labels, auto_labels))[:10]
Out[454]:
In [456]:
assert len(gold_labels) == len(auto_labels)
In [547]:
# we iterate through the two lists of labels
for gold_label, auto_label in list(zip(gold_labels, auto_labels)):
# label is a negative entity => error type is TN or FP
if gold_label == "O":
if gold_label == auto_label:
errors["tn"]+=1
else:
errors["fp"]+=1
# label is a positive entity => error type is TP or FN
else:
if gold_label == auto_label:
errors["tp"]+=1
else:
errors["fn"]+=1
In [548]:
errors
Out[548]:
Let's create one function to compute each of these measures.
Precision is the fraction of retrieved entities that are correct.
This measure takes into account the correctly identified entities (TPs) as well as those that were mistakenly tagged as entities (FPs), but does not consider the entities that were missed (FNs).
It's defined by the following formula:
In [549]:
def calc_precision(d_errors):
"""
Calculates the precision given the input error dictionary.
"""
if(d_errors["tp"] + d_errors["fp"] == 0):
return 0
else:
return d_errors["tp"] / (d_errors["tp"] + d_errors["fp"])
In [557]:
precision = calc_precision(errors)
Recall is the fraction of correct entities that are retrieved by the system.
Recall does not consider the FPs but, instead, takes into account the TNs, i.e. the entities that were missed.
It's defined by the following formula:
In [551]:
def calc_recall(d_errors):
"""
Calculates the recall given the input error dictionary.
"""
if(d_errors["tp"] + d_errors["fn"] == 0):
return 0
else:
return d_errors["tp"] / (d_errors["tp"] + d_errors["fn"])
In [558]:
recall = calc_recall(errors)
Finally, the F1 score is a global metric that combines both precision and recall giving them equal importance.
It's defined by the following formula:
In [553]:
def calc_fscore(d_errors):
"""
Calculates the f-score given the input error dictionary.
"""
prec = calc_precision(d_errors)
rec = calc_recall(d_errors)
if(prec == 0 and rec == 0):
return 0
else:
return 2*((prec * rec) / (prec + rec))
In [559]:
fscore = calc_fscore(errors)
In [560]:
print("Precision: {0:.2f}".format(precision))
print("Recall: {0:.2f}".format(recall))
print("F1-score: {0:.2f}".format(fscore))
In [477]:
def calc_accuracy(d_errors):
"""
Calculates the accuracy given the input error dictionary.
"""
true_predictions = d_errors["tp"] + d_errors["tn"]
false_predictions = d_errors["fp"] + d_errors["fn"]
return true_predictions / (true_predictions + false_predictions)
In [478]:
calc_accuracy(errors)
Out[478]:
In [479]:
print("Precision: {0:.2f}".format(precision))
print("Recall: {0:.2f}".format(recall))
print("F1-score: {0:.2f}".format(fscore))
The good news is that you don't really need to implement all these maeasures, as there are libraries that can compute precision, recall and f-score for you.
Still, it's important to know what those scores mean and how they are obtained!
In [561]:
from sklearn.metrics import precision_recall_fscore_support
In [562]:
precision, recall, fscore, support = precision_recall_fscore_support(gold_labels
, auto_labels
, average="micro"
, labels=["B-DATE","I-DATE"])
In [563]:
print("Precision: {0:.2f}".format(precision))
print("Recall: {0:.2f}".format(recall))
print("F1-score: {0:.2f}".format(fscore))