In this second part of Assignment 4, you will be asked to read a type of XML and to convert it to a type of CSV.
NewsReader needs your help! Within the project, a English NLP pipeline has been developed and they would like to know how well it performs. However, in order to run the scorer, they must convert their output format NAF to CoNLL. In this assignment, you will write the converter!
The NLP task they've chosen is Entity Linking. The goal of Entity Linking is to link an expression to the identity of an entity. For example, in the sentence Ford makes cars, the goal of Entity Linking would be to link the expression Ford to the Wikipedia page of Ford Motor Company. This is a challenging task, since Ford has many meanings; for example, it can also mean the actor Harrison Ford.
The output from the NewsReader pipeline is not a text file, nor CSV/TSV. No, it's a type of XML. Instead of going through a file line by line, you search for specific elements or attributes of elements.
In [ ]:
from lxml import etree
Please observe the following element. Try to understand which elements are children/parents from which elements.
<entity id="e3" type="PERSON">
<references>
<!--Craig Wood-->
<span>
<target id="t42" />
<target id="t43" />
</span>
</references>
<externalReferences>
<externalRef resource="spotlight_v1" reference="http://dbpedia.org/resource/Craig_Wood_(golfer)" confidence="1.0" reftype="en" />
<externalRef resource="spotlight_v1" reference="http://dbpedia.org/resource/Craig_Wood_(NASCAR)" confidence="9.0210754E-36" reftype="en" />
<externalRef resource="spotlight_v1" reference="http://dbpedia.org/resource/Craig_Wood_(film_editor)" confidence="1.739528E-36" reftype="en" />
</externalReferences>
</entity>
Above, you see an example of the Newsreader output in NAF.
entity
element is the main element.entity
element contains information about its id and the entity type (attributes id
, and type
, respectively.)references
element. This element provides us with the information that the entity is 'Craig Wood' and that the term 'Craig' is the 42nd term in the document and 'Wood' the 43rd.externalReferences
element. This shows the output from the system 'spotlight_v1', which tries to link the entity 'Craig Wood' to Dbpedia (structured Wikipedia). The system has a confidence
of 1.0 (the highest possible value) that the entity refers to http://dbpedia.org/resource/Craig_Wood_(golfer).Our goal is to extract the following information from this element:
type
of element entity
.externalReferences/externalRef
).We want to convert this entity element to a format called CoNLL. Using the entity element as input, it should output the following:
...
41 from _ _
42 Craig (PERSON http://dbpedia.org/resource/Craig_Wood_(golfer)
43 Wood PERSON) http://dbpedia.org/resource/Craig_Wood_(golfer)
44 's _ _
...
Note that it also includes the tokens that are not annotated as an entity.
The goal of this assignment is to complete the code snippet below, which means that you will convert one NAF file to CoNLL. The assignment can be roughly divided into the following steps:
converter.py
and utils.py
)This is the code snippet that you will have to complete and run:
In [ ]:
doc = etree.parse('../Data/xml_data/naf.xml')
t_id2info = dict()
entity_els = doc.findall('entities/entity')
for entity_el in entity_els:
# determine entity type (default is _)
entity_type = type_of_entity(entity_el)
# extract dbpedia link with highest confidence (default is _)
chosen_dbpedia_link = dbpedia_link_with_highest_confidence(entity_el)
# determine the position of t_ids in the entity
t_ids_positions = t_ids_with_position(entity_el)
#loop over t_ids and their positions
for t_id, position in t_ids_positions:
#get position of t_id
#HINT: use the indicate_position_of_tid function
entity_type_with_position = indicate_position_of_tid(entity_type, position)
#update dictionary
t_id2info[t_id] = {'entity_type_with_position': entity_type_with_position,
'dbpedia_link': chosen_dbpedia_link}
print(t_id, t_id2info[t_id])
input('continue?') # only here for debugging
# the input here allows you to inspect the output one entity element at a time
# load the mapping of term identifier to lemmas
tid2lemma = load_mapping_tid2token(doc)
# use the information from t_id2info and tid2lemma to create the conll
# T_ID TAB LEMMA TAB ENTITY_TYPE_WITH_POSITION TAB CHOSEN_DBPEDIA_LINK NEWLINE
# HINT if a t_id does not have annotation both ENTITY_TYPE_WITH_POSITION and CHOSEN_DBPEDIA_LINK are '_'
with open('../Data/xml_data/naf.conll', 'w') as outfile:
for t_id, lemma in sorted(tid2lemma.items()):
# your code here
We will first load one entity element as XML in order to help us develop our program that will run on many entity elements.
In [ ]:
#load the element as XML element.
entity = '''
<entity id="e3" type="PERSON">
<references>
<!--Craig Wood-->
<span>
<target id="t42" />
<target id="t43" />
</span>
</references>
<externalReferences>
<externalRef resource="spotlight_v1" reference="http://dbpedia.org/resource/Craig_Wood_(golfer)" confidence="1.0" reftype="en" />
<externalRef resource="spotlight_v1" reference="http://dbpedia.org/resource/Craig_Wood_(NASCAR)" confidence="9.0210754E-36" reftype="en" />
<externalRef resource="spotlight_v1" reference="http://dbpedia.org/resource/Craig_Wood_(film_editor)" confidence="1.739528E-36" reftype="en" />
</externalReferences>
</entity>'''
entity_el = etree.fromstring(entity)
Create a function type_of_entity()
that takes one parameter: entity_el
(positional parameter). It should return the entity_type
of the entity element (access the value of the attribute type
). If the value is an empty string, or the attribute does not exist, return the string '_'
.
In [ ]:
def type_of_entity(entity_el):
'''
given an entity element, return the entity type
'''
# your code here
entity_type = type_of_entity(entity_el)
print(entity_type)
Create a function dbpedia_link_with_highest_confidence()
that takes one parameter: entity_el
(positional parameter). It should return the dbpedia link with the highest confidence, or return the string '_'
if there are no dbpedia links in the externalReferences
element.
To do this, create a list of tuples with dbpedia links with their corresponding confidences:
[(1.0, 'http://dbpedia.org/resource/Craig_Wood_(golfer)'),
(9.0210754E-36, 'http://dbpedia.org/resource/Craig_Wood_(NASCAR)'),
(1.739528E-36, 'http://dbpedia.org/resource/Craig_Wood_(film_editor)')]
HINT: do not forget to change the confidence to float (it's now a string).
In [ ]:
def dbpedia_link_with_highest_confidence(entity_el):
'''
given an entity element, return the dbpedia link with the highest confidence
'''
# your code here
result = dbpedia_link_with_highest_confidence(entity_el)
print(result)
Create a function called t_ids_with_position()
that takes one parameter: entity_el
(positional parameter). It should loop over the /references/span/target
elements and return a list of tuples (term id
, position_in_entity
).
Possible values for position_in_entity
are: "start"
, "middle"
, "end"
, "start_and_end"
Example of output for the example entity_el
:
[(42, 'start'), (43, 'end')]
HINT: return an empty list if there are no target elements.
In [ ]:
def t_ids_with_position(entity_el):
'''
given an entity element, return the position of each term id in that entity
'''
term_positions = []
# find all 'span/target elements' elements and determine the number of children
target_els = entity_el.findall('references/span/target')
len_target_els = len(target_els)
#if there is only one element, the position is 'start_and_end'
if len_target_els == 1:
# your code here
#if there are 0 children or two or more children, loop over the target elements.
else:
#HINT: use enumerate
#HINT: use 'len_target_els' to check if it's the last element
# your code here
return term_positions
t_ids_with_position(entity_el)
Create a function indicate_position_of_tid()
that returns the entity_type with parantheses that indicate the position of a term id.
For example, if the entity type is 'ORG', the function should work in the following way:
-position = 'start': '(ORG'
-position = 'middle': 'ORG'
-position = 'end': 'ORG)'
-position = 'start_and_end': '(ORG)'
If there is no entity type (entity_type == '_'
), return the string '_'
.
In [ ]:
#### return entity type with parentheses indicating position
def indicate_position_of_tid(entity_type, position):
'''
this function returns the entity_type with the position.
for example, if the entity type is 'ORG',
the function should work in the following way
-position = 'start': '(ORG'
-position = 'middle': 'ORG'
-position = 'end': 'ORG)'
-position = 'start_and_end': '(ORG)'
if entity_type == '_':
return '_'
'''
# your code here
return result
Create a function load_mapping_tid2token()
that takes one parameter: doc
(positional parameter), which represents a loaded XML file of type lxml.etree._ElementTree
. It should return a dictionary mapping all t_ids to their corresponding lemmas.
For example, for this element:
<term id="t1" type="open" lemma="accord" pos="V" morphofeat="VBG">
<span>
<target id="w1" />
</span>
<externalReferences>
<externalRef resource="wn30g.bin64" reference="ili-30-02700104-v" confidence="0.732195" />
<externalRef resource="wn30g.bin64" reference="ili-30-02255268-v" confidence="0.267805" />
<externalRef resource="WordNet-3.0" reference="ili-30-02700104-v" confidence="0.59329313" />
<externalRef resource="WordNet-3.0" reference="ili-30-02255268-v" confidence="0.40670687" />
</externalReferences>
</term>
the dictionary would be update with the: a) KEY: 1 (integer) b) VALUE: 'accord'
In [ ]:
def load_mapping_tid2token(doc):
"""
given a loaded xml file (doc) of type lxml.etree._ElementTree
create a dictionary mapping all t_ids to their corresponding lemmas
"""
# your code here
tid2lemma = load_mapping_tid2token(doc)
Now call the helper functions in the for-loop at the top of this notebook. Complete this code by writing the output to a CoNLL file.
Create two Python files:
utils.py
: should contain all your helper functionsconverter.py
: should contain the main code (and import the helper functions)
In [ ]: