Assignment 4b: Converting XML to CSV

Due: Tuesday the 4th of December 2018 at 20:00

  • Please name your notebook with the following naming convention: ASSIGNMENT_4b_FIRSTNAME_LASTNAME.ipynb
  • Please submit your complete assignment (4a + 4b) by compressing all your material (notebooks + python files + additional files) into a single .zip file following this naming convention: ASSIGNMENT_4_FIRSTNAME_LASTNAME.zip.
    Use this google form for submission.
  • If you have questions about this assignment, please refer to the forum on the Canvas site.

In this second part of Assignment 4, you will be asked to read a type of XML and to convert it to a type of CSV.

Introduction

NewsReader needs your help! Within the project, a English NLP pipeline has been developed and they would like to know how well it performs. However, in order to run the scorer, they must convert their output format NAF to CoNLL. In this assignment, you will write the converter!

The NLP task they've chosen is Entity Linking. The goal of Entity Linking is to link an expression to the identity of an entity. For example, in the sentence Ford makes cars, the goal of Entity Linking would be to link the expression Ford to the Wikipedia page of Ford Motor Company. This is a challenging task, since Ford has many meanings; for example, it can also mean the actor Harrison Ford.

The output from the NewsReader pipeline is not a text file, nor CSV/TSV. No, it's a type of XML. Instead of going through a file line by line, you search for specific elements or attributes of elements.


In [ ]:
from lxml import etree

NE (Named Entities)

Please observe the following element. Try to understand which elements are children/parents from which elements.

<entity id="e3" type="PERSON">
      <references>
        <!--Craig Wood-->
        <span>
          <target id="t42" />
          <target id="t43" />
        </span>
      </references>
      <externalReferences>
        <externalRef resource="spotlight_v1" reference="http://dbpedia.org/resource/Craig_Wood_(golfer)" confidence="1.0" reftype="en" />
        <externalRef resource="spotlight_v1" reference="http://dbpedia.org/resource/Craig_Wood_(NASCAR)" confidence="9.0210754E-36" reftype="en" />
        <externalRef resource="spotlight_v1" reference="http://dbpedia.org/resource/Craig_Wood_(film_editor)" confidence="1.739528E-36" reftype="en" />
      </externalReferences>
    </entity>

Above, you see an example of the Newsreader output in NAF.

  • the entity element is the main element.
  • the entity element contains information about its id and the entity type (attributes id, and type, respectively.)
  • the first child of the entity element is the references element. This element provides us with the information that the entity is 'Craig Wood' and that the term 'Craig' is the 42nd term in the document and 'Wood' the 43rd.
  • the second child of the entity element is the externalReferences element. This shows the output from the system 'spotlight_v1', which tries to link the entity 'Craig Wood' to Dbpedia (structured Wikipedia). The system has a confidence of 1.0 (the highest possible value) that the entity refers to http://dbpedia.org/resource/Craig_Wood_(golfer).

Our goal is to extract the following information from this element:

  • entity type: 'PERSON', 'ORGANISATION' or 'LOCATION'. This can be found in the attribute type of element entity.
  • the dbpedia link with the highest confidence (see externalReferences/externalRef).
  • finally, we want to know for each term (t_42 and t_43), which position they have in the entity. t_42 ('Craig') is the first term in the entity, and t_43 ('Wood') is the last term in the entity.

We want to convert this entity element to a format called CoNLL. Using the entity element as input, it should output the following:

...

41  from    _   _
42  Craig   (PERSON http://dbpedia.org/resource/Craig_Wood_(golfer)
43  Wood    PERSON) http://dbpedia.org/resource/Craig_Wood_(golfer)
44  's  _   _

...

Note that it also includes the tokens that are not annotated as an entity.

Goal of this assignment

The goal of this assignment is to complete the code snippet below, which means that you will convert one NAF file to CoNLL. The assignment can be roughly divided into the following steps:

  • Step 1: complete the helper functions below
  • Step 2: call the helper functions in the for-loop
  • Step 3: move the code to python files (converter.py and utils.py)

This is the code snippet that you will have to complete and run:


In [ ]:
doc = etree.parse('../Data/xml_data/naf.xml')

t_id2info = dict()

entity_els = doc.findall('entities/entity')

for entity_el in entity_els:
    
    # determine entity type (default is _)
    entity_type = type_of_entity(entity_el)
    
    # extract dbpedia link with highest confidence (default is _)
    chosen_dbpedia_link = dbpedia_link_with_highest_confidence(entity_el)
   
    # determine the position of t_ids in the entity
    t_ids_positions = t_ids_with_position(entity_el)
    
    #loop over t_ids and their positions
    for t_id, position in t_ids_positions:

        #get position of t_id
        #HINT: use the indicate_position_of_tid function
        entity_type_with_position = indicate_position_of_tid(entity_type, position)

        #update dictionary
        t_id2info[t_id] = {'entity_type_with_position': entity_type_with_position,
                           'dbpedia_link': chosen_dbpedia_link}
        
        print(t_id, t_id2info[t_id])
    
    input('continue?') # only here for debugging
    # the input here allows you to inspect the output one entity element at a time
    

# load the mapping of term identifier to lemmas
tid2lemma = load_mapping_tid2token(doc)

# use the information from t_id2info and tid2lemma to create the conll
# T_ID TAB LEMMA  TAB ENTITY_TYPE_WITH_POSITION TAB CHOSEN_DBPEDIA_LINK NEWLINE

# HINT if a t_id does not have annotation both ENTITY_TYPE_WITH_POSITION and CHOSEN_DBPEDIA_LINK are '_'

with open('../Data/xml_data/naf.conll', 'w') as outfile:
    for t_id, lemma in sorted(tid2lemma.items()):
        
      # your code here

We will first load one entity element as XML in order to help us develop our program that will run on many entity elements.


In [ ]:
#load the element as XML element.
entity = '''
<entity id="e3" type="PERSON">
      <references>
        <!--Craig Wood-->
        <span>
          <target id="t42" />
          <target id="t43" />
        </span>
      </references>
      <externalReferences>
        <externalRef resource="spotlight_v1" reference="http://dbpedia.org/resource/Craig_Wood_(golfer)" confidence="1.0" reftype="en" />
        <externalRef resource="spotlight_v1" reference="http://dbpedia.org/resource/Craig_Wood_(NASCAR)" confidence="9.0210754E-36" reftype="en" />
        <externalRef resource="spotlight_v1" reference="http://dbpedia.org/resource/Craig_Wood_(film_editor)" confidence="1.739528E-36" reftype="en" />
      </externalReferences>
    </entity>'''

entity_el = etree.fromstring(entity)

Step 1: Help functions

In order extract all the relevant information from the entity element, we are going to write a lot of small functions to help us.

1a. Get the entity type

Create a function type_of_entity() that takes one parameter: entity_el (positional parameter). It should return the entity_type of the entity element (access the value of the attribute type). If the value is an empty string, or the attribute does not exist, return the string '_'.


In [ ]:
def type_of_entity(entity_el):
    '''
    given an entity element, return the entity type
    '''
    # your code here

entity_type = type_of_entity(entity_el)
print(entity_type)

Create a function dbpedia_link_with_highest_confidence() that takes one parameter: entity_el (positional parameter). It should return the dbpedia link with the highest confidence, or return the string '_' if there are no dbpedia links in the externalReferences element.

To do this, create a list of tuples with dbpedia links with their corresponding confidences:

    [(1.0, 'http://dbpedia.org/resource/Craig_Wood_(golfer)'),
     (9.0210754E-36, 'http://dbpedia.org/resource/Craig_Wood_(NASCAR)'),
     (1.739528E-36, 'http://dbpedia.org/resource/Craig_Wood_(film_editor)')]

HINT: do not forget to change the confidence to float (it's now a string).


In [ ]:
def dbpedia_link_with_highest_confidence(entity_el):
    '''
    given an entity element, return the dbpedia link with the highest confidence
    '''
    # your code here
    
result = dbpedia_link_with_highest_confidence(entity_el)
print(result)

1c. Find the positions of terms in entity

Create a function called t_ids_with_position() that takes one parameter: entity_el (positional parameter). It should loop over the /references/span/target elements and return a list of tuples (term id, position_in_entity). Possible values for position_in_entity are: "start", "middle", "end", "start_and_end"

Example of output for the example entity_el:

    [(42, 'start'), (43, 'end')]

HINT: return an empty list if there are no target elements.


In [ ]:
def t_ids_with_position(entity_el):
    '''
    given an entity element, return the position of each term id in that entity
    '''
    term_positions = []
    
    # find all 'span/target elements' elements and determine the number of children
    target_els = entity_el.findall('references/span/target')
    len_target_els = len(target_els)
    
    #if there is only one element, the position is 'start_and_end'
    if len_target_els == 1:
        # your code here
    
    #if there are 0 children or two or more children, loop over the target elements.
    else:
        #HINT: use enumerate
        #HINT: use 'len_target_els' to check if it's the last element
        
        # your code here

    return term_positions

t_ids_with_position(entity_el)

1d. Get the entity_type with the position

Create a function indicate_position_of_tid() that returns the entity_type with parantheses that indicate the position of a term id.

For example, if the entity type is 'ORG', the function should work in the following way:

    -position = 'start': '(ORG'
    -position = 'middle': 'ORG'
    -position = 'end': 'ORG)'
    -position = 'start_and_end': '(ORG)'

If there is no entity type (entity_type == '_'), return the string '_'.


In [ ]:
#### return entity type with parentheses indicating position

def indicate_position_of_tid(entity_type, position):
    '''
    this function returns the entity_type with the position.
    for example, if the entity type is 'ORG', 
    the function should work in the following way
    -position = 'start': '(ORG'
    -position = 'middle': 'ORG'
    -position = 'end': 'ORG)'
    -position = 'start_and_end': '(ORG)'
    
    if entity_type == '_': 
        return '_'
    '''
    # your code here 
    
    return result

1e. Mapping the t_ids to their corresponding lemmas

Create a function load_mapping_tid2token() that takes one parameter: doc (positional parameter), which represents a loaded XML file of type lxml.etree._ElementTree. It should return a dictionary mapping all t_ids to their corresponding lemmas.

For example, for this element:

    <term id="t1" type="open" lemma="accord" pos="V" morphofeat="VBG">
      <span>
        <target id="w1" />
      </span>
      <externalReferences>
        <externalRef resource="wn30g.bin64" reference="ili-30-02700104-v" confidence="0.732195" />
        <externalRef resource="wn30g.bin64" reference="ili-30-02255268-v" confidence="0.267805" />
        <externalRef resource="WordNet-3.0" reference="ili-30-02700104-v" confidence="0.59329313" />
        <externalRef resource="WordNet-3.0" reference="ili-30-02255268-v" confidence="0.40670687" />
      </externalReferences>
    </term>

the dictionary would be update with the: a) KEY: 1 (integer) b) VALUE: 'accord'


In [ ]:
def load_mapping_tid2token(doc):
    """
    given a loaded xml file (doc) of type lxml.etree._ElementTree
    create a dictionary mapping all t_ids to their corresponding lemmas
    """
    # your code here

tid2lemma = load_mapping_tid2token(doc)

Step 2: Calling the helper functions

Now call the helper functions in the for-loop at the top of this notebook. Complete this code by writing the output to a CoNLL file.

Step 3: Moving the code to Python files

Create two Python files:

  • utils.py: should contain all your helper functions
  • converter.py: should contain the main code (and import the helper functions)

In [ ]: