Chapter 18 - Data Formats III (XML)

Until now, we have already seen quite some data formats (CSV/TSV, JSON). In this week, we will learn how to work with one of the most popular structured data format: XML. XML is used a lot in NLP and therefore it is important that you know how to work with it.

At the end of this chapter, you will be able to

read an XML file using etree.parse
read XML from string using etree.fromstring
convert an XML element to a string using etree.tostring
use the following methods and attributes of an XML element (of type lxml.etree._Element):
- to access elements: methodsfind,findall, andgetchildren`
- to access attributes: method get
- to access element information: attributes tag and text
[not needed for assignment] create your own XML and write it to a file

If you want to learn more about this chapter, you might find the following links useful:

XML
detailled XML introduction
NAF XML
Xpath
Other structured data formats: JSON-LD, MicroData, RDF

If you have questions about this chapter, please refer to the forum on Canvas.

1. Introduction to XML

NLP is all about data. More specifically, we usually want to annotate (manually or automatically) textual data with information about:

What would data look like that contains all this information? Let's look at a simple example:



In [ ]:

    
import nltk



In [ ]:

    
text = nltk.word_tokenize("Tom Cruise is an actor.")
print(nltk.pos_tag(text))

In this example, we see that the format is a list of tuples. The first element of each tuple is the word and the second element is the part of speech tag. Great, so far this works. However, we also want to indicate that Tom Cruise is an entity. Now, we start to run into trouble, because some annotations are for single words and some are for combinations of words. In addition, sometimes we have more than one annotation per token. Data structures such as CSV and TSV are not great at representing linguistic information. So is there a format that is better at it? The answer is yes and the format is XML.

2. Terminology

Let's look at an example (the line numbers are there for explanation purposes). On purpose, we start with a non-linguistic, hopefully intuitive example. In the folder ../Data/xml_data this XML is stored as the file course.xml. You can inspect this file using a text editor (e.g. Atom, BBEdit or Notepad++).

1.  <Course>
2.      <person role="coordinator">Van der Vliet</person>
3.      <person role="instructor">Van Miltenburg</person>
4.      <person role="instructor">Van Son</person>
5.      <person role="instructor">Postma</person>
6.      <person role="instructor">Sommerauer</person>
7.      <person role="student">Baloche</person>
8.      <person role="student">De Boer</person>
9.      <animal role="student">Rubber duck</animal>
10.     <person role="student">Van Doorn</person>
11.     <person role="student">De Jager</person>
12.     <person role="student">King</person>
13.     <person role="student">Kingham</person>
14.     <person role="student">Mózes</person>
15.     <person role="student">Rübsaam</person>
16.     <person role="student">Torsi</person>
17.     <person role="student">Witteman</person>
18.     <person role="student">Wouterse</person>
19.     <person/>
20. </Course>

2.1 Elements

Line 1 to 19 all show examples of XML elements. Each XML element contains a starting tag (e.g. <person>) and an end tag (e.g. </person>). An element can contain:

text Van der Vliet on line 2
attributes: role attribute in lines 2 to 18
elements: elements can contain other elements, e.g. person elements inside the Course element. The terminology to talk about this is as follows. In this example, we call person the child of Course and Course the parent of person.

Please note that on line 19 the starting tag and end tag are combined. This happens when an element has no children and/or no text. The syntax for an element is then <START_TAG/>.

2.2 Root element

A special element is the root element. In our example, Course is our root element. The element starts at line 1 (<Course>) and ends at line 19 (</Course>). Notice the difference between the begin tag (no '/') and the end tag (with '/'). A root element is special in that it is the only element, which is the sole parent element to all the other elements.

2.3 Attributes

Elements can contain attributes, which contain information about the element. In this case, this information is the role a person has in the course. All attributes are located in the start tag of an XML element.

3. Working with XML in Python

Now that we know the basics of XML, we want to be able to access it in Python. In order to work with XML, we will use the lxml library.



In [ ]:

    
from lxml import etree

We will focus on the following methods/attributes:

to parse the XML from file or string: the methods etree.parse() and etree.fromstring()
to access the root element: the methods getroot()
to access elements: the methods find(), findall(), and getchildren()
to access attributes: the method get()
to access element information: the attributes tag and text

3.1 Parsing XML from file or string

The etree.fromstring() is used to parse XML from a string:



In [ ]:

    
xml_string = """
<Course>
    <person role="coordinator">Van der Vliet</person>
    <person role="instructor">Van Miltenburg</person>
    <person role="instructor">Van Son</person>
    <person role="instructor">Marten Postma</person>
    <person role="student">Baloche</person>
    <person role="student">De Boer</person>
    <animal role="student">Rubber duck</animal>
    <person role="student">Van Doorn</person>
    <person role="student">De Jager</person>
    <person role="student">King</person>
    <person role="student">Kingham</person>
    <person role="student">Mózes</person>
    <person role="student">Rübsaam</person>
    <person role="student">Torsi</person>
    <person role="student">Witteman</person>
    <person role="student">Wouterse</person>
    <person/>
</Course>
"""

tree = etree.fromstring(xml_string)
print(type(tree))

The etree.parse() method is used to load XML files on your computer:



In [ ]:

    
tree = etree.parse('../Data/xml_data/course.xml')
print(type(tree))

As you can see, etree.parse() returns an ElementTree, whereas etree.fromstring() returns an Element. One of the important differences is that the ElementTree class serialises as a complete document, as opposed to a single Element. This includes top-level processing instructions and comments, as well as a DOCTYPE and other DTD content in the document. For now, it's not too important that you know what these are; just remember that there is a difference btween ElementTree and Element.

3.1 Accessing root element

While etree.fromstring() gives you the root element right away, etree.parse() does not. In order to access the root element of ElementTree, we first need to use the getroot() method. Note that this does not show the XML element itself, but only a reference. In order to show the element itself, we can use the etree.dump() method.



In [ ]:

    
root = tree.getroot()
print('root', type(root), root)
print()
print('etree.dump example')
etree.dump(root, pretty_print=True)

As with any python object, we can use the built-in function dir() to list all methods of an element (which has the type lxml.etree._Element) , some of which will be illustrated below.



In [ ]:

    
print(type(root))
dir(root)

3.2 Accessing elements

There are several ways of accessing XML elements. The find() method returns the first matching child.



In [ ]:

    
first_person_el = root.find('person')
etree.dump(first_person_el, pretty_print=True)

In order to get a list of all person children, we can use the findall() method. Notice that this does not return the animal since we are looking for person elements.



In [ ]:

    
all_person_els = root.findall('person')
all_person_els

Sometimes, we simple want all the children, while ignoring the start tags. This can be achieved using the getchildren() method. This will simply return all children. Now we do get the animal element again.



In [ ]:

    
all_child_els = root.getchildren()
all_child_els

3.3 Accessing element information

We will now show how to access the attributes, text, and tag of an element.

The get() method is used to access the attribute of an element. If an attribute does not exists, it will return None, hence no error.



In [ ]:

    
first_person_el = root.find('person')
role_first_person_el = first_person_el.get('role')
attribute_not_found = first_person_el.get('blabla')
print('role first person element:', role_first_person_el)
print('value if not found:', attribute_not_found)

The text of an element is found in the attribute text:



In [ ]:

    
print(first_person_el.text)

The tag of an element is found in the attribute tag:



In [ ]:

    
print(first_person_el.tag)

4 How to deal with more than one layer

In our previous example, we had an XML with only one nested layer (person). However, XML can deal with many more. Let's look at such an example and think about how you would access the first target element, i.e.

<target id="t1" />

<NAF xml:lang="en" version="v3">
    <terms>
        <term id="t1" type="open" lemma="Tom" pos="N" morphofeat="NNP">
        <term id="t2" type="open" lemma="Cruise" pos="N" morphofeat="NNP">
        <term id="t3" type="open" lemma="be" pos="V" morphofeat="VBZ">
        <term id="t4" type="open" lemma="an" pos="R" morphofeat="DT">
        <term id="t5" type="open" lemma="actor" pos="N" morphofeat="NN">
    </terms>
    <entities>
        <entity id="e3" type="PERSON">
              <references>
                  <span>
                      <target id="t1" />
                      <target id="t2" />
                  </span>
              </references>
        </entity>
    </entities>
</NAF>

Again, we use etree.fromstring() to load XML from a string:



In [ ]:

    
naf_string = """
<NAF xml:lang="en" version="v3">
    <text>
        <wf id="w1" offset="0" length="3" sent="1" para="1">tom</wf>
        <wf id="w2" offset="4" length="6" sent="1" para="1">cruise</wf>
        <wf id="w3" offset="11" length="2" sent="1" para="1">is</wf>
        <wf id="w4" offset="14" length="2" sent="1" para="1">an</wf>
        <wf id="w5" offset="17" length="5" sent="1" para="1">actor</wf>
    </text>
    <terms>
        <term id="t1" type="open" lemma="Tom" pos="N" morphofeat="NNP"/>
        <term id="t2" type="open" lemma="Cruise" pos="N" morphofeat="NNP"/>
        <term id="t3" type="open" lemma="be" pos="V" morphofeat="VBZ"/>
        <term id="t4" type="open" lemma="an" pos="R" morphofeat="DT"/>
        <term id="t5" type="open" lemma="actor" pos="N" morphofeat="NN"/>
    </terms> 
    <entities>
        <entity id="e3" type="PERSON">
              <references>
                  <span>
                      <target id="t1" />
                      <target id="t2" />
                  </span>
              </references>
        </entity>
    </entities>
</NAF>
"""

naf = etree.fromstring(naf_string)
print(type(naf))
etree.dump(naf, pretty_print=True)

Please note that the structure is as follows:

the NAF element is the parent of the elements text, terms, and entities
the wf elements are children of the text element, which provides us information about the position of words in the text, e.g. that tom is the first word in the text (id="w1") and in the first sentence (sent="1")
the term elements are children of the term elements, which provide us information about lemmatization and part of speech
the entity element is a child of the entities element. We learn from the entity element that the terms t1 and t2 (e.g. Tom Cruise) form an entity of type person.

One way of accessing the first target element is by going one level at a time:



In [ ]:

    
entities_el = naf.find('entities')
entity_el = entities_el.find('entity')
references_el = entity_el.find('references')
span_el = references_el.find('span')
target_el = span_el.find('target')
etree.dump(target_el, pretty_print=True)

Is there a better way? The answer is yes! The following way is an easier way to find our target element:



In [ ]:

    
target_el = naf.find('entities/entity/references/span/target')
etree.dump(target_el, pretty_print=True)

You can also use findall() to find all target elements:



In [ ]:

    
for target_el in naf.findall('entities/entity/references/span/target'):
    etree.dump(target_el, pretty_print=True)

5. EXTRA: Creating your own XML

Please note that this section is optional, meaning that you don't need to understand this section in order to complete the assignment.

There are three main steps:

Step a: Create an XML object with a root element
Step b: Creating child elements and adding them
Step c: Writing to a file

Step a: Create an XML object with a root element

You create a new XML object by:

creating the root element -> using etree.Element
creating the main XML object -> using etree.ElementTree

You do not have to fully understand how this works. Please make sure you can reuse this code snippet when you create your own XML.



In [ ]:

    
our_root = etree.Element('Course')
our_tree = etree.ElementTree(our_root)

We can inspect what we have created by using the etree.dump() method. As you can see, we only have the root node Course currently in our document.



In [ ]:

    
etree.dump(our_root, pretty_print=True)

As you see, we created an XML object, containing only the root element Course.

Step b: Creating child elements and adding them

There are two ways to add child elements to the root element. The first is to create an element using the etree.Element() method and using append() to add it to the root:



In [ ]:

    
# Define tag, attributes and text of the new element
tag = 'person' # what the start and end tag will be 
attributes = {'role': 'student'} # dictionary of attributes, can be more than one
name_student = 'Lee' # the text of the elements

# Create new Element
new_person_element = etree.Element(tag, attrib=attributes)
new_person_element.text = name_student

# Add to root
our_root.append(new_person_element)

# Inspect the current XML
etree.dump(our_root, pretty_print=True)

However, this is so common that there is a shorter and much more efficient way to do this: by using etree.SubElement(). It accepts the same arguments as the etree.Element() method, but additionally requires the parent as first argument:



In [ ]:

    
# Define tag, attributes and text of the new element
tag = 'person' 
attributes = {'role': 'student'} 
name_student = 'Pitt' 

# Add to root
another_person_element = etree.SubElement(our_root, tag, attrib=attributes) # parent is our_root
another_person_element.text = name_student

# Inspect the current XML
etree.dump(our_root, pretty_print=True)

As we have seen before, XML can have multiple nested layers. Creating these works the same way as adding child elements to the root, but now we specify one of the other elements as the parent (in this case, new_person_element).



In [ ]:

    
# Define tag, attributes and text of the new element
tag = 'pet'
attributes = {'role': 'joy'}
name_pet = 'Romeo'

# Add to new_person_element
new_pet_element = etree.SubElement(new_person_element, tag, attrib=attributes) # parent is new_person_element
new_pet_element.text = name_pet

# Inspect the current XML
etree.dump(our_root, pretty_print=True)

Step c: Writing to a file

This is how we can write our selfmade XML to a file. Please inspect ../Data/xml_data/selfmade.xml using a text editor to check if it worked.



In [ ]:

    
with open('../Data/xml_data/selfmade.xml', 'wb') as outfile:
    our_tree.write(outfile,
                   pretty_print=True,
                   xml_declaration=True,
                   encoding='utf-8')

Exercises

Exercise 1:

Have another look at the XML below. Then print the following information:

the names of all students
the names of all instructors whose name starts with 'Van'
all names containing a space
the role of 'Rubber duck'



In [ ]:

    
xml_string = """
<Course>
    <person role="coordinator">Van der Vliet</person>
    <person role="instructor">Van Miltenburg</person>
    <person role="instructor">Van Son</person>
    <person role="instructor">Marten Postma</person>
    <person role="student">Baloche</person>
    <person role="student">De Boer</person>
    <animal role="student">Rubber duck</animal>
    <person role="student">Van Doorn</person>
    <person role="student">De Jager</person>
    <person role="student">King</person>
    <person role="student">Kingham</person>
    <person role="student">Mózes</person>
    <person role="student">Rübsaam</person>
    <person role="student">Torsi</person>
    <person role="student">Witteman</person>
    <person role="student">Wouterse</person>
    <person/>
</Course>
"""

tree = etree.fromstring(xml_string)
print(type(tree))



In [ ]:



In [ ]:



In [ ]:



In [ ]:

Exercise 2:

In the folder ../Data/xml_data there is an XML file called framenet.xml, which is a simplified version of the data provided by the FrameNet project.

FrameNet is a lexical database describing semantic frames, which are representations of events or situations and the participants in it. For example, cooking typically involves a person doing the cooking (Cook), the food that is to be cooked (Food), something to hold the food while cooking (Container) and a source of heat (Heating_instrument). In FrameNet, this is represented as a frame called Apply_heat. The Cook, Food, Heating_instrument and Container are called frame elements (FEs). Words that evoke this frame, such as fry, bake, boil, and broil, are called lexical units (LUs) of the Apply_heat frame. FrameNet also contains relations between frames. For example, Apply_heat has relations with the Absorb_heat, Cooking_creation and Intentionally_affect frames. In FrameNet, frame descriptions are stored in XML format.

framenet.xml contains the information about the frame Waking_up. Parse the XML file and print the following:

the name of the frame
the names of all lexical units
the definitions of all lexical units
the related frames with their type of relation to Waking_up (e.g. Event with the Inherits from relation)



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [ ]: