Until now, we have already seen quite some data formats (CSV/TSV, JSON). In this week, we will learn how to work with one of the most popular structured data format: XML. XML is used a lot in NLP and therefore it is important that you know how to work with it.
etree.parse
etree.fromstring
etree.tostring
use the following methods and attributes of an XML element (of type lxml.etree._Element
):
methods
find,
findall, and
getchildren`get
tag
and text
[not needed for assignment] create your own XML and write it to a file
If you have questions about this chapter, please refer to the forum on Canvas.
NLP is all about data. More specifically, we usually want to annotate (manually or automatically) textual data with information about:
What would data look like that contains all this information? Let's look at a simple example:
In [ ]:
import nltk
In [ ]:
text = nltk.word_tokenize("Tom Cruise is an actor.")
print(nltk.pos_tag(text))
In this example, we see that the format is a list of tuples. The first element of each tuple is the word and the second element is the part of speech tag. Great, so far this works. However, we also want to indicate that Tom Cruise is an entity. Now, we start to run into trouble, because some annotations are for single words and some are for combinations of words. In addition, sometimes we have more than one annotation per token. Data structures such as CSV and TSV are not great at representing linguistic information. So is there a format that is better at it? The answer is yes and the format is XML.
Let's look at an example (the line numbers are there for explanation purposes). On purpose, we start with a non-linguistic, hopefully intuitive example. In the folder ../Data/xml_data
this XML is stored as the file course.xml
. You can inspect this file using a text editor (e.g. Atom, BBEdit or Notepad++).
1. <Course>
2. <person role="coordinator">Van der Vliet</person>
3. <person role="instructor">Van Miltenburg</person>
4. <person role="instructor">Van Son</person>
5. <person role="instructor">Postma</person>
6. <person role="instructor">Sommerauer</person>
7. <person role="student">Baloche</person>
8. <person role="student">De Boer</person>
9. <animal role="student">Rubber duck</animal>
10. <person role="student">Van Doorn</person>
11. <person role="student">De Jager</person>
12. <person role="student">King</person>
13. <person role="student">Kingham</person>
14. <person role="student">Mózes</person>
15. <person role="student">Rübsaam</person>
16. <person role="student">Torsi</person>
17. <person role="student">Witteman</person>
18. <person role="student">Wouterse</person>
19. <person/>
20. </Course>
Line 1 to 19 all show examples of XML elements. Each XML element contains a starting tag (e.g. <person>
) and an end tag (e.g. </person>
). An element can contain:
person
the child
of Course
and Course
the parent
of person
.Please note that on line 19 the starting tag and end tag are combined. This happens when an element has no children and/or no text. The syntax for an element is then <START_TAG/>
.
A special element is the root element. In our example, Course
is our root element. The element starts at line 1 (<Course>
) and ends at line 19 (</Course>
). Notice the difference between the begin tag (no '/') and the end tag (with '/'). A root element is special in that it is the only element, which is the sole parent element to all the other elements.
Elements can contain attributes, which contain information about the element. In this case, this information is the role
a person has in the course. All attributes are located in the start tag of an XML element.
Now that we know the basics of XML, we want to be able to access it in Python. In order to work with XML, we will use the lxml library.
In [ ]:
from lxml import etree
We will focus on the following methods/attributes:
etree.parse()
and etree.fromstring()
getroot()
find()
, findall()
, and getchildren()
get()
tag
and text
In [ ]:
xml_string = """
<Course>
<person role="coordinator">Van der Vliet</person>
<person role="instructor">Van Miltenburg</person>
<person role="instructor">Van Son</person>
<person role="instructor">Marten Postma</person>
<person role="student">Baloche</person>
<person role="student">De Boer</person>
<animal role="student">Rubber duck</animal>
<person role="student">Van Doorn</person>
<person role="student">De Jager</person>
<person role="student">King</person>
<person role="student">Kingham</person>
<person role="student">Mózes</person>
<person role="student">Rübsaam</person>
<person role="student">Torsi</person>
<person role="student">Witteman</person>
<person role="student">Wouterse</person>
<person/>
</Course>
"""
tree = etree.fromstring(xml_string)
print(type(tree))
The etree.parse()
method is used to load XML files on your computer:
In [ ]:
tree = etree.parse('../Data/xml_data/course.xml')
print(type(tree))
As you can see, etree.parse()
returns an ElementTree
, whereas etree.fromstring()
returns an Element
. One of the important differences is that the ElementTree
class serialises as a complete document, as opposed to a single Element
. This includes top-level processing instructions and comments, as well as a DOCTYPE and other DTD content in the document. For now, it's not too important that you know what these are; just remember that there is a difference btween ElementTree
and Element
.
While etree.fromstring()
gives you the root element right away, etree.parse()
does not. In order to access the root element of ElementTree
, we first need to use the getroot()
method. Note that this does not show the XML element itself, but only a reference. In order to show the element itself, we can use the etree.dump()
method.
In [ ]:
root = tree.getroot()
print('root', type(root), root)
print()
print('etree.dump example')
etree.dump(root, pretty_print=True)
As with any python object, we can use the built-in function dir()
to list all methods of an element (which has the type lxml.etree._Element
) , some of which will be illustrated below.
In [ ]:
print(type(root))
dir(root)
In [ ]:
first_person_el = root.find('person')
etree.dump(first_person_el, pretty_print=True)
In order to get a list of all person children, we can use the findall()
method.
Notice that this does not return the animal
since we are looking for person
elements.
In [ ]:
all_person_els = root.findall('person')
all_person_els
Sometimes, we simple want all the children, while ignoring the start tags. This can be achieved using the getchildren()
method. This will simply return all children.
Now we do get the animal
element again.
In [ ]:
all_child_els = root.getchildren()
all_child_els
The get()
method is used to access the attribute of an element.
If an attribute does not exists, it will return None
, hence no error.
In [ ]:
first_person_el = root.find('person')
role_first_person_el = first_person_el.get('role')
attribute_not_found = first_person_el.get('blabla')
print('role first person element:', role_first_person_el)
print('value if not found:', attribute_not_found)
The text of an element is found in the attribute text
:
In [ ]:
print(first_person_el.text)
The tag of an element is found in the attribute tag
:
In [ ]:
print(first_person_el.tag)
<NAF xml:lang="en" version="v3">
<terms>
<term id="t1" type="open" lemma="Tom" pos="N" morphofeat="NNP">
<term id="t2" type="open" lemma="Cruise" pos="N" morphofeat="NNP">
<term id="t3" type="open" lemma="be" pos="V" morphofeat="VBZ">
<term id="t4" type="open" lemma="an" pos="R" morphofeat="DT">
<term id="t5" type="open" lemma="actor" pos="N" morphofeat="NN">
</terms>
<entities>
<entity id="e3" type="PERSON">
<references>
<span>
<target id="t1" />
<target id="t2" />
</span>
</references>
</entity>
</entities>
</NAF>
Again, we use etree.fromstring()
to load XML from a string:
In [ ]:
naf_string = """
<NAF xml:lang="en" version="v3">
<text>
<wf id="w1" offset="0" length="3" sent="1" para="1">tom</wf>
<wf id="w2" offset="4" length="6" sent="1" para="1">cruise</wf>
<wf id="w3" offset="11" length="2" sent="1" para="1">is</wf>
<wf id="w4" offset="14" length="2" sent="1" para="1">an</wf>
<wf id="w5" offset="17" length="5" sent="1" para="1">actor</wf>
</text>
<terms>
<term id="t1" type="open" lemma="Tom" pos="N" morphofeat="NNP"/>
<term id="t2" type="open" lemma="Cruise" pos="N" morphofeat="NNP"/>
<term id="t3" type="open" lemma="be" pos="V" morphofeat="VBZ"/>
<term id="t4" type="open" lemma="an" pos="R" morphofeat="DT"/>
<term id="t5" type="open" lemma="actor" pos="N" morphofeat="NN"/>
</terms>
<entities>
<entity id="e3" type="PERSON">
<references>
<span>
<target id="t1" />
<target id="t2" />
</span>
</references>
</entity>
</entities>
</NAF>
"""
naf = etree.fromstring(naf_string)
print(type(naf))
etree.dump(naf, pretty_print=True)
Please note that the structure is as follows:
NAF
element is the parent of the elements text
, terms
, and entities
wf
elements are children of the text
element, which provides us information about the position of words in the text, e.g. that tom is the first word in the text (id="w1
") and in the first sentence (sent="1")term
elements are children of the term
elements, which provide us information about lemmatization and part of speechentity
element is a child of the entities
element. We learn from the entity
element that the terms t1
and t2
(e.g. Tom Cruise) form an entity of type person
.One way of accessing the first target
element is by going one level at a time:
In [ ]:
entities_el = naf.find('entities')
entity_el = entities_el.find('entity')
references_el = entity_el.find('references')
span_el = references_el.find('span')
target_el = span_el.find('target')
etree.dump(target_el, pretty_print=True)
Is there a better way? The answer is yes! The following way is an easier way to find our target
element:
In [ ]:
target_el = naf.find('entities/entity/references/span/target')
etree.dump(target_el, pretty_print=True)
You can also use findall()
to find all target
elements:
In [ ]:
for target_el in naf.findall('entities/entity/references/span/target'):
etree.dump(target_el, pretty_print=True)
Please note that this section is optional, meaning that you don't need to understand this section in order to complete the assignment.
There are three main steps:
You create a new XML object by:
root
element -> using etree.Element
etree.ElementTree
You do not have to fully understand how this works. Please make sure you can reuse this code snippet when you create your own XML.
In [ ]:
our_root = etree.Element('Course')
our_tree = etree.ElementTree(our_root)
We can inspect what we have created by using the etree.dump()
method. As you can see, we only have the root node Course
currently in our document.
In [ ]:
etree.dump(our_root, pretty_print=True)
As you see, we created an XML object, containing only the root element Course.
In [ ]:
# Define tag, attributes and text of the new element
tag = 'person' # what the start and end tag will be
attributes = {'role': 'student'} # dictionary of attributes, can be more than one
name_student = 'Lee' # the text of the elements
# Create new Element
new_person_element = etree.Element(tag, attrib=attributes)
new_person_element.text = name_student
# Add to root
our_root.append(new_person_element)
# Inspect the current XML
etree.dump(our_root, pretty_print=True)
However, this is so common that there is a shorter and much more efficient way to do this: by using etree.SubElement()
. It accepts the same arguments as the etree.Element()
method, but additionally requires the parent as first argument:
In [ ]:
# Define tag, attributes and text of the new element
tag = 'person'
attributes = {'role': 'student'}
name_student = 'Pitt'
# Add to root
another_person_element = etree.SubElement(our_root, tag, attrib=attributes) # parent is our_root
another_person_element.text = name_student
# Inspect the current XML
etree.dump(our_root, pretty_print=True)
As we have seen before, XML can have multiple nested layers. Creating these works the same way as adding child elements to the root, but now we specify one of the other elements as the parent (in this case, new_person_element
).
In [ ]:
# Define tag, attributes and text of the new element
tag = 'pet'
attributes = {'role': 'joy'}
name_pet = 'Romeo'
# Add to new_person_element
new_pet_element = etree.SubElement(new_person_element, tag, attrib=attributes) # parent is new_person_element
new_pet_element.text = name_pet
# Inspect the current XML
etree.dump(our_root, pretty_print=True)
In [ ]:
with open('../Data/xml_data/selfmade.xml', 'wb') as outfile:
our_tree.write(outfile,
pretty_print=True,
xml_declaration=True,
encoding='utf-8')
In [ ]:
xml_string = """
<Course>
<person role="coordinator">Van der Vliet</person>
<person role="instructor">Van Miltenburg</person>
<person role="instructor">Van Son</person>
<person role="instructor">Marten Postma</person>
<person role="student">Baloche</person>
<person role="student">De Boer</person>
<animal role="student">Rubber duck</animal>
<person role="student">Van Doorn</person>
<person role="student">De Jager</person>
<person role="student">King</person>
<person role="student">Kingham</person>
<person role="student">Mózes</person>
<person role="student">Rübsaam</person>
<person role="student">Torsi</person>
<person role="student">Witteman</person>
<person role="student">Wouterse</person>
<person/>
</Course>
"""
tree = etree.fromstring(xml_string)
print(type(tree))
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In the folder ../Data/xml_data
there is an XML file called framenet.xml
, which is a simplified version of the data provided by the FrameNet project.
FrameNet is a lexical database describing semantic frames, which are representations of events or situations and the participants in it. For example, cooking typically involves a person doing the cooking (Cook
), the food that is to be cooked (Food
), something to hold the food while cooking (Container
) and a source of heat (Heating_instrument
). In FrameNet, this is represented as a frame called Apply_heat
. The Cook
, Food
, Heating_instrument
and Container
are called frame elements (FEs). Words that evoke this frame, such as fry, bake, boil, and broil, are called lexical units (LUs) of the Apply_heat
frame. FrameNet also contains relations between frames. For example, Apply_heat
has relations with the Absorb_heat
, Cooking_creation
and Intentionally_affect
frames. In FrameNet, frame descriptions are stored in XML format.
framenet.xml
contains the information about the frame Waking_up
. Parse the XML file and print the following:
Waking_up
(e.g. Event
with the Inherits from
relation)
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]: