In Core Python, we discussed about text files. In this chapter, we will discuss about XML.
Extensible Markup Language (XML) is a markup language that defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. The W3C's XML 1.0 Specification and several other related specifications —all of them free open standards—define XML. Also XML is a text formatted data which can be viewed and edited on any text editor.
XML design emphasize
- simplicity,
- generality, and
- usability.
Although the design of XML focused on documents, it is widely used for the representation of data structures used in web services and configurations of desktop applications.
Two most common document file formats, "Office Open XML" and "OpenDocument", are based on XML.
<?xml version="1.0"?>
<books>
<book title="Ṛg-Veda Khilāni">
<editor>Jost Gippert</editor>
<publication>Frankfurt: TITUS</publication>
<year>2008</year>
<web_page>http://titus.uni-frankfurt.de/texte/etcs/ind/aind/ved/rv/rvkh/rvkh.htm</web_page>
</book>
<book title="Ṛgveda-Saṃhitā">
<editor>Jost Gippert</editor>
<publication>Frankfurt: TITUS</publication>
<year>2000</year>
<web_page>http://titus.uni-frankfurt.de/texte/etcs/ind/aind/ved/rv/mt/rv.htm</web_page>
</book>
</books>
XML documents can be visualized as a tree, where you have one parent and children's. Nodes can have zero or more child nodes. But children nodes will always have only one parent node. As book
node has a parent node in books. You will observe that both the book
node has the same parent node. Also, both book
nodes have multiple different child nodes describing the book.
editor
, publication
, year
and web_page
nodes have same parent book
. Similarly book
nodes have single parent node as books
.
Also node that at the top of the document only one node books
is present.
Each Node can have attributes as title
is the attribute of node book
.
<xml>
<book title="Ṛg-Veda Khilāni">
Python has rich support for XML by having multiple libs to parse XML documents. Lets dicuss them in details. Following are the sub-modules supported nativly by Python
we can import ET
using the following command
In [14]:
import xml.etree.ElementTree as ET
XML can parse either the xml file using the following code,
In [53]:
old_books = 'code/data/old_books.xml'
nasa_data = 'code/data/nasa.xml'
In [ ]:
Opening an xml file is actually quite simple : you open it and you parse it. Who would have guessed ?
In [108]:
tree = ET.parse(old_books)
root = tree.getroot()
print(tree)
or read string using the following code
In [18]:
xml_book = """<?xml version="1.0"?>
<books>
<book title="Ṛg-Veda Khilāni">
<editor>Jost Gippert</editor>
<publication>Frankfurt: TITUS</publication>
<year>2008</year>
<web_page>http://titus.uni-frankfurt.de/texte/etcs/ind/aind/ved/rv/rvkh/rvkh.htm</web_page>
</book>
<book title="Ṛgveda-Saṃhitā">
<editor>Jost Gippert</editor>
<publication>Frankfurt: TITUS</publication>
<year>2000</year>
<web_page>http://titus.uni-frankfurt.de/texte/etcs/ind/aind/ved/rv/mt/rv.htm</web_page>
</book>
</books>
"""
root = ET.fromstring(xml_book)
As an Element, root
also has tag and to following code can be used to find the tag
In [23]:
print(root.tag)
We can use len
to find the number of direct child nodes. As in our example we have two book
nodes,
In [60]:
print(len(root))
In [68]:
print(ET.tostring(root))
In [71]:
dec_root = ET.tostring(root).decode()
print(dec_root)
print(type(dec_root))
In [66]:
print(dir(root))
In [ ]:
we can use `for` loop to traverse the direct descendents nodes.
In [39]:
for ele in root:
print(ele)
as shown above we get element nodes using for loop, lets get more information from them by enhancing the existing code
In [38]:
for ele in root:
print(ele.tag, ele.attrib)
we can also find the nodes using indexes.
In [34]:
print(root[1])
If more than one attibutes are present then individual attributes can be accessas similar to dictionary
In [35]:
print(root[1].attrib['title'])
In [37]:
print(root[0][1].text)
In [78]:
for event, elem in ET.iterparse(old_books):
print(event, elem)
In [79]:
# ########################## NOTE ##########################
# Please run the commented code on command prompt to appreciate
# its power, the working code has been saved as `read_nasa.py`
# in code folder
# file_name = 'data/nasa.xml'
# for event, elem in ET.iterparse(file_name):
# print(event, elem)
In [50]:
tree = ET.parse(old_books)
root = tree.getroot()
parser = ET.XMLPullParser(['start', 'end'])
print(parser)
------------------ END
Say, we are only interested in part of the whole xml document, in this section we will discuss technologies which will help us in solving this situation
In [82]:
for editor in root.iter('editor'):
print(editor)
print(editor.text)
as, you can see we were able to directly select editor tags
It finds only elements with a tag which are direct children of the current element.
In [92]:
for editor in root.findall('book'):
print(editor)
print(editor.tag)
In [95]:
print(root.findall('editor'))
for editor in root.findall('editor'):
print(editor)
print(editor.tag)
As you can see that editor
is not direct children for the current element root
, thus we got empty value
It find the first child with a particular tag
In [97]:
print(root.find('book'))
In [98]:
print(root.find('editor'))
In [117]:
ele = root.find('book')
ele.get('title')
Out[117]:
We can build a XML document using Element
& SubElement
functions of ElementTree
In [135]:
a = ET.Element('a')
b = ET.SubElement(a, 'b')
b.attrib["B"] = "TEST"
c = ET.SubElement(a, 'c')
d = ET.SubElement(a, 'd')
e = ET.SubElement(d, 'e')
f = ET.SubElement(e, 'f')
ET.dump(a)
print(ET.tostring(a).decode())
<?xml version="1.0"?>
<actors xmlns:fictional="http://characters.example.com"
xmlns="http://people.example.com">
<actor>
<name>John Cleese</name>
<fictional:character>Lancelot</fictional:character>
<fictional:character>Archie Leach</fictional:character>
</actor>
<actor>
<name>Eric Idle</name>
<fictional:character>Sir Robin</fictional:character>
<fictional:character>Gunther</fictional:character>
<fictional:character>Commander Clement</fictional:character>
</actor>
</actors>
In [118]:
xml_text = """<?xml version="1.0"?>
<actors xmlns:fictional="http://characters.example.com"
xmlns="http://people.example.com">
<actor>
<name>John Cleese</name>
<fictional:character>Lancelot</fictional:character>
<fictional:character>Archie Leach</fictional:character>
</actor>
<actor>
<name>Eric Idle</name>
<fictional:character>Sir Robin</fictional:character>
<fictional:character>Gunther</fictional:character>
<fictional:character>Commander Clement</fictional:character>
</actor>
</actors>"""
In [125]:
root = ET.fromstring(xml_text)
for actor in root.findall('{http://people.example.com}actor'):
name = actor.find('{http://people.example.com}name')
print(name.text)
for char in actor.findall('{http://characters.example.com}character'):
print(' |->', char.text)
In [ ]:
Syntax | Meaning |
---|---|
tag | Selects all child elements with the given tag. For example, spam selects all child elements named spam, and spam/egg selects all grandchildren named egg in all children named spam. |
* | Selects all child elements. For example, */egg selects all grandchildren named egg. |
. | Selects the current node. This is mostly useful at the beginning of the path, to indicate that it’s a relative path. |
// | Selects all subelements, on all levels beneath the current element. For example, .//egg selects all eggelements in the entire tree. |
.. | Selects the parent element. Returns None if the path attempts to reach the ancestors of the start element (the element find was called on). |
[@attrib] | Selects all elements that have the given attribute. |
[@attrib='value'] | Selects all elements for which the given attribute has the given value. The value cannot contain quotes. |
[tag] | Selects all elements that have a child named tag. Only immediate children are supported. |
[tag='text'] | Selects all elements that have a child named tag whose complete text content, including descendants, equals the given text. |
[position] | Selects all elements that are located at the given position. The position can be either an integer (1 is the first position), the expression last() (for the last position), or a position relative to the last position (e.g. last()-1). |
In [ ]:
The ElementTree.write() method can be used to save the updated document to specified file.
In [115]:
ele = root.find('book')
ele.attrib['title'] = "Rig-Veda Khilāni"
updated_xml = 'code/data/updated_old_book.xml'
tree.write(updated_xml)
In [116]:
with open(updated_xml) as f:
print(f.read())
he XML processing modules are not secure against maliciously constructed data. An attacker can abuse XML features to carry out denial of service attacks, access local files, generate network connections to other machines, or circumvent firewalls.
The following table gives an overview of the known attacks and whether the various modules are vulnerable to them.
kind | sax | etree | minidom | pulldom | xmlrpc |
---|---|---|---|---|---|
billion laughs | Vulnerable | Vulnerable | Vulnerable | Vulnerable | Vulnerable |
quadratic blowup | Vulnerable | Vulnerable | Vulnerable | Vulnerable | Vulnerable |
external entity expansion | Vulnerable | Safe (1) | Safe (2) | Vulnerable | Safe (3) |
DTD retrieval | Vulnerable | Safe | Safe | Vulnerable | Safe |
decompression bomb | Safe | Safe | Safe | Safe | Vulnerable |
In [ ]:
In [16]:
xml_book = """
<?xml version="1.0"?>
<books>
<book title="Ṛg-Veda Khilāni">
<editor>Jost Gippert</editor>
<publication>Frankfurt: TITUS</publication>
<year>2008</year>
<web_page>http://titus.uni-frankfurt.de/texte/etcs/ind/aind/ved/rv/rvkh/rvkh.htm</web_page>
</book>
<book title="Ṛgveda-Saṃhitā">
<editor>Jost Gippert</editor>
<publication>Frankfurt: TITUS</publication>
<year>2000</year>
<web_page>http://titus.uni-frankfurt.de/texte/etcs/ind/aind/ved/rv/mt/rv.htm</web_page>
</book>
</books>
"""
root = ET.fromstring(xml_book)
due to blank first line this error happens, to avoid this error remove the blank spaces from the start of string, as shown below
In [17]:
xml_book = """<?xml version="1.0"?>
<books>
<book title="Ṛg-Veda Khilāni">
<editor>Jost Gippert</editor>
<publication>Frankfurt: TITUS</publication>
<year>2008</year>
<web_page>http://titus.uni-frankfurt.de/texte/etcs/ind/aind/ved/rv/rvkh/rvkh.htm</web_page>
</book>
<book title="Ṛgveda-Saṃhitā">
<editor>Jost Gippert</editor>
<publication>Frankfurt: TITUS</publication>
<year>2000</year>
<web_page>http://titus.uni-frankfurt.de/texte/etcs/ind/aind/ved/rv/mt/rv.htm</web_page>
</book>
</books>
"""
root = ET.fromstring(xml_book)