XML

In Core Python, we discussed about text files. In this chapter, we will discuss about XML.

What is XML

Extensible Markup Language (XML) is a markup language that defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. The W3C's XML 1.0 Specification and several other related specifications —all of them free open standards—define XML. Also XML is a text formatted data which can be viewed and edited on any text editor.

Design Goal

XML design emphasize

- simplicity,
- generality, and 
- usability.

Although the design of XML focused on documents, it is widely used for the representation of data structures used in web services and configurations of desktop applications.

Two most common document file formats, "Office Open XML" and "OpenDocument", are based on XML.

XML Examples

<?xml version="1.0"?>
<books>
    <book title="Ṛg-Veda Khilāni">
        <editor>Jost Gippert</editor>
        <publication>Frankfurt: TITUS</publication>
        <year>2008</year>
        <web_page>http://titus.uni-frankfurt.de/texte/etcs/ind/aind/ved/rv/rvkh/rvkh.htm</web_page>
    </book>    
    <book title="Ṛgveda-Saṃhitā">
        <editor>Jost Gippert</editor>
        <publication>Frankfurt: TITUS</publication>
        <year>2000</year>
        <web_page>http://titus.uni-frankfurt.de/texte/etcs/ind/aind/ved/rv/mt/rv.htm</web_page>
    </book>
</books>

Detailed explanation of XML components

XML documents can be visualized as a tree, where you have one parent and children's. Nodes can have zero or more child nodes. But children nodes will always have only one parent node. As book node has a parent node in books. You will observe that both the book node has the same parent node. Also, both book nodes have multiple different child nodes describing the book.

editor, publication, year and web_page nodes have same parent book. Similarly book nodes have single parent node as books.

Also node that at the top of the document only one node books is present.

Each Node can have attributes as title is the attribute of node book.

<xml>
<book title="Ṛg-Veda Khilāni">

XML support in Python

Python has rich support for XML by having multiple libs to parse XML documents. Lets dicuss them in details. Following are the sub-modules supported nativly by Python

xml.etree.ElementTree: the ElementTree API, a simple and lightweight XML processor
xml.dom: the DOM API definition
xml.dom.minidom: a minimal DOM implementation
xml.dom.pulldom: support for building partial DOM trees
xml.sax: SAX2 base classes and convenience functions
xml.parsers.expat: the Expat parser binding

xml.etree.ElementTree

we can import ET using the following command



In [14]:

    
import xml.etree.ElementTree as ET

XML can parse either the xml file using the following code,



In [53]:

    
old_books = 'code/data/old_books.xml'
nasa_data = 'code/data/nasa.xml'

Opening xml file



In [ ]:

    
Opening an xml file is actually quite simple : you open it and you parse it. Who would have guessed ?



In [108]:

    
tree = ET.parse(old_books)
root = tree.getroot()
print(tree)









    



<xml.etree.ElementTree.ElementTree object at 0x7f94ab7c4ba8>

or read string using the following code



In [18]:

    
xml_book = """<?xml version="1.0"?>
<books>
    <book title="Ṛg-Veda Khilāni">
        <editor>Jost Gippert</editor>
        <publication>Frankfurt: TITUS</publication>
        <year>2008</year>
        <web_page>http://titus.uni-frankfurt.de/texte/etcs/ind/aind/ved/rv/rvkh/rvkh.htm</web_page>
    </book>    
    <book title="Ṛgveda-Saṃhitā">
        <editor>Jost Gippert</editor>
        <publication>Frankfurt: TITUS</publication>
        <year>2000</year>
        <web_page>http://titus.uni-frankfurt.de/texte/etcs/ind/aind/ved/rv/mt/rv.htm</web_page>
    </book>
</books>
"""
root = ET.fromstring(xml_book)

As an Element, root also has tag and to following code can be used to find the tag



In [23]:

    
print(root.tag)









    



books

We can use len to find the number of direct child nodes. As in our example we have two book nodes,



In [60]:

    
print(len(root))

Reading root as binary text



In [68]:

    
print(ET.tostring(root))









    



b'<books>\n    <book title="&#7770;g-Veda Khil&#257;ni">\n        <editor>Jost Gippert</editor>\n        <publication>Frankfurt: TITUS</publication>\n        <year>2008</year>\n        <web_page>http://titus.uni-frankfurt.de/texte/etcs/ind/aind/ved/rv/rvkh/rvkh.htm</web_page>\n    </book>    \n    <book title="&#7770;gveda-Sa&#7747;hit&#257;">\n        <editor>Jost Gippert</editor>\n        <publication>Frankfurt: TITUS</publication>\n        <year>2000</year>\n        <web_page>http://titus.uni-frankfurt.de/texte/etcs/ind/aind/ved/rv/mt/rv.htm</web_page>\n    </book>\t\n</books>'

Reading element as formatted text



In [71]:

    
dec_root = ET.tostring(root).decode()
print(dec_root)
print(type(dec_root))









    



<books>
    <book title="&#7770;g-Veda Khil&#257;ni">
        <editor>Jost Gippert</editor>
        <publication>Frankfurt: TITUS</publication>
        <year>2008</year>
        <web_page>http://titus.uni-frankfurt.de/texte/etcs/ind/aind/ved/rv/rvkh/rvkh.htm</web_page>
    </book>    
    <book title="&#7770;gveda-Sa&#7747;hit&#257;">
        <editor>Jost Gippert</editor>
        <publication>Frankfurt: TITUS</publication>
        <year>2000</year>
        <web_page>http://titus.uni-frankfurt.de/texte/etcs/ind/aind/ved/rv/mt/rv.htm</web_page>
    </book>	
</books>
<class 'str'>

All attributes available to an element



In [66]:

    
print(dir(root))









    



['__class__', '__copy__', '__deepcopy__', '__delattr__', '__delitem__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__init__', '__le__', '__len__', '__lt__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setitem__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', 'append', 'clear', 'extend', 'find', 'findall', 'findtext', 'get', 'getchildren', 'getiterator', 'insert', 'items', 'iter', 'iterfind', 'itertext', 'keys', 'makeelement', 'remove', 'set']



In [ ]:

    
we can use `for` loop to traverse the direct descendents nodes.



In [39]:

    
for ele in root:
    print(ele)









    



<Element 'book' at 0x7f94c0349db8>
<Element 'book' at 0x7f94c00ba638>

as shown above we get element nodes using for loop, lets get more information from them by enhancing the existing code



In [38]:

    
for ele in root:
    print(ele.tag, ele.attrib)









    



book {'title': 'Ṛg-Veda Khilāni'}
book {'title': 'Ṛgveda-Saṃhitā'}

we can also find the nodes using indexes.



In [34]:

    
print(root[1])









    



<Element 'book' at 0x7f94c00ba638>

If more than one attibutes are present then individual attributes can be accessas similar to dictionary



In [35]:

    
print(root[1].attrib['title'])









    



Ṛgveda-Saṃhitā



In [37]:

    
print(root[0][1].text)









    



Frankfurt: TITUS

Reading Large XML file using `iterparse`



In [78]:

    
for event, elem in ET.iterparse(old_books):
    print(event, elem)









    



end <Element 'editor' at 0x7f9493b41db8>
end <Element 'publication' at 0x7f9493b41d68>
end <Element 'year' at 0x7f9493b41d18>
end <Element 'web_page' at 0x7f9493b41cc8>
end <Element 'book' at 0x7f9493b41e08>
end <Element 'editor' at 0x7f9493b41c28>
end <Element 'publication' at 0x7f9493b41bd8>
end <Element 'year' at 0x7f9493b41b88>
end <Element 'web_page' at 0x7f9493b41b38>
end <Element 'book' at 0x7f9493b41a48>
end <Element 'books' at 0x7f9493b41e58>



In [79]:

    
# ########################## NOTE ##########################
# Please run the commented code on command prompt to appreciate
# its power, the working code has been saved as `read_nasa.py`  
# in code folder

# file_name = 'data/nasa.xml'
# for event, elem in ET.iterparse(file_name):
#     print(event, elem)

!!!TODO!!! : Reading Large XML file using `XMLPullParser`



In [50]:

    
tree = ET.parse(old_books)
root = tree.getroot()
parser = ET.XMLPullParser(['start', 'end'])
print(parser)









    



<xml.etree.ElementTree.XMLPullParser object at 0x7f9493a908d0>

------------------ END

Finding interesting elements

Say, we are only interested in part of the whole xml document, in this section we will discuss technologies which will help us in solving this situation

Using iter



In [82]:

    
for editor in root.iter('editor'):
    print(editor)
    print(editor.text)









    



<Element 'editor' at 0x7f9493a91728>
Jost Gippert
<Element 'editor' at 0x7f9493a91ef8>
Jost Gippert

as, you can see we were able to directly select editor tags

using findall

It finds only elements with a tag which are direct children of the current element.



In [92]:

    
for editor in root.findall('book'):
    print(editor)
    print(editor.tag)









    



<Element 'book' at 0x7f9493a91688>
book
<Element 'book' at 0x7f9493a91ea8>
book



In [95]:

    
print(root.findall('editor'))
for editor in root.findall('editor'):
    print(editor)
    print(editor.tag)

[]

As you can see that editor is not direct children for the current element root, thus we got empty value

Using find

It find the first child with a particular tag



In [97]:

    
print(root.find('book'))









    



<Element 'book' at 0x7f9493a91688>



In [98]:

    
print(root.find('editor'))









    



None

Accessing Element Attributes



In [117]:

    
ele = root.find('book')
ele.get('title')









    Out[117]:





'Rig-Veda Khilāni'

Building XML documents

We can build a XML document using Element & SubElement functions of ElementTree



In [135]:

    
a = ET.Element('a')
b = ET.SubElement(a, 'b')
b.attrib["B"] = "TEST"
c = ET.SubElement(a, 'c')
d = ET.SubElement(a, 'd')
e = ET.SubElement(d, 'e')
f = ET.SubElement(e, 'f')

ET.dump(a)
print(ET.tostring(a).decode())









    



<a><b B="TEST" /><c /><d><e><f /></e></d></a>
<a><b B="TEST" /><c /><d><e><f /></e></d></a>

Parsing XML with Namespaces

<?xml version="1.0"?>
<actors xmlns:fictional="http://characters.example.com"
        xmlns="http://people.example.com">
    <actor>
        <name>John Cleese</name>
        <fictional:character>Lancelot</fictional:character>
        <fictional:character>Archie Leach</fictional:character>
    </actor>
    <actor>
        <name>Eric Idle</name>
        <fictional:character>Sir Robin</fictional:character>
        <fictional:character>Gunther</fictional:character>
        <fictional:character>Commander Clement</fictional:character>
    </actor>
</actors>



In [118]:

    
xml_text = """<?xml version="1.0"?>
<actors xmlns:fictional="http://characters.example.com"
        xmlns="http://people.example.com">
    <actor>
        <name>John Cleese</name>
        <fictional:character>Lancelot</fictional:character>
        <fictional:character>Archie Leach</fictional:character>
    </actor>
    <actor>
        <name>Eric Idle</name>
        <fictional:character>Sir Robin</fictional:character>
        <fictional:character>Gunther</fictional:character>
        <fictional:character>Commander Clement</fictional:character>
    </actor>
</actors>"""



In [125]:

    
root = ET.fromstring(xml_text)
for actor in root.findall('{http://people.example.com}actor'):
    name = actor.find('{http://people.example.com}name')
    print(name.text)
    for char in actor.findall('{http://characters.example.com}character'):
        print('   |->', char.text)









    



John Cleese
   |-> Lancelot
   |-> Archie Leach
Eric Idle
   |-> Sir Robin
   |-> Gunther
   |-> Commander Clement

XPath support



In [ ]:

Syntax	Meaning
tag	Selects all child elements with the given tag. For example, spam selects all child elements named spam, and spam/egg selects all grandchildren named egg in all children named spam.
*	Selects all child elements. For example, */egg selects all grandchildren named egg.
.	Selects the current node. This is mostly useful at the beginning of the path, to indicate that it’s a relative path.
//	Selects all subelements, on all levels beneath the current element. For example, .//egg selects all eggelements in the entire tree.
..	Selects the parent element. Returns None if the path attempts to reach the ancestors of the start element (the element find was called on).
[@attrib]	Selects all elements that have the given attribute.
[@attrib='value']	Selects all elements for which the given attribute has the given value. The value cannot contain quotes.
[tag]	Selects all elements that have a child named tag. Only immediate children are supported.
[tag='text']	Selects all elements that have a child named tag whose complete text content, including descendants, equals the given text.
[position]	Selects all elements that are located at the given position. The position can be either an integer (1 is the first position), the expression last() (for the last position), or a position relative to the last position (e.g. last()-1).



In [ ]:

Modifying an XML File

The ElementTree.write() method can be used to save the updated document to specified file.



In [115]:

    
ele = root.find('book')
ele.attrib['title'] = "Rig-Veda Khilāni"
updated_xml = 'code/data/updated_old_book.xml'
tree.write(updated_xml)



In [116]:

    
with open(updated_xml) as f:
    print(f.read())









    



<books>
    <book title="Rig-Veda Khil&#257;ni">
        <editor>Jost Gippert</editor>
        <publication>Frankfurt: TITUS</publication>
        <year>2008</year>
        <web_page>http://titus.uni-frankfurt.de/texte/etcs/ind/aind/ved/rv/rvkh/rvkh.htm</web_page>
    </book>    
    <book title="&#7770;gveda-Sa&#7747;hit&#257;">
        <editor>Jost Gippert</editor>
        <publication>Frankfurt: TITUS</publication>
        <year>2000</year>
        <web_page>http://titus.uni-frankfurt.de/texte/etcs/ind/aind/ved/rv/mt/rv.htm</web_page>
    </book>	
</books>

XML vulnerabilities

he XML processing modules are not secure against maliciously constructed data. An attacker can abuse XML features to carry out denial of service attacks, access local files, generate network connections to other machines, or circumvent firewalls.

The following table gives an overview of the known attacks and whether the various modules are vulnerable to them.

kind	sax	etree	minidom	pulldom	xmlrpc
billion laughs	Vulnerable	Vulnerable	Vulnerable	Vulnerable	Vulnerable
quadratic blowup	Vulnerable	Vulnerable	Vulnerable	Vulnerable	Vulnerable
external entity expansion	Vulnerable	Safe (1)	Safe (2)	Vulnerable	Safe (3)
DTD retrieval	Vulnerable	Safe	Safe	Vulnerable	Safe
decompression bomb	Safe	Safe	Safe	Safe	Vulnerable



In [ ]:

Common Errors and causes



In [16]:

    
xml_book = """
<?xml version="1.0"?>
<books>
    <book title="Ṛg-Veda Khilāni">
        <editor>Jost Gippert</editor>
        <publication>Frankfurt: TITUS</publication>
        <year>2008</year>
        <web_page>http://titus.uni-frankfurt.de/texte/etcs/ind/aind/ved/rv/rvkh/rvkh.htm</web_page>
    </book>    
    <book title="Ṛgveda-Saṃhitā">
        <editor>Jost Gippert</editor>
        <publication>Frankfurt: TITUS</publication>
        <year>2000</year>
        <web_page>http://titus.uni-frankfurt.de/texte/etcs/ind/aind/ved/rv/mt/rv.htm</web_page>
    </book>
</books>
"""
root = ET.fromstring(xml_book)









    



Traceback (most recent call last):

  File "/home/mayank/.local/lib64/python3.4/site-packages/IPython/core/interactiveshell.py", line 2910, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)

  File "<ipython-input-16-dc6bb8d739a9>", line 18, in <module>
    root = ET.fromstring(xml_book)

  File "/usr/lib64/python3.4/xml/etree/ElementTree.py", line 1335, in XML
    parser.feed(text)

  File "<string>", line unknown
ParseError: XML or text declaration not at start of entity: line 2, column 0

due to blank first line this error happens, to avoid this error remove the blank spaces from the start of string, as shown below



In [17]:

    
xml_book = """<?xml version="1.0"?>
<books>
    <book title="Ṛg-Veda Khilāni">
        <editor>Jost Gippert</editor>
        <publication>Frankfurt: TITUS</publication>
        <year>2008</year>
        <web_page>http://titus.uni-frankfurt.de/texte/etcs/ind/aind/ved/rv/rvkh/rvkh.htm</web_page>
    </book>    
    <book title="Ṛgveda-Saṃhitā">
        <editor>Jost Gippert</editor>
        <publication>Frankfurt: TITUS</publication>
        <year>2000</year>
        <web_page>http://titus.uni-frankfurt.de/texte/etcs/ind/aind/ved/rv/mt/rv.htm</web_page>
    </book>
</books>
"""
root = ET.fromstring(xml_book)

XML