Source : Dive Into Python - Chapter 12 XML by Mark Pilgrim

XML overview

XML is a generalized way of describing hierarchical structured data.

An xml document contains one or more elements, which are delimited by start and end tags. Elements can be nested to any depth.

The first element in every xml document is called the root element. An xml document can only have one root element.

Elements can have attributes, which are name-value pairs. Attributes are listed within the start tag of an element and separated by whitespace. Attribute names can not be repeated within an element. Attribute values must be quoted. You may use either single or double quotes.

An element’s attributes form an unordered set of keys and values, like a Python dictionary.

Elements can have text content.

Like Python functions can be declared in different modules, xml elements can be declared in different namespaces. Namespaces usually look like URLs.

You can also use an xmlns:prefix declaration to define a namespace and associate it with a prefix. Then each element in that namespace must be explicitly declared with the prefix.

xml documents can contain character encoding information on the first line, before the root element.

Parsing XML


In [1]:
#import lxml.etree as etree

try:
    from lxml import etree as etree
except ImportError:
    import xml.etree.ElementTree as etree

In [2]:
tree = etree.parse('feed.xml')
root = tree.getroot()
root


Out[2]:
<Element {http://www.w3.org/2005/Atom}feed at 0x4d26948>

Elements Are Lists


In [3]:
root.tag


Out[3]:
'{http://www.w3.org/2005/Atom}feed'

In [4]:
len(root)


Out[4]:
8

In [5]:
for child in root:
    print(child)


<Element {http://www.w3.org/2005/Atom}title at 0x4cacd08>
<Element {http://www.w3.org/2005/Atom}subtitle at 0x4d30688>
<Element {http://www.w3.org/2005/Atom}id at 0x4d304c8>
<Element {http://www.w3.org/2005/Atom}updated at 0x4cacd08>
<Element {http://www.w3.org/2005/Atom}link at 0x4d30688>
<Element {http://www.w3.org/2005/Atom}entry at 0x4d304c8>
<Element {http://www.w3.org/2005/Atom}entry at 0x4cacd08>
<Element {http://www.w3.org/2005/Atom}entry at 0x4d30688>

Attributes Are Dictonaries


In [6]:
root.attrib


Out[6]:
{'{http://www.w3.org/XML/1998/namespace}lang': 'en'}

In [7]:
c4_att = root[4].attrib
c4_att


Out[7]:
{'rel': 'alternate', 'type': 'text/html', 'href': 'http://diveintomark.org/'}

In [8]:
c4_att['rel'],c4_att['href']


Out[8]:
('alternate', 'http://diveintomark.org/')

Searching


In [9]:
# find 1st matching entry
tree.find('//{http://www.w3.org/2005/Atom}entry')


Out[9]:
<Element {http://www.w3.org/2005/Atom}entry at 0x4d38108>

In [10]:
# find all entry elements
tree.findall('//{http://www.w3.org/2005/Atom}entry')


Out[10]:
[<Element {http://www.w3.org/2005/Atom}entry at 0x4d38108>,
 <Element {http://www.w3.org/2005/Atom}entry at 0x4d38288>,
 <Element {http://www.w3.org/2005/Atom}entry at 0x4d30688>]

In [11]:
# find all category elements
tree.findall('//{http://www.w3.org/2005/Atom}category')


Out[11]:
[<Element {http://www.w3.org/2005/Atom}category at 0x4d381c8>,
 <Element {http://www.w3.org/2005/Atom}category at 0x4d38408>,
 <Element {http://www.w3.org/2005/Atom}category at 0x4d38488>,
 <Element {http://www.w3.org/2005/Atom}category at 0x4d38448>,
 <Element {http://www.w3.org/2005/Atom}category at 0x4d38608>,
 <Element {http://www.w3.org/2005/Atom}category at 0x4d38648>,
 <Element {http://www.w3.org/2005/Atom}category at 0x4d38688>,
 <Element {http://www.w3.org/2005/Atom}category at 0x4d386c8>,
 <Element {http://www.w3.org/2005/Atom}category at 0x4d38708>,
 <Element {http://www.w3.org/2005/Atom}category at 0x4d38748>,
 <Element {http://www.w3.org/2005/Atom}category at 0x4d38788>,
 <Element {http://www.w3.org/2005/Atom}category at 0x4d387c8>]

In [12]:
# find all category element with attribute term="mp4"
tree.findall('//{http://www.w3.org/2005/Atom}category[@term="mp4"]')


Out[12]:
[<Element {http://www.w3.org/2005/Atom}category at 0x4d38748>]

In [13]:
# find all elements with href attribute
href_nodes = tree.findall('//{http://www.w3.org/2005/Atom}*[@href]')
for e in href_nodes:
    print(e.attrib['href'])   # get link url


http://diveintomark.org/
http://diveintomark.org/archives/2009/03/27/dive-into-history-2009-edition
http://diveintomark.org/archives/2009/03/21/accessibility-is-a-harsh-mistress
http://diveintomark.org/archives/2008/12/18/give-part-1-container-formats

In [14]:
# advanced search with XPath
NSMAP = {'atom': 'http://www.w3.org/2005/Atom'}
entries = tree.xpath("//atom:category[@term='accessibility']/..", namespaces=NSMAP)
entries[0].tag


Out[14]:
'{http://www.w3.org/2005/Atom}entry'

In [15]:
title = entries[0].xpath('./atom:title/text()', namespaces=NSMAP)
title


Out[15]:
['Accessibility is a harsh mistress']

Generating XML


In [16]:
new_feed = etree.Element('{http://www.w3.org/2005/Atom}feed',     
    attrib={'{http://www.w3.org/XML/1998/namespace}lang': 'en'}) 
print(etree.tostring(new_feed))


b'<ns0:feed xmlns:ns0="http://www.w3.org/2005/Atom" xml:lang="en"/>'

In [17]:
# add more element/text
title = etree.SubElement(new_feed, 'title', attrib={'type':'html'})
print(etree.tounicode(new_feed))


<ns0:feed xmlns:ns0="http://www.w3.org/2005/Atom" xml:lang="en"><title type="html"/></ns0:feed>

In [18]:
title.text = 'Dive into Python!'
print(etree.tounicode(new_feed))


<ns0:feed xmlns:ns0="http://www.w3.org/2005/Atom" xml:lang="en"><title type="html">Dive into Python!</title></ns0:feed>

In [19]:
# pretty print XML
print(etree.tounicode(new_feed, pretty_print=True))


<ns0:feed xmlns:ns0="http://www.w3.org/2005/Atom" xml:lang="en">
  <title type="html">Dive into Python!</title>
</ns0:feed>

You might also want to check out xmlwitch,
another third-party library for generating xml. It makes extensive use of the with statement to make xml generation code more readable.


In [ ]:


In [ ]: