Source : Dive Into Python - Chapter 12 XML by Mark Pilgrim
XML is a generalized way of describing hierarchical structured data.
An xml document contains one or more elements, which are delimited by start and end tags. Elements can be nested to any depth.
The first element in every xml document is called the root element. An xml document can only have one root element.
Elements can have attributes, which are name-value pairs. Attributes are listed within the start tag of an element and separated by whitespace. Attribute names can not be repeated within an element. Attribute values must be quoted. You may use either single or double quotes.
An element’s attributes form an unordered set of keys and values, like a Python dictionary.
Elements can have text content.
Like Python functions can be declared in different modules, xml elements can be declared in different namespaces. Namespaces usually look like URLs.
You can also use an xmlns:prefix declaration to define a namespace and associate it with a prefix. Then each element in that namespace must be explicitly declared with the prefix.
xml documents can contain character encoding information on the first line, before the root element.
In [1]:
#import lxml.etree as etree
try:
from lxml import etree as etree
except ImportError:
import xml.etree.ElementTree as etree
In [2]:
tree = etree.parse('feed.xml')
root = tree.getroot()
root
Out[2]:
In [3]:
root.tag
Out[3]:
In [4]:
len(root)
Out[4]:
In [5]:
for child in root:
print(child)
In [6]:
root.attrib
Out[6]:
In [7]:
c4_att = root[4].attrib
c4_att
Out[7]:
In [8]:
c4_att['rel'],c4_att['href']
Out[8]:
In [9]:
# find 1st matching entry
tree.find('//{http://www.w3.org/2005/Atom}entry')
Out[9]:
In [10]:
# find all entry elements
tree.findall('//{http://www.w3.org/2005/Atom}entry')
Out[10]:
In [11]:
# find all category elements
tree.findall('//{http://www.w3.org/2005/Atom}category')
Out[11]:
In [12]:
# find all category element with attribute term="mp4"
tree.findall('//{http://www.w3.org/2005/Atom}category[@term="mp4"]')
Out[12]:
In [13]:
# find all elements with href attribute
href_nodes = tree.findall('//{http://www.w3.org/2005/Atom}*[@href]')
for e in href_nodes:
print(e.attrib['href']) # get link url
In [14]:
# advanced search with XPath
NSMAP = {'atom': 'http://www.w3.org/2005/Atom'}
entries = tree.xpath("//atom:category[@term='accessibility']/..", namespaces=NSMAP)
entries[0].tag
Out[14]:
In [15]:
title = entries[0].xpath('./atom:title/text()', namespaces=NSMAP)
title
Out[15]:
In [16]:
new_feed = etree.Element('{http://www.w3.org/2005/Atom}feed',
attrib={'{http://www.w3.org/XML/1998/namespace}lang': 'en'})
print(etree.tostring(new_feed))
In [17]:
# add more element/text
title = etree.SubElement(new_feed, 'title', attrib={'type':'html'})
print(etree.tounicode(new_feed))
In [18]:
title.text = 'Dive into Python!'
print(etree.tounicode(new_feed))
In [19]:
# pretty print XML
print(etree.tounicode(new_feed, pretty_print=True))
You might also want to check out xmlwitch,
another third-party library for generating xml. It makes extensive use of the with statement to make xml generation code more readable.
lxml
xml on Wikipedia.org http://en.wikipedia.org/wiki/XML
The ElementTree xml API http://docs.python.org/3.1/library/xml.etree.elementtree.html
Elements and Element Trees http://effbot.org/zone/element.htm
XPath Support in ElementTree http://effbot.org/zone/element-xpath.htm
The ElementTree iterparse Function http://effbot.org/zone/element-iterparse.htm
Parsing xml and html with lxml http://codespeak.net/lxml/1.3/parsing.html
XPath and xslt with lxml http://codespeak.net/lxml/1.3/xpathxslt.html
In [ ]:
In [ ]: