In the previous Notebook, we explored different data formats. In this
Notebook we explore how to actually pull data of interest out of a
formatted data set. To do this we introduce the parsing tool
BeautifulSoup, which provides an elegant and simple method to
parse and access XML formatted data. BeautifulSoup was actually designed
to simplify the task of scraping data from Websites, and thus we can use
it to parse any XML formatted data including HTML or SVG. Another
important tool to learn is regular expressions, which can simplify the
task of finding and selecting specific data in a large document. Python
provides a native implementation of regular expressions through
the re
module.
First, we will create an HTML document that we can use in most of this Notebook to demonstrate different parsing concepts.
In [1]:
# Write out a simple HTML file to demonstrate DOM processing
html = '''
<!DOCTYPE html>
<html>
<head id='hid' class='hclass'>
<title> Test, this is only a test ... </title>
</head>
<body id='bid' class='bclass'>
<header>
This is text in the header.
</header>
<h2 color='mycolor'>This is a Header Level 2</h2>
<p align='myalign'>Here is some text in a paragraph.</p>
<p> Here is a list </p>
<ul id='ulid'>
<li> List Item #1 </li>
<li> List Item #2 </li>
</ul>
<p type='caption'> Here is a table </p>
<table id='tid'>
<tr>
<th> Column #1 </th>
<th> Column #2 </th>
</tr>
<tr>
<td> A value </td>
<td> Another Value </td>
</tr>
</table>
<p> Some concluding text </p>
<footer>
<hr />
This is a text in the footer.
</footer>
</body>
</html>
'''
# Now save the HTML string
with open('test.html', 'w') as fout:
fout.write(html)
There are at least two techniques used to parse a structured file like an XML document. The first approach is known as Simple API for XML (or SAX), which is an event drieven parser that reads and processes each part of an XML document sequentially. The second approach is the Document Object Model (or DOM), which reads and parses the entire document. While the SAX approach can be fast and uses a smaller memory footprint, the DOM approach can be more easily used to extract all or most of the information contained in an XML document.
To demonstrate using a DOM, we can process our newly minted HTML file, which is rendered rather simply as shown in the following figure:
This HTML document, which is a valid XML document, demonstrates both hierarchical elements, as well as element attributes and values. This can be seen more easily by examining the document object model (or DOM) representation of this document, which is shown in the following figure:
This figure is actually a screenshot from the Safari Web Browser
Developer Source View. This representation of the DOM very clearly
illustrates the hierarchical nature of the document. At the highest
level we have the html
element, inside of which are two separate
elements: body
and head
.
Lookinga at the document tree more closely, we see that the head
element has an associated id
and class
atributes as well as a child
element called title
, which has a value of Test, this is only a test
...
. The body
element has a number of children elements, including
the header
, h2
, p
, ul
, table
, and footer
elements. Some of
these elements have both child elements, values, and possibly their own
attributes. The relationship between the DOM element and the HTML view
can be seen in the following two figures, where the ul
element is
highlighted in the DOM model,
and the corresponding element is highlighted in the HTML view.
To parse an XML document, like our example HTML document, we can use the Python Beautiful Soup library. This library uses an XML/HTML parser to build a DOM tree, and Beautiful Soup then provides traversal methods to access and modify the DOM for a specific document. Beautiful Soup has been extremely popular for the ease with which it allows web scraping, for example, you can pull data out of an HTML table. But it is more powerful than this, as it allows you to easily parse and manipulate any XML document as we will see in the Data Visualization Notebook.
To use Beautiful Soup, we first need to import the library, and then
create a BeautifulSoup object that provides access to the parsed data.
Document elements, like body
or table
are directly accessed from the
parsed tree; and element attributes or data can be easily extracted,
deleted, or replaced. If required, new data can also be added to an
existing document, allowing for the dynamic creation of a new document.
These capabilities are demonstrated in the following code cells.
In [2]:
# Lets parse our HTML document
# We use BeautofulSoup version 4
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('test.html'))
# Now lets print out the start of the HTMl file
print(soup.prettify()[:108])
In [3]:
# We can access document elements directly
print('title element:= ', soup.title)
print('title value:', soup.title.string)
# We can access parent data
print('title parent element: ', soup.title.parent.name)
In [4]:
# We can directly access elemnt attributes
print('body class attribute: ', soup.body['class'])
In [5]:
# We can access an entire element's content
print(soup.ul)
In [6]:
# We can find all occurances of a particular element
for el in soup.find_all('p'):
print(el)
In [7]:
# We can also change data in the document
soup.title.string = 'This is a new title!'
print(soup.title)
soup.body['class'] = 'newClass'
print("\nBody class attribute = ", soup.body['class'])
In [8]:
# We can delete elements
myTable = soup.table.extract()
print(soup.table)
In [9]:
# We can select elements based on CSS Selectors
target = soup.select('p[type]')
print(target)
In [10]:
# We need to pull out the first element in the list to get tag
# Now we can insert our table back into the DOM
target[0].insert_after(myTable)
print(soup.table)
In [11]:
# We can also insert entirely new elements.
# First we create a new element (tag)
tag = soup.new_tag('h3', id='h3id')
tag.string = 'A New Header'
# Now we can append (in this case we put the new element at the end of the body)
soup.body.append(tag)
# Show the result
print(soup.h3)
While Beautiful Soup provides a great deal of power and simplicity in DOM parsing and element retrieval, the full power of parsing a document requires the use of regular expressions.
Regular expressions, or RE or regexes, are expressions that can be used
to match one or more occurrences of a particular pattern. Regular
expressions are not unique to Python, they are used in many programming
languages and many Unix command line tools like sed, grep, or awk.
Regular expressions are used in Python through the re
module. To
build a regular expression, you need to understand the syntax of the RE
language. Once a regular expression is developed, it is compiled and
executed by an engine written in C in order to provide fast execution.
To begin, most characterers in a regular expression simply match
themselves, For example python
would match any occurrence of the six
letters python
either alone or embedded in another word. There are
several special characters, known as metacharacters, that control the
behaviour of the rest of the regular expresion. These metacharacters are
listed in the following table.
Metacharacter | Meaning | Example |
---|---|---|
. | Matches any character except a newline | 1.3 matches 123 , 1a3 , and 1#3 among others |
^ | Matches sequence at the beginning of the line | ^Python matches Python at the beginning of a line |
$ | Matches sequence at the end of the line | Python$ matches Python at the end of a line |
* | Matches zero or more occurrences of a pattern | 12*3 matches 13 , 123 , 1223 , etc. |
+ | Matches one or more occurrences of a pattern | 12+3 matches 123 , 1223 , etc. |
? | Matches zero or one occurrences of a pattern | 12?3 matches 13 and 123 |
{ } | Match repeated qualifier | {m, n} means match at least m and at most n occurrences |
[ ] | Used to specify a character class | [a-z] means match any lower case character |
\ | Escape character | \w means match and alphanumeric character, and \\ means match a backslash |
| | or operator | A | B match either A or B |
( ) | Grouping Operator | (a, b) |
One additional point to remember is that inside a character class (i.e.,
[ ]
) many of these metacharacters lose their special meaning, and thus
can be used to match themselves.
To master regular expressions requires a lot of practice, but the
investment is well worth it as they are used in many different contexts
and can greatly simplify otherwise complex tasks. Given a regular
expression, the first task in Python is to compile the RE, which is done
by using the compile
method in the re
module. This is demonstrated in
the following code cell where we use a regular expression to find the
element containing CMI
toi display our local airport.
In [12]:
# We need the re module
import re
# Open and parse our XML document
soup = BeautifulSoup(open('data.xml'))
# Findelements containing the CMI string
for el in soup.find_all(text=re.compile('CMI')):
# To get the entire airport element, we need to go
# up two levels in the DOM tree.
print(el.parent.parent)