Introduction to Data Parsing

Professor Robert J. Brunner

</DIV>


Introduction

In the previous Notebook, we explored different data formats. In this Notebook we explore how to actually pull data of interest out of a formatted data set. To do this we introduce the parsing tool BeautifulSoup, which provides an elegant and simple method to parse and access XML formatted data. BeautifulSoup was actually designed to simplify the task of scraping data from Websites, and thus we can use it to parse any XML formatted data including HTML or SVG. Another important tool to learn is regular expressions, which can simplify the task of finding and selecting specific data in a large document. Python provides a native implementation of regular expressions through the re module.

First, we will create an HTML document that we can use in most of this Notebook to demonstrate different parsing concepts.



In [1]:
# Write out a simple HTML file to demonstrate DOM processing

html = '''
<!DOCTYPE html>
<html>
<head id='hid' class='hclass'>
<title> Test, this is only a test ... </title>
</head>
<body id='bid' class='bclass'>
<header> 
This is text in the header.
</header>

<h2 color='mycolor'>This is a Header Level 2</h2>

<p align='myalign'>Here is some text in a paragraph.</p>

<p> Here is a list </p>
<ul id='ulid'>
<li> List Item #1 </li>
<li> List Item #2 </li>
</ul>

<p type='caption'> Here is a table </p>
<table id='tid'>
<tr>
<th> Column #1 </th>
<th> Column #2 </th>
</tr>
<tr>
<td> A value </td>
<td> Another Value </td>
</tr>
</table>

<p> Some concluding text </p>

<footer>
<hr />
This is a text in the footer.
</footer>

</body>
</html>
'''

# Now save the HTML string
with open('test.html', 'w') as fout:
    fout.write(html)

Document Object Model

There are at least two techniques used to parse a structured file like an XML document. The first approach is known as Simple API for XML (or SAX), which is an event drieven parser that reads and processes each part of an XML document sequentially. The second approach is the Document Object Model (or DOM), which reads and parses the entire document. While the SAX approach can be fast and uses a smaller memory footprint, the DOM approach can be more easily used to extract all or most of the information contained in an XML document.

To demonstrate using a DOM, we can process our newly minted HTML file, which is rendered rather simply as shown in the following figure:

This HTML document, which is a valid XML document, demonstrates both hierarchical elements, as well as element attributes and values. This can be seen more easily by examining the document object model (or DOM) representation of this document, which is shown in the following figure:

This figure is actually a screenshot from the Safari Web Browser Developer Source View. This representation of the DOM very clearly illustrates the hierarchical nature of the document. At the highest level we have the html element, inside of which are two separate elements: body and head.

Lookinga at the document tree more closely, we see that the head element has an associated id and class atributes as well as a child element called title, which has a value of Test, this is only a test .... The body element has a number of children elements, including the header, h2, p, ul, table, and footer elements. Some of these elements have both child elements, values, and possibly their own attributes. The relationship between the DOM element and the HTML view can be seen in the following two figures, where the ul element is highlighted in the DOM model,

and the corresponding element is highlighted in the HTML view.


Parsing

To parse an XML document, like our example HTML document, we can use the Python Beautiful Soup library. This library uses an XML/HTML parser to build a DOM tree, and Beautiful Soup then provides traversal methods to access and modify the DOM for a specific document. Beautiful Soup has been extremely popular for the ease with which it allows web scraping, for example, you can pull data out of an HTML table. But it is more powerful than this, as it allows you to easily parse and manipulate any XML document as we will see in the Data Visualization Notebook.

To use Beautiful Soup, we first need to import the library, and then create a BeautifulSoup object that provides access to the parsed data. Document elements, like body or table are directly accessed from the parsed tree; and element attributes or data can be easily extracted, deleted, or replaced. If required, new data can also be added to an existing document, allowing for the dynamic creation of a new document. These capabilities are demonstrated in the following code cells.



In [2]:
# Lets parse our HTML document

# We use BeautofulSoup version 4
from bs4 import BeautifulSoup
  
soup = BeautifulSoup(open('test.html'))

# Now lets print out the start of the HTMl file
print(soup.prettify()[:108])


<!DOCTYPE html>
<html>
 <head class="hclass" id="hid">
  <title>
   Test, this is only a test ...
  </title>

In [3]:
# We can access document elements directly
print('title element:= ', soup.title)
print('title value:', soup.title.string)

# We can access parent data
print('title parent element: ', soup.title.parent.name)


title element:=  <title> Test, this is only a test ... </title>
title value:  Test, this is only a test ... 
title parent element:  head

In [4]:
# We can directly access elemnt attributes

print('body class attribute: ', soup.body['class'])


body class attribute:  ['bclass']

In [5]:
# We can access an entire element's content
print(soup.ul)


<ul id="ulid">
<li> List Item #1 </li>
<li> List Item #2 </li>
</ul>

In [6]:
# We can find all occurances of a particular element

for el in soup.find_all('p'):
    print(el)


<p align="myalign">Here is some text in a paragraph.</p>
<p> Here is a list </p>
<p type="caption"> Here is a table </p>
<p> Some concluding text </p>

In [7]:
# We can also change data in the document

soup.title.string = 'This is a new title!'

print(soup.title)

soup.body['class'] = 'newClass'

print("\nBody class attribute = ", soup.body['class'])


<title>This is a new title!</title>

Body class attribute =  newClass

In [8]:
# We can delete elements

myTable = soup.table.extract()

print(soup.table)


None

In [9]:
# We can select elements based on CSS Selectors
target = soup.select('p[type]')
print(target)


[<p type="caption"> Here is a table </p>]

In [10]:
# We need to pull out the first element in the list to get tag
# Now we can insert our table back into the DOM

target[0].insert_after(myTable)
print(soup.table)


<table id="tid">
<tr>
<th> Column #1 </th>
<th> Column #2 </th>
</tr>
<tr>
<td> A value </td>
<td> Another Value </td>
</tr>
</table>

In [11]:
# We can also insert entirely new elements.

# First we create a new element (tag)
tag = soup.new_tag('h3', id='h3id')
tag.string = 'A New Header'

# Now we can append (in this case we put the new element at the end of the body)

soup.body.append(tag)

# Show the result
print(soup.h3)


<h3 id="h3id">A New Header</h3>

While Beautiful Soup provides a great deal of power and simplicity in DOM parsing and element retrieval, the full power of parsing a document requires the use of regular expressions.

Regular Expressions

Regular expressions, or RE or regexes, are expressions that can be used to match one or more occurrences of a particular pattern. Regular expressions are not unique to Python, they are used in many programming languages and many Unix command line tools like sed, grep, or awk. Regular expressions are used in Python through the re module. To build a regular expression, you need to understand the syntax of the RE language. Once a regular expression is developed, it is compiled and executed by an engine written in C in order to provide fast execution.

To begin, most characterers in a regular expression simply match themselves, For example python would match any occurrence of the six letters python either alone or embedded in another word. There are several special characters, known as metacharacters, that control the behaviour of the rest of the regular expresion. These metacharacters are listed in the following table.

Metacharacter Meaning Example
. Matches any character except a newline 1.3 matches 123, 1a3, and 1#3 among others
^ Matches sequence at the beginning of the line ^Python matches Python at the beginning of a line
$ Matches sequence at the end of the line Python$ matches Python at the end of a line
* Matches zero or more occurrences of a pattern 12*3 matches 13, 123, 1223, etc.
+ Matches one or more occurrences of a pattern 12+3 matches 123, 1223, etc.
? Matches zero or one occurrences of a pattern 12?3 matches 13 and 123
{ } Match repeated qualifier {m, n} means match at least m and at most n occurrences
[ ] Used to specify a character class [a-z] means match any lower case character
\ Escape character \w means match and alphanumeric character, and \\ means match a backslash
| or operator A | B match either A or B
( ) Grouping Operator (a, b)

One additional point to remember is that inside a character class (i.e., [ ]) many of these metacharacters lose their special meaning, and thus can be used to match themselves.

To master regular expressions requires a lot of practice, but the investment is well worth it as they are used in many different contexts and can greatly simplify otherwise complex tasks. Given a regular expression, the first task in Python is to compile the RE, which is done by using the compile method in the re module. This is demonstrated in the following code cell where we use a regular expression to find the element containing CMI toi display our local airport.



In [12]:
# We need the re module
import re 

# Open and parse our XML document
soup = BeautifulSoup(open('data.xml'))

# Findelements containing the CMI string
for el in soup.find_all(text=re.compile('CMI')):

    # To get the entire airport element, we need to go 
    # up two levels in the DOM tree.
    print(el.parent.parent)

Additional References

  1. Dive Into Python3 regular expression chapter.
  2. BeautifulSoup tutorial.