There are to main approaches how to parse data
SAX (Simple API for XML) - it scan elements on the fly. This approach does not store anything in memory.
DOM (Document Object Model) - it creates model of all elements in memory. Allows higher functions.
In this section is introduced HTMLParser. This is a SAX parser. In next examples is used following sample HTML content:
In [1]:
sample_html = """
<html>
<head>
<title>Test</title>
</head>
<body>
<h1>Heading!</h1>
<p class="major_content">Some content.</p>
<p class="minor_content">Some other content.</p>
</body>
</html>
"""
Simle example of usage follows. Following parser print out encountered tags and data.
In [2]:
# from HTMLParser import HTMLParser # Python 2.7
from html.parser import HTMLParser
class TestHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
print("Tag start:", tag)
def handle_endtag(self, tag):
print("Tag end:", tag)
def handle_data(self, data):
print("Tag data:", data)
# instantiate the parser and fed in some HTML
parser = TestHTMLParser(convert_charrefs=True)
parser.feed(sample_html)
The goal of this second parser is to get content from paragraph with class: major_content.
In [3]:
class Test2HTMLParser(HTMLParser):
def __init__(self):
HTMLParser.__init__(self, convert_charrefs=True)
self.recording = False
def handle_starttag(self, tag, attrs):
if tag == "p" and "major_content" in dict(attrs).values():
self.recording = True
def handle_endtag(self, tag):
self.recording = False
def handle_data(self, data):
if self.recording:
print(data)
# instantiate the parser and fed in some HTML
parser2 = Test2HTMLParser()
parser2.feed(sample_html)
See ElementTree XML API for more information. This library is designed for XML parsing, but it si possible to use it also for HTML parsing with various levels of succuess. This parser is DOM parser. Simple example that iterates over HTML tree (only first and second level) follows:
In [4]:
import xml.etree.ElementTree as ET
tree = ET.fromstring(sample_html)
for child1 in tree:
print(child1.tag)
for child2 in child1:
print("\t", child2.tag, "-", child2.text)
Second example prints just content of paragraph with major_content class:
In [5]:
import xml.etree.ElementTree as ET
tree = ET.fromstring(sample_html)
tree.findall("./body/p[@class='major_content']")[0].text
Out[5]:
The BeautifulSoup is library dedicated to simplify scraping information from HTML pages. It is an DOM parser. The sample data named Example1 from this tutorial repo are used next example:
In [6]:
from bs4 import BeautifulSoup
# path to data
path = "data/example1.html"
# template for printing the output
sentence = "{} {} is {} years old."
# load data
with open(path, 'r') as datafile:
sample_html = datafile.read()
# create tree
soup = BeautifulSoup(sample_html, "html.parser")
# get title and print it
title = soup.find("title")
print(title.text, "\n")
# select all rows in table
table = soup.find("table", {"id": "main_table"})
table_rows = table.findAll("tr")
# iterate over table and print results
for row in table_rows:
first_name = row.find("td", {"class": "first_name"})
last_name = row.find("td", {"class": "last_name"})
age = row.find("td", {"class": "age"})
if first_name and last_name and age:
print(sentence.format(first_name.text, last_name.text, age.text))
Attributes of the elements are accessible as simple as follows:
In [7]:
print(table.attrs)
In [8]:
sample_html = """
<html>
<head>
<title>Test</title>
</head>
<body>
<h1>Heading!</h1>
<p class="major_content">Some content. And even more content.</p>
<p class="minor_content">
Some other content.
Numbers related content.
The important information is, that the key number is 23.
</p>
</body>
</html>
"""
If you need just the key number value from the text. And it is sure that:
information appers only once in the text
information will not change the form (words, word order ...)
You can use following approach.
In [9]:
# unclean way
target_start = sample_html.find("the key number is ") + len("the key number is")
target_end = sample_html[target_start:].find(".") + target_start
print(sample_html[target_start:target_end])
Or you can do the same thing, but more correctly with Regex.
In [10]:
# much beter way (with regex)
import re
print(re.search('the key number is (.*).', sample_html).group(1))
In next piece of code is shown how to create JSON encoded message in Python with JSON library.
In [11]:
import json
# sample data
message = [
{"time": 123, "value": 5},
{"time": 124, "value": 6},
{"status": "ok", "finish": [True, False, False]},
]
# pack message as json
js_message = json.dumps(message)
# show result
print(type(js_message))
print(js_message)
Note, that the output is string. In similar way you can unpack the message back to Python standard list/dictionary. Example follows.
In [12]:
# unpack message
message = json.loads(js_message)
# show result
print(type(message))
print(message)
In [13]:
import requests
r = requests.get("http://api.open-notify.org/iss-now.json")
obj = r.json()
print(obj)
The Requests function json() convert the json response to Python dictionary. In next code block is demonstrated how to get data from obtained response.
In [14]:
import datetime
# raw data
print("Raw data:")
print(obj)
# important part
print("\nSelected items from data:")
print(obj["timestamp"])
print(obj['iss_position']['latitude'], obj['iss_position']['longitude'])
# unix timestamp to human format
timestamp = datetime.datetime.fromtimestamp(obj["timestamp"]).strftime('%Y-%m-%d %H:%M:%S')
# print of cleaned data
print("\nCleaned data:")
print("Time and date: {}".format(timestamp))
print("Latitude: {}, longitude: {}".format(obj['iss_position']['latitude'], obj['iss_position']['longitude']))
In [ ]: