Processing of HTTP response - JSON and HTML

In this tutorial ti is covered basic operations with HTML and JSON.

For more informations about related stuff see:

HTML (XML) parsing

There are to main approaches how to parse data

  • SAX (Simple API for XML) - it scan elements on the fly. This approach does not store anything in memory.

  • DOM (Document Object Model) - it creates model of all elements in memory. Allows higher functions.

HTML parsing with Python HTMLParser class

In this section is introduced HTMLParser. This is a SAX parser. In next examples is used following sample HTML content:


In [1]:
sample_html =  """
<html>
    <head>
        <title>Test</title>
    </head>
    <body>
        <h1>Heading!</h1>
        <p class="major_content">Some content.</p>
        <p class="minor_content">Some other content.</p>
    </body>
</html>
"""

Simle example of usage follows. Following parser print out encountered tags and data.


In [2]:
# from HTMLParser import HTMLParser # Python 2.7
from html.parser import HTMLParser

class TestHTMLParser(HTMLParser):
    
    def handle_starttag(self, tag, attrs):
        print("Tag start:", tag)

    def handle_endtag(self, tag):
        print("Tag end:", tag)

    def handle_data(self, data):
        print("Tag data:", data)
    
# instantiate the parser and fed in some HTML
parser = TestHTMLParser(convert_charrefs=True)
parser.feed(sample_html)


Tag data: 

Tag start: html
Tag data: 
    
Tag start: head
Tag data: 
        
Tag start: title
Tag data: Test
Tag end: title
Tag data: 
    
Tag end: head
Tag data: 
    
Tag start: body
Tag data: 
        
Tag start: h1
Tag data: Heading!
Tag end: h1
Tag data: 
        
Tag start: p
Tag data: Some content.
Tag end: p
Tag data: 
        
Tag start: p
Tag data: Some other content.
Tag end: p
Tag data: 
    
Tag end: body
Tag data: 

Tag end: html

The goal of this second parser is to get content from paragraph with class: major_content.


In [3]:
class Test2HTMLParser(HTMLParser):

    def __init__(self):
        HTMLParser.__init__(self, convert_charrefs=True)
        self.recording = False
    
    def handle_starttag(self, tag, attrs):
        if tag == "p" and "major_content" in dict(attrs).values():
            self.recording = True

    def handle_endtag(self, tag):
        self.recording = False

    def handle_data(self, data):
        if self.recording:
            print(data)

# instantiate the parser and fed in some HTML
parser2 = Test2HTMLParser()
parser2.feed(sample_html)


Some content.

Examples with the ElementTree XML API

See ElementTree XML API for more information. This library is designed for XML parsing, but it si possible to use it also for HTML parsing with various levels of succuess. This parser is DOM parser. Simple example that iterates over HTML tree (only first and second level) follows:


In [4]:
import xml.etree.ElementTree as ET

tree = ET.fromstring(sample_html)

for child1 in tree:
    print(child1.tag)
    for child2 in child1:
        print("\t", child2.tag, "-", child2.text)


head
	 title - Test
body
	 h1 - Heading!
	 p - Some content.
	 p - Some other content.

Second example prints just content of paragraph with major_content class:


In [5]:
import xml.etree.ElementTree as ET

tree = ET.fromstring(sample_html)
        
tree.findall("./body/p[@class='major_content']")[0].text


Out[5]:
'Some content.'

Examples with BeautifulSoup library

The BeautifulSoup is library dedicated to simplify scraping information from HTML pages. It is an DOM parser. The sample data named Example1 from this tutorial repo are used next example:


In [6]:
from bs4 import BeautifulSoup

# path to data
path = "data/example1.html"

# template for printing the output
sentence = "{} {} is {} years old."

# load data
with open(path, 'r') as datafile:
    sample_html = datafile.read()

# create tree
soup = BeautifulSoup(sample_html, "html.parser")

# get title and print it
title = soup.find("title")
print(title.text, "\n")

# select all rows in table
table = soup.find("table",  {"id": "main_table"})
table_rows = table.findAll("tr")  

# iterate over table and print results
for row in table_rows:
    first_name = row.find("td", {"class": "first_name"})
    last_name = row.find("td", {"class": "last_name"})
    age = row.find("td", {"class": "age"})
    if first_name and last_name and age:
        print(sentence.format(first_name.text, last_name.text, age.text))


Example webpage! 

Alice Smith is 31 years old.
Bob Stone is 38 years old.
Narcissus Hyacinth is 34 years old.
Adelmar Egino is 50 years old.

Attributes of the elements are accessible as simple as follows:


In [7]:
print(table.attrs)


{'id': 'main_table'}

Getting specific string from HTML (or other text)

In some cases can be benefical to get the particular information from source without parsing. In next examples is used following source.


In [8]:
sample_html =  """
<html>
    <head>
        <title>Test</title>
    </head>
    <body>
        <h1>Heading!</h1>
        <p class="major_content">Some content. And even more content.</p>
        <p class="minor_content">
            Some other content.
            Numbers related content.
            The important information is, that the key number is 23.
        </p>
    </body>
</html>
"""

If you need just the key number value from the text. And it is sure that:

  • information appers only once in the text

  • information will not change the form (words, word order ...)

You can use following approach.


In [9]:
# unclean way
target_start = sample_html.find("the key number is ") + len("the key number is")
target_end = sample_html[target_start:].find(".") + target_start
print(sample_html[target_start:target_end])


 23

Or you can do the same thing, but more correctly with Regex.


In [10]:
# much beter way (with regex)
import re
print(re.search('the key number is (.*).', sample_html).group(1))


23

Work with JSON

In next piece of code is shown how to create JSON encoded message in Python with JSON library.

Simple example


In [11]:
import json

# sample data
message = [
    {"time": 123, "value": 5},
    {"time": 124, "value": 6},
    {"status": "ok", "finish": [True, False, False]}, 
]

# pack message as json
js_message = json.dumps(message)

# show result
print(type(js_message))
print(js_message)


<class 'str'>
[{"value": 5, "time": 123}, {"value": 6, "time": 124}, {"status": "ok", "finish": [true, false, false]}]

Note, that the output is string. In similar way you can unpack the message back to Python standard list/dictionary. Example follows.


In [12]:
# unpack message
message = json.loads(js_message)

# show result
print(type(message))
print(message)


<class 'list'>
[{'time': 123, 'value': 5}, {'time': 124, 'value': 6}, {'status': 'ok', 'finish': [True, False, False]}]

JSON support in Requests library

The Requests library can convert the HTTP JSON reponse directly to Python standard format (dictionary/list). See following example.


In [13]:
import requests

r = requests.get("http://api.open-notify.org/iss-now.json")
obj = r.json()

print(obj)


{'timestamp': 1490553547, 'message': 'success', 'iss_position': {'longitude': '-67.1912', 'latitude': '-44.8940'}}

The Requests function json() convert the json response to Python dictionary. In next code block is demonstrated how to get data from obtained response.


In [14]:
import datetime

# raw data
print("Raw data:")
print(obj)

# important part
print("\nSelected items from data:")
print(obj["timestamp"])
print(obj['iss_position']['latitude'], obj['iss_position']['longitude'])

# unix timestamp to human format
timestamp = datetime.datetime.fromtimestamp(obj["timestamp"]).strftime('%Y-%m-%d %H:%M:%S')

# print of cleaned data
print("\nCleaned data:")
print("Time and date: {}".format(timestamp))
print("Latitude: {}, longitude: {}".format(obj['iss_position']['latitude'], obj['iss_position']['longitude']))


Raw data:
{'timestamp': 1490553547, 'message': 'success', 'iss_position': {'longitude': '-67.1912', 'latitude': '-44.8940'}}

Selected items from data:
1490553547
-44.8940 -67.1912

Cleaned data:
Time and date: 2017-03-26 20:39:07
Latitude: -44.8940, longitude: -67.1912

In [ ]: