Lesson 40:

Parsing HTML with the BeautifulSoup Module

HTML stands for 'Hyper Text Markup Language', which is text that loads elements of a webpage.

Parsing HTML in Python can be done via the beautifulsoup module. It is a third party module, and must be installed via pip.


In [2]:
import bs4

This module can be used in conjunction with requests to download and parse webpages.

For example, to download a webpage from Amazon, and find price information on the page.


In [12]:
import bs4
import requests

res = requests.get('http://www.amazon.ca/Automate-Boring-Stuff-Python-Programming/dp/1593275994')
res.raise_for_status()

No errors were raised, we can now parse the webpage text:


In [13]:
soup = bs4.BeautifulSoup(res.text)


/usr/local/lib/python3.5/site-packages/bs4/__init__.py:166: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

To get rid of this warning, change this:

 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "html.parser")

  markup_type=markup_type))

This typically raises a warning, but it is not an exception. It can be solved by using bs4.BeautifulSoup(res.text, 'html.paser') as described.

We can now find elements in the page, and we can do this via CSS selections and the .select method.

In Chrome, we can copy the CSS Path via 'Inspect > Copy > Copy selector'.


In [14]:
# Store this element in that list at index 0
elems = soup.select('#buyNewSection > div > div > span > span')

print(elems[0])
print(elems[0].text)


<span class="a-size-medium a-color-price offer-price a-text-normal">CDN$ 27.22</span>
CDN$ 27.22

The element was sucessful imported and stored.

Amazon Price Scraper Program:

If we tie this all together, we can create a simple program to perform these steps.


In [23]:
import bs4, requests

def getAmazonPrice(productUrl):
    # Use requests to download a URL
    res = requests.get(productUrl)
    # Raise for status to check for errors and crash if there are issues
    res.raise_for_status()
    
    # Pass the HTML to Beautiful Soup
    soup = bs4.BeautifulSoup(res.text, 'html.parser')
    # Pass the price CSS selector, and store into a list of all matching elements
    elems = soup.select('#buyNewSection > div > div > span > span')
    # Examine and return the text of the first (and only) element in the list
    return elems[0].text.strip()
    
    
    
price = getAmazonPrice('http://www.amazon.ca/Automate-Boring-Stuff-Python-Programming/dp/1593275994')
print('The current price of \'Automate the Boring Stuff with Python\' on Amazon is ' + price)


The current price of 'Automate the Boring Stuff with Python' on Amazon is CDN$ 27.22

This method can be used to automate web scraping without ever using the browser. However, it may require try and except statement to handle special conditions; the CSS Selector may not apply to every page.

Recap

  • Web pages are plaintext files formatted as HTML.
  • HTML can be parsed with the BeautifulSoup module.
  • BeautifulSoup is imported as bs4.
  • Pass the string with the HTML to bs4.BeautifulSoup() function to get a Soup object.
  • The Soup object has a .select() method that can be passed a string of the CSS Selector for an HTML tag.
  • You can get a CSS Selector string from the browser's developer tools.
  • The .select() method will return a list of matching element objects.