HTML stands for 'Hyper Text Markup Language', which is text that loads elements of a webpage.
Parsing HTML in Python can be done via the beautifulsoup
module. It is a third party module, and must be installed via pip
.
In [2]:
import bs4
This module can be used in conjunction with requests
to download and parse webpages.
For example, to download a webpage from Amazon, and find price information on the page.
In [12]:
import bs4
import requests
res = requests.get('http://www.amazon.ca/Automate-Boring-Stuff-Python-Programming/dp/1593275994')
res.raise_for_status()
No errors were raised, we can now parse the webpage text:
In [13]:
soup = bs4.BeautifulSoup(res.text)
This typically raises a warning, but it is not an exception. It can be solved by using bs4.BeautifulSoup(res.text, 'html.paser')
as described.
We can now find elements in the page, and we can do this via CSS selections and the .select
method.
In Chrome, we can copy the CSS Path via 'Inspect > Copy > Copy selector'.
In [14]:
# Store this element in that list at index 0
elems = soup.select('#buyNewSection > div > div > span > span')
print(elems[0])
print(elems[0].text)
The element was sucessful imported and stored.
If we tie this all together, we can create a simple program to perform these steps.
In [23]:
import bs4, requests
def getAmazonPrice(productUrl):
# Use requests to download a URL
res = requests.get(productUrl)
# Raise for status to check for errors and crash if there are issues
res.raise_for_status()
# Pass the HTML to Beautiful Soup
soup = bs4.BeautifulSoup(res.text, 'html.parser')
# Pass the price CSS selector, and store into a list of all matching elements
elems = soup.select('#buyNewSection > div > div > span > span')
# Examine and return the text of the first (and only) element in the list
return elems[0].text.strip()
price = getAmazonPrice('http://www.amazon.ca/Automate-Boring-Stuff-Python-Programming/dp/1593275994')
print('The current price of \'Automate the Boring Stuff with Python\' on Amazon is ' + price)
This method can be used to automate web scraping without ever using the browser. However, it may require try
and except
statement to handle special conditions; the CSS Selector may not apply to every page.
BeautifulSoup
module.BeautifulSoup
is imported as bs4
.bs4.BeautifulSoup()
function to get a Soup object..select()
method that can be passed a string of the CSS Selector for an HTML tag..select()
method will return a list of matching element objects.