Sometimes, we really need to use some data from a web page. Hopefully we can find the data in a useful format that we can work with, like CSV or even Excel files. But if it really only exists on a web page, we will have to use web scraping to get it. This is usually fiddly and frustrating, which is why it's a last resort.

There are three steps to web scraping:

  1. Fetch the page
  2. Parse the HTML
  3. Select the data you want

For this demo, we'll use a list of country areas from the CIA World Factbook. This is available in better formats, but it's a good illustration.


In [2]:
import requests
import lxml.html

In [3]:
# Fetch the data
response = requests.get('https://www.cia.gov/library/publications/the-world-factbook/rankorder/2147rank.html')
response.text[:200]


Out[3]:
'<!doctype html>\n<!--[if lt IE 7]> <html class="no-js lt-ie9 lt-ie8 lt-ie7" lang="en"> <![endif]-->\n<!--[if IE 7]>    <html class="no-js lt-ie9 lt-ie8" lang="en"> <![endif]-->\n<!--[if IE 8]>    <html c'

At the moment, this is just one long string. We need to parse it so the computer knows about the structure.


In [4]:
# Parse the HTML
# Note that lxml can do this in one step, but we're keeping the steps separate for now
html = lxml.html.fromstring(response.text)

To work out how to get the bit of data we want, go to the page in your browser, right click on one of the things we want, and click 'Inspect element' to examine the HTML structure of the page.

We get the data we want using XPath, a special mini-language for selecting elements in HTML and XML documents. Here, //td[@class="region"] matches all <td> (table data) tags anywhere in the page with the attribute class="region". More information about XPath is available on Wikipedia (as with many technical things).

If you already know how to use CSS selectors, you can alternatively use those, but we won't cover that here (because I don't know them as well)! If you don't know what CSS selectors are, save it for later.


In [5]:
countries = []
for td in html.xpath("""//td[@class="region"]"""):
    countries.append(td.text_content())

In [7]:
print(len(countries))
countries[:10]


253
Out[7]:
['Country Comparison\xa0::\xa0Area',
 'Russia',
 'Canada',
 'United States',
 'China',
 'Brazil',
 'Australia',
 'India',
 'Argentina',
 'Kazakhstan']

In [8]:
areas = []
for td in html.xpath("""//td[@class="category_data"]"""):
    areas.append(td.text_content())

In [10]:
print(len(areas))
areas[:10]


252
Out[10]:
['       17,098,242',
 '        9,984,670',
 '        9,826,675',
 '        9,596,960',
 '        8,514,877',
 '        7,741,220',
 '        3,287,263',
 '        2,780,400',
 '        2,724,900',
 '        2,381,741']

In [11]:
areas_num = []
for a in areas:
    areas_num.append(int(a.strip().replace(',', '')))

Now we'll match them back up using Python's zip() function to join two lists together. Remember that the first name in the countries list is spurious, so we need to throw it away.


In [13]:
list(zip(countries[1:], areas_num))[:10]


Out[13]:
[('Russia', 17098242),
 ('Canada', 9984670),
 ('United States', 9826675),
 ('China', 9596960),
 ('Brazil', 8514877),
 ('Australia', 7741220),
 ('India', 3287263),
 ('Argentina', 2780400),
 ('Kazakhstan', 2724900),
 ('Algeria', 2381741)]

Another introduction to scraping using these same tools can be found in The Hitchhiker's Guide to Python.

Exercise

Pick another page from the CIA world factbook, or if you're feeling adventurous, another page with data from elsewhere on the internet. Scrape data from it using these tools.


In [ ]: