Lab 2: Web Scraping

TFs: Ray Jones, Johanna Breyer, Nicolas Bonneel



In [2]:

    
import cs109style
cs109style.customize_mpl()
cs109style.customize_css()

# special IPython command to prepare the notebook for matplotlib
%matplotlib inline 

from collections import defaultdict

import pandas as pd
import matplotlib.pyplot as plt
import requests
from pattern import web









    



Setting custom matplotlib visual style
Setting custom CSS for the IPython Notebook

Fetching population data from Wikipedia

In this example we will fetch data about countries and their population from Wikipedia.

http://en.wikipedia.org/wiki/List_of_countries_by_past_and_future_population has several tables for individual countries, subcontinents as well as different years. We will combine the data for all countries and all years in a single panda dataframe and visualize the change in population for different countries.

We will go through the following steps:

fetching html with embedded data
parsing html to extract the data
collecting the data in a panda dataframe
displaying the data

To give you some starting points for your homework, we will also show the different sub-steps that can be taken to reach the presented solution.

Fetching the Wikipedia site



In [1]:

    
url = 'http://en.wikipedia.org/wiki/List_of_countries_by_past_and_future_population'
website_html = requests.get(url).text
#print website_html









    



---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-d32cfd971381> in <module>()
      1 url = 'http://en.wikipedia.org/wiki/List_of_countries_by_past_and_future_population'
----> 2 website_html = requests.get(url).text
      3 #print website_html

NameError: name 'requests' is not defined

Parsing html data



In [12]:

    
def get_population_html_tables(html):
    """Parse html and return html tables of wikipedia population data."""

    dom = web.Element(html)
    
    ### 0. step: look at html source!
    
    #### 1. step: get all tables
    # tbls = dom('table')
    #### 2. step: get all tables we care about
    tbls = dom.by_class('wikitable sortable')
    return tbls

tables = get_population_html_tables(website_html)
print "table length: %d" %len(tables)
for t in tables:
    print t.attributes









    



table length: 6
{u'style': u'text-align: right', u'border': u'1', u'class': u'wikitable sortable'}
{u'style': u'text-align: right', u'border': u'1', u'class': u'wikitable sortable'}
{u'style': u'text-align: right', u'border': u'1', u'class': u'wikitable sortable'}
{u'style': u'text-align: right', u'border': u'1', u'class': u'wikitable sortable'}
{u'style': u'text-align: right', u'border': u'1', u'class': u'wikitable sortable'}
{u'style': u'text-align: right', u'border': u'1', u'class': u'wikitable sortable'}



In [22]:

    
def table_type(tbl):
    ### Extract the table type
    return tbl('th')[1].content

table_by_type = {}
for tbl in tables:
    typ = table_type(tbl)
    if typ not in table_by_type:
        table_by_type[typ] = list()  # equivalent to []
    table_by_type[typ].append(tbl)

# Equivalent code below

# group the tables by type
tables_by_type = defaultdict(list)  
# defaultdicts have a default value that is inserted when a new key is accessed
for tbl in tables:
    tables_by_type[table_type(tbl)].append(tbl)

print tables_by_type









    



defaultdict(<type 'list'>, {u'Country or territory': [Element(tag='table'), Element(tag='table'), Element(tag='table')], u'(Sub)continent': [Element(tag='table'), Element(tag='table'), Element(tag='table')]})

Extracting data and filling it into a dictionary



In [34]:

    
def get_countries_population(tables):
    """Extract population data for countries from all tables and store it in dictionary."""
    
    result = defaultdict(dict)

    # 1. step: try to extract data for a single table
    for tbl in tables:
        headers = tbl('tr')
        first_header = headers[0]
        th_s = first_header('th')
    
        years = [int(val.content) for val in th_s if val.content.isnumeric()]
        year_indices = [idx for idx, val in enumerate(th_s) if val.content.isnumeric()]
        print years
        print year_indices
        # 2. step: iterate over all tables, extract headings and actual data and combine data into single dict
        rows = tbl('tr')[1:]
        for row in rows:
            tds = row('td')
            country_name = tds[1]('a')[0].content
            population_by_year = [int(tds[colidx].content.replace(',', '')) 
                                  for colidx in year_indices]
            subdict = dict(zip(years, population_by_year))
            result[country_name].update(subdict)
    
    
    return result


result = get_countries_population(tables_by_type[u'Country or territory'])
print len(result), "Countries extracted"
print result[u'Canada']









    



[1950, 1955, 1960, 1965, 1970, 1975, 1980]
[2, 3, 5, 7, 9, 11, 13]
[1985, 1990, 1995, 2000, 2005, 2010, 2015]
[2, 4, 6, 8, 10, 12, 14]
[2020, 2025, 2030, 2035, 2040, 2045, 2050]
[2, 4, 6, 8, 10, 12, 14]
227 Countries extracted
{1985: 25942, 2050: 41136, 1955: 16050, 2020: 36387, 1990: 27791, 1960: 18267, 2025: 37559, 1995: 29691, 1965: 20071, 2030: 38565, 2000: 31100, 1970: 21750, 2035: 39396, 2005: 32386, 1975: 23209, 2040: 40070, 2010: 33760, 1980: 24593, 2045: 40635, 1950: 14011, 2015: 35100}
1985 2050

Creating a dataframe from a dictionary



In [33]:

    
# create dataframe

df = pd.DataFrame.from_dict(result, orient='index')
# sort based on year
df.sort(axis=1,inplace=True)
print df









    



<class 'pandas.core.frame.DataFrame'>
Index: 227 entries, Afghanistan to Zimbabwe
Data columns (total 21 columns):
1950    227  non-null values
1955    227  non-null values
1960    227  non-null values
1965    227  non-null values
1970    227  non-null values
1975    227  non-null values
1980    227  non-null values
1985    227  non-null values
1990    227  non-null values
1995    227  non-null values
2000    227  non-null values
2005    227  non-null values
2010    227  non-null values
2015    227  non-null values
2020    227  non-null values
2025    227  non-null values
2030    227  non-null values
2035    227  non-null values
2040    227  non-null values
2045    227  non-null values
2050    227  non-null values
dtypes: int64(21)

Some data accessing functions for a panda dataframe



In [35]:

    
subtable = df.iloc[0:2, 0:2]
print "subtable"
print subtable
print ""

column = df[1955]
print "column"
print column
print ""

row = df.ix[0] #row 0
print "row"
print row
print ""

rows = df.ix[:2] #rows 0,1
print "rows"
print rows
print ""

element = df.ix[0,1955] #element
print "element"
print element
print ""

# max along column
print "max"
print df[1950].max()
print ""

# axes
print "axes"
print df.axes
print ""

row = df.ix[0]
print "row info"
print row.name
print row.index
print ""

countries =  df.index
print "countries"
print countries
print ""

print "Austria"
print df.ix['Austria']









    



subtable
             1950  1955
Afghanistan  8150  8891
Albania      1227  1392

column
Afghanistan             8891
Albania                 1392
Algeria                 9842
American Samoa            20
Andorra                    6
Angola                  4423
Anguilla                   5
Antigua and Barbuda       51
Argentina              18928
Armenia                 1565
Aruba                     54
Australia               9277
Austria                 6947
Azerbaijan              3314
Bahamas                   87
...
United Arab Emirates                83
United Kingdom                   50946
United States                   165069
United States Virgin Islands        28
Uruguay                           2353
Uzbekistan                        7232
Vanuatu                             59
Venezuela                         6170
Vietnam                          27738
Wallis and Futuna                    7
West Bank                          788
Western Sahara                      16
Yemen                             5265
Zambia                            2869
Zimbabwe                          3409
Name: 1955, Length: 227, dtype: int64

row
1950     8150
1955     8891
1960     9829
1965    10998
1970    12431
1975    14132
1980    15044
1985    13120
1990    13568
1995    19445
2000    22461
2005    26335
2010    29121
2015    32564
2020    36644
2025    41117
2030    45665
2035    50195
2040    54717
2045    59255
2050    63795
Name: Afghanistan, dtype: int64

rows
<class 'pandas.core.frame.DataFrame'>
Index: 2 entries, Afghanistan to Albania
Data columns (total 21 columns):
1950    2  non-null values
1955    2  non-null values
1960    2  non-null values
1965    2  non-null values
1970    2  non-null values
1975    2  non-null values
1980    2  non-null values
1985    2  non-null values
1990    2  non-null values
1995    2  non-null values
2000    2  non-null values
2005    2  non-null values
2010    2  non-null values
2015    2  non-null values
2020    2  non-null values
2025    2  non-null values
2030    2  non-null values
2035    2  non-null values
2040    2  non-null values
2045    2  non-null values
2050    2  non-null values
dtypes: int64(21)

element
8891

max
562580

axes
[Index([u'Afghanistan', u'Albania', u'Algeria', u'American Samoa', u'Andorra', u'Angola', u'Anguilla', u'Antigua and Barbuda', u'Argentina', u'Armenia', u'Aruba', u'Australia', u'Austria', u'Azerbaijan', u'Bahamas', u'Bahrain', u'Bangladesh', u'Barbados', u'Belarus', u'Belgium', u'Belize', u'Benin', u'Bermuda', u'Bhutan', u'Bolivia', u'Bosnia and Herzegovina', u'Botswana', u'Brazil', u'British Virgin Islands', u'Brunei', u'Bulgaria', u'Burkina Faso', u'Burundi', u'Cambodia', u'Cameroon', u'Canada', u'Cape Verde', u'Cayman Islands', u'Central African Republic', u'Chad', u'Chile', u'China', u'Colombia', u'Comoros', u'Congo (Brazzaville)', u'Congo (Kinshasa)', u'Cook Islands', u'Costa Rica', u'Croatia', u'Cuba', u'Curaçao', u'Cyprus', u'Czech Republic', u'Denmark', u'Djibouti', u'Dominica', u'Dominican Republic', u'Ecuador', u'Egypt', u'El Salvador', u'Equatorial Guinea', u'Eritrea', u'Estonia', u'Ethiopia', u'Faroe Islands', u'Federated States of Micronesia', u'Fiji', u'Finland', u'France', u'French Polynesia', u'Gabon', u'Gambia', u'Gaza Strip', u'Georgia', u'Germany', u'Ghana', u'Gibraltar', u'Greece', u'Greenland', u'Grenada', u'Guam', u'Guatemala', u'Guernsey', u'Guinea', u'Guinea-Bissau', u'Guyana', u'Haiti', u'Honduras', u'Hong Kong', u'Hungary', u'Iceland', u'India', u'Indonesia', u'Iran', u'Iraq', u'Ireland', u'Isle of Man', u'Israel', u'Italy', u'Ivory Coast', u'Jamaica', u'Japan', u'Jersey', u'Jordan', u'Kazakhstan', u'Kenya', u'Kiribati', u'Kuwait', u'Kyrgyzstan', u'Laos', u'Latvia', u'Lebanon', u'Lesotho', u'Liberia', u'Libya', u'Liechtenstein', u'Lithuania', u'Luxembourg', u'Macau', u'Macedonia', u'Madagascar', u'Malawi', u'Malaysia', u'Maldives', u'Mali', u'Malta', u'Marshall Islands', u'Mauritania', u'Mauritius', u'Mayotte', u'Mexico', u'Moldova', u'Monaco', u'Mongolia', u'Montenegro', u'Montserrat', u'Morocco', u'Mozambique', u'Myanmar', u'Namibia', u'Nauru', u'Nepal', u'Netherlands', u'New Caledonia', u'New Zealand', u'Nicaragua', u'Niger', u'Nigeria', u'North Korea', u'Northern Mariana Islands', u'Norway', u'Oman', u'Pakistan', u'Palau', u'Panama', u'Papua New Guinea', u'Paraguay', u'Peru', u'Philippines', u'Poland', u'Portugal', u'Puerto Rico', u'Qatar', u'Romania', u'Russia', u'Rwanda', u'Saint Barthélemy', u'Saint Helena, Ascension and Tristan da Cunha', u'Saint Kitts and Nevis', u'Saint Lucia', u'Saint Martin', u'Saint Pierre and Miquelon', u'Saint Vincent and the Grenadines', u'Samoa', u'San Marino', u'Saudi Arabia', u'Senegal', u'Serbia', u'Seychelles', u'Sierra Leone', u'Singapore', u'Sint Maarten', u'Slovakia', u'Slovenia', u'Solomon Islands', u'Somalia', u'South Africa', u'South Korea', u'Spain', u'Sri Lanka', u'Sudan', u'Suriname', u'Swaziland', u'Sweden', u'Switzerland', u'Syria', u'São Tomé and Príncipe', u'Taiwan', u'Tajikistan', u'Tanzania', u'Thailand', u'Timor-Leste', u'Togo', u'Tonga', u'Trinidad and Tobago', u'Tunisia', u'Turkey', u'Turkmenistan', u'Turks and Caicos Islands', u'Tuvalu', u'Uganda', u'Ukraine', u'United Arab Emirates', u'United Kingdom', u'United States', u'United States Virgin Islands', u'Uruguay', u'Uzbekistan', u'Vanuatu', u'Venezuela', u'Vietnam', u'Wallis and Futuna', u'West Bank', u'Western Sahara', u'Yemen', u'Zambia', u'Zimbabwe'], dtype=object), Int64Index([1950, 1955, 1960, 1965, 1970, 1975, 1980, 1985, 1990, 1995, 2000, 2005, 2010, 2015, 2020, 2025, 2030, 2035, 2040, 2045, 2050], dtype=int64)]

row info
Afghanistan
Int64Index([1950, 1955, 1960, 1965, 1970, 1975, 1980, 1985, 1990, 1995, 2000, 2005, 2010, 2015, 2020, 2025, 2030, 2035, 2040, 2045, 2050], dtype=int64)

countries
Index([u'Afghanistan', u'Albania', u'Algeria', u'American Samoa', u'Andorra', u'Angola', u'Anguilla', u'Antigua and Barbuda', u'Argentina', u'Armenia', u'Aruba', u'Australia', u'Austria', u'Azerbaijan', u'Bahamas', u'Bahrain', u'Bangladesh', u'Barbados', u'Belarus', u'Belgium', u'Belize', u'Benin', u'Bermuda', u'Bhutan', u'Bolivia', u'Bosnia and Herzegovina', u'Botswana', u'Brazil', u'British Virgin Islands', u'Brunei', u'Bulgaria', u'Burkina Faso', u'Burundi', u'Cambodia', u'Cameroon', u'Canada', u'Cape Verde', u'Cayman Islands', u'Central African Republic', u'Chad', u'Chile', u'China', u'Colombia', u'Comoros', u'Congo (Brazzaville)', u'Congo (Kinshasa)', u'Cook Islands', u'Costa Rica', u'Croatia', u'Cuba', u'Curaçao', u'Cyprus', u'Czech Republic', u'Denmark', u'Djibouti', u'Dominica', u'Dominican Republic', u'Ecuador', u'Egypt', u'El Salvador', u'Equatorial Guinea', u'Eritrea', u'Estonia', u'Ethiopia', u'Faroe Islands', u'Federated States of Micronesia', u'Fiji', u'Finland', u'France', u'French Polynesia', u'Gabon', u'Gambia', u'Gaza Strip', u'Georgia', u'Germany', u'Ghana', u'Gibraltar', u'Greece', u'Greenland', u'Grenada', u'Guam', u'Guatemala', u'Guernsey', u'Guinea', u'Guinea-Bissau', u'Guyana', u'Haiti', u'Honduras', u'Hong Kong', u'Hungary', u'Iceland', u'India', u'Indonesia', u'Iran', u'Iraq', u'Ireland', u'Isle of Man', u'Israel', u'Italy', u'Ivory Coast', u'Jamaica', u'Japan', u'Jersey', u'Jordan', u'Kazakhstan', u'Kenya', u'Kiribati', u'Kuwait', u'Kyrgyzstan', u'Laos', u'Latvia', u'Lebanon', u'Lesotho', u'Liberia', u'Libya', u'Liechtenstein', u'Lithuania', u'Luxembourg', u'Macau', u'Macedonia', u'Madagascar', u'Malawi', u'Malaysia', u'Maldives', u'Mali', u'Malta', u'Marshall Islands', u'Mauritania', u'Mauritius', u'Mayotte', u'Mexico', u'Moldova', u'Monaco', u'Mongolia', u'Montenegro', u'Montserrat', u'Morocco', u'Mozambique', u'Myanmar', u'Namibia', u'Nauru', u'Nepal', u'Netherlands', u'New Caledonia', u'New Zealand', u'Nicaragua', u'Niger', u'Nigeria', u'North Korea', u'Northern Mariana Islands', u'Norway', u'Oman', u'Pakistan', u'Palau', u'Panama', u'Papua New Guinea', u'Paraguay', u'Peru', u'Philippines', u'Poland', u'Portugal', u'Puerto Rico', u'Qatar', u'Romania', u'Russia', u'Rwanda', u'Saint Barthélemy', u'Saint Helena, Ascension and Tristan da Cunha', u'Saint Kitts and Nevis', u'Saint Lucia', u'Saint Martin', u'Saint Pierre and Miquelon', u'Saint Vincent and the Grenadines', u'Samoa', u'San Marino', u'Saudi Arabia', u'Senegal', u'Serbia', u'Seychelles', u'Sierra Leone', u'Singapore', u'Sint Maarten', u'Slovakia', u'Slovenia', u'Solomon Islands', u'Somalia', u'South Africa', u'South Korea', u'Spain', u'Sri Lanka', u'Sudan', u'Suriname', u'Swaziland', u'Sweden', u'Switzerland', u'Syria', u'São Tomé and Príncipe', u'Taiwan', u'Tajikistan', u'Tanzania', u'Thailand', u'Timor-Leste', u'Togo', u'Tonga', u'Trinidad and Tobago', u'Tunisia', u'Turkey', u'Turkmenistan', u'Turks and Caicos Islands', u'Tuvalu', u'Uganda', u'Ukraine', u'United Arab Emirates', u'United Kingdom', u'United States', u'United States Virgin Islands', u'Uruguay', u'Uzbekistan', u'Vanuatu', u'Venezuela', u'Vietnam', u'Wallis and Futuna', u'West Bank', u'Western Sahara', u'Yemen', u'Zambia', u'Zimbabwe'], dtype=object)

Austria
1950    6935
1955    6947
1960    7047
1965    7271
1970    7467
1975    7579
1980    7549
1985    7560
1990    7723
1995    8047
2000    8113
2005    8185
2010    8214
2015    8224
2020    8220
2025    8190
2030    8120
2035    8009
2040    7867
2045    7702
2050    7521
Name: Austria, dtype: int64

Plotting population of 4 countries



In [36]:

    
plotCountries = ['Austria', 'Germany', 'United States', 'France']
    
for country in plotCountries:
    row = df.ix[country]
    plt.plot(row.index, row, label=row.name ) 
    
plt.ylim(ymin=0) # start y axis at 0

plt.xticks(rotation=70)
plt.legend(loc='best')
plt.xlabel("Year")
plt.ylabel("# people (million)")
plt.title("Population of countries")









    Out[36]:





<matplotlib.text.Text at 0x10d1c5bd0>

Plot 5 most populous countries from 2010 and 2060



In [37]:

    
def plot_populous(df, year):
    # sort table depending on data value in year column
    df_by_year = df.sort(year, ascending=False)
    
    plt.figure()
    for i in range(5):  
        row = df_by_year.ix[i]
        plt.plot(row.index, row, label=row.name ) 
            
    plt.ylim(ymin=0)
    
    plt.xticks(rotation=70)
    plt.legend(loc='best')
    plt.xlabel("Year")
    plt.ylabel("# people (million)")
    plt.title("Most populous countries in %d" % year)

plot_populous(df, 2010)
plot_populous(df, 2050)



In [ ]: