Getting Started with Web Scraping

Workshop on Web Scraping and Text Processing with Python

by Radhika Saksena, Princeton University, saksena@princeton.edu, radhika.saksena@gmail.com

Disclaimer: The code examples presented in this and subsequent handouts are for educational purposes only. Please seek advice from a legal expert about the legal implications of using this code for web scraping.

1. Automating web downloads with Python

1.1 Python's `urllib2` & `urllib` modules

Example: Downloading GDELT datasets

Our first example demonstrates Python's urllib2 module. The urllib2 module provides various methods to download data and interact with WWW content and protocols - all from within a Python script. urllib2 can be used to access files in different formats, such as HTML, XML, JSON, *.txt, PDF, etc., from the web. We will be using this module quite a lot in this workshop.

The Global Data on Events, Location and Tone (GDELT) is a big data resource comprising of hundreds of millions of global events that have been geotagged and coded according to hundreds of event categories of conflict and co-operation.
Here is an example where urllib2 is used to download the GDELT dataset published on May 23, 2014, available as a compressed archive in CSV format (20140523.export.CSV.zip), from the GDELT website at http://data.gdeltproject.org/events/index.html.



In [1]:

    
import urllib2

resp = urllib2.urlopen("http://data.gdeltproject.org/events/20140523.export.CSV.zip")

with open("20140523.export.CSV.zip","wb") as fout:                                          
        fout.write(resp.read())

Instead of a single file, let's get a sequence of archived+compressed files from the GDELT website and store them locally. Specifically, the code in script getGDELT.py downloads all the daily updated files released in May 2014 (http://gdeltproject.org/data.html#dailyupdates). Note the wb mode when opening the destination file. Also, note the use of the Python for loop to construct URLs for a sequence of files and retrieve them.



In [1]:

    
%load getGDELT.py



In [2]:

    
# Copyright 2014, Radhika S. Saksena (radhika dot saksena at gmail dot com)
# Licensed under the Apache License, Version 2.0
# http://www.apache.org/licenses/LICENSE-2.0

# Web scraping and text processing with Python workshop

import urllib2
import sys
import os
import time

'''Download the full resolution GDELT event dataset for March 2014.'''

month = 3

# make a gdelt/ directory if it doesn't already exist
if not os.path.exists(os.getcwd() + "/gdelt"):
    os.mkdir(os.getcwd() + "/gdelt")

# try downloading the files comprising events data for March 2014
#for day in range(1,31):
for day in range(1,7):

    # construct the URL from which we retrieve the (compressed) events files
    fileName = "2014%02d%02d.export.CSV.zip" % (month,day)
    fileURL = "http://data.gdeltproject.org/events/%s"  % (fileName)
    localFile = os.getcwd() + "/gdelt/" + fileName

    print("Downloading file " + fileURL + " ...")

    # use the urllib2 module to fetch the events data file
    resp = urllib2.urlopen(fileURL)

    # write the retrieved events file to a local file
    with open(localFile,"wb") as fout:
        fout.write(resp.read())

    print("Downloaded file %s." % fileURL)

    # inject delay between consecutive fetches
    time.sleep(5)









    



Downloading file http://data.gdeltproject.org/events/20140301.export.CSV.zip ...
Downloaded file http://data.gdeltproject.org/events/20140301.export.CSV.zip.
Downloading file http://data.gdeltproject.org/events/20140302.export.CSV.zip ...
Downloaded file http://data.gdeltproject.org/events/20140302.export.CSV.zip.
Downloading file http://data.gdeltproject.org/events/20140303.export.CSV.zip ...
Downloaded file http://data.gdeltproject.org/events/20140303.export.CSV.zip.
Downloading file http://data.gdeltproject.org/events/20140304.export.CSV.zip ...
Downloaded file http://data.gdeltproject.org/events/20140304.export.CSV.zip.
Downloading file http://data.gdeltproject.org/events/20140305.export.CSV.zip ...
Downloaded file http://data.gdeltproject.org/events/20140305.export.CSV.zip.
Downloading file http://data.gdeltproject.org/events/20140306.export.CSV.zip ...
Downloaded file http://data.gdeltproject.org/events/20140306.export.CSV.zip.

When downloading binary files, urllib module's urlretrieve() method can be handy too.



In [ ]:

    
import urllib

urllib.urlretrieve("http://data.gdeltproject.org/events/20140519.export.CSV.zip","20140519.CSV.zip")

1.2 Automating OS commands from a Python script

Example: Unzipping downloaded GDELT datasets

Python provides some tools to execute operating system commands from within the Python script. One way of doing this is using the os module's system() method. For example, the code snippet below shows how to unzip a file gdelt/20140519.export.csv.zip. Note that you should have a utility called unzip on your machine or replace unzip in the code below with what's available on your system. For example, Linux users can use tar -zxvf instead of unzip.



In [ ]:

    
import os
filename = "gdelt/20140519.export.CSV.zip"
os.system("unzip " + filename)

In the script getGDELT2.py below, the files are downloaded and then unzipped into the corresponding CSV file from within the Python script using Python's os.system() function.



In [3]:

    
%load getGDELT2.py



In [5]:

    
# Copyright 2014, Radhika S. Saksena (radhika dot saksena at gmail dot com)
# Licensed under the Apache License, Version 2.0
# http://www.apache.org/licenses/LICENSE-2.0

# Web scraping and text processing with Python workshop

import urllib2
import sys
import os
import time

'''Download and uncompress the full resolution GDELT event dataset for March 2014.'''

month = 3

# make a gdelt/ directory if it doesn't already exist
if not os.path.exists(os.getcwd() + "/gdelt"):
    os.mkdir(os.getcwd() + "/gdelt")

# try downloading the files comprising events data for March 2014
#for day in range(1,31):
for day in range(1,7):

    try:
        # construct the URL from which we retrieve the (compressed) events file 
        fileName = "2014%02d%02d.export.CSV.zip" % (month,day)
        fileURL = "http://data.gdeltproject.org/events/%s"  % (fileName)
        localFile = os.getcwd() + "/gdelt/" + fileName

        print("Downloading file " + fileURL + " ...")

        # use the urllib2 module to fetch the events data file
        resp = urllib2.urlopen(fileURL)

        # write the retrieved events file to a local file
        with open(localFile,"wb") as fout:
            fout.write(resp.read())

        print("Downloaded file %s." % fileURL)

        # uncompress the downloaded events file
        os.chdir("gdelt")
        os.system("unzip {0}".format(localFile))
        os.chdir("..")
    except:
        pass

    # inject delay between consecutive fetches
    time.sleep(5)









    



Downloading file http://data.gdeltproject.org/events/20140301.export.CSV.zip ...
Downloaded file http://data.gdeltproject.org/events/20140301.export.CSV.zip.
Downloading file http://data.gdeltproject.org/events/20140302.export.CSV.zip ...
Downloaded file http://data.gdeltproject.org/events/20140302.export.CSV.zip.
Downloading file http://data.gdeltproject.org/events/20140303.export.CSV.zip ...
Downloaded file http://data.gdeltproject.org/events/20140303.export.CSV.zip.
Downloading file http://data.gdeltproject.org/events/20140304.export.CSV.zip ...
Downloaded file http://data.gdeltproject.org/events/20140304.export.CSV.zip.
Downloading file http://data.gdeltproject.org/events/20140305.export.CSV.zip ...
Downloaded file http://data.gdeltproject.org/events/20140305.export.CSV.zip.
Downloading file http://data.gdeltproject.org/events/20140306.export.CSV.zip ...
Downloaded file http://data.gdeltproject.org/events/20140306.export.CSV.zip.

To check that things have progressed correctly, run the file command (on Linux/Mac OS X) to determine the type of files that have been downloaded and uncompressed.



In [6]:

    
!file gdelt/20140306.export.CSV









    



gdelt/20140306.export.CSV: UTF-8 Unicode English text, with very long lines

1.3 Visualizing scraped data with Python (or R)

The codebook for these GDELT datasets is available at http://data.gdeltproject.org/documentation/GDELT-Data_Format_Codebook.pdf and a text file containing the column header labels is available at http://gdeltproject.org/data/lookups/CSV.header.historical.txt.

Example: News Frequency Distribution

The CSV events data(s) file can be loaded in to a dataframe for further data processing using Python's pandas module. In the code below, plotGDELTHist.py, we'll plot the frequencies of country occurrences in the events database for the month of March 2014. We'll also use a cut-off to visualize only those countries whose frequency of occurrence exceeds a minimum threshold.

Please note that this example is an initial rough analysis. More pre-processing and analytical machinery might need to be employed to extract substantive insights.



In [7]:

    
% matplotlib inline
% load plotGDELTHist.py



In [8]:

    
# Copyright 2014, Radhika S. Saksena (radhika dot saksena at gmail dot com)
# Licensed under the Apache License, Version 2.0
# http://www.apache.org/licenses/LICENSE-2.0

# Web scraping and text processing with Python workshop

import sys
import os
import time
import warnings
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import urllib2

warnings.filterwarnings('ignore')

'''Bar chart of event counts per country March 2014.'''

month = 3 # set month to March

# exit if gdelt/ directory if it doesn't already exist
if not os.path.exists(os.getcwd() + "/gdelt"):
    raise "Data directory does not exist."
    sys.exit(-1)

# get the column names for the GDELT dataset
headerURL = "http://gdeltproject.org/data/lookups/CSV.header.historical.txt"
headerStr = urllib2.urlopen(headerURL).read()
colNames = headerStr.split("\t")
#colNames = open("CSV.header.historical.txt","r").readline().split("\t")
colNames.append("SOURCEURL")

# initialize frequency table
allCnt = 0

#for day in range(1,31):
for day in range(1,7):
    # construct the URL from which we will try to retrieve the record
    fileName = "2014%02d%02d.export.CSV" % (month,day)
    localFile = os.getcwd() + "/gdelt/" + fileName

    # check if the dataset for this date exists
    if(not os.path.exists(localFile)):
        continue

    frame = pd.read_csv(localFile,sep="\t",names=colNames)

    # assign the event to a country if it's either one of the actors
    actor1Cnt = frame['Actor1CountryCode'].value_counts()
    actor2Cnt= frame['Actor2CountryCode'].value_counts()

    allCnt += actor1Cnt + actor2Cnt

# convert the frequency table in to a form suitable for plotting
allCnt = allCnt.order(ascending=True)

maxCountry,maxCount = max(allCnt.iteritems(),key=lambda x: x[1])
print("Maximum frequency encountered: " + str(maxCount))
print("Maximum frequency encountered for country: " + maxCountry)


# subset the frequency table to save only high frequency entries
allCnt = allCnt[allCnt > 0.05*maxCount]

# take the log of the frequencies
allCnt = np.log(allCnt)



# now plot the frequency table as a bar graph
plt.figure(figsize=(8,8))
pos = np.arange(len(allCnt.values))

plt.title('Frequency distribution of countries in the news database.')
plt.barh(pos,allCnt.values,color=["#006633"]*len(pos))

annotations = ["{0:4.2e}".format(np.exp(v)) for v in allCnt.values]
for p,c,val in zip(pos,annotations,allCnt.values):
    plt.annotate(str(c), xy=(val,p+.5),va='center')

ticks = plt.yticks(pos + .5,allCnt.keys())

plt.grid(axis='x',color='white',linestyle='-')

plt.show()









    



Maximum frequency encountered: 230700.0
Maximum frequency encountered for country: USA

2. Handling (HTTP and other) Exceptions Gracefully

2.1 HTTP Errors

Example: U.S. Congressional Record

Now let's download the congressional record from the US Congress's open data website, http://beta.congress.gov/congressional-record/browse-by-date/ . The task here is to use urllib2 to download the daily record, for the month of May 2014. The downloaded PDF files are to be written to a local directory named archive/.

The code in getCongressBreaks.py attempts to create the URLs of the Congressional record documents for May 2014. (In general, it is a good idea to first open the website in a browser, inspect the HTML source and devise a strategy for constructing the URLs that need to be fetched.) Then this code proceeds to download each of the URLs using a for loop and the urllib2 module.



In [9]:

    
%load getCongressBreaks.py



In [10]:

    
# Copyright 2014, Radhika S. Saksena (radhika dot saksena at gmail dot com)
# Licensed under the Apache License, Version 2.0
# http://www.apache.org/licenses/LICENSE-2.0

# Web scraping and text processing with Python workshop

import urllib2
import sys
import os
import time

'''Download "Entire Issue" of the Congressional Record for the period May 1 - 31 2014.'''

month = 5 # set month to May

# make an archive/ directory if it doesn't already exist
if not os.path.exists(os.getcwd() + "/archive"):
    os.mkdir(os.getcwd() + "/archive")

# try downloading the Congressional Record for May 2014
#for day in range(1,31):
for day in range(1,6):

    # construct the URL from which we will try to retrieve the record
    fileName = "CREC-2014-%02d-%02d.pdf" % (month,day)
    fileURL = "http://beta.congress.gov/crec/2014/%02d/%02d/%s"  % (month,day,fileName)
    localFile = os.getcwd() + "/archive/" + fileName

    # use the urllib2 module to fetch the record
    resp = urllib2.urlopen(fileURL)

    # write the record (PDF file) to a local file
    with open(localFile,"wb") as fout:
        fout.write(resp.read())

    print("Downloaded file %s." % fileURL)

    # inject interval between consecutive requests
    time.sleep(5)









    



Downloaded file http://beta.congress.gov/crec/2014/05/01/CREC-2014-05-01.pdf.
Downloaded file http://beta.congress.gov/crec/2014/05/02/CREC-2014-05-02.pdf.





    



---------------------------------------------------------------------------
HTTPError                                 Traceback (most recent call last)
<ipython-input-10-2715a3604285> in <module>()
     28 
     29     # use the urllib2 module to fetch the record
---> 30     resp = urllib2.urlopen(fileURL)
     31 
     32     # write the record (PDF file) to a local file

//anaconda/python.app/Contents/lib/python2.7/urllib2.pyc in urlopen(url, data, timeout)
    125     if _opener is None:
    126         _opener = build_opener()
--> 127     return _opener.open(url, data, timeout)
    128 
    129 def install_opener(opener):

//anaconda/python.app/Contents/lib/python2.7/urllib2.pyc in open(self, fullurl, data, timeout)
    408         for processor in self.process_response.get(protocol, []):
    409             meth = getattr(processor, meth_name)
--> 410             response = meth(req, response)
    411 
    412         return response

//anaconda/python.app/Contents/lib/python2.7/urllib2.pyc in http_response(self, request, response)
    521         if not (200 <= code < 300):
    522             response = self.parent.error(
--> 523                 'http', request, response, code, msg, hdrs)
    524 
    525         return response

//anaconda/python.app/Contents/lib/python2.7/urllib2.pyc in error(self, proto, *args)
    446         if http_err:
    447             args = (dict, 'default', 'http_error_default') + orig_args
--> 448             return self._call_chain(*args)
    449 
    450 # XXX probably also want an abstract factory that knows when it makes

//anaconda/python.app/Contents/lib/python2.7/urllib2.pyc in _call_chain(self, chain, kind, meth_name, *args)
    380             func = getattr(handler, meth_name)
    381 
--> 382             result = func(*args)
    383             if result is not None:
    384                 return result

//anaconda/python.app/Contents/lib/python2.7/urllib2.pyc in http_error_default(self, req, fp, code, msg, hdrs)
    529 class HTTPDefaultErrorHandler(BaseHandler):
    530     def http_error_default(self, req, fp, code, msg, hdrs):
--> 531         raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
    532 
    533 class HTTPRedirectHandler(BaseHandler):

HTTPError: HTTP Error 404: Not Found

The code in the U.S. Congressional record example above crashes when one of the URLs that we constructed cannot be found. This is a HTTP error with status code 404. Some of the HTTP status codes that one commonly encounters when accessing websites are listed here:
- 200 (OK)
- 301 (Moved Permanently)
- 400 (Bad Request)
- 403 (Forbidden)
- 404 (File Not Found)
- 503 (Service Unavailable).
A full list of HTTP error codes is available on wikipedia at http://en.wikipedia.org/wiki/List_of_HTTP_status_codes.

2.2 Handling Exceptions

When we are automating web content extraction, it's generally a good idea to do some error handling so that when such HTTP errors are encountered, our code does not exit abruptly. Python has built-in exception handling capabilities, namely the try and except blocks, to gracefully handle such errors rather than exiting.

The example below demonstrates the use of the try and except blocks to catch an exception when a list is indexed past its length. Instead of crashing, the code prints a message and exits the for loop and continues with processing. Furthermore, in the except block, some information about the exception is printed out.



In [ ]:

    
# This code generates and IndexError exception and crashes - Why?

twitterUsers = ["@pmharper","@Kathleen_Wynne","@JustinTrudeau","@timhudak","@DenisCoderre"]

# do some intermediate processing

# print out list of Twitter users
print("Twitter users in list: ")
for i in range(0,6):
     print(twitterUsers[i])
        
# more processing - for example, mine tweets of these users
print("Continuing with more Twitter processing.")



In [ ]:

    
# This script does not crash due to proper exception handling.

twitterUsers = ["@pmharper","@Kathleen_Wynne","@JustinTrudeau","@timhudak","@DenisCoderre"]

# do some intermediate processing

# print out list of Twitter users
print("Twitter users in list: ")
for i in range(0,6):
    try:
        print(twitterUsers[i])
    except Exception as e:
        print("Encountered exception: {0}.".format(e))
        pass
        
# do some more processing - for example, mine tweets of these users
print("Continuing with more Twitter processing.")

Example: U.S. Congressional Record (cont.)

Now let's revisit the US congressional record example, this time incorporating Python exception handling to elegantly respond to HTTP errors.



In [11]:

    
%load getCongressPDFs.py



In [13]:

    
# Copyright 2014, Radhika S. Saksena (radhika dot saksena at gmail dot com)
# Licensed under the Apache License, Version 2.0
# http://www.apache.org/licenses/LICENSE-2.0

# Web scraping and text processing with Python workshop

import urllib2
import sys
import os
import time

'''Download "Entire Issue" of the Congressional Record for the period May 1 - 31 2014.'''

month = 5 # set month to May

# make an archive/ directory if it doesn't already exist
if not os.path.exists(os.getcwd() + "/archive"):
    os.mkdir(os.getcwd() + "/archive")

# try downloading the Congressional Record for May 2014
#for day in range(1,31):
for day in range(1,6):

    # construct the URL from which we will try to retrieve the record
    fileName = "CREC-2014-%02d-%02d.pdf" % (month,day)
    fileURL = "http://beta.congress.gov/crec/2014/%02d/%02d/%s"  % (month,day,fileName)
    localFile = os.getcwd() + "/archive/" + fileName

    try:
        # use the urllib2 module to fetch the record
        resp = urllib2.urlopen(fileURL)

        # write the record (PDF file) to a local file
        with open(localFile,"wb") as fout:
            fout.write(resp.read())

        print("Downloaded file %s." % fileURL)

    except:
        # if the record is unavailable, ignore and try the next day
        print("Encountered exception while trying to download file {0}.".format(fileName))
        pass

    # inject interval between consecutive requests
    time.sleep(5)









    



Downloaded file http://beta.congress.gov/crec/2014/05/01/CREC-2014-05-01.pdf.
Downloaded file http://beta.congress.gov/crec/2014/05/02/CREC-2014-05-02.pdf.
Encountered exception while trying to download file CREC-2014-05-03.pdf.
Encountered exception while trying to download file CREC-2014-05-04.pdf.
Downloaded file http://beta.congress.gov/crec/2014/05/05/CREC-2014-05-05.pdf.

You can also check the status codes of HTTP requests made by urllib2.urlopen(). For example, the status code of the request shown below is 200 which indicates that urllib2.urlopen() was successful in accessing the URL passed to it.



In [1]:

    
import urllib2

resp = urllib2.urlopen("http://www.princeton.edu")
print("Response code is: {0}.".format(resp.code))

# take some action if resp.code == 200

# example of a 404 (File Not Found error)
# resp = urllib2.urlopen("http://www.princeton.edu/urlerror")
# print("Response code is: {0}.".format(resp.code))









    



Response code is: 200.



In [ ]: