Workshop on Web Scraping and Text Processing with Python
by Radhika Saksena, Princeton University, saksena@princeton.edu, radhika.saksena@gmail.com
Disclaimer: The code examples presented in this and subsequent handouts are for educational purposes only. Please seek advice from a legal expert about the legal implications of using this code for web scraping.
Our first example demonstrates Python's urllib2 module. The urllib2 module provides various methods to download data and interact with WWW content and protocols - all from within a Python script. urllib2 can be used to access files in different formats, such as HTML, XML, JSON, *.txt, PDF, etc., from the web. We will be using this module quite a lot in this workshop.
The Global Data on Events, Location and Tone (GDELT) is a big data resource comprising of hundreds of millions of global events that have been geotagged and coded according to hundreds of event categories of conflict and co-operation.
Here is an example where urllib2 is used to download the GDELT dataset published on May 23, 2014, available as a compressed archive in CSV format (20140523.export.CSV.zip), from the GDELT website at http://data.gdeltproject.org/events/index.html.
In [1]:
import urllib2
resp = urllib2.urlopen("http://data.gdeltproject.org/events/20140523.export.CSV.zip")
with open("20140523.export.CSV.zip","wb") as fout:
fout.write(resp.read())
getGDELT.py downloads all the daily updated files released in May 2014 (http://gdeltproject.org/data.html#dailyupdates). Note the wb mode when opening the destination file. Also, note the use of the Python for loop to construct URLs for a sequence of files and retrieve them.
In [1]:
%load getGDELT.py
In [2]:
# Copyright 2014, Radhika S. Saksena (radhika dot saksena at gmail dot com)
# Licensed under the Apache License, Version 2.0
# http://www.apache.org/licenses/LICENSE-2.0
# Web scraping and text processing with Python workshop
import urllib2
import sys
import os
import time
'''Download the full resolution GDELT event dataset for March 2014.'''
month = 3
# make a gdelt/ directory if it doesn't already exist
if not os.path.exists(os.getcwd() + "/gdelt"):
os.mkdir(os.getcwd() + "/gdelt")
# try downloading the files comprising events data for March 2014
#for day in range(1,31):
for day in range(1,7):
# construct the URL from which we retrieve the (compressed) events files
fileName = "2014%02d%02d.export.CSV.zip" % (month,day)
fileURL = "http://data.gdeltproject.org/events/%s" % (fileName)
localFile = os.getcwd() + "/gdelt/" + fileName
print("Downloading file " + fileURL + " ...")
# use the urllib2 module to fetch the events data file
resp = urllib2.urlopen(fileURL)
# write the retrieved events file to a local file
with open(localFile,"wb") as fout:
fout.write(resp.read())
print("Downloaded file %s." % fileURL)
# inject delay between consecutive fetches
time.sleep(5)
In [ ]:
import urllib
urllib.urlretrieve("http://data.gdeltproject.org/events/20140519.export.CSV.zip","20140519.CSV.zip")
os module's system() method. For example, the code snippet below shows how to unzip a file gdelt/20140519.export.csv.zip. Note that you should have a utility called unzip on your machine or replace unzip in the code below with what's available on your system. For example, Linux users can use tar -zxvf instead of unzip.
In [ ]:
import os
filename = "gdelt/20140519.export.CSV.zip"
os.system("unzip " + filename)
In the script getGDELT2.py below, the files are downloaded and then unzipped into the corresponding CSV file from within the Python script using Python's os.system() function.
In [3]:
%load getGDELT2.py
In [5]:
# Copyright 2014, Radhika S. Saksena (radhika dot saksena at gmail dot com)
# Licensed under the Apache License, Version 2.0
# http://www.apache.org/licenses/LICENSE-2.0
# Web scraping and text processing with Python workshop
import urllib2
import sys
import os
import time
'''Download and uncompress the full resolution GDELT event dataset for March 2014.'''
month = 3
# make a gdelt/ directory if it doesn't already exist
if not os.path.exists(os.getcwd() + "/gdelt"):
os.mkdir(os.getcwd() + "/gdelt")
# try downloading the files comprising events data for March 2014
#for day in range(1,31):
for day in range(1,7):
try:
# construct the URL from which we retrieve the (compressed) events file
fileName = "2014%02d%02d.export.CSV.zip" % (month,day)
fileURL = "http://data.gdeltproject.org/events/%s" % (fileName)
localFile = os.getcwd() + "/gdelt/" + fileName
print("Downloading file " + fileURL + " ...")
# use the urllib2 module to fetch the events data file
resp = urllib2.urlopen(fileURL)
# write the retrieved events file to a local file
with open(localFile,"wb") as fout:
fout.write(resp.read())
print("Downloaded file %s." % fileURL)
# uncompress the downloaded events file
os.chdir("gdelt")
os.system("unzip {0}".format(localFile))
os.chdir("..")
except:
pass
# inject delay between consecutive fetches
time.sleep(5)
In [6]:
!file gdelt/20140306.export.CSV
The codebook for these GDELT datasets is available at http://data.gdeltproject.org/documentation/GDELT-Data_Format_Codebook.pdf and a text file containing the column header labels is available at http://gdeltproject.org/data/lookups/CSV.header.historical.txt.
pandas module. In the code below, plotGDELTHist.py, we'll plot the frequencies of country occurrences in the events database for the month of March 2014. We'll also use a cut-off to visualize only those countries whose frequency of occurrence exceeds a minimum threshold.
In [7]:
% matplotlib inline
% load plotGDELTHist.py
In [8]:
# Copyright 2014, Radhika S. Saksena (radhika dot saksena at gmail dot com)
# Licensed under the Apache License, Version 2.0
# http://www.apache.org/licenses/LICENSE-2.0
# Web scraping and text processing with Python workshop
import sys
import os
import time
import warnings
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import urllib2
warnings.filterwarnings('ignore')
'''Bar chart of event counts per country March 2014.'''
month = 3 # set month to March
# exit if gdelt/ directory if it doesn't already exist
if not os.path.exists(os.getcwd() + "/gdelt"):
raise "Data directory does not exist."
sys.exit(-1)
# get the column names for the GDELT dataset
headerURL = "http://gdeltproject.org/data/lookups/CSV.header.historical.txt"
headerStr = urllib2.urlopen(headerURL).read()
colNames = headerStr.split("\t")
#colNames = open("CSV.header.historical.txt","r").readline().split("\t")
colNames.append("SOURCEURL")
# initialize frequency table
allCnt = 0
#for day in range(1,31):
for day in range(1,7):
# construct the URL from which we will try to retrieve the record
fileName = "2014%02d%02d.export.CSV" % (month,day)
localFile = os.getcwd() + "/gdelt/" + fileName
# check if the dataset for this date exists
if(not os.path.exists(localFile)):
continue
frame = pd.read_csv(localFile,sep="\t",names=colNames)
# assign the event to a country if it's either one of the actors
actor1Cnt = frame['Actor1CountryCode'].value_counts()
actor2Cnt= frame['Actor2CountryCode'].value_counts()
allCnt += actor1Cnt + actor2Cnt
# convert the frequency table in to a form suitable for plotting
allCnt = allCnt.order(ascending=True)
maxCountry,maxCount = max(allCnt.iteritems(),key=lambda x: x[1])
print("Maximum frequency encountered: " + str(maxCount))
print("Maximum frequency encountered for country: " + maxCountry)
# subset the frequency table to save only high frequency entries
allCnt = allCnt[allCnt > 0.05*maxCount]
# take the log of the frequencies
allCnt = np.log(allCnt)
# now plot the frequency table as a bar graph
plt.figure(figsize=(8,8))
pos = np.arange(len(allCnt.values))
plt.title('Frequency distribution of countries in the news database.')
plt.barh(pos,allCnt.values,color=["#006633"]*len(pos))
annotations = ["{0:4.2e}".format(np.exp(v)) for v in allCnt.values]
for p,c,val in zip(pos,annotations,allCnt.values):
plt.annotate(str(c), xy=(val,p+.5),va='center')
ticks = plt.yticks(pos + .5,allCnt.keys())
plt.grid(axis='x',color='white',linestyle='-')
plt.show()
urllib2 to download the daily record, for the month of May 2014. The downloaded PDF files are to be written to a local directory named archive/.getCongressBreaks.py attempts to create the URLs of the Congressional record documents for May 2014. (In general, it is a good idea to first open the website in a browser, inspect the HTML source and devise a strategy for constructing the URLs that need to be fetched.) Then this code proceeds to download each of the URLs using a for loop and the urllib2 module.
In [9]:
%load getCongressBreaks.py
In [10]:
# Copyright 2014, Radhika S. Saksena (radhika dot saksena at gmail dot com)
# Licensed under the Apache License, Version 2.0
# http://www.apache.org/licenses/LICENSE-2.0
# Web scraping and text processing with Python workshop
import urllib2
import sys
import os
import time
'''Download "Entire Issue" of the Congressional Record for the period May 1 - 31 2014.'''
month = 5 # set month to May
# make an archive/ directory if it doesn't already exist
if not os.path.exists(os.getcwd() + "/archive"):
os.mkdir(os.getcwd() + "/archive")
# try downloading the Congressional Record for May 2014
#for day in range(1,31):
for day in range(1,6):
# construct the URL from which we will try to retrieve the record
fileName = "CREC-2014-%02d-%02d.pdf" % (month,day)
fileURL = "http://beta.congress.gov/crec/2014/%02d/%02d/%s" % (month,day,fileName)
localFile = os.getcwd() + "/archive/" + fileName
# use the urllib2 module to fetch the record
resp = urllib2.urlopen(fileURL)
# write the record (PDF file) to a local file
with open(localFile,"wb") as fout:
fout.write(resp.read())
print("Downloaded file %s." % fileURL)
# inject interval between consecutive requests
time.sleep(5)
try and except blocks, to gracefully handle such errors rather than exiting. try and except blocks to catch an exception when a list is indexed past its length. Instead of crashing, the code prints a message and exits the for loop and continues with processing. Furthermore, in the except block, some information about the exception is printed out.
In [ ]:
# This code generates and IndexError exception and crashes - Why?
twitterUsers = ["@pmharper","@Kathleen_Wynne","@JustinTrudeau","@timhudak","@DenisCoderre"]
# do some intermediate processing
# print out list of Twitter users
print("Twitter users in list: ")
for i in range(0,6):
print(twitterUsers[i])
# more processing - for example, mine tweets of these users
print("Continuing with more Twitter processing.")
In [ ]:
# This script does not crash due to proper exception handling.
twitterUsers = ["@pmharper","@Kathleen_Wynne","@JustinTrudeau","@timhudak","@DenisCoderre"]
# do some intermediate processing
# print out list of Twitter users
print("Twitter users in list: ")
for i in range(0,6):
try:
print(twitterUsers[i])
except Exception as e:
print("Encountered exception: {0}.".format(e))
pass
# do some more processing - for example, mine tweets of these users
print("Continuing with more Twitter processing.")
In [11]:
%load getCongressPDFs.py
In [13]:
# Copyright 2014, Radhika S. Saksena (radhika dot saksena at gmail dot com)
# Licensed under the Apache License, Version 2.0
# http://www.apache.org/licenses/LICENSE-2.0
# Web scraping and text processing with Python workshop
import urllib2
import sys
import os
import time
'''Download "Entire Issue" of the Congressional Record for the period May 1 - 31 2014.'''
month = 5 # set month to May
# make an archive/ directory if it doesn't already exist
if not os.path.exists(os.getcwd() + "/archive"):
os.mkdir(os.getcwd() + "/archive")
# try downloading the Congressional Record for May 2014
#for day in range(1,31):
for day in range(1,6):
# construct the URL from which we will try to retrieve the record
fileName = "CREC-2014-%02d-%02d.pdf" % (month,day)
fileURL = "http://beta.congress.gov/crec/2014/%02d/%02d/%s" % (month,day,fileName)
localFile = os.getcwd() + "/archive/" + fileName
try:
# use the urllib2 module to fetch the record
resp = urllib2.urlopen(fileURL)
# write the record (PDF file) to a local file
with open(localFile,"wb") as fout:
fout.write(resp.read())
print("Downloaded file %s." % fileURL)
except:
# if the record is unavailable, ignore and try the next day
print("Encountered exception while trying to download file {0}.".format(fileName))
pass
# inject interval between consecutive requests
time.sleep(5)
urllib2.urlopen(). For example, the status code of the request shown below is 200 which indicates that urllib2.urlopen() was successful in accessing the URL passed to it.
In [1]:
import urllib2
resp = urllib2.urlopen("http://www.princeton.edu")
print("Response code is: {0}.".format(resp.code))
# take some action if resp.code == 200
# example of a 404 (File Not Found error)
# resp = urllib2.urlopen("http://www.princeton.edu/urlerror")
# print("Response code is: {0}.".format(resp.code))
In [ ]: