CSE 6040, Fall 2015 [05, Part B]: Web services 101

The second part of today's lab considers another rich source of data: the web! You will need some of these ideas to do the first homework assignment.

References for today's topics:

The Requests module

A simple way to download a web page in Python is to use the Requests module.

The following example downloads the Georgia Tech home page, storing the raw HTML returned as a string named content.


In [1]:
# Download the Georgia Tech home page

import requests
response = requests.get ('http://www.gatech.edu')
webpage = response.text  # or response.content for raw bytes

print (webpage[0:100]) # Prints the first hundred characters only


<!DOCTYPE html>
<html lang="en" dir="ltr" 
  xmlns:fb="http://www.facebook.com/2008/fbml"
  xmlns:co

Exercise: Write some Python code that (a) downloads the class home page, and (b) prints a list of all the "base filenames" of IPython notebooks that the page references. The base filename is the name of the file ignoring the preceding path. For instance, the base filename of the notebook you are reading now is, 05b--www.


In [ ]:
# (Enter your code for the preceding exercise in this code box)

Example: Yelp! search. Here's a more complex example, motivated by a screenshot from Yelp! after executing a search for ramen in Atlanta. Take note of the URL.

The URL encodes what is known as an HTTP "get" method (or request). It basically means a URL with two parts: a command followed by one or more arguments. In this case, the command is everything up to and including the word search; the arguments are the rest, where individual arguments are separated by the & or #.

"HTTP" stands for "HyperText Transport Protocol," which is a standardized set of communication protocols that allow web clients, like your web browser or your Python program, to communicate with web servers.

In this next example, let's see how to build a "get request" with the requests module. It's pretty easy!


In [ ]:
url_command = 'http://yelp.com/search'
url_args = {'find_desc': "ramen"
            , 'find_loc': "atlanta, ga"
            , 'ns': 1
            , 'start': 0}
response = requests.get (url_command, params=url_args)

print ("==> Downloading from: '%s'" % response.url) # confirm URL
print ("\n==> Excerpt from this URL:\n\n%s\n" % response.text[0:100])

Exercise. Try modifying and extending the above code to retrieve the 13th entry in the search results.


In [ ]:
# (Enter your code for the preceding exercise in this code box)

Interacting with a web API

We hope the preceding exercise was painful: it is rough downloading raw HTML and trying to extract information from it!

Luckily, many websites provide an application programming interface (API) for querying their data or otherwise accessing their services from your programs. For instance, Twitter provides a web API for gathering tweets, Flickr provides one for gathering image data, and Github for accessing information about repository histories.

These kinds of web APIs are much easier to use than, for instance, the preceding technique which scrapes raw web pages and then has to parse the resulting HTML. Moreover, there are more scalable in the sense that the web servers can transmit structured data in a less verbose form than raw HTML. In Homework 1, you will apply the techniques below, as well as others, to write some Python scripts to interact with the Yelp! web API.

As a starting example, here is some code to look at all the activity on Github related to our course's IPython notebook repository.

Inspect this code and try running it. See if you can figure out what it does. Note that it is split into two parts, so you can try to digest one before moving on to the second.


In [ ]:
response = requests.get ('https://api.github.com/repos/rvuduc/cse6040-ipynbs/events')
urls = set ()
for event in response.json ():
    urls.add (event['actor']['url'])

In [ ]:
# Blank cell, for you to debug or print program state, as needed

In [ ]:
peeps = {}

for url in urls:
    response = requests.get (url)
    key = response.json ()['login']
    value = response.json ()['name']
    response.close ()
    peeps[key] = value
    
for key, value in peeps.items ():
    print ("%s: '%s'" % (key, str (value)))

In [ ]:
# Blank cell, for you to debug or print program state, as needed

A more advanced example: Unpacking a zip file

In Labs 4 and 5-A, you worked with an email repository that you had to manually download and unpack.

As it happens, you can do that from within your Python program as well!


In [ ]:
import zipfile
import StringIO

URL_ZIPPED = "http://cse6040.gatech.edu/fa15/skilling-j.zip"

r = requests.get (URL_ZIPPED)
zipped_maildir = zipfile.ZipFile (StringIO.StringIO (r.content), 'r')

print ("==> Downloaded: %s" % URL_ZIPPED)

You can inspect the contents of this archive.


In [ ]:
# For the first COUNT items in the archive,
# print the original and compressed file sizes.

COUNT = 10
print ("Contents (first %d items):" % COUNT)
for zi in zipped_maildir.infolist ()[0:COUNT]:
    print ("  %s: %d -> %d bytes"
          % (zi.filename, zi.file_size, zi.compress_size))

Exercise: Count messages. Write a Python program to count the number of messages in the archive.

Hint: How can you tell a folder from a file?


In [ ]:
def count_zipped_messages (zipped_maildir):
    """Returns the number of email messages in a zipped maildir."""
    pass # Replace with your implementation

msg_count = count_zipped_messages (zipped_maildir)
print ("==> Found %d messages." % msg_count)
assert msg_count == 4139

Exercise: A-Priori. Can you adapt your implementation of the a-priori algorithm to work on a zipped email archive?


In [ ]: