(In order to load the stylesheet of this notebook, execute the last code cell in this notebook)
In this assignment, we will compile a list of books that were best-sellers during the summer of 2014. To get this list, we will use the Books API, from the New York Times.
Like most APIs, NYT requires that each developer should have a private (secret) key in order to use their services. This way, they are able to throttle the number of requests that are being issued. According to their website, this limit is at 5,000 requests per day. You can register here. When asked, request a key for the Best Sellers API. Your key will be in the next web page. Save it in a text file named key.txt, in the same folder as this notebook.
Do not add this key to your repo (git add) because it includes sensitive information that only you should know. To make sure that you do not add this by mistake, you will need to tell git which files to always ignore. We can do this by creating a .gitignore file under the root directory of the repository (spring-2015-homeworks/) and inside it add the path to the file that you want to ignore:
submissions/Homework-1/key.txt
If you did this right, the file should not come up when you type git status
.
Now that we donwloaded our key, we should read it into a variable to use it for the rest of this notebook.
In [1]:
key = ""
with open('key.txt','r') as f:
key = f.readline().strip()
if len(key) > 0:
print "Succesfully retrieved API key"
Let's make a sample request to check that everything works fine. We will retrieve the names of NYT best-seller lists.
In [2]:
import requests
response = requests.get("http://api.nytimes.com/svc/books/v3/lists/names.json?api-key=%s"%(key))
print response
Before we generate the request string, we need to read the guidelines from here. According to the documentation under the BEST-SELLER LIST NAMES section, the request must follow this URI structure:
http://api.nytimes.com/svc/books/{version}/lists/names[.response_format]?api-key={your-API-key}
We need to replace {version} with v3, the response_format with json and include our secrete API key.
We can try to print the response that NYT returns.
In [4]:
print response.text
It is not possible to read the raw response. Instead, we need to decode the raw response as JSON and use the json
library to print it.
In [5]:
import json
print json.dumps(response.json(), indent=2)
Now, this is much better! We can easily see that the response consists of a response status, the number of results and a list of the best-seller lists. For each of these lists, we get information about its name, its update frequency, its lifetime and its codename.
Instead of JSON, we can also set the response type to be XML.
In [9]:
response = requests.get("http://api.nytimes.com/svc/books/v3/lists/names.xml?api-key=%s"%(key))
print response.text[:500]
Again, as before, we can use a library to print the XML in a readable way.
In [8]:
import xml.dom.minidom
xml_parser = xml.dom.minidom.parseString(response.text)
pretty_response = xml_parser.toprettyxml()
print pretty_response
In this section, we practice some of the basic Python tools that we learned in class and the powerful string handling methods that Python offers. Our goal is to be able to pick the interesting parts of the response and transform them in a format that will be useful to us.
Our first task will be to isolate the names of all the best-seller lists of the NYT. Fill in the rest of the print_names_from_XML()
function that reads the XML response and prints all these names. (5 pts)
Hint: Our pretty formatter puts each tag on a separate line. You may want to read the documentation of the split()
, strip()
and startswith()
functions.
In [15]:
def print_names_from_XML(response):
"""Prints the names of all the best-seller lists that are in the response.
Parameters:
response: Response object
The response object that is a result of a get request for the names of the
best-selling lists from the Books API.
"""
xml_parser = xml.dom.minidom.parseString(response.text)
pretty_response = xml_parser.toprettyxml()
# Fill-in the code that prints the list names
In [16]:
response = requests.get("http://api.nytimes.com/svc/books/v3/lists/names.xml?api-key=%s"%(key))
print_names_from_XML(response)
Can you do the same thing for the JSON response? Notice that a JSON object is basically a dictionary. (5 pts)
In [17]:
def print_names_from_JSON(response):
"""Prints the names of all the best-seller lists that are in the response.
Parameters:
response: Response object
The response object that is a result of a get request for the names of the
best-selling lists from the Books API.
"""
# Fill-in the code that prints the list names
In [18]:
response = requests.get("http://api.nytimes.com/svc/books/v3/lists/names.json?api-key=%s"%(key))
print_names_from_JSON(response)
Let's try something more complicated. Pick your favorite list. Your task is to print the titles of the books that were best-sellers for the list you picked, on the week of July 1st, 2014. (20 pts)
Notice: If you read the API documentation carefully, you will see that
the service returns 20 results at a time. Use the offset parameter to page through the results.
The total number of books that you should be expecting is returned as num_results
. It is easier to handle the response if you are working with JSON, so prefer it over the XML.
In [13]:
# Write your code here
Perfect! By now you should know how to navigate the responses of the API.
We are now ready to tackle our original problem; to compile a summary for the best-sellers over a period of 2 months.
First, we need to become confident working with dates. Since we want to issue requests that span a period of 2 months, we need to be able to automatically advance a day, without needing to keep the logistics of how many days each month has. To this end, we will use the datetime
library. Here is an example
In [19]:
import datetime
now = datetime.datetime.now()
print "Now:", now
print "Now (only date):", now.date()
print "Tomorrow:", now + datetime.timedelta(days=1)
print "Now (formatted):", now.strftime("%d:%m:%Y")
new_year = "01-01-2015"
new_year_date = datetime.datetime.strptime(new_year, "%m-%d-%Y")
print "Parsed", new_year_date.date()
The basic component of our project will be a function that takes as input a date and a list name, executes as many requests to the Books API as needed to get the list of books for that day and returns the list together with the date of its publication.
To return more than one elements from a function (a tuple of elements) we write
def foo():
return "foo", 42
and then
r = foo()
print r[0] # "foo"
print r[1] # 42
or
txt, num = foo()
print txt # "foo"
print num # 42
Write a function that, given a list name and a date, returns a tuple with the books that were best-sellers for that date and the date on which the list was published by the NYT. (40 pts)
In [20]:
import datetime
import time
def get_books(date, list_name):
"""Returns a tuple containing the list of books and the publication date of the list
Parameters:
date: datetime
The day for which we want to check the best-selling list.
list_name: string
The name of best-selling list that want to check. This needs to follow
the Books API guidelines, e.g. 'hardcore-fiction'.
Returns:
books_set: set
The set of books that were best-sellers according to NYT.
published_date: datetime
The date on which the list was published.
"""
Notice that the free API key that we have has a limit of 8 QPS (queries per second). If we send multiple queries and pass this limit, we will get back an error instead of the answer. To avoid this situation, a naive way is, after each query, to wait $1/8=0.125$ seconds. The command for this is
time.sleep(0.125)
Let's now test our function:
In [21]:
date = datetime.date(2014,7,1)
list_name = "hardcover-fiction"
book_list, book_date = get_books(date, list_name)
print book_list
print
print "Published on", book_date
Great! The final step is to write a function that takes a time window and a list name, and returns a dictionary with the books that were best-sellers as keys and the number of weeks that they were in the list as values. (30 pts)
In [22]:
import datetime
def most_popular(start_date, end_date, list_name):
"""Returns the books and the number of weeks that were best-sellers for the given time window
Parameters:
start_date: datetime
The first day to check.
end_date: datetime
The last day to check.
list_name: string
The name of best-selling list that want to check. This needs to follow
the Books API guidelines, e.g. 'hardcore-fiction'.
Returns:
books_dict: dictionary
Dictionary of book titles with the number of weeks on the requested NYT
"""
Again, let's test our function. It might take a while to run (because of the QPS limit).
In [23]:
start_date = datetime.date(2014,6,1)
end_date = datetime.date(2014,8,31)
list_name = "hardcover-fiction"
books_dict = most_popular(start_date, end_date, list_name)
for book in books_dict:
print book, ":", books_dict[book]
In [1]:
# Code for setting the style of the notebook
from IPython.core.display import HTML
def css_styling():
styles = open("../../theme/custom.css", "r").read()
return HTML(styles)
css_styling()
Out[1]: