Accessing Databases via Web APIs


In this lesson we'll learn what an API (Application Programming Interface) is, how it's normally used, and how we can collect data from it. We'll then look at how Python can help us quickly gather data from APIs, parse the data, and write to a CSV. There are four sections:

  1. Constructing an API GET request
  2. Parsing the JSON response
  3. Looping through result pages
  4. Exporting to CSV

First we'll import the required Python libraries


In [ ]:
import requests  # to make the GET request 
import json  # to parse the JSON response to a Python dictionary
import time  # to pause after each API call
import csv  # to write our data to a CSV
import pandas  # to see our CSV

All of these are standard Python libraries, so no matter your distribution, these should be installed.

1. Constructing an API GET request


We're going to use the New York Times API. You'll need to first sign up for an API key.

We know that every call to any API will require us to provide:

  1. a base URL for the API,
  2. (usually) some authorization code or key, and
  3. a format for the response.

Let's write this information to some variables:


In [ ]:
# set API key var
key = ""

# set base url var
base_url = "http://api.nytimes.com/svc/search/v2/articlesearch"

# set response format var
response_format = ".json"

Notice we assign each variable as a string. While the requests library will convert integers, it's better to be consistent and use strings for all parameters of a GET request. We choose JSON as the response format, as it is easy to parse quickly with Python, though XML is often an viable frequently offered alternative. JSON stands for "Javascript object notation." It has a very similar structure to a python dictionary -- both are built on key/value pairs.

You often want to send some sort of data in the URL’s query string. This data tells the API what information you want. In our case, we're going to look for articles about Duke Ellington. Requests allows you to provide these arguments as a dictionary, using the params keyword argument. In addition to the search term q, we have to put in the api-key term. We know these key names from the NYT API documentation.


In [ ]:
# set search parameters
search_params = {"q": "Duke Ellington",
                 "api-key": key}

Now we're ready to make the request. We use the .get method from the requests library to make an HTTP GET Request.


In [ ]:
# make request
response = requests.get(base_url + response_format, params=search_params)

Now, we have a response object called response. We can get all the information we need from this object. For instance, we can see that the URL has been correctly encoded by printing the URL. Click on the link to see what happens.


In [ ]:
print(response.url)

Click on that link to see it returns! Notice that all Python is doing here for us is helping us construct a complicated URL built with & and = signs. You just noticed we could just as well copy and paste this URL to a browser and then save the response, but Python's requests library is much easier and scalable when making multiple queries in succession.

Challenge 1: Adding a date range

What if we only want to search within a particular date range? The NYT Article API allows us to specify start and end dates.

Alter the search_params code above so that the request only searches for articles in the year 2015.

You're going to need to look at the documentation to see how to do this.


In [ ]:
# set date parameters here

In [ ]:
# # uncomment to test
# r = requests.get(base_url + response_format, params=search_params)
# print(r.url)

Challenge 2: Specifying a results page

The above will return the first 10 results. To get the next ten, you need to add a "page" parameter. Change the search parameters above to get the second 10 results.


In [ ]:
# set page parameters here

In [ ]:
## uncomment to test
# r = requests.get(base_url+response_format, params=search_params)
# print(r.url)

2. Parsing the JSON response


We can read the content of the server’s response using .text


In [ ]:
# inspect the content of the response, parsing the result as text
response_text = r.text
print(response_text[:1000])

What you see here is JSON text, encoded as unicode text. As mentioned, JSON is bascially a Python dictionary, and we can convert this string text to a Python dictionary by using the loads to load from a string.


In [ ]:
# convert JSON response to a dictionary
data = json.loads(response_text)
print(data)

That looks intimidating! But it's really just a big dictionary. The most time-consuming part of using APIs is traversing the various key-value trees to see where the information you want resides. Let's see what keys we got in there.


In [ ]:
print(data.keys())

In [ ]:
# this is boring
print(data['status'])

In [ ]:
# so is this
print(data['copyright'])

In [ ]:
# this is what we want!
print(data['response'])

In [ ]:
print(data['response'].keys())

In [ ]:
print(data['response']['meta'].keys())

In [ ]:
print(data['response']['meta']['hits'])

Looks like there were 93 hits total for our query. Let's take a look:


In [ ]:
print(data['response']['docs'])

It starts with a square bracket, so it looks like a list, and from a glance it looks like the list of articles we're interested in.


In [ ]:
print(type(data['response']['docs']))

Let's just save this list to a new variable. Often when using web APIs, you'll spend the majority of your time restructuring the response data to how you want it.


In [ ]:
docs = data['response']['docs']

In [ ]:
print(docs[0])

Wow! That's a lot of information about just one article! But wait...


In [ ]:
print(len(docs))

3. Looping through result pages


We're making progress, but we only have 10 items. The original response said we had 93 hits! Which means we have to make 93 /10, or 10 requests to get them all. Sounds like a job for a loop!


In [ ]:
# get number of hits total (in any page we request)
hits = data['response']['meta']['hits']
print("number of hits: ", str(hits))

# get number of pages
pages = hits // 10 + 1

# make an empty list where we'll hold all of our docs for every page
all_docs = []

# now we're ready to loop through the pages
for i in range(pages):

    print("collecting page", str(i))

    # set the page parameter
    search_params['page'] = i

    # make request
    r = requests.get(base_url + response_format, params=search_params)

    # get text and convert to a dictionary
    data = json.loads(r.text)

    # get just the docs
    docs = data['response']['docs']

    # add those docs to the big list
    all_docs = all_docs + docs

    # IMPORTANT pause between calls
    time.sleep(5)

In [ ]:
print(len(all_docs))

4. Exporting to CSV


Great, now we have all the articles. Let's just take out some bits of information and write to a CSV.


In [ ]:
final_docs = []

for d in all_docs:
    
    # create empty dict for each doc to collect info
    targeted_info = {}
    targeted_info['id'] = d['_id']
    targeted_info['headline'] = d['headline']['main']
    targeted_info['date'] = d['pub_date'][0:10]  # cutting time of day.
    targeted_info['word_count'] = d['word_count']
    targeted_info['keywords'] = [keyword['value'] for keyword in d['keywords']]
    try:  # some docs don't have this info
        targeted_info['lead_paragraph'] = d['lead_paragraph']
    except:
        pass

    # append final doc info to list
    final_docs.append(targeted_info)

We can write our sifted information to a CSV now:


In [ ]:
header = final_docs[1].keys()

with open('all-docs.csv', 'w') as output_file:
    dict_writer = csv.DictWriter(output_file, header)
    dict_writer.writeheader()
    dict_writer.writerows(final_docs)
    
pandas.read_csv('all-docs.csv')