Interacting with web APIs


Overview. We introduce the basics of interacting with web APIs using the requests package. We discuss the basics of how web APIs are usually constructed and show how to interact with the BEA and as illustrations of the concepts.

Outline

  • Web APIs: We describe how APIs are usually accessed via urls with special a special format
  • BEA: We us the Bureau of Economic Analysis (BEA)'s API as an in-depth example of how this works
  • Open Data Network: We use the Open Data Network API as another, simpler example of getting data from the web

Note: requires internet access to run.

This Jupyter notebook was created by Chase Coleman and Spencer Lyon for the NYU Stern course Data Bootcamp.


Preliminaries

Import the usual suspects


In [ ]:
import pandas as pd             # data package
import matplotlib.pyplot as plt # graphics 
import datetime as dt           # date tools, used to note current date  
import sys

# these are new 
import requests

%matplotlib inline 

print('\nPython version: ', sys.version) 
print('Pandas version: ', pd.__version__)
print('Requests version: ', requests.__version__)
print("Today's date:", dt.date.today())

Web API basics

Many websites make data available through the use of their API (examples: Airbnb, quandl, FRED, BEA, ESPN, and many others)

Most of the time you interact with the API by making http (or https) requests. To do this you direct your browser to a special URL for the website. Usually this URL takes the following form:

https://my_website.com/api?FirstParam=first_value&SecondParam=second_value

Notice that I have broken the URL into pieces using different colors of text:

  • The red part (https://my_website.com/api) is called the root url for the API. This is the starting point for all API interactions with this website
  • Next is the blue question mark ?. This separates the root url from a list of parameters
  • Finally, in green we have a list of parameters that take the form key=value. Each key, value pair is separated by a &.

Because we are lazy and use Python, instead of directing our browser to these special urls, we will use the function requests.get (that is, the get function from the requests package). Here's how the example above looks when using that function

root_url = "https://my_website.com/api"
params = {"FirstParam": "first_value", "SecondParam": "second_value"}
requests.get(root_url, params=params)

BEA API

In this section we will look at how to use the requests package to interact with the API provided by the Bureau of Economic Analysis (BEA).

The API itself is documented on their website at this link.

Some key takeaways from that document:

  • The root url is https://bea.gov/api/data
  • There are two required parameters to every API call:
    1. UserID: This is a special "password" you obtain when you register to use the API. I registered with the email address nyu.databootcamp@gmail.com. The UserID they gave me is in the next code cell
    2. Method: This is one of 5 possible methods the BEA has defined: GetDataSetList, GetParameterList, GetParameterValues, GetParameterValuesFiltered, GetData.
  • Any additional parameters will depend on the Method that is used

Let's use what we know already and prepare some tools for interacting with their API


In [ ]:
import requests

def bea_request(method, **kwargs):
    # this is the UserID they gave me
    BEA_ID = "2A629F24-EF8D-4043-BC1F-8CB6A331A2F3"

    # root url for bea API
    API_URL = "https://bea.gov/api/data"
    
    # start constructing params dict
    params = dict(UserID=BEA_ID, method=method)
    
    # bring in any additional keyword arguments to the dict
    params.update(kwargs)
        
    # Make request
    r = requests.get(API_URL, params=params)
    return r

Notice that we have used a new syntax **kwargs in that function. What this does is at the time the function is called, all extra parameters set by name are added to a dict called kwargs. Here's a more simple example that illustrates the point:


In [ ]:
# NOTE: the name kwargs wasn't special, here I use 
def my_func(**some_params):
    return some_params

In [ ]:
my_func(b=10)

In [ ]:
my_func(a=1, b=2)

Exercise (2 min): Experiment with my_func to make sure you understand how it works. You might try these things out:

  • Why doesn't my_func(1) work?
  • What is the type of x in x = my_func(a=1, b=2)?
  • What is the type of and len of x in x = my_func()?

Let's test out our bea_request function by calling the GetDataSetList method.

First, we need to check the methods page of the documentation to make sure we don't need any additional parameters. Looks like this one doesn't. Let's call it and see what we get


In [ ]:
datasets_raw = bea_request("GetDataSetList")
type(datasets_raw)

In [ ]:
# did the request succeed?
datasets_raw.ok

In [ ]:
# status code 200 means success!
datasets_raw.status_code

In [ ]:
datasets_raw.content

The actual data returned from the BEA website is contained in datasets.content. This will be a JSON object (remember the plotly notebook), but can be converted into a python dict by calling the json method:


In [ ]:
datasets_raw_dict = datasets_raw.json()
print("length of datasets_raw_dict:", len(datasets_raw_dict))
datasets_raw_dict

Notice that this dict has one item. The key is BEAAPI. The value is another dict. Let's take a look inside this one


In [ ]:
datasets_dict = datasets_raw_dict["BEAAPI"]
print("length of datasets_dict:", len(datasets_dict))
datasets_dict

The value here is another dict, this time with two keys:

  • Request: gives details regarding the API request we made -- we'll throw this one away
  • Results: The actual data.

Let's pull the data into a dataframe so we can see what we are working with


In [ ]:
datasets = pd.DataFrame(datasets_dict["Results"]["Dataset"])
datasets

What we have here is a mapping from a DatasetName to a description of that dataset. This is helpful as we'll use it later on when we actually want to get our data.

Exercise (4 min): Read the documentation for the GetData API method (here) and determine the following:

  • What are the required parameters?
  • What are optional parameters?
  • How can we determine what optional parameters are available? (Hint 1: it varies by dataset. Hint 2: check out the GetParameterList method)

Let's put this to practice and actually get some data.

Suppose I wanted to get data on the expenditure formula for GDP. You might remember from econ 101 that this is:

$$GDP = C + G + I + NX$$

where $GDP$ is GDP , $C$ is personal consumption, $G$ is government spending, $I$ is investment, and $NX$ is net exports.

All of these variables are available from the BEA in the national income and product accounts (NIPA) table. Let's see what parameters are required to use the GetData method when DataSetName=NIPA (NOTE, I'm not walking us through what the response look like this time -- I'll just write the code that gets us to the result)


In [ ]:
nipa_params_raw = bea_request("GetParameterList", DataSetName="NIPA")
nipa_params = pd.DataFrame(nipa_params_raw.json()["BEAAPI"]["Results"]["Parameter"])
nipa_params

The ParameterName column above tells us the name of all additional parameters we can send to GetData.

The ParameterIsRequiredFlag has a 1 if that parameter is required and a 0 if it is optional

Finally, the ParameterDataType tells us what type the value of each parameter should be.

I did a of digging and found that the GDP data we are after lives in table 6. Let's get quarterly data for 1990 to 2016


In [ ]:
gdp_data = bea_request("GetData", DataSetName="NIPA",
                      TableId=6,
                      Frequency="Q",
                      Year=list(range(1990, 2017)))

In [ ]:
# check to make sure we have a 200, meaning success
gdp_data.status_code

In [ ]:
# extract the results and read into a DataFrame
gdp = pd.DataFrame(gdp_data.json()["BEAAPI"]["Results"]["Data"])
print("The shape of gdp is", gdp.shape)
gdp.head()

The important columns for us are going to be DataValue, SeriesCode, and TimePeriod. I did a bit more digging and found that the series codes map into our variables as follows


In [ ]:
gdp_names = {"DPCERX": "C",
             "A191RX": "GDP",
             "A019RX": "NX",
             "A006RX": "I",
             "A822RX": "G"}

Let's insert the names we know into the SeriesCode column using the replace method:


In [ ]:
gdp.iloc[[0, 107, 498, 1102, 1672], :]

In [ ]:
gdp["SeriesCode"] = gdp["SeriesCode"].replace(gdp_names)
gdp.iloc[[0, 107, 498, 1102, 1672], :]

Exercise (10 min) WARNING: this is a long exercise, but should make you use tools from almost every lecture of the last 6 weeks.

Our want is:

  • A DataFrame with one column for each of those 5 variables
  • The index should be the time period and should have type DatetimeIndex
  • The dtype for all columns should be float64

Here's an outline of how I would do this:

  • Remove all rows where Series code isn't one of our 5 variables (now named GDP, C, G, etc.)
  • drop all columns we don't need
  • Convert the TimePeriod column to a datetime (HINT: use pd.to_datetime)
  • convert the DataValue column to have the correct dtype (HINT: you'll need to use the .str methods here)
  • At this point you have 3 columns, all with the right dtype. Now use some combination of set_index and unstack to get the correct row and column labels (HINT: You might have ended up with 2 levels on your column index (I did) -- drop the one for DataValue if necessary)

Test out how well this went by plotting the DataFarme


In [ ]:

Open Data Network API

The Open Data Network is a collection of cities, states, and Federal Governmental agencies that have all open accessed their data using the same tools. If you follow the link to the Open Data Network, there is a list of all cities that participate at the bottom. It includes New York City, Chicago, Boston, Houston, and many more.

The tool all of these cities are using to open source their data is called Socrata. One of the benefits of using the same tool is that it leads to being able to access various datasets using the same API.

The general API documentation can be found here. Let's open this up and see whether we can extract some of the important pieces of information that we'd like. We need to find two things:

  • A "root url" that we put at the beginning of all of our requests
  • The set of parameters that we want to define for any request (information like what dataset, how many observations, or what time frame).

This API has some nice features that you won't necessarily get on other APIs. One of these is that it will return a type of file called a json file. Lucky for us, pandas knows how to read this type of file, so when we interact with the Open Data Network (or any other Socrata based dataset) we can just use pd.read_json instead of what we showed in our previous example.

Root URL

The documentation starts by discussing "API Endpoints." An API endpoint is just the thing that we are referring to as the root url -- The website that we use to make our requests. Each dataset will have a different API endpoint because they are hosted by different organizations (different cities/states/agencies).

One example of an API endpoint is https://data.cityofchicago.org/. We could find this by going to the Open Data Network site and searching "Chicago crime."

Parameters

The types of parameters that we need to pass will depend on the dataset that we will be using. The only way you'll understand all of these parameters is by carefully reading the docs -- If you ask too many questions without having read the documentation, some people online may tell you RTFD. I will describe a few of them here though.

Socrata has created a system that allows you to use parameters to limit the type of data you return. Many of these act like SQL queries and, in a nod to this, they called this functionality SoQL queries . It allows you to do things like:

  • Choose a specific subset of columns from the data
  • Choose how many observations you want (useful if you are just playing with data for the first time and don't need the full dataset -- much like using df.head())
  • Choose observations based on some type of a requirement

You also have access to some more parameters that give authorization like an app_token.

Example

We read in the data on all crimes in Chicago since 2001.


In [ ]:
chi_apie = "https://data.cityofchicago.org/"
chi_crime_url = chi_apie + "resource/6zsd-86xi.json?$limit=25000"
chi_df = pd.read_json(chi_crime_url)

chi_df.head()[["arrest", "case_number", "community_area", "date"]]

Exercise: Find the API endpoint for Boston crime (use the Crime Incident Reports July 2012-August 2015 data).

Exercise: Read in the first 50 observations of the Boston crime dataset into a dataframe named bos_df


In [ ]:

We can now look at what types everything in these two datasets are and look at what information is contained in them.


In [ ]:
bos_df.dtypes

In [ ]:
chi_df.dtypes

Plot Chicago crime over time

Recall, we only have the first 25000 elements of the dataset, so the results are likely to be nonsense. We do it anyways because it gives us a chance to use the timeseries tools we talked about previously.


In [ ]:
chi_df = chi_df.set_index("date")

In [ ]:
cases_per_month = chi_df.resample("M").count()["case_number"]

In [ ]:
cases_per_month.plot()

In [ ]: