Overview. We introduce the basics of interacting with web APIs using the requests package. We discuss the basics of how web APIs are usually constructed and show how to interact with the BEA and as illustrations of the concepts.
Outline
Note: requires internet access to run.
This Jupyter notebook was created by Chase Coleman and Spencer Lyon for the NYU Stern course Data Bootcamp.
In [ ]:
import pandas as pd # data package
import matplotlib.pyplot as plt # graphics
import datetime as dt # date tools, used to note current date
import sys
# these are new
import requests
%matplotlib inline
print('\nPython version: ', sys.version)
print('Pandas version: ', pd.__version__)
print('Requests version: ', requests.__version__)
print("Today's date:", dt.date.today())
Many websites make data available through the use of their API (examples: Airbnb, quandl, FRED, BEA, ESPN, and many others)
Most of the time you interact with the API by making http (or https) requests. To do this you direct your browser to a special URL for the website. Usually this URL takes the following form:
https://my_website.com/api?FirstParam=first_value&SecondParam=second_value
Notice that I have broken the URL into pieces using different colors of text:
https://my_website.com/api
) is called the root url for the API. This is the starting point for all API interactions with this websitekey=value
. Each key, value pair is separated by a &
.Because we are lazy and use Python, instead of directing our browser to these special urls, we will use the function requests.get
(that is, the get
function from the requests
package). Here's how the example above looks when using that function
root_url = "https://my_website.com/api"
params = {"FirstParam": "first_value", "SecondParam": "second_value"}
requests.get(root_url, params=params)
In this section we will look at how to use the requests
package to interact with the API provided by the Bureau of Economic Analysis (BEA).
The API itself is documented on their website at this link.
Some key takeaways from that document:
https://bea.gov/api/data
UserID
: This is a special "password" you obtain when you register to use the API. I registered with the email address nyu.databootcamp@gmail.com. The UserID they gave me is in the next code cellMethod
: This is one of 5 possible methods the BEA has defined: GetDataSetList
, GetParameterList
, GetParameterValues
, GetParameterValuesFiltered
, GetData
.Method
that is usedLet's use what we know already and prepare some tools for interacting with their API
In [ ]:
import requests
def bea_request(method, **kwargs):
# this is the UserID they gave me
BEA_ID = "2A629F24-EF8D-4043-BC1F-8CB6A331A2F3"
# root url for bea API
API_URL = "https://bea.gov/api/data"
# start constructing params dict
params = dict(UserID=BEA_ID, method=method)
# bring in any additional keyword arguments to the dict
params.update(kwargs)
# Make request
r = requests.get(API_URL, params=params)
return r
Notice that we have used a new syntax **kwargs
in that function. What this does is at the time the function is called, all extra parameters set by name are added to a dict
called kwargs
. Here's a more simple example that illustrates the point:
In [ ]:
# NOTE: the name kwargs wasn't special, here I use
def my_func(**some_params):
return some_params
In [ ]:
my_func(b=10)
In [ ]:
my_func(a=1, b=2)
Exercise (2 min): Experiment with my_func
to make sure you understand how it works. You might try these things out:
my_func(1)
work?x
in x = my_func(a=1, b=2)
?x
in x = my_func()
?Let's test out our bea_request
function by calling the GetDataSetList
method.
First, we need to check the methods page of the documentation to make sure we don't need any additional parameters. Looks like this one doesn't. Let's call it and see what we get
In [ ]:
datasets_raw = bea_request("GetDataSetList")
type(datasets_raw)
In [ ]:
# did the request succeed?
datasets_raw.ok
In [ ]:
# status code 200 means success!
datasets_raw.status_code
In [ ]:
datasets_raw.content
The actual data returned from the BEA website is contained in datasets.content
. This will be a JSON object (remember the plotly notebook), but can be converted into a python dict
by calling the json
method:
In [ ]:
datasets_raw_dict = datasets_raw.json()
print("length of datasets_raw_dict:", len(datasets_raw_dict))
datasets_raw_dict
Notice that this dict has one item. The key is BEAAPI
. The value is another dict. Let's take a look inside this one
In [ ]:
datasets_dict = datasets_raw_dict["BEAAPI"]
print("length of datasets_dict:", len(datasets_dict))
datasets_dict
The value here is another dict, this time with two keys:
Request
: gives details regarding the API request we made -- we'll throw this one awayResults
: The actual data. Let's pull the data into a dataframe so we can see what we are working with
In [ ]:
datasets = pd.DataFrame(datasets_dict["Results"]["Dataset"])
datasets
What we have here is a mapping from a DatasetName
to a description of that dataset. This is helpful as we'll use it later on when we actually want to get our data.
Exercise (4 min): Read the documentation for the GetData
API method (here) and determine the following:
GetParameterList
method)Let's put this to practice and actually get some data.
Suppose I wanted to get data on the expenditure formula for GDP. You might remember from econ 101 that this is:
$$GDP = C + G + I + NX$$where $GDP$ is GDP , $C$ is personal consumption, $G$ is government spending, $I$ is investment, and $NX$ is net exports.
All of these variables are available from the BEA in the national income and product accounts (NIPA) table. Let's see what parameters are required to use the GetData
method when DataSetName=NIPA
(NOTE, I'm not walking us through what the response look like this time -- I'll just write the code that gets us to the result)
In [ ]:
nipa_params_raw = bea_request("GetParameterList", DataSetName="NIPA")
nipa_params = pd.DataFrame(nipa_params_raw.json()["BEAAPI"]["Results"]["Parameter"])
nipa_params
The ParameterName
column above tells us the name of all additional parameters we can send to GetData
.
The ParameterIsRequiredFlag
has a 1 if that parameter is required and a 0 if it is optional
Finally, the ParameterDataType
tells us what type the value of each parameter should be.
I did a of digging and found that the GDP data we are after lives in table 6. Let's get quarterly data for 1990 to 2016
In [ ]:
gdp_data = bea_request("GetData", DataSetName="NIPA",
TableId=6,
Frequency="Q",
Year=list(range(1990, 2017)))
In [ ]:
# check to make sure we have a 200, meaning success
gdp_data.status_code
In [ ]:
# extract the results and read into a DataFrame
gdp = pd.DataFrame(gdp_data.json()["BEAAPI"]["Results"]["Data"])
print("The shape of gdp is", gdp.shape)
gdp.head()
The important columns for us are going to be DataValue
, SeriesCode
, and TimePeriod
. I did a bit more digging and found that the series codes map into our variables as follows
In [ ]:
gdp_names = {"DPCERX": "C",
"A191RX": "GDP",
"A019RX": "NX",
"A006RX": "I",
"A822RX": "G"}
Let's insert the names we know into the SeriesCode
column using the replace
method:
In [ ]:
gdp.iloc[[0, 107, 498, 1102, 1672], :]
In [ ]:
gdp["SeriesCode"] = gdp["SeriesCode"].replace(gdp_names)
gdp.iloc[[0, 107, 498, 1102, 1672], :]
Exercise (10 min) WARNING: this is a long exercise, but should make you use tools from almost every lecture of the last 6 weeks.
Our want is:
DatetimeIndex
float64
Here's an outline of how I would do this:
GDP
, C
, G
, etc.)drop
all columns we don't needpd.to_datetime
).str
methods here)set_index
and unstack
to get the correct row and column labels (HINT: You might have ended up with 2 levels on your column index (I did) -- drop the one for DataValue
if necessary)Test out how well this went by plotting the DataFarme
In [ ]:
The Open Data Network is a collection of cities, states, and Federal Governmental agencies that have all open accessed their data using the same tools. If you follow the link to the Open Data Network, there is a list of all cities that participate at the bottom. It includes New York City, Chicago, Boston, Houston, and many more.
The tool all of these cities are using to open source their data is called Socrata. One of the benefits of using the same tool is that it leads to being able to access various datasets using the same API.
The general API documentation can be found here. Let's open this up and see whether we can extract some of the important pieces of information that we'd like. We need to find two things:
This API has some nice features that you won't necessarily get on other APIs. One of these is that it will return a type of file called a json file. Lucky for us, pandas knows how to read this type of file, so when we interact with the Open Data Network (or any other Socrata based dataset) we can just use pd.read_json
instead of what we showed in our previous example.
The documentation starts by discussing "API Endpoints." An API endpoint is just the thing that we are referring to as the root url -- The website that we use to make our requests. Each dataset will have a different API endpoint because they are hosted by different organizations (different cities/states/agencies).
One example of an API endpoint is https://data.cityofchicago.org/
. We could find this by going to the Open Data Network site and searching "Chicago crime."
The types of parameters that we need to pass will depend on the dataset that we will be using. The only way you'll understand all of these parameters is by carefully reading the docs -- If you ask too many questions without having read the documentation, some people online may tell you RTFD. I will describe a few of them here though.
Socrata has created a system that allows you to use parameters to limit the type of data you return. Many of these act like SQL queries and, in a nod to this, they called this functionality SoQL queries . It allows you to do things like:
df.head()
)You also have access to some more parameters that give authorization like an app_token
.
We read in the data on all crimes in Chicago since 2001.
In [ ]:
chi_apie = "https://data.cityofchicago.org/"
chi_crime_url = chi_apie + "resource/6zsd-86xi.json?$limit=25000"
chi_df = pd.read_json(chi_crime_url)
chi_df.head()[["arrest", "case_number", "community_area", "date"]]
Exercise: Find the API endpoint for Boston crime (use the Crime Incident Reports July 2012-August 2015 data).
Exercise: Read in the first 50 observations of the Boston crime dataset into a dataframe named bos_df
In [ ]:
We can now look at what types everything in these two datasets are and look at what information is contained in them.
In [ ]:
bos_df.dtypes
In [ ]:
chi_df.dtypes
In [ ]:
chi_df = chi_df.set_index("date")
In [ ]:
cases_per_month = chi_df.resample("M").count()["case_number"]
In [ ]:
cases_per_month.plot()
In [ ]: