Accessing Databases via Web APIs


In this lesson we'll learn what an API (Application Programming Interface) is, how it's normally used, and how we can collect data from it. We'll then look at how R can help us quickly gather data from APIs, parse the data, and write to a CSV. There are four sections:

  1. Constructing an API GET request
  2. Parsing the JSON response
  3. Looping through result pages
  4. Exporting to CSV

First we'll import the required R libraries:


In [ ]:
library(data.table)  # for rbindlist to merge list of data frames

1. Constructing an API GET request


We're going to use the New York Times API. You'll need to first sign up for an API key.

We know that every call to any API will require us to provide:

  1. a base URL for the API,
  2. (usually) some authorization code or key, and
  3. a format for the response.

Let's write this information to some variables:


In [ ]:
# set API key var
key <- ""

# set base url var
base.url <- "http://api.nytimes.com/svc/search/v2/articlesearch"

# set response format var
response.format <- ".json"

These variables prepare to tell R the basics of our GET request.

You often want to send some sort of data in the URL’s query string. This data tells the API what information you want.


In [ ]:
# specify a main search term (q)
search.term <- 'Duke Ellington'
search.term <- URLencode(URL = search.term, reserved = TRUE)
print(search.term)

Because URLs do not accept spaces, we have to encode it with a %20 between our search words. You can look up more symbols for HTML URL encoding. But luckily R's URLencode does all this for us.


In [ ]:
# paste components together to create URL for get request
get.request <- paste0(base.url, response.format, "?", "q=", search.term, "&api-key=", key)
print(get.request)

Try copying and pasting this URL in your browser and see what you get. It looks like a mess! This is JSON. While it may look scary, it's just a tree structure of key and value pairs.

NOTE: We know the parameters above are q and api-key from the NYT API Documentation. You can add all the parameters you want from there!

Now let's actually make the GET request in R:


In [ ]:
# send the GET request using httr package
response <- httr::GET(url = get.request)
print(response)

Then use httr's content function to read the HTML text of the response:


In [ ]:
# inspect the content of the response, parsing the result as text
response <- httr::content(x = response, as = "text")
substr(x = response, start = 1, stop = 1000)

2. Parsing the JSON response

We can then use the jsonlite package to store the JSON in a dataframe:


In [ ]:
# convert JSON response to a dataframe
response.df<-jsonlite::fromJSON(txt = response, simplifyDataFrame = TRUE, flatten = TRUE)

# inspect the dataframe
str(response.df, max.level = 3)

# get number of hits
print(response.df$response$meta$hits)

If we're going to make multiple requests, we can put all of this into a function:


In [ ]:
nytapi <- function(search.terms=NULL, begin.date=NULL, end.date=NULL, page=NULL,
                     base.url="http://api.nytimes.com/svc/search/v2/articlesearch",
                     response.format=".json",
                     key=""){

  # combine parameters
  params <- list(
    c("q", search.terms),
    c("begin_date", begin.date),
    c("end_date", end.date),
    c("page", page)
  )
  params <- params[sapply(X = params, length)>1]
  params <- sapply(X = params, FUN = paste0, collapse="=")
  params <- paste0(params, collapse="&")
  
  # URL encode query portion
  query <- URLencode(URL = params, reserved = FALSE)

  # combine with base url and other options
  get.request <- paste0(base.url, response.format, "?", query, "&api-key=", key)
  
  # send GET request
  response <- httr::GET(url = get.request)
  
  # parse response to JSON
  response <- httr::content(response, "text")  
  response <- jsonlite::fromJSON(txt = response, simplifyDataFrame = T, flatten = T)
  
  return(response)
}

To cycle through pages, we'll need to figure out how many hits there are total, and divide that by the 10 per page.


In [ ]:
# get number of hits, number of page queries
duke <- nytapi(search.terms = "Duke Ellington", begin.date = 20150101, end.date = 20160101)

hits <- duke$response$meta$hits
print(hits)

calls <- round(hits/10)
print(calls)

3. Looping through result pages

So there are 97 hits, which means we'll need to make 10 calls.


In [ ]:
# get all articles   
duke.articles <- sapply(X = 0:calls, FUN = function(page){
  response <- tryCatch(expr = {
    r <- nytapi(search.terms = "Duke Ellington", begin.date = 20150101, end.date = 20160101, page = page)
    r$response$docs
  }, error=function(e) NULL)
  Sys.sleep(5)  # sleep between each call!
  return(response)
})

We can combine the list of dataframes and fill in missing values with data.table's rbindlist:


In [ ]:
duke.all <- rbindlist(duke.articles, fill=T)

4. Exporting to CSV

Now we can write the characters of the merged dataframes to a CSV:


In [ ]:
write.csv(apply(duke.all,2,as.character), file = "all-articles.csv")