In this lesson we'll learn what an API (Application Programming Interface) is, how it's normally used, and how we can collect data from it. We'll then look at how R can help us quickly gather data from APIs, parse the data, and write to a CSV. There are four sections:
First we'll import the required R libraries:
In [ ]:
library(data.table) # for rbindlist to merge list of data frames
We're going to use the New York Times API. You'll need to first sign up for an API key.
We know that every call to any API will require us to provide:
Let's write this information to some variables:
In [ ]:
# set API key var
key <- ""
# set base url var
base.url <- "http://api.nytimes.com/svc/search/v2/articlesearch"
# set response format var
response.format <- ".json"
These variables prepare to tell R the basics of our GET request.
You often want to send some sort of data in the URL’s query string. This data tells the API what information you want.
In [ ]:
# specify a main search term (q)
search.term <- 'Duke Ellington'
search.term <- URLencode(URL = search.term, reserved = TRUE)
print(search.term)
Because URLs do not accept spaces, we have to encode it with a %20
between our search words. You can look up more symbols for HTML URL encoding. But luckily R's URLencode
does all this for us.
In [ ]:
# paste components together to create URL for get request
get.request <- paste0(base.url, response.format, "?", "q=", search.term, "&api-key=", key)
print(get.request)
Try copying and pasting this URL in your browser and see what you get. It looks like a mess! This is JSON. While it may look scary, it's just a tree structure of key and value pairs.
NOTE: We know the parameters above are q
and api-key
from the NYT API Documentation. You can add all the parameters you want from there!
Now let's actually make the GET request in R:
In [ ]:
# send the GET request using httr package
response <- httr::GET(url = get.request)
print(response)
Then use httr
's content
function to read the HTML text of the response:
In [ ]:
# inspect the content of the response, parsing the result as text
response <- httr::content(x = response, as = "text")
substr(x = response, start = 1, stop = 1000)
We can then use the jsonlite
package to store the JSON in a dataframe:
In [ ]:
# convert JSON response to a dataframe
response.df<-jsonlite::fromJSON(txt = response, simplifyDataFrame = TRUE, flatten = TRUE)
# inspect the dataframe
str(response.df, max.level = 3)
# get number of hits
print(response.df$response$meta$hits)
If we're going to make multiple requests, we can put all of this into a function:
In [ ]:
nytapi <- function(search.terms=NULL, begin.date=NULL, end.date=NULL, page=NULL,
base.url="http://api.nytimes.com/svc/search/v2/articlesearch",
response.format=".json",
key=""){
# combine parameters
params <- list(
c("q", search.terms),
c("begin_date", begin.date),
c("end_date", end.date),
c("page", page)
)
params <- params[sapply(X = params, length)>1]
params <- sapply(X = params, FUN = paste0, collapse="=")
params <- paste0(params, collapse="&")
# URL encode query portion
query <- URLencode(URL = params, reserved = FALSE)
# combine with base url and other options
get.request <- paste0(base.url, response.format, "?", query, "&api-key=", key)
# send GET request
response <- httr::GET(url = get.request)
# parse response to JSON
response <- httr::content(response, "text")
response <- jsonlite::fromJSON(txt = response, simplifyDataFrame = T, flatten = T)
return(response)
}
To cycle through pages, we'll need to figure out how many hits there are total, and divide that by the 10 per page.
In [ ]:
# get number of hits, number of page queries
duke <- nytapi(search.terms = "Duke Ellington", begin.date = 20150101, end.date = 20160101)
hits <- duke$response$meta$hits
print(hits)
calls <- round(hits/10)
print(calls)
So there are 97 hits, which means we'll need to make 10 calls.
In [ ]:
# get all articles
duke.articles <- sapply(X = 0:calls, FUN = function(page){
response <- tryCatch(expr = {
r <- nytapi(search.terms = "Duke Ellington", begin.date = 20150101, end.date = 20160101, page = page)
r$response$docs
}, error=function(e) NULL)
Sys.sleep(5) # sleep between each call!
return(response)
})
We can combine the list of dataframes and fill in missing values with data.table
's rbindlist
:
In [ ]:
duke.all <- rbindlist(duke.articles, fill=T)
Now we can write the characters of the merged dataframes to a CSV:
In [ ]:
write.csv(apply(duke.all,2,as.character), file = "all-articles.csv")