In [1]:
# Import required libraries
import requests
import json
from __future__ import division
import math
import csv
import matplotlib.pyplot as plt
In the first place, we know that every call will require us to provide:
So let's put store those in some variables.
Use the following demonstration keys for now, but in the future, get your own!
In [26]:
# set key
key="be8992a420bfd16cf65e8757f77a5403:8:44644296"
# set base url
base_url="http://api.nytimes.com/svc/search/v2/articlesearch"
# set response format
response_format=".json"
You often want to send some sort of data in the URL’s query string. This data tells the API what information you want. In our case, we want articles about Duke Ellington. Requests allows you to provide these arguments as a dictionary, using the params
keyword argument. In addition to the search term q
, we have to put in the api-key
term.
In [23]:
# set search parameters
search_params = {"q":"Duke Ellington",
"api-key":key}
Now we're ready to make the request. We use the .get
method from the requests
library to make an HTTP GET Request.
In [24]:
# make request
r = requests.get(base_url+response_format, params=search_params)
Now, we have a response object called r
. We can get all the information we need from this object. For instance, we can see that the URL has been correctly encoded by printing the URL. Click on the link to see what happens.
In [25]:
print(r.url)
Click on that link to see it returns!
What if we only want to search within a particular date range? The NYT Article Api allows us to specify start and end dates.
Alter the search_params
code above so that the request only searches for articles in the year 2015.
You're gonna need to look at the documentation here to see how to do this.
In [ ]:
# set date parameters here
In [8]:
# Uncomment to test
# r = requests.get(base_url+response_format, params=search_params)
# print(r.url)
In [ ]:
# set page parameters here
In [9]:
# Uncomment to test
# r = requests.get(base_url+response_format, params=search_params)
# print(r.url)
We can read the content of the server’s response using .text
In [10]:
# Inspect the content of the response, parsing the result as text
response_text= r.text
print(response_text[:1000])
What you see here is JSON text, encoded as unicode text. JSON stands for "Javascript object notation." It has a very similar structure to a python dictionary -- both are built on key/value pairs. This makes it easy to convert JSON response to a python dictionary.
In [11]:
# Convert JSON response to a dictionary
data = json.loads(response_text)
# data
That looks intimidating! But it's really just a big dictionary. Let's see what keys we got in there.
In [12]:
print(data.keys())
In [13]:
# this is boring
data['status']
Out[13]:
In [14]:
# so is this
data['copyright']
Out[14]:
In [15]:
# this is what we want!
# data['response']
In [19]:
data['response'].keys()
Out[19]:
In [20]:
data['response']['meta']['hits']
Out[20]:
In [21]:
# data['response']['docs']
type(data['response']['docs'])
Out[21]:
That looks what we want! Let's put that in it's own variable.
In [22]:
docs = data['response']['docs']
In [23]:
docs[0]
Out[23]:
That's great. But we only have 10 items. The original response said we had 93 hits! Which means we have to make 93 /10, or 10 requests to get them all. Sounds like a job for a loop!
But first, let's review what we've done so far.
In [24]:
# set key
key="ef9055ba947dd842effe0ecf5e338af9:15:72340235"
# set base url
base_url="http://api.nytimes.com/svc/search/v2/articlesearch"
# set response format
response_format=".json"
# set search parameters
search_params = {"q":"Duke Ellington",
"api-key":key,
"begin_date":"20150101", # date must be in YYYYMMDD format
"end_date":"20151231"}
# make request
r = requests.get(base_url+response_format, params=search_params)
# convert to a dictionary
data=json.loads(r.text)
# get number of hits
hits = data['response']['meta']['hits']
print("number of hits: ", str(hits))
# get number of pages
pages = int(math.ceil(hits/10))
# make an empty list where we'll hold all of our docs for every page
all_docs = []
# now we're ready to loop through the pages
for i in range(pages):
print("collecting page", str(i))
# set the page parameter
search_params['page'] = i
# make request
r = requests.get(base_url+response_format, params=search_params)
# get text and convert to a dictionary
data=json.loads(r.text)
# get just the docs
docs = data['response']['docs']
# add those docs to the big list
all_docs = all_docs + docs
In [25]:
len(all_docs)
Out[25]:
In [27]:
# DEFINE YOUR FUNCTION HERE
In [29]:
# uncomment to test
# get_api_data("Duke Ellington", 2014)
In [30]:
all_docs[0]
Out[30]:
This is all great, but it's pretty messy. What we’d really like to to have, eventually, is a CSV, with each row representing an article, and each column representing something about that article (header, date, etc). As we saw before, the best way to do this is to make a lsit of dictionaries, with each dictionary representing an article and each dictionary representing a field of metadata from that article (e.g. headline, date, etc.) We can do this with a custom function:
In [31]:
def format_articles(unformatted_docs):
'''
This function takes in a list of documents returned by the NYT api
and parses the documents into a list of dictionaries,
with 'id', 'header', and 'date' keys
'''
formatted = []
for i in unformatted_docs:
dic = {}
dic['id'] = i['_id']
dic['headline'] = i['headline']['main']
dic['date'] = i['pub_date'][0:10] # cutting time of day.
formatted.append(dic)
return(formatted)
In [32]:
all_formatted = format_articles(all_docs)
In [33]:
all_formatted[:5]
Out[33]:
Edit the function above so that we include the lead_paragraph
and word_count
fields.
HINT: Some articles may not contain a lead_paragraph, in which case, it'll throw an error if you try to address this value (which doesn't exist.) You need to add a conditional statement that takes this into consideraiton. If
Advanced: Add another key that returns a list of keywords
associated with the article.
In [34]:
def format_articles(unformatted_docs):
'''
This function takes in a list of documents returned by the NYT api
and parses the documents into a list of dictionaries,
with 'id', 'header', 'date', 'lead paragrph' and 'word count' keys
'''
formatted = []
for i in unformatted_docs:
dic = {}
dic['id'] = i['_id']
dic['headline'] = i['headline']['main']
dic['date'] = i['pub_date'][0:10] # cutting time of day.
# YOUR CODE HERE
formatted.append(dic)
return(formatted)
In [41]:
# uncomment to test
# all_formatted = format_articles(all_docs)
# all_formatted[:5]
In [40]:
keys = all_formatted[1]
# writing the rest
with open('all-formated.csv', 'w') as output_file:
dict_writer = csv.DictWriter(output_file, keys)
dict_writer.writeheader()
dict_writer.writerows(all_formatted)
In [42]:
# YOUR CODE HERE