All API's: http://developer.nytimes.com/ Article search API: http://developer.nytimes.com/article_search_v2.json Best-seller API: http://developer.nytimes.com/books_api.json#/Documentation Test/build queries: http://developer.nytimes.com/
Tip: Remember to include your API key in all requests! And their interactive web thing is pretty bad. You'll need to register for the API key.
In [2]:
import requests
1) What books topped the Hardcover Fiction NYT best-sellers list on Mother's Day in 2009 and 2010? How about Father's Day?
In [72]:
dates = ['2009-05-10', '2010-05-09', '2009-06-21', '2010-06-20']
for date in dates:
response = requests.get('https://api.nytimes.com/svc/books/v3/lists//.json?list-name=hardcover-fiction&published-date=' + date + '&api-key=1a25289d587a49b7ba8128badd7088a2')
data = response.json()
print('On', date, 'this was the hardcover fiction NYT best-sellers list:')
for item in data['results']:
for book in item['book_details']:
print(book['title'])
print('')
2) What are all the different book categories the NYT ranked in June 6, 2009? How about June 6, 2015?
In [90]:
cat_dates = ['2009-06-06', '2015-06-06']
for date in cat_dates:
cat_response = requests.get('https://api.nytimes.com/svc/books/v3/lists/names.json?published-date=' + date + '&api-key=1a25289d587a49b7ba8128badd7088a2')
cat_data = cat_response.json()
print('On', date + ', these were the different book categories the NYT ranked:')
categories = []
for result in cat_data['results']:
categories.append(result['list_name'])
print(', '.join(set(categories)))
print('')
3) Muammar Gaddafi's name can be transliterated many many ways. His last name is often a source of a million and one versions - Gadafi, Gaddafi, Kadafi, and Qaddafi to name a few. How many times has the New York Times referred to him by each of those names?
Tip: Add "Libya" to your search to make sure (-ish) you're talking about the right guy.
In [195]:
gaddafis = ['Gadafi', 'Gaddafi', 'Kadafi', 'Qaddafi']
for gaddafi in gaddafis:
g_response = requests.get('https://api.nytimes.com/svc/search/v2/articlesearch.json?q=' + gaddafi + '+libya&api-key=1a25289d587a49b7ba8128badd7088a2')
g_data = g_response.json()
print('There are', g_data['response']['meta']['hits'], 'instances of the spelling', gaddafi + '.')
In [1]:
# TA-COMMENT: As per usual, your commented code is excellent! I love how you're thinking through what might work.
In [205]:
# #HELP try 1.
# #Doesn't show next pages.
# gaddafis = ['Gadafi', 'Gaddafi', 'Kadafi', 'Qaddafi']
# for gaddafi in gaddafis:
# g_response = requests.get('https://api.nytimes.com/svc/search/v2/articlesearch.json?q=' + gaddafi + '+libya&page=0&api-key=1a25289d587a49b7ba8128badd7088a2')
# g_data = g_response.json()
# print('There are', len(g_data['response']['docs']), 'instances of the spelling', gaddafi)
In [206]:
# #HELP try 2. What I want to do next is
# #if the number of articles != 10 , stop
# #else, add 1 to the page number
# #Tell it to loop until the end result is not 10
# #but right now it keeps crashing
# #Maybe try by powers of 2.
# import time, sys
# pages = range(400)
# total_articles = 0
# for page in pages:
# g_response = requests.get('https://api.nytimes.com/svc/search/v2/articlesearch.json?q=gaddafi+libya&page=' + str(page) + '&api-key=1a25289d587a49b7ba8128badd7088a2')
# g_data = g_response.json()
# articles_on_pg = len(g_data['response']['docs'])
# total_articles = total_articles + articles_on_pg
# print(total_articles)
# time.sleep(0.6)
In [207]:
#HELP try 3. Trying by powers of 2.
#OMG does 'hits' means the number of articles with this text?? If so, where could I find that in the README??
# numbers = range(10)
# pages = []
# for number in numbers:
# pages.append(2 ** number)
# #temp
# print(pages)
# import time, sys
# total_articles = 0
# for page in pages:
# g_response = requests.get('https://api.nytimes.com/svc/search/v2/articlesearch.json?q=gaddafi+libya&page=' + str(page) + '&api-key=1a25289d587a49b7ba8128badd7088a2')
# g_data = g_response.json()
# articles_on_pg = len(g_data['response']['docs'])
# #temp
# meta_on_pg = g_data['response']['meta']
# print(page, articles_on_pg, meta_on_pg)
# time.sleep(1)
In [208]:
# #HELP (troubleshooting the page number that returns a keyerror)
# #By trial and error, it seems like "101" breaks it. 100 is fine.
# g_response = requests.get('https://api.nytimes.com/svc/search/v2/articlesearch.json?q=gadafi+libya&page=101&api-key=1a25289d587a49b7ba8128badd7088a2')
# g_data = g_response.json()
# articles_on_pg = len(g_data['response']['docs'])
# print(articles_on_pg)
4) What's the title of the first story to mention the word 'hipster' in 1995? What's the first paragraph?
In [161]:
hip_response = requests.get('https://api.nytimes.com/svc/search/v2/articlesearch.json?q=hipster&begin_date=19950101&sort=oldest&api-key=1a25289d587a49b7ba8128badd7088a2')
hip_data = hip_response.json()
first_hipster = hip_data['response']['docs'][0]
print('The first hipster article of 1995 was titled', first_hipster['headline']['main'] + '.\nCheck it out:\n' + first_hipster['lead_paragraph'])
5) How many times was gay marriage mentioned in the NYT between 1950-1959, 1960-1969, 1970-1978, 1980-1989, 1990-2099, 2000-2009, and 2010-present?
Tip: You'll want to put quotes around the search term so it isn't just looking for "gay" and "marriage" in the same article.
Tip: Write code to find the number of mentions between Jan 1, 1950 and Dec 31, 1959.
In [204]:
decade_range = range(5)
date_attributes = []
for decade in decade_range:
date_attributes.append('begin_date=' + str(1950 + decade*10) +'0101&end_date=' + str(1959 + decade*10) + '1231')
date_attributes.append('begin_date=20100101')
for date in date_attributes:
gm_response = requests.get('https://api.nytimes.com/svc/search/v2/articlesearch.json?q="gay+marriage"&' + date + '&api-key=1a25289d587a49b7ba8128badd7088a2')
gm_data = gm_response.json()
hits = gm_data['response']['meta']['hits']
print(hits)
6) What section talks about motorcycles the most?
Tip: You'll be using facets
In [9]:
#I searched for motorcyle or motorcycles
# for motorcyles:
# {'count': 10, 'term': 'New York and Region'}
# {'count': 10, 'term': 'New York and Region'}
# {'count': 7, 'term': 'World'}
# {'count': 6, 'term': 'Arts'}
# {'count': 6, 'term': 'Business'}
# {'count': 5, 'term': 'U.S.'}
# for motorcycle:
# {'count': 24, 'term': 'Sports'}
# {'count': 24, 'term': 'Sports'}
# {'count': 20, 'term': 'New York and Region'}
# {'count': 16, 'term': 'U.S.'}
# {'count': 14, 'term': 'Arts'}
# {'count': 8, 'term': 'Business'}
moto_response = requests.get('https://api.nytimes.com/svc/search/v2/articlesearch.json?q=motorcyle+OR+motorcyles&facet_field=section_name&api-key=1a25289d587a49b7ba8128badd7088a2')
moto_data = moto_response.json()
# #temp. Answer: dict
# print(type(moto_data))
# #temp. Answer: ['status', 'copyright', 'response']
# print(moto_data.keys())
# #temp. Answer: dict
# print(type(moto_data['response']))
# #temp. Answer: ['docs', 'meta', 'facets']
# print(moto_data['response'].keys())
# #temp. Answer: dict
# print(type(moto_data['response']['facets']))
# #temp. Answer: 'section_name'
# print(moto_data['response']['facets'].keys())
# #temp. Answer: dict
# print(type(moto_data['response']['facets']['section_name']))
# #temp. Answer:'terms'
# print(moto_data['response']['facets']['section_name'].keys())
# #temp. Answer: list
# print(type(moto_data['response']['facets']['section_name']['terms']))
# #temp. It's a list of dictionaries, with a count and a section name for each one.
# print(moto_data['response']['facets']['section_name']['terms'][0])
sections = moto_data['response']['facets']['section_name']['terms']
the_most = 0
for section in sections:
if section['count'] > the_most:
the_most = section['count']
the_most_name = section['term']
print(the_most_name, 'talks about motorcycles the most, with', the_most, 'articles.')
# #Q: WHY DO SO FEW ARTICLES MENTION MOTORCYCLES?
# #A: MAYBE BECAUSE MANY ARTICLES AREN'T IN SECTIONS?
# #temp. Answer: {'hits': 312, 'offset': 0, 'time': 24}
# print(moto_data['response']['meta'])
# #temp. Answer: ['document_type', 'blog', 'multimedia', 'pub_date',
# #'news_desk', 'keywords', 'byline', '_id', 'headline', 'snippet',
# #'source', 'lead_paragraph', 'web_url', 'print_page', 'slideshow_credits',
# #'abstract', 'section_name', 'word_count', 'subsection_name', 'type_of_material']
# print(moto_data['response']['docs'][0].keys())
# #temp. Answer: Sports
# #print(moto_data['response']['docs'][0]['section_name'])
# #temp.
# # Sports
# # Sports
# # Sports
# # None
# # Multimedia/Photos
# # Multimedia/Photos
# # Multimedia/Photos
# # New York and Region
# # None
# # New York and Region
# # New York and Region
# for article in moto_data['response']['docs']:
# print(article['section_name'])
# #temp. 10. There are only 10 because only 10 show up in search results.
# print(len(moto_data['response']['docs']))
7) How many of the last 20 movies reviewed by the NYT were Critics' Picks? How about the last 40? The last 60?
Tip: You really don't want to do this 3 separate times (1-20, 21-40 and 41-60) and add them together. What if, perhaps, you were able to figure out how to combine two lists? Then you could have a 1-20 list, a 1-40 list, and a 1-60 list, and then just run similar code for each of them.
In [286]:
offsets = range(3)
picks_by_group = []
for offset in offsets:
picks_response = requests.get('https://api.nytimes.com/svc/movies/v2/reviews/search.json?offset=' + str(offset * 20) + '&api-key=1a25289d587a49b7ba8128badd7088a2')
picks_data = picks_response.json()
results = picks_data['results']
picks = 0
for result in results:
if result['critics_pick'] == 1:
picks = picks + 1
picks_by_group.append(picks)
print('In the most recent', offset * 20, 'to', offset * 20 + 20, 'movies, the critics liked', picks, 'movies.')
print('In the past', (offset + 1) * 20, 'reviews, the critics liked', sum(picks_by_group), 'movies.')
print('')
In [ ]:
# #temp. Answer: ['has_more', 'status', 'results', 'copyright', 'num_results']
# print(picks_data.keys())
# #temp. 20
# #not what we're looking for
# print(picks_data['num_results'])
# #temp. Answer: list
# print(type(picks_data['results']))
# #temp.
# print(picks_data['results'][0])
# #temp. Answer: ['display_title', 'headline', 'mpaa_rating', 'critics_pick',
# #'publication_date', 'link', 'summary_short', 'byline', 'opening_date', 'multimedia', 'date_updated']
# print(picks_data['results'][0].keys())
8) Out of the last 40 movie reviews from the NYT, which critic has written the most reviews?
In [287]:
offsets = range(2)
bylines = []
for offset in offsets:
picks_response = requests.get('https://api.nytimes.com/svc/movies/v2/reviews/search.json?offset=' + str(offset * 20) + '&api-key=1a25289d587a49b7ba8128badd7088a2')
picks_data = picks_response.json()
for result in picks_data['results']:
bylines.append(result['byline'])
print(bylines)
In [316]:
# I tried Counter, but there were two most common results, and it only gave me one.
# from collections import Counter
# print(collections.Counter(bylines))
# print(Counter(bylines).most_common(1))
In [326]:
sorted_bylines = (sorted(bylines))
numbers = range(40)
most_bylines = 0
for number in numbers:
if most_bylines < sorted_bylines.count(sorted_bylines[number]):
most_bylines = sorted_bylines.count(sorted_bylines[number])
for number in numbers:
if most_bylines == sorted_bylines.count(sorted_bylines[number]) and sorted_bylines[number] != sorted_bylines[number - 1]:
print(sorted_bylines[number], sorted_bylines.count(sorted_bylines[number]))