In [ ]:

    
# In case you haven't installed the API
! pip install nytimesarticle



In [1]:

    
from nytimesarticle import articleAPI

import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline
import datetime
import csv
import math
import time
from ProgressBar import ProgressBar









    



/Users/kshain/anaconda/envs/AC209/lib/python2.7/site-packages/matplotlib/font_manager.py:273: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment.
  warnings.warn('Matplotlib is building the font cache using fc-list. This may take a moment.')

Obtaining the Data

### Consumer Confidence Index
### New York Times Articles

Consumer Confidence Index

The consumer confidence index (CCI) is based on survey results of real consumers. They are asked their opinions of current and future economic conditions as well as about their personal economic situation. These survey questions are encoded and normalized to a baseline of 100 coming from the 1985 results. These results are obtained on a monthly basis by the Organisation for Economic Co-operation and Development and can be downloaded directly as a CSV from https://data.oecd.org/leadind/consumer-confidence-index-cci.htm.

New York Times Articles

Article Search API

The New York Times Article Search API allows for searching and obtaining headlines and lead paragraphs of articles dating back to 1851. Along with each article, there is metadata like the date it was published and the section in which it appeared. There is definitely some possibility that not all articles make it into the database, but an inspection of modern articles finds an order of 10s of articles per day which seems reasonable. The API call returns the data as JSON which can be used as such or transformed into CSV.

To access the API, one needs to obtain an API key from https://developer.nytimes.com/signup. And install the API using:

! pip install nytimesarticle



In [2]:

    
from nytimesarticle import articleAPI
api = articleAPI('ca372b5c9318406780fe9ebef28e96a1')

Peculiarities of the API

The first thing to note are the usage limits for the API. Calls are limited to 1000 per day and 5 per second. Therefore, we need to make sure that our function sleeps between each call. The trickier issue with the API is that it will only return 100 pages of results from any given search. This means that searching for a year long window will have too many results and you will just get the first few weeks which fills the 100 pages. For this reason, we iterate through search windows of one week and monitor the number of pages found to make sure that it never exceeds 100 from any given search.

Downloading the Data

We will save each year of data as a separate CSV. The steps of the downloading the data to CSV is as follows.

Denote the first week long interval to search

Make an API call to search for articles in that week from the business section
Check how many pages are returned from the search
Iterate through the pages and articles in the page
Extract data from JSON and put into CSV format
After getting one week of data as a CSV, append to the file



In [3]:

    
def downloadToFile(startdate, enddate, filename):
    """
    Makes API calls to extract id, publication date, headline, and lead paragraph from NY Times articles in the date range.
    Then, saves the data to a local file in csv format.
    startdate: start of date range to extract (yyyymmdd)
    enddate: end of date range to extract (yyyymmdd)
    filename: csv file to create and append to
    """
    
    startdate = datetime.datetime.strptime(str(startdate), '%Y%m%d')
    enddate = datetime.datetime.strptime(str(enddate), '%Y%m%d')

    sliceStart = startdate

    while (sliceStart<enddate):
        leads = []
        ids = []
        dates = []
        headlines = []
        
        sliceEnd = min(sliceStart + datetime.timedelta(weeks=1), enddate)

        sliceStartInt = int(sliceStart.strftime('%Y%m%d'))
        sliceEndInt = int(sliceEnd.strftime('%Y%m%d'))
        print 'Downloading from {} to {}'.format(sliceStartInt, sliceEndInt)
        while True:
            try:
                numhits = api.search(fl = ['_id'],begin_date = sliceStartInt, end_date=sliceEndInt,fq = {'section_name':'Business'}, page=1)['response']['meta']['hits']
                time.sleep(1)
                break
            except:
                print 'JSON error avoided'
        pages = int(math.ceil(float(numhits)/10))
        time.sleep(1)
        pbar2 = ProgressBar(pages)
        print '{} pages to download'.format(pages) # Note that you can't download past page number 100
        for page in range(1,min(pages+1,100)):
            while True:
                try:
                    articles = api.search(fl= ['_id','headline','lead_paragraph','pub_date'], begin_date = sliceStartInt, end_date=sliceEndInt,fq = {'section_name':'Business'}, page=page)
                    time.sleep(1)
                    break
                except:
                    print 'JSON error avoided'
            
            pbar2.increment()
            for i in articles['response']['docs']:
                if (i['lead_paragraph'] is not None) and (i['headline'] != []):
                    headlines.append(i['headline']['main'])
                    leads.append(i['lead_paragraph'])
                    ids.append(i['_id'])
                    dates.append(i['pub_date'])

        pbar2.finish()
        sliceStart = sliceEnd

        zipped = zip(ids, dates, headlines, leads)
        if zipped:
            with open(filename, "a") as f:
                writer = csv.writer(f)
                for line in zipped: 
                    writer.writerow([unicode(s).encode("utf-8") for s in line])



In [4]:

    
downloadToFile(19900101, 19900115, 'Sample_Output.csv')









    



Downloading from 19900101 to 19900108
39 pages to download
Complete! Total Elapsed time: 54.0 seconds                        
Downloading from 19900108 to 19900115
61 pages to download
Complete! Total Elapsed time: 82.5 seconds

Working with the Files

Let's just check what we have in the files now. We can iterate over the yearly CSV files to make a dataframe with all of the data.



In [6]:

    
all_data_list = []
for year in range(1990,1992):
    data = pd.read_csv('{}_Output.csv'.format(year), header=None)
    all_data_list.append(data) # list of dataframes
data = pd.concat(all_data_list, axis=0)
data.columns = ['id','date','headline', 'lead']
data.head()









    Out[6]:






  
    
      
      id
      date
      headline
      lead
    
  
  
    
      0
      4fd1aa888eb7c8105d6c860e
      1990-01-03T00:00:00Z
      Tandem Expected To Show Computer
      LEAD: Tandem Computers Inc. is expected to int...
    
    
      1
      52b85b7738f0d8094087c782
      1990-01-03T00:00:00Z
      Chrysler Shows Van Concept
      LEAD: The Chrysler Corporation today introduce...
    
    
      2
      52b85b7638f0d8094087c780
      1990-01-02T00:00:00Z
      Loan Pact Seen For Hungary
      LEAD: Hungary expects to complete a deal with ...
    
    
      3
      52b85b7538f0d8094087c77f
      1990-01-02T00:00:00Z
      Counterattack Planned By Lawyers for Lincoln
      LEAD: Lawyers for Charles H. Keating Jr., who ...
    
    
      4
      4fd18d4c8eb7c8105d691815
      1990-01-08T00:00:00Z
      Intermetrics Inc reports earnings for Qtr to N...
      LEAD: *3*** COMPANY REPORTS ** *3* Intermetric...

	id	date	headline	lead
0	4fd1aa888eb7c8105d6c860e	1990-01-03T00:00:00Z	Tandem Expected To Show Computer	LEAD: Tandem Computers Inc. is expected to int...
1	52b85b7738f0d8094087c782	1990-01-03T00:00:00Z	Chrysler Shows Van Concept	LEAD: The Chrysler Corporation today introduce...
2	52b85b7638f0d8094087c780	1990-01-02T00:00:00Z	Loan Pact Seen For Hungary	LEAD: Hungary expects to complete a deal with ...
3	52b85b7538f0d8094087c77f	1990-01-02T00:00:00Z	Counterattack Planned By Lawyers for Lincoln	LEAD: Lawyers for Charles H. Keating Jr., who ...
4	4fd18d4c8eb7c8105d691815	1990-01-08T00:00:00Z	Intermetrics Inc reports earnings for Qtr to N...	LEAD: 3 COMPANY REPORTS 3 Intermetric...