In [1]:
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline
import datetime
import csv
import math
import time
from ProgressBar import ProgressBar
import cPickle
import scipy
import nltk
import string
from collections import defaultdict
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
We will load all of our data files from the API calls and put them into a single dataframe. This covers 1990 through 2016.
In [2]:
all_data_list = []
for year in range(1990,2017):
data = pd.read_csv('{}_Output.csv'.format(year), header=None)
all_data_list.append(data) # list of dataframes
data = pd.concat(all_data_list, axis=0)
data.columns = ['id','date','headline', 'lead']
data.head()
Out[2]:
By inspection, we see that the data is remarkably clean. There are almost no changes that need to be made, except for removing a few articles that do not have lead paragraphs. This is only ~500 articles out of ~500,000.
In [3]:
data.shape
Out[3]:
In [4]:
data.dropna(inplace=True)
In [5]:
data.shape
Out[5]:
We add a column that uniquely identifies just the month and year of each article. This is useful because we will be aggregating by month several times later on.
In [6]:
data['yearmonth'] = map(lambda x: x[:7], data.date)
The lead paragraphs are then tokenized and stemmed. Tokenizing just makes an ordered list of words from each lead paragraph and removes punctuation. The stemming removes suffixes like the plural “s” or past “ed” which we hypothesize not to have semantic importance. Lastly, we add the stemmed lead paragraphs to the dataframe.
In [7]:
def getStems(lead):
stemmer = nltk.stem.SnowballStemmer("english")
tokens = nltk.word_tokenize(''.join(ch for ch in lead if ch not in set(string.punctuation)))
return map(lambda x: stemmer.stem(x.decode('utf-8')), tokens)
In [8]:
def joinstems(stemlist):
return ' '.join(stem for stem in stemlist)
In [9]:
stems = map(getStems, data.lead)
In [10]:
data['leadstems'] = map(joinstems, stems)
In [11]:
data.head()
Out[11]:
Unigrams are single words without context. Using CountVectorizer it is straightforward to extract the unigrams that occur in at least 0.5% of articles and less than 80% of articles.
Beyond unigrams, we also speculate that co-occurrence of words is relevant in business articles. For example, “unemployment fell” includes two words that independently are negative, but together are positive. Therefore, we also want to extract bigrams, pairs of words that appear consecutively. However, there are many more bigrams than unigrams as unigrams are limited to the size of the dictionary and bigrams are limited to that number squared. To reduce the dimension, we only use bigrams where the two words are much more likely to occur together than separately based on the Dunning Log-Likelihood Ratio(DLLR).
The essence of the DLLR statistic is that it measures how much more likely two words occur consecutively than would be expected by chance. As expected, this statistic takes as inputs the number of times the constituent words occur together and separately.
The first step is to extract all of the bigram stems from the entire dataset using the appropriate function from nltk
. Then, we put all bigrams regardless of article into a list and count the number of times that each unique bigram occurs.
In [12]:
def getBigrams(stems):
stems = stems.split()
finder = nltk.collocations.BigramCollocationFinder.from_words(stems)
return(finder.ngram_fd.items())
In [15]:
bigramList = map(getBigrams, data.leadstems)
In [16]:
flatBigrams = [item for sublist in bigramList for item in sublist]
In [17]:
def reduceByKey(toIterate): # For us, the bigram is the key
d = defaultdict(int)
for i, (key, value) in enumerate(toIterate):
d[key] += value
return list(d.items())
In [18]:
summedBigrams = reduceByKey(flatBigrams)
It's useful to be able to save and load this result since it takes a few minutes to run.
In [212]:
with open("summedBigrams.pkl", "wb") as output_file:
pickle.dump(summedBigrams, output_file)
In [213]:
with open("summedBigrams.pkl", "rb") as input_file:
summedBigrams2 = pickle.load(input_file)
It is not important to go into the details of the calculations, but we use a few nested functions to keep all of the logs from getting messy. In the literature, the notation used is, $$k_{11}=N(A,B)$$ $$k_{12}=N(A,not B)$$ $$k_{21}=N(not A,B)$$ $$k_{22}=N(not A,not B)$$ Where $N(A,B)$ is the number of times the bigram "A B" occurs.
In [14]:
def xLogX(x):
if x==0:
return 0
else:
return x * np.log(x)
def entropy(things):
theSum = 0
theResult = 0
for thing in things:
theResult += xLogX(thing)
theSum += thing;
return xLogX(theSum) - theResult;
def logLikelihoodRatio(k11, k12, k21, k22):
rowEntropy = entropy([k11 + k12, k21 + k22]);
columnEntropy = entropy([k11 + k21, k12 + k22]);
matrixEntropy = entropy([k11, k12, k21, k22]);
return(2.0 * (rowEntropy + columnEntropy - matrixEntropy))
def getLLR(summedBigram):
k11 = summedBigram[1]
stem1 = summedBigram[0][0]
stem2 = summedBigram[0][1]
k12 = bigrams1Dict[stem1] - k11
k21 = bigrams2Dict[stem2] - k11
k22 = totalBigrams - k11 - k12 - k21
return(logLikelihoodRatio(k11, k12, k21, k22))
It is also useful to have a dictionary of the number of times each word appears in any general bigram so that we can find the $N(A, not\,B)$ and like terms
In [19]:
bigrams1 = map(lambda x: (x[0][0], x[1]), summedBigrams)
summedBigrams1= reduceByKey(bigrams1)
In [20]:
bigrams2 = map(lambda x: (x[0][1], x[1]), summedBigrams)
summedBigrams2= reduceByKey(bigrams2)
In [21]:
bigrams1Dict = dict(summedBigrams1)
bigrams2Dict = dict(summedBigrams2)
In [22]:
onlyValues = np.array(map(lambda x: x[1], summedBigrams))
totalBigrams = onlyValues.sum()
Finally, we can get the DLLR statistic for each bigram.
In [23]:
LLRs = map(getLLR, summedBigrams)
In [26]:
plt.hist(LLRs, bins=200000);
plt.xlim((0,100))
plt.xlabel('Dunning Log-likelihood Ratio')
plt.ylabel('Counts')
plt.title('Distribution of Bigram Log-Likelihood Ratios')
Out[26]:
In [28]:
print 'Most likely bigram: {}'.format(np.array(summedBigrams)[np.argmax(np.array(LLRs))][0])
Just as a sanity check, it is good to note that "New York" is most likely to be a bigram according to our metric. This makes sense as "New" and "York" are way more likely to occur together than separately, especially in the New York Times.
Lastly, we have done a lot of computation so we can save the results.
In [29]:
justBigrams = map(lambda x: x[0], summedBigrams)
bigramLLR = zip(justBigrams,LLRs)
In [235]:
with open("bigramLLR.pkl", "wb") as output_file:
cPickle.dump(bigramLLR, output_file)
In [65]:
with open("bigramLLR.pkl", "rb") as input_file:
bigramLLR = cPickle.load(input_file)
Choosing the minimum DLLR threshold for inclusion in our bigram vocabulary is a bit arbitrary, but we can set it such that we have a reasonable number of bigrams in our final vocab. With some testing, 1500 is a good threshold.
In [30]:
bigramVocab = np.array(summedBigrams)[list(np.where(np.array(LLRs)>1500)[0])]
In [31]:
len(bigramVocab)
Out[31]:
Much like with unigrams, it makes sense to eliminate bigrams that contain stopwords as those are not likely to have any meaning in our future analysis.
In [32]:
stopwords = ['or','the','a','an','and','as','at','it','this','that','he','I','she','of','is','was','are','be','did','do','has','had','have','with','for','from', 'to', 'on', 'in', 'monday', 'tuesday', 'wednesday', 'thursday', 'friday', 'saturday', 'sunday', 'yesterday', 'tomorrow','year']
In [33]:
bigramVocab = [bigram for bigram in bigramVocab if (set(bigram[0]).isdisjoint(stopwords))]
Lastly, we will stitch together just the stems from the bigram vocabulary list so that they are exactly how they will appear in the stemmed lead paragraphs.
In [34]:
bigramVocab = map(lambda x: x[0][0]+' '+x[0][1], list(bigramVocab))
We seek to extract the number of times each unigram and bigram of interest appears in each article. We already have a bigram vocabulary, but we need to create a unigram vocabulary by doing a first pass with the count vectorizer only looking for unigrams.
In [35]:
countVec = CountVectorizer(stop_words='english', max_df=0.8, min_df=.005, strip_accents='unicode')
wordMatrix = countVec.fit_transform(data.leadstems)
unigramVocab = countVec.get_feature_names()
In [36]:
totalVocab = unigramVocab + bigramVocab
The, we can do a count vectorizer that takes as an argument the particular vocabulary that we are looking for.
In [37]:
countVecAll = CountVectorizer(ngram_range = (1,2), vocabulary = totalVocab)
wordMatrix = countVecAll.fit_transform(data.leadstems)
We also made a couple of helper functions to save and load the sparse arrays that come out of the count vectorizer.
In [38]:
def save_sparse_csr(filename,array):
np.savez(filename,data = array.data ,indices=array.indices,
indptr =array.indptr, shape=array.shape )
def load_sparse_csr(filename):
loader = np.load(filename)
return scipy.sparse.csr_matrix(( loader['data'], loader['indices'], loader['indptr']),
shape = loader['shape'])
In [149]:
save_sparse_csr('bigramWordMatrix.npz',wordMatrix)
As the final step, we can make monthly aggregates of the unigrams and bigrams outputted by the count vectorizer. It is possible that we will still want to use the granular article level data, but the monthly aggregates will turn out to be very useful.
In [39]:
grouped = data.groupby('yearmonth')
In [40]:
wordsByMonth = np.zeros((len(grouped.groups.keys()),wordMatrix.shape[1]))
for i, month in enumerate(np.sort(grouped.groups.keys())):
wordsByMonth[i] = np.sum(wordMatrix[grouped.get_group(month).index], axis=0)
As an example of the monthly counts by word, we can see how often "America" appears
In [98]:
fig = plt.figure(figsize=(11,5))
ax = fig.add_subplot(111)
ax.plot(range(len(grouped)),wordsByMonth[:,103])
ax.set_title('Occurrences of "{}"'.format(totalVocab[103]))
ax.set_ylabel('Occurrences')
ax.set_xlabel('Time')
ax.set_xlim((0,len(grouped)))
ax.set_xticklabels(list(np.arange(1960,2018,8)));