In [15]:
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline
import datetime
import csv
import math
import time
from ProgressBar import ProgressBar
import scipy
import pickle
import cPickle


import nltk
import string
from collections import defaultdict
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.decomposition import LatentDirichletAllocation
from sklearn.linear_model import Lasso

from sklearn.externals import joblib

Latent Dirichlet Allocation

Theory

The raw output of the count vectorizer is too high dimensional (~10,000) to be particularly useful, especially because we find that no individual n-gram has a high signal-to-noise ratio. Therefore, we would like to perform some sort of dimensional reduction. One approach is to find the average sentiment of all words using the SentiWordNet dictionary. It's easy to argue that the sentiment of words in the business section reflects feelings about the economy. Another approach is topic modeling with Latent Dirichlet Allocation (LDA). While topics are not obviously correlated with CCI, it extracts interpretable information from the documents. That the topics generated from LDA are likely not very correlated to SentiWordNet scores makes it even more compelling to use both means of dimensional reduction.

For LDA, we posit that the articles are generated by randomly pulling words from a mixture of topics and that the topics are defined by the unigrams and bigrams that they contain. After random initialization, the algorithm iterates through words and reassigns words to topics based on how often they occur in articles of that topic. Eventually, the allocation becomes more self-consistent as words move to topics where they are more common and articles become more concentrated under a few topics. As opposed to some clustering algorithms, the documents are assigned to a mixture of topics since the same words tend to occur across many articles so the articles are not fully separable.

Implementation

The actual implementation of LDA is fairly complicated because there is a lot of iteration and sampling from distributions. However, it is already nicely implemented in sklearn. Still, it is not immediately clear how many topics one should use in LDA. To get an idea, we can inspect the topics that are generated. However, topics that we recognize may not be ideal for predicting CCI, so we also use cross validation. After cross validation, we found that 8 topics performs the best and is quite interpretable.


In [3]:
def save_sparse_csr(filename,array):
    np.savez(filename,data = array.data ,indices=array.indices,
             indptr =array.indptr, shape=array.shape )

def load_sparse_csr(filename):
    loader = np.load(filename)
    return scipy.sparse.csr_matrix((  loader['data'], loader['indices'], loader['indptr']),
                         shape = loader['shape'])

In [110]:
wordMatrixBigrams = load_sparse_csr('bigramWordMatrix4.npz')

In [109]:
num_topics=8
ldaBigrams = LatentDirichletAllocation(n_topics=num_topics)

In [114]:
ldaDocsBigrams = ldaBigrams.fit_transform(wordMatrixBigrams)

In [115]:
np.save('ldaDocBigramScores4_8', ldaDocsBigrams)

Inspection

A nice way to see if the topics generated make some sense is to check the top 15 words in each topic. LDA gives the weighting of each component, so we can just argsort and match the indexes with the vocabulary used to generate the wordMatrix.


In [116]:
with open("totalVocab4.pkl", "rb") as input_file:
    totalVocab = pickle.load(input_file)

In [117]:
num_top_words = 15
topic_words = []

for topic in ldaBigrams.components_:
    word_idx = np.argsort(topic)[::-1][0:num_top_words]
    topic_words.append([totalVocab[i] for i in word_idx])

In [121]:
topicDF = pd.DataFrame(topic_words)
topicDF.index = ['Topic {}'.format(i) for i in range(1,(num_topics+1))]
topicDF.columns = ['Stem {}'.format(i) for i in range(1,(num_top_words+1))]
topicDF


Out[121]:
Stem 1 Stem 2 Stem 3 Stem 4 Stem 5 Stem 6 Stem 7 Stem 8 Stem 9 Stem 10 Stem 11 Stem 12 Stem 13 Stem 14 Stem 15
Topic 1 quarter million earn share report compani loss cent sale revenu net year incom result fiscal
Topic 2 global tax oil corp energi polici follow annual gas bankruptci judg 20 million file 11
Topic 3 percent price rate stock year bank month rose market profit said fell economi increas central
Topic 4 said compani billion bank execut plan chief busi financi deal year corpor mr new chief execut
Topic 5 million new york new york firm agre fund group stock offer exchang secur unit insur yesterday
Topic 6 share onlin health compani care lead 18 canadian pharmaceut 1989 medic net inc share earn 31 health care
Topic 7 state unit market said unit state expect say year china growth thursday street debt govern sale
Topic 8 european feder euro mani trade court reserv minist rule time money feder reserv new pound union

As we can see, the topics seem to make some sense. For example, Topic 6 clearly has to do with "healthcare", "pharmaceuticals", and "medicine", but it is also interesting to note that "Canada" comes up a lot in those articles. This is not surprising at all as their system of socialized medicine is often referenced in the US. Also, Topic 8 has to do more with monetary policy as it includes "Euro", "Pound", "money", and "Federal Reserve". Something particularly interesting about Topic 8 is that "bank" does not appear. While I assume that "Federal", "Reserve", and "Bank" all appear frequently in that order, "bank" is more likely to occur elsewhere in the business section, not just regarding monetary policy. This indicates how LDA does not just group words that occur in articles together, it weights against including words that occur frequently across all documents.

Finally, we can group by month and save the topic scores.


In [43]:
data = pd.read_csv('allDataWithStems.csv', index_col=0)

In [89]:
grouped = data.groupby('yearmonth')

In [119]:
topicsByMonthBigrams = np.zeros((len(grouped.groups.keys()),ldaDocsBigrams.shape[1]))
for i, month in enumerate(np.sort(grouped.groups.keys())):
    topicsByMonthBigrams[i] = np.mean(ldaDocsBigrams[grouped.get_group(month).index], axis=0)

In [120]:
np.save('topicsByMonthBigrams4_8', topicsByMonthBigrams)