In [15]:

    
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline
import datetime
import csv
import math
import time
from ProgressBar import ProgressBar
import scipy
import pickle
import cPickle


import nltk
import string
from collections import defaultdict
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.decomposition import LatentDirichletAllocation
from sklearn.linear_model import Lasso

from sklearn.externals import joblib

Latent Dirichlet Allocation

Theory
Implementation
Inspection

Theory

The raw output of the count vectorizer is too high dimensional (~10,000) to be particularly useful, especially because we find that no individual n-gram has a high signal-to-noise ratio. Therefore, we would like to perform some sort of dimensional reduction. One approach is to find the average sentiment of all words using the SentiWordNet dictionary. It's easy to argue that the sentiment of words in the business section reflects feelings about the economy. Another approach is topic modeling with Latent Dirichlet Allocation (LDA). While topics are not obviously correlated with CCI, it extracts interpretable information from the documents. That the topics generated from LDA are likely not very correlated to SentiWordNet scores makes it even more compelling to use both means of dimensional reduction.

For LDA, we posit that the articles are generated by randomly pulling words from a mixture of topics and that the topics are defined by the unigrams and bigrams that they contain. After random initialization, the algorithm iterates through words and reassigns words to topics based on how often they occur in articles of that topic. Eventually, the allocation becomes more self-consistent as words move to topics where they are more common and articles become more concentrated under a few topics. As opposed to some clustering algorithms, the documents are assigned to a mixture of topics since the same words tend to occur across many articles so the articles are not fully separable.

Implementation

The actual implementation of LDA is fairly complicated because there is a lot of iteration and sampling from distributions. However, it is already nicely implemented in sklearn. Still, it is not immediately clear how many topics one should use in LDA. To get an idea, we can inspect the topics that are generated. However, topics that we recognize may not be ideal for predicting CCI, so we also use cross validation. After cross validation, we found that 8 topics performs the best and is quite interpretable.



In [3]:

    
def save_sparse_csr(filename,array):
    np.savez(filename,data = array.data ,indices=array.indices,
             indptr =array.indptr, shape=array.shape )

def load_sparse_csr(filename):
    loader = np.load(filename)
    return scipy.sparse.csr_matrix((  loader['data'], loader['indices'], loader['indptr']),
                         shape = loader['shape'])



In [110]:

    
wordMatrixBigrams = load_sparse_csr('bigramWordMatrix4.npz')



In [109]:

    
num_topics=8
ldaBigrams = LatentDirichletAllocation(n_topics=num_topics)



In [114]:

    
ldaDocsBigrams = ldaBigrams.fit_transform(wordMatrixBigrams)



In [115]:

    
np.save('ldaDocBigramScores4_8', ldaDocsBigrams)

Inspection

A nice way to see if the topics generated make some sense is to check the top 15 words in each topic. LDA gives the weighting of each component, so we can just argsort and match the indexes with the vocabulary used to generate the wordMatrix.



In [116]:

    
with open("totalVocab4.pkl", "rb") as input_file:
    totalVocab = pickle.load(input_file)



In [117]:

    
num_top_words = 15
topic_words = []

for topic in ldaBigrams.components_:
    word_idx = np.argsort(topic)[::-1][0:num_top_words]
    topic_words.append([totalVocab[i] for i in word_idx])



In [121]:

    
topicDF = pd.DataFrame(topic_words)
topicDF.index = ['Topic {}'.format(i) for i in range(1,(num_topics+1))]
topicDF.columns = ['Stem {}'.format(i) for i in range(1,(num_top_words+1))]
topicDF









    Out[121]:






  
    
      
      Stem 1
      Stem 2
      Stem 3
      Stem 4
      Stem 5
      Stem 6
      Stem 7
      Stem 8
      Stem 9
      Stem 10
      Stem 11
      Stem 12
      Stem 13
      Stem 14
      Stem 15
    
  
  
    
      Topic 1
      quarter
      million
      earn
      share
      report
      compani
      loss
      cent
      sale
      revenu
      net
      year
      incom
      result
      fiscal
    
    
      Topic 2
      global
      tax
      oil
      corp
      energi
      polici
      follow
      annual
      gas
      bankruptci
      judg
      20
      million
      file
      11
    
    
      Topic 3
      percent
      price
      rate
      stock
      year
      bank
      month
      rose
      market
      profit
      said
      fell
      economi
      increas
      central
    
    
      Topic 4
      said
      compani
      billion
      bank
      execut
      plan
      chief
      busi
      financi
      deal
      year
      corpor
      mr
      new
      chief execut
    
    
      Topic 5
      million
      new
      york
      new york
      firm
      agre
      fund
      group
      stock
      offer
      exchang
      secur
      unit
      insur
      yesterday
    
    
      Topic 6
      share
      onlin
      health
      compani
      care
      lead
      18
      canadian
      pharmaceut
      1989
      medic
      net inc
      share earn
      31
      health care
    
    
      Topic 7
      state
      unit
      market
      said
      unit state
      expect
      say
      year
      china
      growth
      thursday
      street
      debt
      govern
      sale
    
    
      Topic 8
      european
      feder
      euro
      mani
      trade
      court
      reserv
      minist
      rule
      time
      money
      feder reserv
      new
      pound
      union

As we can see, the topics seem to make some sense. For example, Topic 6 clearly has to do with "healthcare", "pharmaceuticals", and "medicine", but it is also interesting to note that "Canada" comes up a lot in those articles. This is not surprising at all as their system of socialized medicine is often referenced in the US. Also, Topic 8 has to do more with monetary policy as it includes "Euro", "Pound", "money", and "Federal Reserve". Something particularly interesting about Topic 8 is that "bank" does not appear. While I assume that "Federal", "Reserve", and "Bank" all appear frequently in that order, "bank" is more likely to occur elsewhere in the business section, not just regarding monetary policy. This indicates how LDA does not just group words that occur in articles together, it weights against including words that occur frequently across all documents.

Finally, we can group by month and save the topic scores.



In [43]:

    
data = pd.read_csv('allDataWithStems.csv', index_col=0)



In [89]:

    
grouped = data.groupby('yearmonth')



In [119]:

    
topicsByMonthBigrams = np.zeros((len(grouped.groups.keys()),ldaDocsBigrams.shape[1]))
for i, month in enumerate(np.sort(grouped.groups.keys())):
    topicsByMonthBigrams[i] = np.mean(ldaDocsBigrams[grouped.get_group(month).index], axis=0)



In [120]:

    
np.save('topicsByMonthBigrams4_8', topicsByMonthBigrams)

	Stem 1	Stem 2	Stem 3	Stem 4	Stem 5	Stem 6	Stem 7	Stem 8	Stem 9	Stem 10	Stem 11	Stem 12	Stem 13	Stem 14	Stem 15
Topic 1	quarter	million	earn	share	report	compani	loss	cent	sale	revenu	net	year	incom	result	fiscal
Topic 2	global	tax	oil	corp	energi	polici	follow	annual	gas	bankruptci	judg	20	million	file	11
Topic 3	percent	price	rate	stock	year	bank	month	rose	market	profit	said	fell	economi	increas	central
Topic 4	said	compani	billion	bank	execut	plan	chief	busi	financi	deal	year	corpor	mr	new	chief execut
Topic 5	million	new	york	new york	firm	agre	fund	group	stock	offer	exchang	secur	unit	insur	yesterday
Topic 6	share	onlin	health	compani	care	lead	18	canadian	pharmaceut	1989	medic	net inc	share earn	31	health care
Topic 7	state	unit	market	said	unit state	expect	say	year	china	growth	thursday	street	debt	govern	sale
Topic 8	european	feder	euro	mani	trade	court	reserv	minist	rule	time	money	feder reserv	new	pound	union