Latent Dirichlet Allocation (LDA)

The LDA is a well-known probabilistic model to handle mixtures of topics in an unsupervised way. It has been applied to a large number of problems (Blei, 2012). The original paper has over 10000 citations (Blei, 2003).

NOTE: use a Python3 kernel to run this notebook

Model formalism

In statistical terms a generative model is a model for randomly generating observable data. The model specifies a joint distribution over observed and latent variables. The joint distribution for the LDA model can be shown as a graphical model. There is also a generative story (see below) that can sometimes be more intuitive than the plate diagram.

Picturing a generative story

Recall that the Dirichlet Process (DP) (Ferguson, 1973) is essentially a distribution over distributions, where each draw from a DP is itself a distribution and importantly for clustering applications it serves as a natural prior that lets the number of clusters grow as the data grows. The DP has a base distribution parameter $\beta$ and a strength or concentration parameter $\alpha$. This is visualized below in a plate diagram.

  • $\alpha$ is a hyperprior for the DP over per-document topic distributions
  • $\beta$ is the hyperprior for the DP over per-topic word distributions
  • $\theta_{m}$ is the topic distribution for document $m$
  • $\phi_{k}$ is the word distribution for topic $k$
  • $z_{m,n}$ is the topic for the $n$th word in document $m$
  • $w_{m,n}$ is the specific word

The generative process

  1. Choose $\theta_{m} \sim \textrm{Dir}(\alpha)$, where $m \in \{1,...M\}$ and $\textrm{Dir}(\alpha)$ is the Dirichlet distribtion for $\alpha$
  2. Choose $\phi_{k} \sim \textrm{Dir}(\beta)$, where $k \in \{1,...K\}$
  3. For each of the word positions ($m$,$n$), where $n \in \{1,...N\}$, and $m \in \{1,...M\}$

    • Choose a topic $z_{m,n} \sim \textrm{Multinomial}(\theta_{m})$
    • Choose a word $w_{m,n} \sim \textrm{Multinomial}(\phi_{m,n})$

$\phi$ is a $K*V$ Markov matrix each row of which denotes the word distribution of a topic.

Smoothing

In the smoothed version of LDA the reason that $\beta$ is not connected directly to $w_{n}$ is that words in documents tend to be sparse and this formulation is a smoothes the model evaluation as it tends to help with deal with the large number of zero probabilities.

There are many versions of LDA today and covering them all is not in the scope of this introduction.


In [9]:
%matplotlib inline
from IPython.display import Image
Image(filename='lda_plate.png')


Out[9]:

Model the NPR articles with Latent Dirichlet Allocation

  1. Run the LDA model with sklearn (http://scikit-learn.org/stable/modules/decomposition.html#latentdirichletallocation)
  2. Visualize it with pyldavis (https://pyldavis.readthedocs.io/en/latest)

In [2]:
# USE Python3
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore', category=DeprecationWarning, module='.*/IPython/.*')

import pyLDAvis
import pyLDAvis.sklearn

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

pyLDAvis.enable_notebook()

df = pd.read_csv('npr_articles.csv', parse_dates=['date_published'])
text = df['processed_text'].values.tolist()

Vectorize the words

Essentially create a numeric representation of the words based on frequencies


In [3]:
max_features = 1000
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
                                max_features=max_features,
                                stop_words='english')
tf = tf_vectorizer.fit_transform(text)
print("ready")


ready

Run LDA


In [4]:
n_topics = 20
lda_model = LatentDirichletAllocation(n_topics=n_topics, max_iter=5,
                                      learning_method='online',
                                      learning_offset=50.,
                                      random_state=0)

lda_model.fit(tf)
pyLDAvis.sklearn.prepare(lda_model,tf, tf_vectorizer, R=20)


Out[4]:

In [6]:
def get_top_words(model, feature_names, n_top_words):
    top_words = {}
    for topic_idx, topic in enumerate(model.components_):
        _top_words = [feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]]
        top_words[str(topic_idx)] = _top_words
    return(top_words)

In [7]:
## get the token to topic matrix
word_topic = np.zeros((max_features,n_topics),)
print(n_topics)
lda_model.components_
for topic_idx, topic in enumerate(lda_model.components_):
    word_topic[:,topic_idx] = topic

print("token-topic matrix",word_topic.shape)

## create a matrix of the top words used to define each topic
top_words = 15
tf_feature_names = np.array(tf_vectorizer.get_feature_names())
top_words = get_top_words(lda_model,tf_feature_names,top_words)
all_top_words = np.array(list(set().union(*[v for v in top_words.values()])))

for key,vals in top_words.items():
    print(key," ".join(vals))
print("total words: %s"%len(all_top_words))

top_word_inds = [np.where(tf_feature_names == tw)[0][0] for tw in all_top_words]


20
token-topic matrix (1000, 20)
0 vote power party political government say opposition voter win change movement election anti conservative result
1 say city water people work company home make use resident community house local building npr
2 sessions hearing attorney alabama general confirmation civil senator right sen record the_united_states justice sexual tuesday
3 rate park area report increase population say high accord rise number people california bear rural
4 producer happy this_week station favorite twitter follow new television local executive facebook movie review close
5 say science human scientist change new use world way space time make earth universe look
6 job think president say make people come tax company trump thing secretary want write new
7 say trump president obama election npr elect state campaign news make people time report donald_trump
8 say report people police npr attack tell officer city kill man force case death accord
9 christmas tree car holiday word drive answer npr letter challenge copyright listener tradition sign year
10 china say chinese company market price product ship trade consumer new make industry npr accord
11 new rule year administration say pay time feel obama npr make way world report issue
12 npr copyright now8216s 8212 talk robin_young speak join new news host discuss editor reporter hear
13 job tax worker refugee economy growth income americans economic wage work pay labor cut obama
14 trump president russia say intelligence russian elect business company think conflict deal administration secretary country
15 say make just know time think people come way life work story thing want feel
16 food say make eat israel cook sugar add small dish plant coal recipe farm restaurant
17 health care insurance patient repeal say plan people obamacare law doctor hospital coverage medicaid medical
18 say school student study state people use child help work high program make parent research
19 film day game tv series movie episode plane culture team small flight star copyright podcast
total words: 212

In [6]:
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn import preprocessing
import matplotlib.pyplot as plt

def make_scatter(fit,ax,pcX=0,pcY=1,font_size=10,font_name='sans serif',ms=20,leg=True,title=None):
    colors = ['k','cyan','r','orange','g','b','magenta']
    lines = []
    indices = np.arange(fit.shape[0])
    s = ax.scatter(fit[indices,pcX],fit[indices,pcY],s=ms,alpha=0.9)
    lines.append(s)

    for tick in ax.xaxis.get_major_ticks():
        tick.label.set_fontsize(font_size-2)
    for tick in ax.yaxis.get_major_ticks():
        tick.label.set_fontsize(font_size-2)

    buff = 0.02
    bufferX = buff * (fit[:,pcX].max() - fit[:,pcX].min())
    bufferY = buff * (fit[:,pcY].max() - fit[:,pcY].min())
    ax.set_xlim([fit[:,pcX].min()-bufferX,fit[:,pcX].max()+bufferX])
    ax.set_ylim([fit[:,pcY].min()-bufferY,fit[:,pcY].max()+bufferY])
    ax.set_xlabel("D-%s"%str(pcX+1),fontsize=font_size,fontname=font_name)
    ax.set_ylabel("D-%s"%str(pcY+1),fontsize=font_size,fontname=font_name)
    plt.locator_params(axis='x',nbins=5)

fig = plt.figure(figsize=(10,8))
ax = fig.add_subplot(111)

mat = word_topic
matScaled = preprocessing.scale(mat.T)
pca_fit = PCA(n_components=2).fit_transform(matScaled)

make_scatter(pca_fit,ax)
plt.show()



In [ ]: