The LDA is a well-known probabilistic model to handle mixtures of topics in an unsupervised way. It has been applied to a large number of problems (Blei, 2012). The original paper has over 10000 citations (Blei, 2003).
NOTE: use a Python3 kernel to run this notebook
In statistical terms a generative model is a model for randomly generating observable data. The model specifies a joint distribution over observed and latent variables. The joint distribution for the LDA model can be shown as a graphical model. There is also a generative story (see below) that can sometimes be more intuitive than the plate diagram.
Recall that the Dirichlet Process (DP) (Ferguson, 1973) is essentially a distribution over distributions, where each draw from a DP is itself a distribution and importantly for clustering applications it serves as a natural prior that lets the number of clusters grow as the data grows. The DP has a base distribution parameter $\beta$ and a strength or concentration parameter $\alpha$. This is visualized below in a plate diagram.
For each of the word positions ($m$,$n$), where $n \in \{1,...N\}$, and $m \in \{1,...M\}$
$\phi$ is a $K*V$ Markov matrix each row of which denotes the word distribution of a topic.
In the smoothed version of LDA the reason that $\beta$ is not connected directly to $w_{n}$ is that words in documents tend to be sparse and this formulation is a smoothes the model evaluation as it tends to help with deal with the large number of zero probabilities.
There are many versions of LDA today and covering them all is not in the scope of this introduction.
In [9]:
%matplotlib inline
from IPython.display import Image
Image(filename='lda_plate.png')
Out[9]:
In [2]:
# USE Python3
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore', category=DeprecationWarning, module='.*/IPython/.*')
import pyLDAvis
import pyLDAvis.sklearn
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
pyLDAvis.enable_notebook()
df = pd.read_csv('npr_articles.csv', parse_dates=['date_published'])
text = df['processed_text'].values.tolist()
In [3]:
max_features = 1000
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
max_features=max_features,
stop_words='english')
tf = tf_vectorizer.fit_transform(text)
print("ready")
In [4]:
n_topics = 20
lda_model = LatentDirichletAllocation(n_topics=n_topics, max_iter=5,
learning_method='online',
learning_offset=50.,
random_state=0)
lda_model.fit(tf)
pyLDAvis.sklearn.prepare(lda_model,tf, tf_vectorizer, R=20)
Out[4]:
In [6]:
def get_top_words(model, feature_names, n_top_words):
top_words = {}
for topic_idx, topic in enumerate(model.components_):
_top_words = [feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]]
top_words[str(topic_idx)] = _top_words
return(top_words)
In [7]:
## get the token to topic matrix
word_topic = np.zeros((max_features,n_topics),)
print(n_topics)
lda_model.components_
for topic_idx, topic in enumerate(lda_model.components_):
word_topic[:,topic_idx] = topic
print("token-topic matrix",word_topic.shape)
## create a matrix of the top words used to define each topic
top_words = 15
tf_feature_names = np.array(tf_vectorizer.get_feature_names())
top_words = get_top_words(lda_model,tf_feature_names,top_words)
all_top_words = np.array(list(set().union(*[v for v in top_words.values()])))
for key,vals in top_words.items():
print(key," ".join(vals))
print("total words: %s"%len(all_top_words))
top_word_inds = [np.where(tf_feature_names == tw)[0][0] for tw in all_top_words]
In [6]:
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn import preprocessing
import matplotlib.pyplot as plt
def make_scatter(fit,ax,pcX=0,pcY=1,font_size=10,font_name='sans serif',ms=20,leg=True,title=None):
colors = ['k','cyan','r','orange','g','b','magenta']
lines = []
indices = np.arange(fit.shape[0])
s = ax.scatter(fit[indices,pcX],fit[indices,pcY],s=ms,alpha=0.9)
lines.append(s)
for tick in ax.xaxis.get_major_ticks():
tick.label.set_fontsize(font_size-2)
for tick in ax.yaxis.get_major_ticks():
tick.label.set_fontsize(font_size-2)
buff = 0.02
bufferX = buff * (fit[:,pcX].max() - fit[:,pcX].min())
bufferY = buff * (fit[:,pcY].max() - fit[:,pcY].min())
ax.set_xlim([fit[:,pcX].min()-bufferX,fit[:,pcX].max()+bufferX])
ax.set_ylim([fit[:,pcY].min()-bufferY,fit[:,pcY].max()+bufferY])
ax.set_xlabel("D-%s"%str(pcX+1),fontsize=font_size,fontname=font_name)
ax.set_ylabel("D-%s"%str(pcY+1),fontsize=font_size,fontname=font_name)
plt.locator_params(axis='x',nbins=5)
fig = plt.figure(figsize=(10,8))
ax = fig.add_subplot(111)
mat = word_topic
matScaled = preprocessing.scale(mat.T)
pca_fit = PCA(n_components=2).fit_transform(matScaled)
make_scatter(pca_fit,ax)
plt.show()
In [ ]: