The simplest topic model (on which all others are based) is latent Dirichlet allocation (LDA).

- LDA is a generative model that infers unobserved meanings from a large set of observations.

- Blei DM, Ng J, Jordan MI. Latent dirichlet allocation. J Mach Learn Res. 2003; 3: 993–1022.
- Blei DM, Lafferty JD. Correction: a correlated topic model of science. Ann Appl Stat. 2007; 1: 634.
- Blei DM. Probabilistic topic models. Commun ACM. 2012; 55: 55–65.
- Chandra Y, Jiang LC, Wang C-J (2016) Mining Social Entrepreneurship Strategies Using Topic Modeling. PLoS ONE 11(3): e0151342. doi:10.1371/journal.pone.0151342

- Topic models assume that each document contains a mixture of topics
- Topics are considered latent/unobserved variables that stand between the documents and terms

It is impossible to directly assess the relationships between topics and documents and between topics and terms.

- What can be directly observed is the distribution of terms over documents, which is known as the document term matrix (DTM).

Topic models algorithmically identify the best set of latent variables (topics) that can best explain the observed distribution of terms in the documents.

The DTM is further decomposed into two matrices：

- a term-topic matrix (TTM)
- a topic-document matrix (TDM)

Each document can be assigned to a primary topic that demonstrates the highest topic-document probability and can then be linked to other topics with declining probabilities.

Assume K topics are in D documents, and each topic is denoted with $\phi_{1:K}$.

Each topic $\phi_K$ is a distribution of fixed words in the given documents.

The topic proportion in the document is denoted as $\theta_d$.

- e.g., the kth topic's proportion in document d is $\theta_{d, k}$.

Let $w_{d,n}$ denote the nth term in document d.

Further, topic models assign topics to a document and its terms.

- For example, the topic assigned to document d is denoted as $z_d$,
- and the topic assigned to the nth term in document d is denoted as $z_{d,n}$.

According to Blei et al. the joint distribution of $\phi_{1:K}$,$\theta_{1:D}$, $z_{1:D}$ and $w_{d, n}$ plus the generative process for LDA can be expressed as:

$ p(\phi_{1:K}, \theta_{1:D}, z_{1:D}, w_{d, n}) $ =

$\prod_{i=1}^{K} p(\phi_i) \prod_{d =1}^D p(\theta_d)(\prod_{n=1}^N p(z_{d,n} \mid \theta_d) \times p(w_{d, n} \mid \phi_{1:K}, Z_{d, n}) ) $

Note that $\phi_{1:k},\theta_{1:D},and z_{1:D}$ are latent, unobservable variables. Thus, the computational challenge of LDA is to compute the conditional distribution of them given the observable specific words in the documents $w_{d, n}$.

Accordingly, the posterior distribution of LDA can be expressed as:

Because the number of possible topic structures is exponentially large, it is impossible to compute the posterior of LDA. Topic models aim to develop efficient algorithms to approximate the posterior of LDA.

- There are two categories of algorithms:
- sampling-based algorithms
- variational algorithms

Using the Gibbs sampling method, we can build a Markov chain for the sequence of random variables (see Eq 1). The sampling algorithm is applied to the chain to sample from the limited distribution, and it approximates the posterior.

Unfortunately, scikit-learn does not support latent Dirichlet allocation.

Therefore, we are going to use the gensim package in Python.

Gensim is developed by Radim Řehůřek,who is a machine learning researcher and consultant in the Czech Republic. We muststart by installing it. We can achieve this by running one of the following commands:

## pip install gensim

```
In [1]:
```%matplotlib inline
from __future__ import print_function
from wordcloud import WordCloud
from gensim import corpora, models, similarities, matutils
import matplotlib.pyplot as plt
import numpy as np

http://www.cs.princeton.edu/~blei/lda-c/ap.tgz

Unzip the data and put them into /Users/chengjun/bigdata/ap/

```
In [2]:
```# Load the data
corpus = corpora.BleiCorpus('/Users/chengjun/bigdata/ap/ap.dat', '/Users/chengjun/bigdata/ap/vocab.txt')

```
In [53]:
```' '.join(dir(corpus))

```
Out[53]:
```

```
In [112]:
```corpus.id2word.items()[:3]

```
Out[112]:
```

```
In [3]:
```NUM_TOPICS = 100

```
In [4]:
```model = models.ldamodel.LdaModel(
corpus, num_topics=NUM_TOPICS, id2word=corpus.id2word, alpha=None)

```
```

```
In [54]:
```' '.join(dir(model))

```
Out[54]:
```

```
In [5]:
```document_topics = [model[c] for c in corpus]

```
In [6]:
```# how many topics does one document cover?
document_topics[2]

```
Out[6]:
```

```
In [7]:
```# The first topic
# format: weight, term
model.show_topic(0, 10)

```
Out[7]:
```

```
In [8]:
```# The 100 topic
# format: weight, term
model.show_topic(99, 10)

```
Out[8]:
```

```
In [213]:
```words = model.show_topic(0, 5)
words

```
Out[213]:
```

```
In [215]:
```model.show_topics(4)

```
Out[215]:
```

```
In [214]:
```for f, w in words[:10]:
print(f, w)

```
```

```
In [39]:
```# write out topcis with 10 terms with weights
for ti in range(model.num_topics):
words = model.show_topic(ti, 10)
tf = sum(f for f, w in words)
with open('/Users/chengjun/github/cjc2016/data/topics_term_weight.txt', 'a') as output:
for f, w in words:
line = str(ti) + '\t' + w + '\t' + str(f/tf)
output.write(line + '\n')

```
In [40]:
```# We first identify the most discussed topic, i.e., the one with the
# highest total weight
topics = matutils.corpus2dense(model[corpus], num_terms=model.num_topics)
weight = topics.sum(1)
max_topic = weight.argmax()

```
In [216]:
```# Get the top 64 words for this topic
# Without the argument, show_topic would return only 10 words
words = model.show_topic(max_topic, 64)
words = np.array(words).T
words_freq=[float(i)*10000000 for i in words[0]]
words = zip(words[1], words_freq)

```
In [219]:
```wordcloud = WordCloud().generate_from_frequencies(words)
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

```
```

```
In [104]:
```num_topics_used = [len(model[doc]) for doc in corpus]
fig,ax = plt.subplots()
ax.hist(num_topics_used, np.arange(42))
ax.set_ylabel('Nr of documents')
ax.set_xlabel('Nr of topics')
fig.tight_layout()
#fig.savefig('Figure_04_01.png')

```
```

```
In [109]:
```# Now, repeat the same exercise using alpha=1.0
# You can edit the constant below to play around with this parameter
ALPHA = 1.0
model1 = models.ldamodel.LdaModel(
corpus, num_topics=NUM_TOPICS, id2word=corpus.id2word, alpha=ALPHA)
num_topics_used1 = [len(model1[doc]) for doc in corpus]

```
In [108]:
```fig,ax = plt.subplots()
ax.hist([num_topics_used, num_topics_used1], np.arange(42))
ax.set_ylabel('Nr of documents')
ax.set_xlabel('Nr of topics')
# The coordinates below were fit by trial and error to look good
plt.text(9, 223, r'default alpha')
plt.text(26, 156, 'alpha=1.0')
fig.tight_layout()

```
```

```
In [113]:
```with open('/Users/chengjun/bigdata/ap/ap.txt', 'r') as f:
dat = f.readlines()

```
In [130]:
```dat[:6]

```
Out[130]:
```

```
In [194]:
```dat[4].strip()[0]

```
Out[194]:
```

```
In [195]:
```docs = []
for i in dat[:100]:
if i.strip()[0] != '<':
docs.append(i)

```
In [196]:
```def clean_doc(doc):
doc = doc.replace('.', '').replace(',', '')
doc = doc.replace('``', '').replace('"', '')
doc = doc.replace('_', '').replace("'", '')
doc = doc.replace('!', '')
return doc
docs = [clean_doc(doc) for doc in docs]

```
In [197]:
```texts = [[i for i in doc.lower().split()] for doc in docs]

```
In [198]:
```from nltk.corpus import stopwords
stop = stopwords.words('english')

```
In [222]:
```' '.join(stop)

```
Out[222]:
```

```
In [223]:
```stop.append('said')

```
In [199]:
```from collections import defaultdict
frequency = defaultdict(int)
for text in texts:
for token in text:
frequency[token] += 1
texts = [[token for token in text if frequency[token] > 1 and token not in stop]
for text in texts]

```
In [200]:
```docs[8]

```
Out[200]:
```

```
In [203]:
```' '.join(texts[9])

```
Out[203]:
```

```
In [204]:
```dictionary = corpora.Dictionary(texts)
lda_corpus = [dictionary.doc2bow(text) for text in texts]
#The function doc2bow() simply counts the number of occurences of each distinct word,
# converts the word to its integer word id and returns the result as a sparse vector.

```
In [205]:
```lda_model = models.ldamodel.LdaModel(
lda_corpus, num_topics=NUM_TOPICS, id2word=dictionary, alpha=None)

```
In [206]:
```import pyLDAvis.gensim
ap_data = pyLDAvis.gensim.prepare(lda_model, lda_corpus, dictionary)

```
```

```
In [221]:
```pyLDAvis.enable_notebook()
pyLDAvis.display(ap_data)

```
Out[221]:
```

```
In [220]:
```pyLDAvis.save_html(ap_data, '/Users/chengjun/github/cjc2016/vis/ap_ldavis.html')

Willi Richert, Luis Pedro Coelho, 2013, Building Machine Learning Systems with Python. Chapter 4. Packt Publishing.

东风夜放花千树：对宋词进行主题分析初探 http://chengjun.github.io/cn/2013/09/topic-modeling-of-song-peom/

Chandra Y, Jiang LC, Wang C-J (2016) Mining Social Entrepreneurship Strategies Using Topic Modeling. PLoS ONE 11(3): e0151342. doi:10.1371/journal.pone.0151342