2014年高考前夕,百度“基于海量作文范文和搜索数据,利用概率主题模型,预测2014年高考作文的命题方向”。如上图所示,共分为了六个主题:时间、生命、民族、教育、心灵、发展。而每个主题下面又包括了一些具体的关键词。比如,生命的主题对应:平凡、自由、美丽、梦想、奋斗、青春、快乐、孤独。
The simplest topic model (on which all others are based) is latent Dirichlet allocation (LDA).
It is impossible to directly assess the relationships between topics and documents and between topics and terms.
Topic models algorithmically identify the best set of latent variables (topics) that can best explain the observed distribution of terms in the documents.
The DTM is further decomposed into two matrices:
Each document can be assigned to a primary topic that demonstrates the highest topic-document probability and can then be linked to other topics with declining probabilities.
Assume K topics are in D documents, and each topic is denoted with $\phi_{1:K}$.
Each topic $\phi_K$ is a distribution of fixed words in the given documents.
The topic proportion in the document is denoted as $\theta_d$.
Let $w_{d,n}$ denote the nth term in document d.
Further, topic models assign topics to a document and its terms.
According to Blei et al. the joint distribution of $\phi_{1:K}$,$\theta_{1:D}$, $z_{1:D}$ and $w_{d, n}$ plus the generative process for LDA can be expressed as:
$ p(\phi_{1:K}, \theta_{1:D}, z_{1:D}, w_{d, n}) $ =
$\prod_{i=1}^{K} p(\phi_i) \prod_{d =1}^D p(\theta_d)(\prod_{n=1}^N p(z_{d,n} \mid \theta_d) \times p(w_{d, n} \mid \phi_{1:K}, Z_{d, n}) ) $
Note that $\phi_{1:k},\theta_{1:D},and z_{1:D}$ are latent, unobservable variables. Thus, the computational challenge of LDA is to compute the conditional distribution of them given the observable specific words in the documents $w_{d, n}$.
Accordingly, the posterior distribution of LDA can be expressed as:
Because the number of possible topic structures is exponentially large, it is impossible to compute the posterior of LDA. Topic models aim to develop efficient algorithms to approximate the posterior of LDA.
Using the Gibbs sampling method, we can build a Markov chain for the sequence of random variables (see Eq 1). The sampling algorithm is applied to the chain to sample from the limited distribution, and it approximates the posterior.
Unfortunately, scikit-learn does not support latent Dirichlet allocation.
Therefore, we are going to use the gensim package in Python.
Gensim is developed by Radim Řehůřek,who is a machine learning researcher and consultant in the Czech Republic. We must start by installing it. We can achieve this by running one of the following commands:
pip install gensim
In [21]:
%matplotlib inline
from __future__ import print_function
from wordcloud import WordCloud
from gensim import corpora, models, similarities, matutils
import matplotlib.pyplot as plt
import numpy as np
http://www.cs.princeton.edu/~blei/lda-c/ap.tgz
Unzip the data and put them into /Users/chengjun/bigdata/ap/
In [22]:
# Load the data
corpus = corpora.BleiCorpus('/Users/chengjun/bigdata/ap/ap.dat', '/Users/chengjun/bigdata/ap/vocab.txt')
In [23]:
' '.join(dir(corpus))
Out[23]:
In [24]:
corpus.id2word.items()[:3]
Out[24]:
In [25]:
NUM_TOPICS = 100
In [26]:
model = models.ldamodel.LdaModel(
corpus, num_topics=NUM_TOPICS, id2word=corpus.id2word, alpha=None)
In [27]:
' '.join(dir(model))
Out[27]:
In [28]:
document_topics = [model[c] for c in corpus]
In [29]:
# how many topics does one document cover?
document_topics[2]
Out[29]:
In [30]:
# The first topic
# format: weight, term
model.show_topic(0, 10)
Out[30]:
In [31]:
# The 100 topic
# format: weight, term
model.show_topic(99, 10)
Out[31]:
In [32]:
words = model.show_topic(0, 5)
words
Out[32]:
In [33]:
model.show_topics(4)
Out[33]:
In [34]:
for f, w in words[:10]:
print(f, w)
In [39]:
# write out topcis with 10 terms with weights
for ti in range(model.num_topics):
words = model.show_topic(ti, 10)
tf = sum(f for f, w in words)
with open('/Users/chengjun/github/cjc2016/data/topics_term_weight.txt', 'a') as output:
for f, w in words:
line = str(ti) + '\t' + w + '\t' + str(f/tf)
output.write(line + '\n')
In [40]:
# We first identify the most discussed topic, i.e., the one with the
# highest total weight
topics = matutils.corpus2dense(model[corpus], num_terms=model.num_topics)
weight = topics.sum(1)
max_topic = weight.argmax()
In [216]:
# Get the top 64 words for this topic
# Without the argument, show_topic would return only 10 words
words = model.show_topic(max_topic, 64)
words = np.array(words).T
words_freq=[float(i)*10000000 for i in words[0]]
words = zip(words[1], words_freq)
In [219]:
wordcloud = WordCloud().generate_from_frequencies(words)
plt.imshow(wordcloud)
plt.axis("off")
plt.show()
In [104]:
num_topics_used = [len(model[doc]) for doc in corpus]
fig,ax = plt.subplots()
ax.hist(num_topics_used, np.arange(42))
ax.set_ylabel('Nr of documents')
ax.set_xlabel('Nr of topics')
fig.tight_layout()
#fig.savefig('Figure_04_01.png')
In [109]:
# Now, repeat the same exercise using alpha=1.0
# You can edit the constant below to play around with this parameter
ALPHA = 1.0
model1 = models.ldamodel.LdaModel(
corpus, num_topics=NUM_TOPICS, id2word=corpus.id2word, alpha=ALPHA)
num_topics_used1 = [len(model1[doc]) for doc in corpus]
In [108]:
fig,ax = plt.subplots()
ax.hist([num_topics_used, num_topics_used1], np.arange(42))
ax.set_ylabel('Nr of documents')
ax.set_xlabel('Nr of topics')
# The coordinates below were fit by trial and error to look good
plt.text(9, 223, r'default alpha')
plt.text(26, 156, 'alpha=1.0')
fig.tight_layout()
http://nbviewer.jupyter.org/github/bmabey/pyLDAvis/blob/master/notebooks/pyLDAvis_overview.ipynb
pip install pyldavis
In [1]:
with open('/Users/chengjun/bigdata/ap/ap.txt', 'r') as f:
dat = f.readlines()
In [2]:
dat[:6]
Out[2]:
In [3]:
dat[4].strip()[0]
Out[3]:
In [4]:
docs = []
for i in dat[:100]:
if i.strip()[0] != '<':
docs.append(i)
In [5]:
def clean_doc(doc):
doc = doc.replace('.', '').replace(',', '')
doc = doc.replace('``', '').replace('"', '')
doc = doc.replace('_', '').replace("'", '')
doc = doc.replace('!', '')
return doc
docs = [clean_doc(doc) for doc in docs]
In [6]:
texts = [[i for i in doc.lower().split()] for doc in docs]
In [ ]:
import nltk
nltk.download()
# 会打开一个窗口,选择book,download,待下载完毕就可以使用了。
In [7]:
from nltk.corpus import stopwords
stop = stopwords.words('english') # 如果此处出错,请执行上一个block的代码
In [8]:
' '.join(stop)
Out[8]:
In [9]:
stop.append('said')
In [10]:
from collections import defaultdict
frequency = defaultdict(int)
for text in texts:
for token in text:
frequency[token] += 1
texts = [[token for token in text if frequency[token] > 1 and token not in stop]
for text in texts]
In [11]:
docs[8]
Out[11]:
In [12]:
' '.join(texts[9])
Out[12]:
In [15]:
dictionary = corpora.Dictionary(texts)
lda_corpus = [dictionary.doc2bow(text) for text in texts]
#The function doc2bow() simply counts the number of occurences of each distinct word,
# converts the word to its integer word id and returns the result as a sparse vector.
In [18]:
lda_model = models.ldamodel.LdaModel(
lda_corpus, num_topics=NUM_TOPICS, id2word=dictionary, alpha=None)
In [19]:
import pyLDAvis.gensim
ap_data = pyLDAvis.gensim.prepare(lda_model, lda_corpus, dictionary)
In [20]:
pyLDAvis.enable_notebook()
pyLDAvis.display(ap_data)
Out[20]:
In [220]:
pyLDAvis.save_html(ap_data, '/Users/chengjun/github/cjc2016/vis/ap_ldavis.html')
Willi Richert, Luis Pedro Coelho, 2013, Building Machine Learning Systems with Python. Chapter 4. Packt Publishing.
东风夜放花千树:对宋词进行主题分析初探 http://chengjun.github.io/cn/2013/09/topic-modeling-of-song-peom/
Chandra Y, Jiang LC, Wang C-J (2016) Mining Social Entrepreneurship Strategies Using Topic Modeling. PLoS ONE 11(3): e0151342. doi:10.1371/journal.pone.0151342