The LDA is a well-known probabilistic model to handle mixtures of topics in an unsupervised way. It has been applied to a large number of problems (Blei, 2012). The original paper has over 10000 citations (Blei, 2003).
In statistical terms a generative model is a model for randomly generating observable data. The model specifies a joint distribution over observed and latent variables. The joint distribution for the LDA model can be shown as a graphical model. There is also a generative story (see below) that can sometimes be more intuitive than the plate diagram.
In [1]:
from IPython.display import Image
Image(filename='lda_plate.png')
Out[1]:
Recall that the Dirichlet Process (DP) (Ferguson, 1973) is essentially a distribution over distributions, where each draw from a DP is itself a distribution and importantly for clustering applications it serves as a natural prior that lets the number of clusters grow as the data grows. The DP has a base distribution parameter $\beta$ and a strength or concentration parameter $\alpha$.
We are still modeling topics. However, documents become sites and words become transitions. Transitions may defined in nucleotide, amino acid or codon space.
For each of the transition positions ($m$,$n$), where $n \in \{1,...N\}$, and $m \in \{1,...M\}$
$\phi$ is a $K*V$ Markov matrix each row of which denotes the transition distribution of a topic.
The vocabulary size ($V$) is the set of all types possible transitions that we want to consider. Each codon transition can be represented as codon1-codon2-feature. For example, CAC-CAT-CpG=True would represent a synonymous mutation for histidine in the vicinity of a CpG island.
The reason that $\beta$ is not connected directly to $w_{n}$ is that word in documents tend to be sparse and this formulation is a smoothed version of LDA and it tends to help with deal with the large number of zero probabilities.
Then we generate sequences -- we assume that each codon is governed by a specific class and we are assuming only one mutations away from root sequence
Transform our sequences into codon transitions (i.e. words) ((61x61) - 61) if we ignore non-transitions
In [13]:
vocabulary = []
for cdn1 in sim.codons:
for cdn2 in sim.codons:
if cdn1 == cdn2:
continue
vocabulary.append(cdn1+"-"+cdn2)
print 'vocabulary: ', len(vocabulary)
transitions = np.zeros((N,M-1),).astype(str)
# transition relative to root
for i in range(sequences.shape[0]):
for j in range(sequences.shape[1]):
if j == sequences.shape[1] - 1:
continue
if sequences[i,0] == sequences[i,j+1]:
transitions[i,j] = '-'
else:
transitions[i,j] = sequences[i,0]+"-"+sequences[i,j+1]
print transitions
In [14]:
# convert words into vector
vocab = set([])
for w in range(transitions.shape[0]):
posTransitions = transitions[w,:]
for t in posTransitions:
if t != '-':
vocab.update([t])
vocab = list(vocab)
print vocab
## documents are positions in alignment
data = []
for w in range(transitions.shape[0]):
posTransitions = transitions[w,:]
document = []
for v in vocab:
document.append(len(np.where(posTransitions == v)[0]))
data.append(document)
print document
data = np.array(data)
In [15]:
import numpy as np
import pymc as pm
K = 3 # number of topics
V = len(vocab) # number of words
D = 5 # number of documents
#data = np.array([[1, 1, 1, 1], [1, 1, 1, 1], [0, 0, 0, 0]])
alpha = np.ones(K)
beta = np.ones(V)
theta = pm.Container([pm.CompletedDirichlet("theta_%s" % i, pm.Dirichlet("ptheta_%s" % i, theta=alpha)) for i in range(D)])
phi = pm.Container([pm.CompletedDirichlet("phi_%s" % k, pm.Dirichlet("pphi_%s" % k, theta=beta)) for k in range(K)])
Wd = [len(doc) for doc in data]
z = pm.Container([pm.Categorical('z_%i' % d,
p = theta[d],
size=Wd[d],
value=np.random.randint(K, size=Wd[d]))
for d in range(D)])
# cannot use p=phi[z[d][i]] here since phi is an ordinary list while z[d][i] is stochastic
w = pm.Container([pm.Categorical("w_%i_%i" % (d,i),
p = pm.Lambda('phi_z_%i_%i' % (d,i),
lambda z=z[d][i], phi=phi: phi[z]),
value=data[d][i],
observed=True)
for d in range(D) for i in range(Wd[d])])
model = pm.Model([theta, phi, z, w])
#mcmc = pm.MCMC(model)
#mcmc.sample(100)
M = pm.MCMC(model)
M.sample(5000,burn=500)
pm.Matplot.plot(M)
In [15]:
In [15]: