The LDA is a well-known probabilistic model to handle mixtures of topics in an unsupervised way. It has been applied to a large number of problems (Blei, 2012). The original paper has over 10000 citations (Blei, 2003).
It is also very important to note that we obtain probabilities associated with the results, and so additional models can be appended to make the output more CAT like.
In Rodrigue et al, 2009 there is a discussion about phenomenological approaches (i.e. Huelsenbeck, (2006)) vs Mechanistic (i.e. Robinson et al. (2003)). The former is more focused on detecting selective effects and the latter tries to explain selective effects through the model. Generative models can be both phenomenological as well as Mechanistic, but instead of trying to detect or explain the selective effects the approach is focused on explaining how the data are generated. Given a reasonable process that explains the data we can detect selective effects from the posterior.
In statistical terms a generative model is a model for randomly generating observable data. The model specifies a joint distribution over observed and latent variables. The joint distribution for the LDA model can be shown as a graphical model. There is also a generative story (see below) that can sometimes be more intuitive than the plate diagram.
In [10]:
from IPython.display import Image
Image(filename='lda_plate.png')
Out[10]:
Recall that the Dirichlet Process (DP) (Ferguson, 1973) is essentially a distribution over distributions, where each draw from a DP is itself a distribution and importantly for clustering applications it serves as a natural prior that lets the number of clusters grow as the data grows. The DP has a base distribution parameter $\beta$ and a strength or concentration parameter $\alpha$.
There different ways of going about this here is my first take on it. We are still modeling topics. However, documents become genes and words become codon-specific nucleotide transitions.
For each of the transition positions ($m$,$n$), where $n \in \{1,...N\}$, and $m \in \{1,...M\}$
$\phi$ is a $K*V$ Markov matrix each row of which denotes the transition distribution of a topic.
The vocabulary size ($V$) is the set of all types possible transitions that we want to consider. Each codon transition can be represented as codon1-codon2-feature. For example, CAC-CAT-CpG=True would represent a synonymous mutation for histidine in the vicinity of a CpG island.
The reason that $\beta$ is not connected directly to $w_{n}$ is that word in documents tend to be sparse and this formulation is a smoothed version of LDA and it tends to help with deal with the large number of zero probabilities.
In [11]:
import sys,os
sys.path.append(os.path.join("..","library"))
%matplotlib inline
from Simulations import Simulator
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
# instantiate the simulator class
sim = Simulator()
# set a non-zero probability of all codon transitions
sim.baseMean = 0.001
sim.baseSd = 0.0001
# set the within class probabilities
sim.withinClsMean = 30
sim.withinClsSd = 3
# non-synomymous to synonymous rate (non-zero and <= 1.0)
sim.nsRate = 0.2
# distribution for no transitions
sim.ntMean = 80.0
sim.ntSd = 1.0
# get a base matrix
mat = sim.get_base_matrix()
# create a prob matrix for each class and specify the class codon relationships
cMatrices = {}
classCodons = {}
for c,classAa in sim.pp2aa.iteritems():
_classCodons = []
for aa in classAa:
for codon in sim.aa2codon[aa]:
_classCodons.append(codon)
cMatrices[c] = sim.generate_class_matrix(mat,_classCodons)
classCodons[c] = _classCodons
# short hand notation for the classes
classShort = {"neg":"1",\
"pos":"2",\
"pnc":"3",\
"aro":"4",\
"npa":"5"}
# summarizing plot
cmap = plt.cm.PuBuGn
fig = plt.figure(figsize=(20,25),dpi=500)
subplt = 0
for c in sim.pp2aa.keys():
cmat = cMatrices[c]
subplt += 1
ax = fig.add_subplot(3,2,subplt)
hmap = ax.imshow(cmat, interpolation='nearest',aspect='auto',cmap=cmap)
_ = ax.set_title("class=%s"%c)
_ = ax.set_xticks(range(sim.codons.size))
_ = ax.set_yticks(range(sim.codons.size))
_ = ax.set_xticklabels(sim.codons,fontsize=7)
_ = ax.set_yticklabels(sim.codons,fontsize=7)
xlabs = ax.get_xticklabels()
plt.setp(xlabs, rotation=90)
ax.set_aspect(1./ax.get_data_ratio())
#cb1 = mpl.colorbar.ColorbarBase(axCb,cmap=cmap,orientation='vertical')
plt.subplots_adjust(hspace=0.1,wspace=0.00001)
Then we generate sequences -- we assume that each codon is governed by a specific class and we are assuming only one mutations away from root sequence
In [16]:
M = 10 ## total genes (transitions are M-1)
N = 5 ## total codons
## reprint the classes just for reference
for key,item in classCodons.iteritems():
print key,item
## randomly select the governing classes
classes = np.array(classCodons.keys())
seedClasses = classes[np.random.randint(0,4,N)]
print 'seed classes', seedClasses
## use a matrix to store the sequences (column 0 is the root)
sequences = np.zeros((N,M),).astype(str)
## get the root sequence codons given the class
for j in range(N):
# select a root codon
cls = seedClasses[j]
clsIndx = np.where(classes==cls)[0]
cmat = cMatrices[cls]
normalizedProbs = cmat.sum(axis=0) / cmat.sum(axis=0).sum()
pairs = [(normalizedProbs[i],sim.codons[i]) for i in range(len(normalizedProbs))]
_rootCodon = np.random.multinomial(1, zip(*pairs)[0])
rootCodon = sim.codons[np.where(_rootCodon==1)[0][0]]
sequences[j,0] = rootCodon
# fill in the rest of the codon sequence
for c in range(1,M):
codonIndx = np.where(sim.codons == rootCodon)[0][0]
normalizedProbs = cmat[:,codonIndx] / cmat[:,codonIndx].sum()
pairs = [(normalizedProbs[i],sim.codons[i]) for i in range(len(normalizedProbs))]
_nextCodon = np.random.multinomial(1, zip(*pairs)[0])
nextCodon = sim.codons[np.where(_nextCodon==1)[0][0]]
sequences[j,c] = nextCodon
print sequences
Transform our sequences into codon transitions (i.e. words) ((61x61) - 61) if we ignore non-transitions
In [13]:
vocabulary = []
for cdn1 in sim.codons:
for cdn2 in sim.codons:
if cdn1 == cdn2:
continue
vocabulary.append(cdn1+"-"+cdn2)
print 'vocabulary: ', len(vocabulary)
transitions = np.zeros((N,M-1),).astype(str)
# transition relative to root
for i in range(sequences.shape[0]):
for j in range(sequences.shape[1]):
if j == sequences.shape[1] - 1:
continue
if sequences[i,0] == sequences[i,j+1]:
transitions[i,j] = '-'
else:
transitions[i,j] = sequences[i,0]+"-"+sequences[i,j+1]
print transitions
In [14]:
# convert words into vector
vocab = set([])
for w in range(transitions.shape[0]):
posTransitions = transitions[w,:]
for t in posTransitions:
if t != '-':
vocab.update([t])
vocab = list(vocab)
print vocab
## documents are positions in alignment
data = []
for w in range(transitions.shape[0]):
posTransitions = transitions[w,:]
document = []
for v in vocab:
document.append(len(np.where(posTransitions == v)[0]))
data.append(document)
print document
data = np.array(data)
In [15]:
import numpy as np
import pymc as pm
K = 3 # number of topics
V = len(vocab) # number of words
D = 5 # number of documents
#data = np.array([[1, 1, 1, 1], [1, 1, 1, 1], [0, 0, 0, 0]])
alpha = np.ones(K)
beta = np.ones(V)
theta = pm.Container([pm.CompletedDirichlet("theta_%s" % i, pm.Dirichlet("ptheta_%s" % i, theta=alpha)) for i in range(D)])
phi = pm.Container([pm.CompletedDirichlet("phi_%s" % k, pm.Dirichlet("pphi_%s" % k, theta=beta)) for k in range(K)])
Wd = [len(doc) for doc in data]
z = pm.Container([pm.Categorical('z_%i' % d,
p = theta[d],
size=Wd[d],
value=np.random.randint(K, size=Wd[d]))
for d in range(D)])
# cannot use p=phi[z[d][i]] here since phi is an ordinary list while z[d][i] is stochastic
w = pm.Container([pm.Categorical("w_%i_%i" % (d,i),
p = pm.Lambda('phi_z_%i_%i' % (d,i),
lambda z=z[d][i], phi=phi: phi[z]),
value=data[d][i],
observed=True)
for d in range(D) for i in range(Wd[d])])
model = pm.Model([theta, phi, z, w])
#mcmc = pm.MCMC(model)
#mcmc.sample(100)
M = pm.MCMC(model)
M.sample(5000,burn=500)
pm.Matplot.plot(M)
In [15]:
In [15]: