Latent Dirichlet Allocation - LDA

The LDA is a well-known probabilistic model to handle mixtures of topics in an unsupervised way. It has been applied to a large number of problems (Blei, 2012). The original paper has over 10000 citations (Blei, 2003).

  • My goal has been to borrow from the LDA literature to improve or modify the CAT family of models
  • But today I am exploring the possibility of using LDA directly
  • Why?
    • The approach is generative and makes fewer assumptions than the CAT models
    • The basic idea is nearly the same as the CAT models -- using a DP to cluster underlying latent classes
    • There exist highly optimized libraries to carry out inference (i.e. millions of documents)

It is also very important to note that we obtain probabilities associated with the results, and so additional models can be appended to make the output more CAT like.

Generative approaches

In Rodrigue et al, 2009 there is a discussion about phenomenological approaches (i.e. Huelsenbeck, (2006)) vs Mechanistic (i.e. Robinson et al. (2003)). The former is more focused on detecting selective effects and the latter tries to explain selective effects through the model. Generative models can be both phenomenological as well as Mechanistic, but instead of trying to detect or explain the selective effects the approach is focused on explaining how the data are generated. Given a reasonable process that explains the data we can detect selective effects from the posterior.

In statistical terms a generative model is a model for randomly generating observable data. The model specifies a joint distribution over observed and latent variables. The joint distribution for the LDA model can be shown as a graphical model. There is also a generative story (see below) that can sometimes be more intuitive than the plate diagram.


In [10]:
from IPython.display import Image

Image(filename='lda_plate.png')


Out[10]:

Recall that the Dirichlet Process (DP) (Ferguson, 1973) is essentially a distribution over distributions, where each draw from a DP is itself a distribution and importantly for clustering applications it serves as a natural prior that lets the number of clusters grow as the data grows. The DP has a base distribution parameter $\beta$ and a strength or concentration parameter $\alpha$.

  • $\alpha$ is a hyperprior for the DP over per-document topic distributions
  • $\beta$ is the hyperprior for the DP over per-topic word distributions
  • $\theta_{m}$ is the topic distribution for document $m$
  • $\phi_{k}$ is the word distribution for topic $k$
  • $z_{m,n}$ is the topic for the $n$th word in document $m$
  • $w_{m,n}$ is the specific word

Documents and words as genes and nucleotides---the generative story

There different ways of going about this here is my first take on it. We are still modeling topics. However, documents become genes and words become codon-specific nucleotide transitions.

  • $\alpha$ is a hyperprior for the DP over per-gene topic distributions
  • $\beta$ is the hyperprior for the DP over per-topic transition distributions
  • $\theta_{m}$ is the topic distribution for gene $m$
  • $\phi_{k}$ is the nucleotide transition distribution for topic $k$
  • $z_{m,n}$ is the topic for the $n$th nucleotide transition in gene $m$
  • $w_{m,n}$ is the specific transition

The generative process

  1. Choose $\theta_{m} \sim \textrm{Dir}(\alpha)$, where $m \in \{1,...M\}$ and $\textrm{Dir}(\alpha)$ is the Dirichlet distribtion for $\alpha$
  2. Choose $\phi_{k} \sim \textrm{Dir}(\beta)$, where $k \in \{1,...K\}$
  3. For each of the transition positions ($m$,$n$), where $n \in \{1,...N\}$, and $m \in \{1,...M\}$

    • Choose a topic $z_{m,n} \sim \textrm{Multinomial}(\theta_{m})$
    • Choose a transition $w_{m,n} \sim \textrm{Multinomial}(\phi_{m,n})$

$\phi$ is a $K*V$ Markov matrix each row of which denotes the transition distribution of a topic.

Vocabulary and smoothing

The vocabulary size ($V$) is the set of all types possible transitions that we want to consider. Each codon transition can be represented as codon1-codon2-feature. For example, CAC-CAT-CpG=True would represent a synonymous mutation for histidine in the vicinity of a CpG island.

The reason that $\beta$ is not connected directly to $w_{n}$ is that word in documents tend to be sparse and this formulation is a smoothed version of LDA and it tends to help with deal with the large number of zero probabilities.

Generate some data for an example

First we specify the distributions that the sequences come from.


In [11]:
import sys,os
sys.path.append(os.path.join("..","library"))
%matplotlib inline
from Simulations import Simulator
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt

# instantiate the simulator class
sim = Simulator()

# set a non-zero probability of all codon transitions                                                                                                                                 
sim.baseMean = 0.001
sim.baseSd   = 0.0001

# set the within class probabilities
sim.withinClsMean = 30
sim.withinClsSd = 3

# non-synomymous to synonymous rate (non-zero and <= 1.0)                                                                                                                             
sim.nsRate = 0.2

# distribution for no transitions                                                                                                                                                     
sim.ntMean = 80.0
sim.ntSd = 1.0

# get a base matrix
mat = sim.get_base_matrix()

# create a prob matrix for each class and specify the class codon relationships                                                                                                                                                         
cMatrices = {}
classCodons = {}
for c,classAa in sim.pp2aa.iteritems():
    _classCodons = []
    for aa in classAa:
        for codon in sim.aa2codon[aa]:
            _classCodons.append(codon)
    cMatrices[c] = sim.generate_class_matrix(mat,_classCodons)
    classCodons[c] = _classCodons

# short hand notation for the classes
classShort = {"neg":"1",\
              "pos":"2",\
              "pnc":"3",\
              "aro":"4",\
              "npa":"5"}

# summarizing plot
cmap = plt.cm.PuBuGn
fig = plt.figure(figsize=(20,25),dpi=500)
subplt = 0
for c in sim.pp2aa.keys():
    cmat = cMatrices[c]
    subplt += 1
    ax = fig.add_subplot(3,2,subplt)
    hmap = ax.imshow(cmat, interpolation='nearest',aspect='auto',cmap=cmap)
    _ = ax.set_title("class=%s"%c)
    _ = ax.set_xticks(range(sim.codons.size))
    _ = ax.set_yticks(range(sim.codons.size))
    _ = ax.set_xticklabels(sim.codons,fontsize=7)
    _ = ax.set_yticklabels(sim.codons,fontsize=7)
    xlabs = ax.get_xticklabels()
    plt.setp(xlabs, rotation=90)
    ax.set_aspect(1./ax.get_data_ratio())

#cb1 = mpl.colorbar.ColorbarBase(axCb,cmap=cmap,orientation='vertical')

plt.subplots_adjust(hspace=0.1,wspace=0.00001)


Then we generate sequences -- we assume that each codon is governed by a specific class and we are assuming only one mutations away from root sequence


In [16]:
M = 10   ## total genes (transitions are M-1)
N = 5    ## total codons

## reprint the classes just for reference
for key,item in classCodons.iteritems():
    print key,item

## randomly select the governing classes
classes = np.array(classCodons.keys())
seedClasses = classes[np.random.randint(0,4,N)]
print 'seed classes', seedClasses

## use a matrix to store the sequences (column 0 is the root)
sequences = np.zeros((N,M),).astype(str)

## get the root sequence codons given the class
for j in range(N):
    # select a root codon
    cls = seedClasses[j]
    clsIndx = np.where(classes==cls)[0]
    cmat = cMatrices[cls]
    normalizedProbs = cmat.sum(axis=0) / cmat.sum(axis=0).sum()
    pairs = [(normalizedProbs[i],sim.codons[i]) for i in range(len(normalizedProbs))]
    _rootCodon = np.random.multinomial(1, zip(*pairs)[0]) 
    rootCodon = sim.codons[np.where(_rootCodon==1)[0][0]]
    sequences[j,0] = rootCodon

    # fill in the rest of the codon sequence 
    for c in range(1,M):
        codonIndx = np.where(sim.codons == rootCodon)[0][0]
        normalizedProbs = cmat[:,codonIndx] / cmat[:,codonIndx].sum()
        pairs = [(normalizedProbs[i],sim.codons[i]) for i in range(len(normalizedProbs))]
        _nextCodon = np.random.multinomial(1, zip(*pairs)[0])
        nextCodon = sim.codons[np.where(_nextCodon==1)[0][0]]
        sequences[j,c] = nextCodon
        
print sequences


neg ['GAT', 'GAC', 'GAG', 'GAA']
pnc ['TCT', 'TCG', 'TCA', 'TCC', 'AGC', 'AGT', 'ACC', 'ACA', 'ACG', 'ACT', 'TGT', 'TGC', 'ATG', 'AAC', 'AAT', 'CAA', 'CAG']
npa ['GGT', 'GGG', 'GGA', 'GGC', 'GCA', 'GCC', 'GCG', 'GCT', 'GTA', 'GTC', 'GTG', 'GTT', 'TTA', 'TTG', 'CTC', 'CTT', 'CTG', 'CTA', 'ATC', 'ATA', 'ATT', 'CCT', 'CCG', 'CCA', 'CCC']
pos ['AAG', 'AAA', 'CGA', 'CGC', 'CGG', 'CGT', 'AGG', 'AGA', 'CAT', 'CAC']
aro ['TTT', 'TTC', 'TAT', 'TAC', 'TGG']
seed classes ['npa' 'pnc' 'pnc' 'npa' 'neg']
[['AAA' 'AAA' 'AAA' 'AAA' 'AAA' 'AAA' 'AAA' 'AAA' 'AAA' 'AAA']
 ['GGA' 'GGA' 'GGA' 'GGA' 'GGA' 'GGA' 'GGA' 'GGA' 'GGA' 'GGA']
 ['TCC' 'TCC' 'AGT' 'AGT' 'TCC' 'TCT' 'TCC' 'TCG' 'TCA' 'CAA']
 ['GTT' 'TTG' 'ATT' 'GTG' 'GTT' 'GTC' 'GCT' 'GTT' 'GCT' 'CCA']
 ['GTC' 'GTC' 'GTC' 'GTC' 'GTC' 'GTC' 'GTC' 'GTC' 'GTC' 'GTC']]

Transform our sequences into codon transitions (i.e. words) ((61x61) - 61) if we ignore non-transitions


In [13]:
vocabulary = []
for cdn1 in sim.codons:
    for cdn2 in sim.codons:
        if cdn1 == cdn2:
            continue
        vocabulary.append(cdn1+"-"+cdn2)
print 'vocabulary: ', len(vocabulary)

transitions = np.zeros((N,M-1),).astype(str)

# transition relative to root
for i in range(sequences.shape[0]):
    for j in range(sequences.shape[1]):
        if j == sequences.shape[1] - 1:
            continue
        if sequences[i,0] == sequences[i,j+1]:
            transitions[i,j] = '-'
        else:
            transitions[i,j] = sequences[i,0]+"-"+sequences[i,j+1]
      
print transitions


vocabulary:  3660
[['-' '-' 'AGG-AAA' '-' 'AGG-CAT' 'AGG-CGT' 'AGG-CGA' 'AGG-CGG' '-']
 ['GTG-TTG' 'GTG-CTG' '-' 'GTG-GCG' 'GTG-CCC' '-' '-' '-' 'GTG-GCT']
 ['GCA-GTG' 'GCA-TTA' 'GCA-CTT' 'GCA-CCT' 'GCA-GTC' 'GCA-TTA' 'GCA-GTT'
  'GCA-GGG' 'GCA-GTC']
 ['-' 'ATT-ATC' 'ATT-GGA' '-' '-' 'ATT-GGA' 'ATT-GGA' 'ATT-GGT' 'ATT-TTG']
 ['-' '-' '-' '-' '-' '-' '-' '-' '-']]

In [14]:
# convert words into vector

vocab = set([])
for w in range(transitions.shape[0]):
    posTransitions = transitions[w,:]
    for t in posTransitions:
        if t != '-':
            vocab.update([t])

vocab = list(vocab)
print vocab
## documents are positions in alignment
data = []
for w in range(transitions.shape[0]):
    posTransitions = transitions[w,:]
    document = []
    for v in vocab:
        document.append(len(np.where(posTransitions == v)[0]))
    data.append(document)
    print document
data = np.array(data)


['GCA-GTC', 'GCA-GGG', 'AGG-CGT', 'GCA-GTT', 'ATT-ATC', 'AGG-AAA', 'GTG-GCG', 'AGG-CAT', 'AGG-CGA', 'ATT-GGA', 'GCA-TTA', 'GTG-TTG', 'AGG-CGG', 'GCA-CTT', 'GCA-CCT', 'ATT-TTG', 'GTG-CCC', 'GTG-CTG', 'ATT-GGT', 'GTG-GCT', 'GCA-GTG']
[0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0]
[2, 1, 0, 1, 0, 0, 0, 0, 0, 0, 2, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1]
[0, 0, 0, 0, 1, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

In [15]:
import numpy as np
import pymc as pm


K = 3 # number of topics                                                                                                                                       
V = len(vocab) # number of words                                                                                                                                        
D = 5 # number of documents                                                                                                                                    

#data = np.array([[1, 1, 1, 1], [1, 1, 1, 1], [0, 0, 0, 0]])

alpha = np.ones(K)
beta = np.ones(V)

theta = pm.Container([pm.CompletedDirichlet("theta_%s" % i, pm.Dirichlet("ptheta_%s" % i, theta=alpha)) for i in range(D)])
phi = pm.Container([pm.CompletedDirichlet("phi_%s" % k, pm.Dirichlet("pphi_%s" % k, theta=beta)) for k in range(K)])
Wd = [len(doc) for doc in data]

z = pm.Container([pm.Categorical('z_%i' % d,
                     p = theta[d],
                     size=Wd[d],
                     value=np.random.randint(K, size=Wd[d]))
                  for d in range(D)])

# cannot use p=phi[z[d][i]] here since phi is an ordinary list while z[d][i] is stochastic                                                                     
w = pm.Container([pm.Categorical("w_%i_%i" % (d,i),
                    p = pm.Lambda('phi_z_%i_%i' % (d,i),
                              lambda z=z[d][i], phi=phi: phi[z]),
                    value=data[d][i],
                    observed=True)
                  for d in range(D) for i in range(Wd[d])])

model = pm.Model([theta, phi, z, w])
#mcmc = pm.MCMC(model)                                                                                                                                         
#mcmc.sample(100)                                                                                                                                              

M = pm.MCMC(model)
M.sample(5000,burn=500)
pm.Matplot.plot(M)


[****************100%******************]  5000 of 5000 completePlotting ptheta_0_0
Plotting ptheta_0_1
Plotting theta_4_0_0
Plotting theta_4_0_1
Plotting theta_4_0_2
Plotting ptheta_3_0
Plotting ptheta_3_1
Plotting ptheta_4_0
Plotting ptheta_4_1
Plotting theta_2_0_0
Plotting theta_2_0_1
Plotting theta_2_0_2
Plotting theta_0_0_0
Plotting theta_0_0_1
Plotting theta_0_0_2
Plotting theta_3_0_0
Plotting theta_3_0_1
Plotting theta_3_0_2
Plotting theta_1_0_0
Plotting theta_1_0_1
Plotting theta_1_0_2
Plotting ptheta_1_0
Plotting ptheta_1_1
Plotting ptheta_2_0
Plotting ptheta_2_1

References

  • Blei, D. Probabilistic topic models Communications of the ACM, 2012, 55, 77-84
  • Blei, D. M.; Ng, A. Y. & Jordan, M. I. Latent Dirichlet Allocation Journal of Machine Learning Research, 2003, 3, 993-1022
  • Ferguson, T. S. A Bayesian Analysis of Some Nonparametric Problems The Annals of Statistics, 1973, 1, 209-230
  • Huelsenbeck, J. P.; Jain, S.; Frost, S. W. D. & Pond, S. L. K. A Dirichlet process model for detecting positive selection in protein-coding DNA sequences PNAS, 2006, 103, 6263-8
  • Robinson, D. M.; Jones, D. T.; Kishino, H.; Goldman, N. & Thorne, J. L. Protein evolution with dependence among codons due to tertiary structure Mol. Biol. Evol., 2003, 20, 1692-1704
  • Rodrigue, N.; Kleinman, C. L.; Philippe, H. & Lartillot, N. Computational methods for evaluating phylogenetic models of coding sequence evolution with dependence between codons. Molecular biology and evolution, 2009, 26, 1663-76
  • simulation as a tree
  • distances using mutual information
  • can I reconstruct a similar tree
  • hemoglobin data (N. Rodrigue)
  • select a few applications of lda in biology for a better intro
  • data augmentation in simulation example to infer tree
  • cluster based on genewise topics (across columns)
  • quadruplit nature of sequences used to infer phylogenetic trees

In [15]:


In [15]: