LDA on vertebrates

Notes on the data

In this example the tree is contstrained
In this example we have to extract position,transition, and branch.
The total position are broken into N 'splits'
Each split contains 30,000 files (except the last file that has less)
So that means that there are 30,000 * N total positions

Browsing the data



In [34]:

    
import os
import numpy as np
from vertebratesLib import *



In [39]:

    
split = "SPLIT1"
summaryTree,summarySpecies,splitPositions = get_split_data(split)
print summaryTree.shape









    



(30000, 400)

a sentence of words is represented as the transitions for a given position



In [36]:

    
def get_sentence(position,splitPositions,summary,ignore=False):
    splitIndex = np.where(splitPositions==position)[0]
    nonZero = np.where(summary[splitIndex,:] != 0)[1]
    sentence = []
    for nz in nonZero:
        if ignore and TRANSITIONS[nz].count(TRANSITIONS[nz][0]) == 2:
            continue
        
        count = int(summary[splitIndex,nz][0])
        sentence.extend([TRANSITIONS[nz]] * count)
    
    return sentence
        
position = '8500'
sentence1 = get_sentence(position,splitPositions,summaryTree,ignore=False)
sentence2 = get_sentence(position,splitPositions,summaryTree,ignore=True)
print("with same AA transition")
print(sentence1)
print("without same AA transition")
print(sentence2)









    



with same AA transition
['AA', 'AA', 'AA', 'AA', 'AA', 'AA', 'AA', 'AA', 'AA', 'AA', 'AA', 'AA', 'AA', 'AA', 'AA', 'AA', 'AA', 'AA', 'AA', 'AA', 'AA', 'AA', 'AA', 'AA', 'AA', 'AA', 'AA', 'AA', 'AA', 'AA', 'AA', 'AA', 'AA', 'AA', 'AA', 'AA', 'AA', 'AA', 'AA', 'AA', 'AA', 'AA', 'AA', 'AA', 'AA', 'AA', 'AA', 'AA', 'AA', 'AA', 'AA', 'AA', 'AA', 'AA', 'AA', 'AA', 'AA', 'AA', 'AA', 'AA', 'AL', 'AV', 'FF', 'FF', 'FF', 'FF', 'GG', 'II', 'II', 'LA', 'LA', 'LL', 'LL', 'LL', 'LL', 'LL', 'LL', 'LL', 'LL', 'LL', 'LL', 'LL', 'LL', 'LL', 'LL', 'LL', 'LL', 'LL', 'LL', 'LL', 'LL', 'LL', 'LL', 'LL', 'LL', 'LL', 'LL', 'LL', 'LL', 'LL', 'LL', 'LL', 'LL', 'LL', 'LL', 'LL', 'LL', 'LL', 'LL', 'LL', 'LL', 'LL', 'LL', 'LV', 'LV', 'MM', 'TT', 'TT', 'TT', 'TT', 'TT', 'VA', 'VA', 'VF', 'VI', 'VL', 'VL', 'VV', 'VV', 'VV', 'VV', 'VV', 'VV', 'VV', 'VV', 'VV', 'VV', 'VV', 'VV', 'VV', 'VV', 'VV', 'VV', 'VV', 'VV', 'VV', 'VV', 'VV', 'VV', 'VV', 'VV', 'VV', 'VV', 'VV', 'VV', 'VV', 'VV', 'VV', 'VV', 'VV', 'VV', 'VV', 'VV', 'VV', 'VV', 'VV', 'VV', 'VV', 'VV', 'VV', 'VV', 'VV', 'VV', 'VV', 'VV', 'VV', 'VV', 'VV', 'VV', 'VV', 'VV', 'VV', 'VV', 'VV', 'VV', 'VV', 'VV', 'VV', 'VV', 'VV', 'VV', 'VV', 'VV', 'VV', 'VV', 'VV', 'VV', 'VV', 'VV', 'VV', 'VV', 'VV', 'VV', 'VV', 'VV', 'VV', 'VV', 'VV', 'VV']
without same AA transition
['AL', 'AV', 'LA', 'LA', 'LV', 'LV', 'VA', 'VA', 'VF', 'VI', 'VL', 'VL']

Simple test run with lda package



In [37]:

    
import lda

## the data matrix are the sentences by vocabulary

vocab = TRANSITIONS

#inPlaceTransitions = []
#for t in TRANSITIONS:

Recall that the Dirichlet Process (DP) (Ferguson, 1973) is essentially a distribution over distributions, where each draw from a DP is itself a distribution and importantly for clustering applications it serves as a natural prior that lets the number of clusters grow as the data grows. The DP has a base distribution parameter $\beta$ and a strength or concentration parameter $\alpha$.

$\alpha$ is a hyperprior for the DP over per-document topic distributions
$\beta$ is the hyperprior for the DP over per-topic word distributions
$\theta_{m}$ is the topic distribution for document $m$
$\phi_{k}$ is the word distribution for topic $k$
$z_{m,n}$ is the topic for the $n$th word in document $m$
$w_{m,n}$ is the specific word

The generative story for phylogenetics

We are still modeling topics. However, documents become sites and words become transitions. Transitions may defined in nucleotide, amino acid or codon space. Perhaps more logically though a document would be all of the sites for a given gene.

$\alpha$ is a hyperprior for the DP over per-site topic distributions
$\beta$ is the hyperprior for the DP over per-topic transition distributions
$\theta_{m}$ is the topic distribution for gene $m$
$\phi_{k}$ is the nucleotide transition distribution for topic $k$
$z_{m,n}$ is the topic for the $n$th nucleotide transition in gene $m$
$w_{m,n}$ is the specific transition

The generative process

Choose $\theta_{m} \sim \textrm{Dir}(\alpha)$, where $m \in \{1,...M\}$ and $\textrm{Dir}(\alpha)$ is the Dirichlet distribtion for $\alpha$
Choose $\phi_{k} \sim \textrm{Dir}(\beta)$, where $k \in \{1,...K\}$
For each of the transition positions ($m$,$n$), where $n \in \{1,...N\}$, and $m \in \{1,...M\}$
- Choose a topic $z_{m,n} \sim \textrm{Multinomial}(\theta_{m})$
- Choose a transition $w_{m,n} \sim \textrm{Multinomial}(\phi_{m,n})$

$\phi$ is a $K*V$ Markov matrix each row of which denotes the transition distribution of a topic.

The type of data to expect

Here I am borrowing from the package LDA, which uses a collapsed version of Gibbs sampling.

In this example.

Positions are documents
First 1000 positions
We consider in place transitions
20 topics
1500 MCMC iterations
7 words in a topic (transitions)

topics

Topic 0: EE ED EQ EK EG DE EA
Topic 1: YY YF FY FF YH YC YS
Topic 2: RR KK RK QQ RQ HH KR
Topic 3: AA AS AT AV SS VA SA
Topic 4: II MM IV IL IM MI ML
Topic 5: SS ST SA SN SP TT SG
Topic 6: WW WL YW WS SW WV WG
Topic 7: KK KR RR KQ RK KE KN
Topic 8: HH HY HQ HR HN QH YH
Topic 9: CC CS SC CY SS CF LC
Topic 10: VV VI IV VA VL II VM
Topic 11: TT TA TS TV TI ST TM
Topic 12: DD DE DN ED EE DG ND
Topic 13: QQ QK QE QH QR QL QP
Topic 14: FF FL LF FY FI FV FC
Topic 15: PP PS SP PA PL PT PQ
Topic 16: NN NS SS ND NK SN NT
Topic 17: LL LI LV LM LF MM LQ
Topic 18: RR RK RQ RT RW RY RV
Topic 19: GG GS GA GE GN SG GD

Where to go

whole tree summary
species specific tree summary
incorporate gene boundries
parameter search to optimize num. of words per topic and num. topics
add distance calculator (KL-divergence) to get distance between any two topic distributions
port to MALLET (U. Mass) or another tuned library



In [ ]:

References

Blei, D. Probabilistic topic models Communications of the ACM, 2012, 55, 77-84
Blei, D. M.; Ng, A. Y. & Jordan, M. I. Latent Dirichlet Allocation Journal of Machine Learning Research, 2003, 3, 993-1022
Ferguson, T. S. A Bayesian Analysis of Some Nonparametric Problems The Annals of Statistics, 1973, 1, 209-230
Huelsenbeck, J. P.; Jain, S.; Frost, S. W. D. & Pond, S. L. K. A Dirichlet process model for detecting positive selection in protein-coding DNA sequences PNAS, 2006, 103, 6263-8
Robinson, D. M.; Jones, D. T.; Kishino, H.; Goldman, N. & Thorne, J. L. Protein evolution with dependence among codons due to tertiary structure Mol. Biol. Evol., 2003, 20, 1692-1704
Rodrigue, N.; Kleinman, C. L.; Philippe, H. & Lartillot, N. Computational methods for evaluating phylogenetic models of coding sequence evolution with dependence between codons. Molecular biology and evolution, 2009, 26, 1663-76



In [ ]: