In [34]:
import os
import numpy as np
from vertebratesLib import *
In [39]:
split = "SPLIT1"
summaryTree,summarySpecies,splitPositions = get_split_data(split)
print summaryTree.shape
a sentence of words is represented as the transitions for a given position
In [36]:
def get_sentence(position,splitPositions,summary,ignore=False):
splitIndex = np.where(splitPositions==position)[0]
nonZero = np.where(summary[splitIndex,:] != 0)[1]
sentence = []
for nz in nonZero:
if ignore and TRANSITIONS[nz].count(TRANSITIONS[nz][0]) == 2:
continue
count = int(summary[splitIndex,nz][0])
sentence.extend([TRANSITIONS[nz]] * count)
return sentence
position = '8500'
sentence1 = get_sentence(position,splitPositions,summaryTree,ignore=False)
sentence2 = get_sentence(position,splitPositions,summaryTree,ignore=True)
print("with same AA transition")
print(sentence1)
print("without same AA transition")
print(sentence2)
In [37]:
import lda
## the data matrix are the sentences by vocabulary
vocab = TRANSITIONS
#inPlaceTransitions = []
#for t in TRANSITIONS:
Recall that the Dirichlet Process (DP) (Ferguson, 1973) is essentially a distribution over distributions, where each draw from a DP is itself a distribution and importantly for clustering applications it serves as a natural prior that lets the number of clusters grow as the data grows. The DP has a base distribution parameter $\beta$ and a strength or concentration parameter $\alpha$.
We are still modeling topics. However, documents become sites and words become transitions. Transitions may defined in nucleotide, amino acid or codon space. Perhaps more logically though a document would be all of the sites for a given gene.
For each of the transition positions ($m$,$n$), where $n \in \{1,...N\}$, and $m \in \{1,...M\}$
$\phi$ is a $K*V$ Markov matrix each row of which denotes the transition distribution of a topic.
Here I am borrowing from the package LDA, which uses a collapsed version of Gibbs sampling.
In this example.
topics
top topics
In [38]:
from IPython.display import Image
dataDir = None
for ddir in [os.path.join("..","data","herve-vertebrates"),\
os.path.join("/","media","ganda","mojo","phylogenetic-models","herve-vertebrates")]:
if os.path.isdir(ddir):
dataDir = ddir
split = "SPLIT1"
position = "0"
treeList = get_trees(split,position,dataDir)
countMatrix = np.zeros((len(treeList),len(TRANSITIONS)),)
t = 0
for t,pbTree in enumerate(treeList):
fixedTree,treeSummary = fix_tree(pbTree)
tlist = []
for item in treeSummary.itervalues():
tlist.extend(item['pairs'])
counts = transitions_to_counts(tlist)
countMatrix[t,:] = counts
figName1 = os.path.join("figures","lda-bplot-check.png")
profile_box_plot(countMatrix,figName1,figTitle='position - %s'%position)
Image(filename=figName1)
Out[38]:
In [ ]:
In [ ]: