In [1]:
import numpy as np
In [2]:
pi_string = """\
AAAA AAAC AAAG AAAT AACA AACC AACG AACT AAGA AAGC AAGG AAGT AATA AATC AATG AATT
ACAA ACAC ACAG ACAT ACCA ACCC ACCG ACCT ACGA ACGC ACGG ACGT ACTA ACTC ACTG ACTT
AGAA AGAC AGAG AGAT AGCA AGCC AGCG AGCT AGGA AGGC AGGG AGGT AGTA AGTC AGTG AGTT
ATAA ATAC ATAG ATAT ATCA ATCC ATCG ATCT ATGA ATGC ATGG ATGT ATTA ATTC ATTG ATTT
CAAA CAAC CAAG CAAT CACA CACC CACG CACT CAGA CAGC CAGG CAGT CATA CATC CATG CATT
CCAA CCAC CCAG CCAT CCCA CCCC CCCG CCCT CCGA CCGC CCGG CCGT CCTA CCTC CCTG CCTT
CGAA CGAC CGAG CGAT CGCA CGCC CGCG CGCT CGGA CGGC CGGG CGGT CGTA CGTC CGTG CGTT
CTAA CTAC CTAG CTAT CTCA CTCC CTCG CTCT CTGA CTGC CTGG CTGT CTTA CTTC CTTG CTTT
GAAA GAAC GAAG GAAT GACA GACC GACG GACT GAGA GAGC GAGG GAGT GATA GATC GATG GATT
GCAA GCAC GCAG GCAT GCCA GCCC GCCG GCCT GCGA GCGC GCGG GCGT GCTA GCTC GCTG GCTT
GGAA GGAC GGAG GGAT GGCA GGCC GGCG GGCT GGGA GGGC GGGG GGGT GGTA GGTC GGTG GGTT
GTAA GTAC GTAG GTAT GTCA GTCC GTCG GTCT GTGA GTGC GTGG GTGT GTTA GTTC GTTG GTTT
TAAA TAAC TAAG TAAT TACA TACC TACG TACT TAGA TAGC TAGG TAGT TATA TATC TATG TATT
TCAA TCAC TCAG TCAT TCCA TCCC TCCG TCCT TCGA TCGC TCGG TCGT TCTA TCTC TCTG TCTT
TGAA TGAC TGAG TGAT TGCA TGCC TGCG TGCT TGGA TGGC TGGG TGGT TGTA TGTC TGTG TGTT
TTAA TTAC TTAG TTAT TTCA TTCC TTCG TTCT TTGA TTGC TTGG TTGT TTTA TTTC TTTG TTTT
"""
## print as string
print pi_string
## also save as indexable array
pi_arr = np.array(pi_string.split()).reshape(16,16)
print pi_arr[:4,:4]
In [3]:
seqalign = np.array([
list("CATCATCATCAT")*10,
list("CATCATGATCAT")*10,
list("GATGATGATCCC")*10,
list("GATGATGATCCC")*10])
print seqalign[:, :10], '...'
We convert it to integers because they are easier to deal with than characters. The 'view' argument is used here to convert chars to 8bit integers. I then reassign the ints as 0, 1, 2, or 3, which can be used to quickly fill the the count array using a set of rules designated below.
In [4]:
## print as char
print seqalign[:, :10]
## get a copy with dtype=int8
seqint = np.copy(seqalign.view(np.int8))
print seqint[:, :10]
## convert ints to indexes of matrix
seqint[seqint == 65] = 0 ## As
seqint[seqint == 67] = 1 ## Cs
seqint[seqint == 71] = 2 ## Ts
seqint[seqint == 84] = 3 ## Gs
print seqint[:, :10]
This records the number of times each of the patterns in this array are observed. We will also fill two "alternate arrays", representing the site patterns observed for other two possible resolutions of our quartet. This first array assumes taxa are ordered as 12|34. We can fill the two alternates (13|24 and 14|23) from the patterns observed here.
In [5]:
## create a 16x16 array of zeros
invcounts = np.zeros((16,16), dtype=int)
## fill the array according to a set of rules
for idx in xrange(seqalign.shape[1]):
i = seqint[:, idx]
invcounts[(4*i[0])+i[1], (4*i[2])+i[3]] += 1
print invcounts
In [6]:
## Let's create three arrays
mats_ints = np.zeros((3, 16, 16), dtype=int)
mats_ints[0] = invcounts
## Fill the alternates
x = 0
for i in xrange(0, 16, 4):
for j in xrange(0, 16, 4):
mats_ints[1, i:i+4, j:j+4] = mats_ints[0, x].reshape(4, 4)
mats_ints[2, i:i+4, j:j+4] = mats_ints[0, x].reshape(4, 4).T
x += 1
print mats_ints[:, :, :]
In [7]:
## What does it look like as character strings
mats_strs = np.zeros((3, 16, 16), dtype="S4")
mats_strs[0] = pi_arr
## fill alternates
x = 0
for i in xrange(0, 16, 4):
for j in xrange(0, 16, 4):
mats_strs[1, i:i+4, j:j+4] = mats_strs[0, x].reshape(4, 4)
mats_strs[2, i:i+4, j:j+4] = mats_strs[0, x].reshape(4, 4).T
x += 1
## print just first 10 columns so that it's more readable
print mats_strs[:, :, :10]
In [8]:
### Well, first we can just calculate the matrix rank
### b/c if they have different rank then we just choose
### the lowest one
for mat in xrange(3):
print "mat={}, rank={}".format(mat, np.linalg.matrix_rank(mats_ints[mat, :, :]))
We didn't need it because there were so many empty columns/rows. However, most of the time with real data, and especially large data sets, the matrices will have little or no empty columns, and so the matrices will all be of equal rank. In that case we need to use SVD to calculate which invariants approach most closely to zero from the SVD scores.
Here we calculate the squared sum of SVD scores above matrix rank 4, the minimum rank observed in the data set... (not sure why this is, exactly).
In [9]:
for idx in xrange(3):
## if all
np.linalg.matrix_rank(mats_ints[idx])
## calculate SVD
scores = np.linalg.svd(mats_ints[idx])[1]
## sum square of X smallest scores
print "mat={}, rank={}, sum={}"\
.format(idx,
np.linalg.matrix_rank(mats_ints[idx]),
np.sqrt(np.sum(scores[4:]**2)))