Borel Numbers, Zipf's Law, and Short Tandem Repeats

Some house keeping first, make sure we have statistics and plotting packages available



In [1]:

    
using(Gadfly)

We will start by building a random generator of short tandem repeats



In [2]:

    
function randomrepeats(n,m)
    d = [string(["A", "C", "G", "T"][digits(l,4,m) + 1]...) => 0 for l in 0:(4^m - 1)]
    for i = 0:(n - 1)
        d[string(rand(["A","G","C","T"],m)...)] += 1
    end
    return d
end









    Out[2]:





randomrepeats (generic function with 1 method)

This functions works by:

Preallocating a dictionary of every possible sequence of length $m$, filled with zeros.
Creating $n$ random short sequences of lenght $m$.
Using the $n$ short sequences as a key to increment the dictionary slot.

Lets give this bad boy a whirl!



In [6]:

    
randomrepeats(10,2)









    Out[6]:





Dict{ASCIIString,Int64} with 16 entries:
  "CC" => 1
  "GC" => 1
  "GG" => 0
  "CG" => 0
  "AT" => 1
  "CA" => 0
  "TG" => 1
  "TA" => 0
  "GT" => 0
  "GA" => 2
  "TT" => 3
  "AC" => 0
  "CT" => 1
  "AA" => 0
  "AG" => 0
  "TC" => 0

Lets generate a larger sample and generate a histogram



In [8]:

    
h = randomrepeats(2^14,4)
plot(
    x = cumsum(sort([v for v in values(h)],rev=true)),
    y = sort([v for v in values(h)],rev=true),
    Geom.bar,
    Guide.xlabel("Cumulative samples"),
    Guide.ylabel("Samples")
)









    Out[8]:

So what is going on here?

Well consider, the probability that the whole sequence of length $n$ is composed of 'A's

$$ \left(\frac{1}{4}\right)^n $$

Conversely, the probability that a sequence of lenght A excludes 'A's is

$$ \left(\frac{3}{4}\right)^n $$

The distribution of, say, 'A's in an sequence of length $n$, is binomial:

$$ \mathbb{P}[\#A=x] = \binom{n}{x} \left(\frac{1}{4}\right)^x \left(\frac{3}{4}\right)^{n-x} $$

This is of course true for counting, individually, 'C's, 'G's, and 'T's, as well