Borel Numbers, Zipf's Law, and Short Tandem Repeats

Some house keeping first, make sure we have statistics and plotting packages available


In [1]:
using(Gadfly)

We will start by building a random generator of short tandem repeats


In [2]:
function randomrepeats(n,m)
    d = [string(["A", "C", "G", "T"][digits(l,4,m) + 1]...) => 0 for l in 0:(4^m - 1)]
    for i = 0:(n - 1)
        d[string(rand(["A","G","C","T"],m)...)] += 1
    end
    return d
end


Out[2]:
randomrepeats (generic function with 1 method)

This functions works by:

  1. Preallocating a dictionary of every possible sequence of length $m$, filled with zeros.
  2. Creating $n$ random short sequences of lenght $m$.
  3. Using the $n$ short sequences as a key to increment the dictionary slot.

Lets give this bad boy a whirl!


In [6]:
randomrepeats(10,2)


Out[6]:
Dict{ASCIIString,Int64} with 16 entries:
  "CC" => 1
  "GC" => 1
  "GG" => 0
  "CG" => 0
  "AT" => 1
  "CA" => 0
  "TG" => 1
  "TA" => 0
  "GT" => 0
  "GA" => 2
  "TT" => 3
  "AC" => 0
  "CT" => 1
  "AA" => 0
  "AG" => 0
  "TC" => 0

Lets generate a larger sample and generate a histogram


In [8]:
h = randomrepeats(2^14,4)
plot(
    x = cumsum(sort([v for v in values(h)],rev=true)),
    y = sort([v for v in values(h)],rev=true),
    Geom.bar,
    Guide.xlabel("Cumulative samples"),
    Guide.ylabel("Samples")
)


Out[8]:
Cumulative samples -2.5×10⁴ -2.0×10⁴ -1.5×10⁴ -1.0×10⁴ -5.0×10³ 0 5.0×10³ 1.0×10⁴ 1.5×10⁴ 2.0×10⁴ 2.5×10⁴ 3.0×10⁴ 3.5×10⁴ 4.0×10⁴ 4.5×10⁴ -2.0×10⁴ -1.9×10⁴ -1.8×10⁴ -1.7×10⁴ -1.6×10⁴ -1.5×10⁴ -1.4×10⁴ -1.3×10⁴ -1.2×10⁴ -1.1×10⁴ -1.0×10⁴ -9.0×10³ -8.0×10³ -7.0×10³ -6.0×10³ -5.0×10³ -4.0×10³ -3.0×10³ -2.0×10³ -1.0×10³ 0 1.0×10³ 2.0×10³ 3.0×10³ 4.0×10³ 5.0×10³ 6.0×10³ 7.0×10³ 8.0×10³ 9.0×10³ 1.0×10⁴ 1.1×10⁴ 1.2×10⁴ 1.3×10⁴ 1.4×10⁴ 1.5×10⁴ 1.6×10⁴ 1.7×10⁴ 1.8×10⁴ 1.9×10⁴ 2.0×10⁴ 2.1×10⁴ 2.2×10⁴ 2.3×10⁴ 2.4×10⁴ 2.5×10⁴ 2.6×10⁴ 2.7×10⁴ 2.8×10⁴ 2.9×10⁴ 3.0×10⁴ 3.1×10⁴ 3.2×10⁴ 3.3×10⁴ 3.4×10⁴ 3.5×10⁴ 3.6×10⁴ 3.7×10⁴ 3.8×10⁴ 3.9×10⁴ 4.0×10⁴ -2×10⁴ 0 2×10⁴ 4×10⁴ -2.0×10⁴ -1.8×10⁴ -1.6×10⁴ -1.4×10⁴ -1.2×10⁴ -1.0×10⁴ -8.0×10³ -6.0×10³ -4.0×10³ -2.0×10³ 0 2.0×10³ 4.0×10³ 6.0×10³ 8.0×10³ 1.0×10⁴ 1.2×10⁴ 1.4×10⁴ 1.6×10⁴ 1.8×10⁴ 2.0×10⁴ 2.2×10⁴ 2.4×10⁴ 2.6×10⁴ 2.8×10⁴ 3.0×10⁴ 3.2×10⁴ 3.4×10⁴ 3.6×10⁴ 3.8×10⁴ 4.0×10⁴ -150 -100 -50 0 50 100 150 200 250 -100 -95 -90 -85 -80 -75 -70 -65 -60 -55 -50 -45 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 125 130 135 140 145 150 155 160 165 170 175 180 185 190 195 200 -100 0 100 200 -100 -90 -80 -70 -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 Samples

So what is going on here?

Well consider, the probability that the whole sequence of length $n$ is composed of 'A's

$$ \left(\frac{1}{4}\right)^n $$

Conversely, the probability that a sequence of lenght A excludes 'A's is

$$ \left(\frac{3}{4}\right)^n $$

The distribution of, say, 'A's in an sequence of length $n$, is binomial:

$$ \mathbb{P}[\#A=x] = \binom{n}{x} \left(\frac{1}{4}\right)^x \left(\frac{3}{4}\right)^{n-x} $$

This is of course true for counting, individually, 'C's, 'G's, and 'T's, as well