Borel Numbers, Zipf's Law, and Short Tandem Repeats

Some house keeping first, make sure we have statistics and plotting packages available


In [1]:
using(Gadfly)

We will start by building a random generator of short tandem repeats


In [2]:
function randomrepeats(n,m)
    d = [string(["A", "C", "G", "T"][digits(l,4,m) + 1]...) => 0 for l in 0:(4^m - 1)]
    for i = 0:(n - 1)
        d[string(rand(["A","G","C","T"],m)...)] += 1
    end
    return d
end


Out[2]:
randomrepeats (generic function with 1 method)

This functions works by:

  1. Preallocating a dictionary of every possible sequence of length $m$, filled with zeros.
  2. Creating $n$ random short sequences of lenght $m$.
  3. Using the $n$ short sequences as a key to increment the dictionary slot.

Lets give this bad boy a whirl!


In [6]:
randomrepeats(10,2)


Out[6]:
Dict{ASCIIString,Int64} with 16 entries:
  "CC" => 1
  "GC" => 1
  "GG" => 0
  "CG" => 0
  "AT" => 1
  "CA" => 0
  "TG" => 1
  "TA" => 0
  "GT" => 0
  "GA" => 2
  "TT" => 3
  "AC" => 0
  "CT" => 1
  "AA" => 0
  "AG" => 0
  "TC" => 0

Lets generate a larger sample and generate a histogram


In [8]:
h = randomrepeats(2^14,4)
plot(
    x = cumsum(sort([v for v in values(h)],rev=true)),
    y = sort([v for v in values(h)],rev=true),
    Geom.bar,
    Guide.xlabel("Cumulative samples"),
    Guide.ylabel("Samples")
)


Out[8]:
Cumulative samples -2.5×10⁴ -2.0×10⁴ -1.5×10⁴ -1.0×10⁴ -5.0×10³ 0 5.0×10³ 1.0×10⁴ 1.5×10⁴ 2.0×10⁴ 2.5×10⁴ 3.0×10⁴ 3.5×10⁴ 4.0×10⁴ 4.5×10⁴ -2.0×10⁴ -1.9×10⁴ -1.8×10⁴ -1.7×10⁴ -1.6×10⁴ -1.5×10⁴ -1.4×10⁴ -1.3×10⁴ -1.2×10⁴ -1.1×10⁴ -1.0×10⁴ -9.0×10³ -8.0×10³ -7.0×10³ -6.0×10³ -5.0×10³ -4.0×10³ -3.0×10³ -2.0×10³ -1.0×10³ 0 1.0×10³ 2.0×10³ 3.0×10³ 4.0×10³ 5.0×10³ 6.0×10³ 7.0×10³ 8.0×10³ 9.0×10³ 1.0×10⁴ 1.1×10⁴ 1.2×10⁴ 1.3×10⁴ 1.4×10⁴ 1.5×10⁴ 1.6×10⁴ 1.7×10⁴ 1.8×10⁴ 1.9×10⁴ 2.0×10⁴ 2.1×10⁴ 2.2×10⁴ 2.3×10⁴ 2.4×10⁴ 2.5×10⁴ 2.6×10⁴ 2.7×10⁴ 2.8×10⁴ 2.9×10⁴ 3.0×10⁴ 3.1×10⁴ 3.2×10⁴ 3.3×10⁴ 3.4×10⁴ 3.5×10⁴ 3.6×10⁴ 3.7×10⁴ 3.8×10⁴ 3.9×10⁴ 4.0×10⁴ -2×10⁴ 0 2×10⁴ 4×10⁴ -2.0×10⁴ -1.8×10⁴ -1.6×10⁴ -1.4×10⁴ -1.2×10⁴ -1.0×10⁴ -8.0×10³ -6.0×10³ -4.0×10³ -2.0×10³ 0 2.0×10³ 4.0×10³ 6.0×10³ 8.0×10³ 1.0×10⁴ 1.2×10⁴ 1.4×10⁴ 1.6×10⁴ 1.8×10⁴ 2.0×10⁴ 2.2×10⁴ 2.4×10⁴ 2.6×10⁴ 2.8×10⁴ 3.0×10⁴ 3.2×10⁴ 3.4×10⁴ 3.6×10⁴ 3.8×10⁴ 4.0×10⁴ -150 -100 -50 0 50 100 150 200 250 -100 -95 -90 -85 -80 -75 -70 -65 -60 -55 -50 -45 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 125 130 135 140 145 150 155 160 165 170 175 180 185 190 195 200 -100 0 100 200 -100 -90 -80 -70 -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 Samples Created with Snap

So what is going on here?

Well consider, the probability that the whole sequence of length $n$ is composed of 'A's

$$ \left(\frac{1}{4}\right)^n $$

Conversely, the probability that a sequence of lenght A excludes 'A's is

$$ \left(\frac{3}{4}\right)^n $$

The distribution of, say, 'A's in an sequence of length $n$, is binomial:

$$ \mathbb{P}[\#A=x] = \binom{n}{x} \left(\frac{1}{4}\right)^x \left(\frac{3}{4}\right)^{n-x} $$

This is of course true for counting, individually, 'C's, 'G's, and 'T's, as well