Some house keeping first, make sure we have statistics and plotting packages available
In [1]:
using(Gadfly)
We will start by building a random generator of short tandem repeats
In [2]:
function randomrepeats(n,m)
d = [string(["A", "C", "G", "T"][digits(l,4,m) + 1]...) => 0 for l in 0:(4^m - 1)]
for i = 0:(n - 1)
d[string(rand(["A","G","C","T"],m)...)] += 1
end
return d
end
Out[2]:
This functions works by:
Lets give this bad boy a whirl!
In [6]:
randomrepeats(10,2)
Out[6]:
Lets generate a larger sample and generate a histogram
In [8]:
h = randomrepeats(2^14,4)
plot(
x = cumsum(sort([v for v in values(h)],rev=true)),
y = sort([v for v in values(h)],rev=true),
Geom.bar,
Guide.xlabel("Cumulative samples"),
Guide.ylabel("Samples")
)
Out[8]:
So what is going on here?
Well consider, the probability that the whole sequence of length $n$ is composed of 'A's
Conversely, the probability that a sequence of lenght A excludes 'A's is
The distribution of, say, 'A's in an sequence of length $n$, is binomial:
This is of course true for counting, individually, 'C's, 'G's, and 'T's, as well