Rosalind is a platform for learning bioinformatics and programming through problem solving.
A string is simply an ordered collection of symbols selected from some alphabet and formed into a word; the length of a string is the number of symbols that it contains.
An example of a length 21 DNA string (whose alphabet contains the symbols 'A', 'C', 'G' and 'T') is "ATGCTTCAGAAAGGTCTTACG".
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC
20 12 17 21
In [2]:
# data = "AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC"
f = open("/Users/markustadej/Projects/rosalind/data/rosalind_dna.txt")
data = readstring(f)
close(f)
nt = Dict{Char, Int64}()
for n in data
nt[n] = get(nt, n, 0) + 1
end
println(nt['A'], ' ', nt['C'], ' ', nt['G'], ' ', nt['T'])
An RNA string is a string formed from the alphabet containing 'A', 'C', 'G' and 'U'.
Given a DNA string $t$ corresponding to a coding strand, its transcribed RNA string $u$ is formed by replacing all occurrences of 'T' in $t$ with 'U' in $u$.
GATGGAACTTGACTACGTAAATT
GAUGGAACUUGACUACGUAAAUU
In [10]:
data = "GATGGAACTTGACTACGTAAATT"
println(replace(data, 'T', 'U'))
In DNA strings, symbols 'A' and 'T' are complements of each other, as are 'C' and 'G'.
The reverse complement of a DNA string $s$ is the string $s^c$ formed by reversing the symbols of $s$, then taking the complement of each symbol (e.g., the reverse complement of "GTCA" is "TGAC").
AAAACCCGGT
ACCGGGTTTT
In [15]:
data = "AAAACCCGGT"
# Solution 1
function compl(c::Char)
c == 'A' && return 'T'
c == 'C' && return 'G'
c == 'G' && return 'C'
c == 'T' && return 'A'
end
println(reverse(map(compl, data)))
# Solution 2
d = Dict("A" => 'T', "C" => 'G', "G" => 'C', "T" => 'A')
println(reverse(replace(data, r"[ACGT]{1}", x -> d[x])))
A sequence is an ordered collection of objects (usually numbers), which are allowed to repeat. Sequences can be finite or infinite. Two examples are the finite sequence $(\pi,-\sqrt{2},0,\pi)$ and the infinite sequence of odd numbers $(1,3,5,7,9,…)$. We use the notation $a_n$ to represent the $n$-th term of a sequence.
A recurrence relation is a way of defining the terms of a sequence with respect to the values of previous terms. In the case of Fibonacci's rabbits from the introduction, any given month will contain the rabbits that were alive the previous month, plus any new offspring. A key observation is that the number of offspring in any month is equal to the number of rabbits that were alive two months prior. As a result, if $F_n$ represents the number of rabbit pairs alive after the $n$-th month, then we obtain the Fibonacci sequence having terms $F_n$ that are defined by the recurrence relation $F_n = F_{n-1} + F_{n-2}$ (with $F_1 = F_2 = 1$ to initiate the sequence). Although the sequence bears Fibonacci's name, it was known to Indian mathematicians over two millennia ago.
When finding the $n$-th term of a sequence defined by a recurrence relation, we can simply use the recurrence relation to generate terms of progressively larger values of $n$. This problem introduces us to the computational tehnique dynamic programming, which successively builds up solutions by using the answers to smaller cases.
5 3
19
In [21]:
function fib(c, d)
a = 1
# b = BigInt(1)
b = 1
for i in 3:c
a, b = b, d*a + b
end
return b
end
# fib(n, k) = n < 2 ? n : fib(n - 1) + k*fib(n - 2)
# fib(100, 4)
function test()
for n in 1:10000000
fib(30, 3)
end
end
@time test()
The GC-content of a DNA string is given by the percentage of symbols in the string that are 'C' or 'G'. For example, the GC-content of "AGCTATAG" is 37.5%. Note that the reverse complement of any DNA string has the same GC-content.
DNA strings must be labeled when they are consolidated into a database. A commonly used method of string labeling is called FASTA format. In this format, the string is introduced by a line that begins with '>', followed by some labeling information. Subsequent lines contain the string itself; the first line to begin with '>' indicates the label of the next string.
In Rosalind's implementation, a string in FASTA format will be labeled by the ID "Rosalind_xxxx", where "xxxx" denotes a four-digit code between 0000 and 9999.
>Rosalind_6404
CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC
TCCCACTAATAATTCTGAGG
>Rosalind_5959
CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT
ATATCCATTTGTCAGCAGACACGC
>Rosalind_0808
CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC
TGGGAACCTGCGGGCAGTAGGTGGAAT
Rosalind_0808
60.919540
We say that a number $x$ is within an absolute error of $y$ to a correct solution if $x$ is within $y$ of the correct solution. For example, if an exact solution is 6.157892, then for $x$ to be within an absolute error of 0.001, we must have that $\lvert x - 6.157892 \rvert < 0.001$, or $6.156892 < x < 6.158892$.
Error bounding is a bital practical tool because of the inherent round-off error in representing decimals in a computer, where only a finite number of decimal places are allotted to any number. After being compounded over a number of operations, this round-off error can become evident. As a result, rather than testing whether two numbers are equal with $x = z$, you may wish to simply verify that $\lvert x - z \rvert$ is very small.
The mathematical field of numerical analysis is devoted to rigorously studying the nature of computational approximation.
In [104]:
f = open("/Users/markustadej/Projects/rosalind/data/rosalind_gc.txt")
seqs = Array{String, 1}()
names = Array{String, 1}()
for line in eachline(f)
if line[1] == '>'
push!(names, line[2:end])
else
push!(seqs, line)
end
end
close(f)
println(names)
println(seqs)
In [ ]:
In [ ]: