ROSALIND

Rosalind is a platform for learning bioinformatics and programming through problem solving.

Counting DNA Nucleotides

Problem

A string is simply an ordered collection of symbols selected from some alphabet and formed into a word; the length of a string is the number of symbols that it contains.

An example of a length 21 DNA string (whose alphabet contains the symbols 'A', 'C', 'G' and 'T') is "ATGCTTCAGAAAGGTCTTACG".

  • Given: A DNA string $s$ of length at most 1000 nt
  • Return: Four integers (separated by spaces) counting the respective number of times that the symbols 'A', 'C', 'G' and 'T' occur in $s$

Sample Dataset

AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC

Sample Output

20 12 17 21

In [2]:
# data = "AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC"

f = open("/Users/markustadej/Projects/rosalind/data/rosalind_dna.txt")
data = readstring(f)
close(f)

nt = Dict{Char, Int64}()

for n in data
    nt[n] = get(nt, n, 0) + 1
end

println(nt['A'], ' ', nt['C'], ' ', nt['G'], ' ', nt['T'])


256 243 234 255

Transcribing DNA into RNA

Problem

An RNA string is a string formed from the alphabet containing 'A', 'C', 'G' and 'U'.

Given a DNA string $t$ corresponding to a coding strand, its transcribed RNA string $u$ is formed by replacing all occurrences of 'T' in $t$ with 'U' in $u$.

  • Given: A DNA string $t$ having length at most 1000 nt
  • Return: The transcribed RNA string of $t$

Sample Dataset

GATGGAACTTGACTACGTAAATT

Sample Output

GAUGGAACUUGACUACGUAAAUU

In [10]:
data = "GATGGAACTTGACTACGTAAATT"

println(replace(data, 'T', 'U'))


GAUGGAACUUGACUACGUAAAUU

Complementing a Strand of DNA

Problem

In DNA strings, symbols 'A' and 'T' are complements of each other, as are 'C' and 'G'.

The reverse complement of a DNA string $s$ is the string $s^c$ formed by reversing the symbols of $s$, then taking the complement of each symbol (e.g., the reverse complement of "GTCA" is "TGAC").

  • Given: A DNA string $s$ of length at most 1000 bp
  • Return: The reverse complement $s^c$ of $s$

Sample Dataset

AAAACCCGGT

Sample Output

ACCGGGTTTT

In [15]:
data = "AAAACCCGGT"

# Solution 1
function compl(c::Char)
    c == 'A' && return 'T'
    c == 'C' && return 'G'
    c == 'G' && return 'C'
    c == 'T' && return 'A'
end

println(reverse(map(compl, data)))

# Solution 2
d = Dict("A" => 'T', "C" => 'G', "G" => 'C', "T" => 'A')
println(reverse(replace(data, r"[ACGT]{1}", x -> d[x])))


ACCGGGTTTT
ACCGGGTTTT

Rabbits and Recurrence Relations

Problem

A sequence is an ordered collection of objects (usually numbers), which are allowed to repeat. Sequences can be finite or infinite. Two examples are the finite sequence $(\pi,-\sqrt{2},0,\pi)$ and the infinite sequence of odd numbers $(1,3,5,7,9,…)$. We use the notation $a_n$ to represent the $n$-th term of a sequence.

A recurrence relation is a way of defining the terms of a sequence with respect to the values of previous terms. In the case of Fibonacci's rabbits from the introduction, any given month will contain the rabbits that were alive the previous month, plus any new offspring. A key observation is that the number of offspring in any month is equal to the number of rabbits that were alive two months prior. As a result, if $F_n$ represents the number of rabbit pairs alive after the $n$-th month, then we obtain the Fibonacci sequence having terms $F_n$ that are defined by the recurrence relation $F_n = F_{n-1} + F_{n-2}$ (with $F_1 = F_2 = 1$ to initiate the sequence). Although the sequence bears Fibonacci's name, it was known to Indian mathematicians over two millennia ago.

When finding the $n$-th term of a sequence defined by a recurrence relation, we can simply use the recurrence relation to generate terms of progressively larger values of $n$. This problem introduces us to the computational tehnique dynamic programming, which successively builds up solutions by using the answers to smaller cases.

  • Given: Positive integers $n ≤ 40$ and $k ≤ 5$.
  • Return: The total number of rabbit pairs that will be present after $n$ months, if we begin with 1 pair and in each generation, every pair of reproduction-age rabbits produces a litter of $k$ rabbit pairs (instead of only 1 pair).

Sample Dataset

5 3

Sample Output

19

In [21]:
function fib(c, d)
    a = 1
#     b = BigInt(1)
    b = 1
    for i in 3:c
        a, b = b, d*a + b
    end
    return b
end

# fib(n, k) = n < 2 ? n : fib(n - 1) + k*fib(n - 2)
# fib(100, 4)

function test()
    for n in 1:10000000
        fib(30, 3)
    end
end

@time test()


  0.256356 seconds (2.17 k allocations: 118.699 KiB)

Computing GC Content

Problem

The GC-content of a DNA string is given by the percentage of symbols in the string that are 'C' or 'G'. For example, the GC-content of "AGCTATAG" is 37.5%. Note that the reverse complement of any DNA string has the same GC-content.

DNA strings must be labeled when they are consolidated into a database. A commonly used method of string labeling is called FASTA format. In this format, the string is introduced by a line that begins with '>', followed by some labeling information. Subsequent lines contain the string itself; the first line to begin with '>' indicates the label of the next string.

In Rosalind's implementation, a string in FASTA format will be labeled by the ID "Rosalind_xxxx", where "xxxx" denotes a four-digit code between 0000 and 9999.

  • Given: At most 10 DNA strings in FASTA format (of length at most 1 kbp each).
  • Return: The ID of the string having the highest GC-content, followed by the GC-content of that string. Rosalind allows for a default error of 0.001 in all decimal answers unless otherwise stated; please see the note on absolute error below.

Sample Dataset

>Rosalind_6404
CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC
TCCCACTAATAATTCTGAGG
>Rosalind_5959
CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT
ATATCCATTTGTCAGCAGACACGC
>Rosalind_0808
CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC
TGGGAACCTGCGGGCAGTAGGTGGAAT

Sample Output

Rosalind_0808
60.919540
Note on Absolute Error

We say that a number $x$ is within an absolute error of $y$ to a correct solution if $x$ is within $y$ of the correct solution. For example, if an exact solution is 6.157892, then for $x$ to be within an absolute error of 0.001, we must have that $\lvert x - 6.157892 \rvert < 0.001$, or $6.156892 < x < 6.158892$.

Error bounding is a bital practical tool because of the inherent round-off error in representing decimals in a computer, where only a finite number of decimal places are allotted to any number. After being compounded over a number of operations, this round-off error can become evident. As a result, rather than testing whether two numbers are equal with $x = z$, you may wish to simply verify that $\lvert x - z \rvert$ is very small.

The mathematical field of numerical analysis is devoted to rigorously studying the nature of computational approximation.


In [104]:
f = open("/Users/markustadej/Projects/rosalind/data/rosalind_gc.txt")

seqs = Array{String, 1}()
names = Array{String, 1}()


for line in eachline(f)
    if line[1] == '>'
        push!(names, line[2:end])
    else
        push!(seqs, line)
    end
end

close(f)


println(names)
println(seqs)


String["Rosalind_5899", "Rosalind_3908", "Rosalind_0134", "Rosalind_2372", "Rosalind_3941", "Rosalind_5679"]
String["AGGCCACCACTGGTGAAAACAGGATTCGAGCGCTGTTTGCAGCCCGTACACTACCGTTGA", "CCCCGCACTTGCGTCTCGTCTGTATCAGGACACTCTATGCATCCTCGCGGAATTAGCTGC", "TTGAGGAGGCTTCTAAAAGTGCTACGCCTTATTAGTGCCGTCCTCTTGGCTTTGTGGAGG", "CCATGGTCGTTACCGAAAATGCCCCGTGTTCAACCCAGTACTTCGAAACGACGTTGACAA", "AATCCACAATAGAAATCACCTACTAACATTTCTGCCCATTTCTATGTTTCTGCAGTGAAC", "TTGATATGATCGGGATGTACTCATGCGGTGCCGTAACAAGAGATATGCGTAAGAACCTAC", "GGCTCGGACGTCATGCTCGAGGAGAGCTTCATCGCGTCGCAGACTCTGTATCTTAGGAAG", "TTAGGATCCTTACAAAACAGGTCGCTAGCGCTGTGAGAGCAGCATTCAAAACCCGCCGGA", "CGAATGAGTACAGTCGGGCGTTGGATCTGTTTTCAGGGAAAAGTGCCTAAGAGGTCTTAT", "CCTCGTGAATCTGTTAGCAGTTTCGGGAACATCTACACGAACGCAGATACAATAACCCTG", "ACACTTTAGCTAGATGGGAAGTACAACATCGCCGGTAGTAGGGAGACGACCACACGACAA", "TAATGTGAGACGCGTAGCCAGGACTCTTTTCACCTCACTTCGTTGTTTCAGGCCTTGAAA", "GGCGTGTAGAATGTCTTGGCTTGGAGGAAAGATGCCCCAAGGAATTTGCTTCAATGCGAT", "GCTTGTAGTGCGGATCGTTATTTAGACCCAACTTTCTCAGTTCAAACGATCTATGAGAGA", "TGTGTAAACTTTACGGGTCCAGCCTGTGGGCCCTTTATTGCCCTTTAAGAGAAATAATAT", "CCG", "TTTTTTTGAACAGCTCCTGATATAGGTACACTTGGATCCGCACTGTGCAGACATTTGCGC", "GTACCAGCCGCATGGCAGATTTGGGGTCCAGTATACTGATGCATAGCTAAAAAGCTATCG", "GAATGACCTGGTGGCTCACTGAGTATCTGGGGGTCTAAACGTGCAGATATGCACGTACGT", "GGCGAAGAGCGTTTCACCCTAACTTTTCCTTTCGCACGTCCATTGAGAGCCGGCGCTAGA", "AAGGGGCCGAAGGCGCGAGTGGAAACAACGGGATTTAGAACCGCAGTTAGAAAAGGAGAA", "CGCAAGCACCTGCCAGTGTCCGGTTGCGTCAAGTGGCCTGCGTTGTTCTAACGCGACCTC", "CCTAATGAACTTATACAGCTGGAGCAATGGACAATTGGTTACGTCCTGTGGTCTCGCCGC", "GGAGATCTATTGTTGCGTCGCGCTCCAGGGCTAGTGTAGCGCGATTTATGTGTTTAGCCG", "ATCAGTCAGCTTAAGCAGGCACTCATTCTTAGAATCGCTCCTCGATCCACGAATGGGTTC", "AATACATGCACCGAATCACTAGCCTAACATAGAGAGTCGAATTCTGCTTCGGCCACGGGA", "AATGCTTTGAGGTTACATCCATACGGTAGATCAGTTATTCGTACTGATTTTGGACCACCA", "AATATTGGACGCAAAATGCTGGATTCAAAGCATGAGTACGAATTAGGGCTAGGTCAAACG", "ACAGCTTCAATGAATAGTAAGTTCGTCTAGAGCAACATATTATGTCCCAGCGTGTGAAGG", "TTAACCCAAGCGGCGAAGCTATGACCGTCAAAGTCGGTGATGGAGCCTGAAGCGCCGAGG", "GGTGGTCGCGTCAACATGATCGGAGGTTTCGGTTGAGCGGAGCGAGGATAACGCCCGTAT", "GGGAAGGAGTGAAGCGTGACTCGAACGAGCAACGTCTGGGAGGGCCGGGTTTTTCCTTGC", "GT", "CTTAATGTGTAAAATTTTCCACATGCCTCTGCAGCCGGGTCCTGCGGCAGTTCTTCGCTA", "TACCAAGCTTGACCGCGAGCAACCCTGGTGTCCCGAATGACGAAAAGTCATCCTCCCTCA", "GATTCCTGATGACGGTTACCCACATAAAATGGGCTCCATAGAAGAACATGGAAAAGATGG", "ATTATTGACCTGATGGTGACTCACCTCGCCAGTTATTCGGGTGAGGAATGAACGGGCATG", "GCGTGTCCCGTTATGTACAACGCGAGTACCGGGTTCTACTCCTTAACGGTTGAGCTGGGC", "AGAATTATAACCGATCAGTCAATGACGCGGGCCTCACAGCGTGGCAAACTAGGGGTGGCG", "TGACAGTCCTAACGAACCGTGGCTCGGAATCCTATCAGGGTTCTTGCCGGCCAGAATAAG", "AGCTGCACGAGAGTACGACCAATCAACACTCAGATACCCCTCTTGCAGTGTAAGCGAGGT", "CTACATCCGCCCAACTCCATCCACGCAGTAGGGGGCATTTGCCGACCTGAGGCTTGTGAG", "TCAAACACGAGAGAGACAATTGTGGAGGTGTCCGGGGAAGTTCTACTCAGTACCTGCATC", "ATCCTCCGCAGCTCTTTGAGGTTGAGGCTGTCCCAATGGCCAACGTCTCTTGGTATTGCC", "AGATGTATTTGACCGGTGGGGATTGTGCAATAAACAGTGCTTTATTGAAAGCTCGAAGTA", "TAGTATGAAATCCCGCGGTGTGCCTTCGTGATTGTCGCCGGTATACATCACTACACGACA", "CCCAGTGATCATCGTGTTGGCGCTAGGTAACGTATTGAGCCCAAGCTGTTGATTCCGTTA", "TTTACTCCA", "TCAATGCCAACAGGCGGGCATTCTTCCCCTGTTATATCAGCTGTCATCTCTTCAGATGCA", "TAATGGCTACTGCAGTATATTCGTCCTGCTTGCGGCCCGGTTTGGTGCTGAGGTGGTGCA", "GACACACGTAGAACCACTAATGCCTGCACGGCAGAATTAAGTTCACGGCTTCTGTGTCTG", "ATGATGGCGCCACCGGGTCTAATGATGCCAAGTGAGTCCTGCTGGTAGTGCTAAGCGCCG", "GGACGGCAGTAAATCACATGTAAAGTTCCGAATGGGAAGACCATTGAGCCAATGAAACTG", "CTCGATGAGGACTATTGCACCCCCTCTACATCCATCACCCCGTTGCAATGGGAGAAAGTC", "GGGTCTATCGACGCGACTAGCGAAGGCGATCTCGATCACTCGGCGTGCTATCAGTCATTT", "ATCTGAGTGAACACAACAGCTTTATGAGGAAACAAAACAACATCTAGTCAATGACTGTCC", "CGGCTAGTGCTAAGAAAGAGGTGCTTGTCTTGCGGCATCGTGACAGCGGGGAGATCAGTT", "ACCAGAATCAGACAAATACGAGGGGGTGCCGTACGGAAGATTAGTTTACGTTGGGTGGTT", "TCGGATATGCCGGTAAACAGAAGGCGTCGCCAAGGCTAGGTACTCAGTTCCCTTTTATTC", "GACAGTGGGTCGGTCTTCGTCAAAATGACTGGTCCCGGAGTGAGCGGAGAAAAGTCTCTG", "CTACTATCGTTACGCCTCAGTAGAGTAGTCTGCTGTCACAATGAAGATTCTTCCAATTCC", "ATCGTCACCACCAAATCGTCGAGCAGCGTGTACGCC", "CCCCGCGTAAGGGATGAGCTTGTGATAGAAGTGAAATGGTCTGCGCGTCGGTATAATGAC", "TCATTCGTATGACCAACATAGCCTGTGCCACCCTTGCCGGTCTAAACACCATGCGCATTG", "TGATTCTAGCTGCCAATAGGCCTCATATGACCTAGGCGATATTTTGTACGCGGCTTGATT", "AAACTCAATGATAGCGGTTAAGTAAGACGACTTAGGAATGCGCCTCACTACGAATGAGTA", "CCTGGGATGTCCCCCAAACGTGCGCCACAGAGATTTCTAGCAAACTTGCTCGAAAGATGC", "TCAAAAGTTCGCTAGCTCTGGAACTAGCCGTCTAGCTTTCTTATAGCAGCACACCTAACG", "ACGATGGCTACACAGGGTTCCGGCCCACTGCGTCCCGTGACTAGGCCTCACTGGGGCCCG", "GAAACGCGCCTGAGCTCGACTTAAGATCAAGAGGACATCCCGTACGTGTCTCTGTGTTAG", "GCTCGTGGGCTCACAACAAGCCAAGGCCGCACAGATTTGCAGCATGTGTTTGGCCACCAT", "TCCGTGGAGCTACCATGTCCAAGGCCCGCCCCGTTGCCACCTCCGTTCGGTTTGAATCAG", "CTTTTAGGTGATGGTGGTTGGTGCCTGAAGCAGCGGCTGCTTGCTGTTTGTATAAGTCAA", "GCCCTAAGAGAACCATGGCACTTCTTTCCGTCACACTGTCCGTGGGTGCGGGAGGTACCT", "CGCGAGTCAGCAGGTATTGCTACATACTGGCAAAGCCGTAGCGATTCGGGATATGTCTAG", "TGCGGTACCTAAACTGCTAGAATCCAGCTTTGCCTTTACTCCTGACGCCGAGGGCGAACT", "ACGAGCGTATGAAGGCGGCACTACATGTCCAGATCTTTAAGTTAACACGATACCTTCTCA", "CGAGTATTTGCGTCTGGAAATCCACGCCATCAGCGTAAAGGCCATGGGCTGTATGGGTGT", "GGCGTACAAGACTACTCCGG", "CTTCTTAGACTCCCGGGATTTTTCGGCTATCGTGGGTCTCGTATGGCGCGGCATCGTCAA", "TTTATTAGGTTCGAAGACGCTGAGCCATTTAGCGGTAGCATTTGGCGCTGCTATGCGTAA", "CTTAGCCGTCTTCTTCTTACACCTCTTTGAATCAACGTTGCTGAGCGCCAGAGCTTAATT", "CCATGCATTGCGATCGAGCGCGTTGATTGGGAACATAAGAAGTTAGGGAAAAACCTCTGG", "ACTGAGACTTTAATCCGCCAGATCCTGTCCTCTTGGTAGGCTGCCAACATCACCCATGAT", "AATTCCACTGTTTATGGGGTATAAAGCTATAGTTCTCCGGATTTAACAACCCGCACTGGT", "CAGATAGTTCCCCAGCGGCCTTTGTCGGGTATCCGGACTTACATTGCGCCGCAGCACTCA", "AGAAAAACTTCTACGCGAGCCTAGGAGGTCACATACGGATCATAGTCCAATCTAGTTCCA", "ATCGCACAAAATAGTATACGTTATAGTCGTAGGCTCCGTAACCTCCGGGAGGCCAAAAAA", "GCTCTGACCATTAAATTCCGTCTGTCGTCTGGTATTCGTGCGACGCTTCTCCGTGGTATT", "GCAGTGATGCCGTTCGTCATAGGGTATACCGGCTAGCTTCACAAACTCCCATTCCTTAGT", "TTTACCCCGCCGAAGGCCCAATAGGCGGTATGGGACGGTAGATCCTAGATCTTGTCCCAC", "ATACGGGCATACTTGGCTGTTACTGACCCGTCGTTCAAAATTCGCACCGGGGAGTGGCTT", "TCGGAGACAAGGGGTGATATCTGATGCGCGCTACACAAGAGGTTTTAAGGTAACTCTCGG", "GTCGAACCGGAACGCGCCAGGTGTTTGCTCTCAGGACTCAAACGAGTTTCGTAAACCGTC", "CGGGACGTCGTAGGCGGCCCGACGGCAATGTCAAAGAGTGTTTCC"]

In [ ]:


In [ ]: