Exercise 1: reading FASTA files

The FASTA-format is a text-based format for nucleotide and protein sequences. A FASTA File begins with a single line description which is indicated by a leading ">".

Example:

>gi|31563518|ref|NP_852610.1| microtubule-associated proteins 1A/1B light chain 3A isoform b [Homo sapiens] MKMRFFSSPCGKAAVDPADRCKEVQQIRDQHPSKIPVIIERYKGEKQLPVLDKTKFLVPDHVNMSELVKIIRRRLQLNPTQAFFLLVNQHSMVSVSTPIADIYEQEKDEDGFLYMVYASQETFGFIRENE

  • parse the nucleotide sequence in the file GPD1_seq.fasta (omit the description line)
  • write a function compute_nt_composition( sequence ), which returns a dictionary containing the number of ocurrences for each base in a given sequence
  • compute the nucleotide composition of the GDP1 protein and pickle the result to a file

In [76]:
import cPickle as pickle
with open('GPD1_seq.fasta', 'r') as f:
    lines = f.readlines()
    
a = 0
t = 0
g = 0
c = 0


for line in lines:
    if line.startswith('>'):
        continue
    else:
        a = a + line.count('A')
        t = t + line.count('T')
        g = g + line.count('G')
        c = c + line.count('C')
        


# print 'Die Sequenz hat', a, '"A"s,', t, '"T"s,', g, '"G"s und', c, '"C"s.'

def compute_nt_composition(sequence):
    global a, t, c, g
    data = {'A' : a, 'T' : t, 'C' : c, 'G' : g}
    for line in lines:
        if line.startswith('>'):
            continue
        else:
            a = a + line.count('A')
            t = t + line.count('T')
            g = g + line.count('G')
            c = c + line.count('C')
    print 'Die Sequenz hat', a, '"A"s,', t, '"T"s,', g, '"G"s und', c, '"C"s.'
    
compute_nt_composition(line)
print data['A']


Die Sequenz hat 648 "A"s, 672 "T"s, 552 "G"s und 480 "C"s.
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-76-43ffce31d214> in <module>()
     36 
     37 compute_nt_composition(line)
---> 38 print data['A']
     39 
     40 

TypeError: 'set' object has no attribute '__getitem__'

Exercise 2: Plot a histogram

take the nucleotide composition of the gene above and plot a histogram of the A, T, G and C frequency. Have your histogram labeled nicely and give it a title. Please, choose yourself if you would like to display horizontal or vertical bars. Advanced options include change of color for individual bars, width of the bars and alignment of labels and bars.


In [79]:
%matplotlib inline 
from pylab import *

Exercise 3: Plot a scatterplot

The file mycoplasma_gene_sequences.csv contains the genomic sequences of all Mycoplasma genitalium genes. The file contains two columns separated by a coma, the WholeCellModelID and the Sequence.

  • Read and parse the file and compute the nucleotide composition for each gene using the compute_nt_composition( seq ) function that you have defined in Exercise 1. Collect the nucleotide compositions f Then use the scatter function to plot a scatterplot of A content versus T content for each gene (don't forget to normalize the nucleotide content by gene length).

  • Indicate the length of each sequence by the dot-size in the scatterplot (hint: s input of scatter function)

  • Plot the scatterplot for each combination of A,G,T,C (use subplot)


In [82]:

Exercise 4: Plot the phasespace

in the numpy tutorial yesterday, you examined how a population of predator and prey can evolve over time theoretically (Lotka-Voltera System). Today, revisit the system and plot the phase space of the two species. In a phase space we plot the two variables against each other.

In a next step, imagine, we would like to visualize how different starting conditions impact population behavior. Try having different conditions in the same phase space plot.


In [1]:
import scipy.integrate