The FASTA-format is a text-based format for nucleotide and protein sequences. A FASTA File begins with a single line description which is indicated by a leading ">".
Example:
>gi|31563518|ref|NP_852610.1| microtubule-associated proteins 1A/1B light chain 3A isoform b [Homo sapiens] MKMRFFSSPCGKAAVDPADRCKEVQQIRDQHPSKIPVIIERYKGEKQLPVLDKTKFLVPDHVNMSELVKIIRRRLQLNPTQAFFLLVNQHSMVSVSTPIADIYEQEKDEDGFLYMVYASQETFGFIRENE
GPD1_seq.fasta (omit the description line)compute_nt_composition( sequence ), which returns a dictionary containing the number of ocurrences for each base in a given sequence
In [76]:
import cPickle as pickle
with open('GPD1_seq.fasta', 'r') as f:
lines = f.readlines()
a = 0
t = 0
g = 0
c = 0
for line in lines:
if line.startswith('>'):
continue
else:
a = a + line.count('A')
t = t + line.count('T')
g = g + line.count('G')
c = c + line.count('C')
# print 'Die Sequenz hat', a, '"A"s,', t, '"T"s,', g, '"G"s und', c, '"C"s.'
def compute_nt_composition(sequence):
global a, t, c, g
data = {'A' : a, 'T' : t, 'C' : c, 'G' : g}
for line in lines:
if line.startswith('>'):
continue
else:
a = a + line.count('A')
t = t + line.count('T')
g = g + line.count('G')
c = c + line.count('C')
print 'Die Sequenz hat', a, '"A"s,', t, '"T"s,', g, '"G"s und', c, '"C"s.'
compute_nt_composition(line)
print data['A']
take the nucleotide composition of the gene above and plot a histogram of the A, T, G and C frequency. Have your histogram labeled nicely and give it a title. Please, choose yourself if you would like to display horizontal or vertical bars. Advanced options include change of color for individual bars, width of the bars and alignment of labels and bars.
In [79]:
%matplotlib inline
from pylab import *
The file mycoplasma_gene_sequences.csv contains the genomic sequences of all Mycoplasma genitalium genes. The file contains two columns separated by a coma, the WholeCellModelID and the Sequence.
Read and parse the file and compute the nucleotide composition for each gene using the compute_nt_composition( seq ) function that you have defined in Exercise 1. Collect the nucleotide compositions f Then use the scatter function to plot a scatterplot of A content versus T content for each gene (don't forget to normalize the nucleotide content by gene length).
Indicate the length of each sequence by the dot-size in the scatterplot (hint: s input of scatter function)
Plot the scatterplot for each combination of A,G,T,C (use subplot)
In [82]:
in the numpy tutorial yesterday, you examined how a population of predator and prey can evolve over time theoretically (Lotka-Voltera System). Today, revisit the system and plot the phase space of the two species. In a phase space we plot the two variables against each other.
In a next step, imagine, we would like to visualize how different starting conditions impact population behavior. Try having different conditions in the same phase space plot.
In [1]:
import scipy.integrate