Exercise 4.1 Parse the GO file into a dictionary

Download http://purl.obolibrary.org/obo/go.obo. Write a function which will parse the file and return a dictionary with GO term ids as keys and named tuples (id, name, namespace) as values. If necessary, read again the week 3 lecture notes to see how named tuples are created.



In [ ]:

Exercise 4.2 Protein locations

Print the human-friendly names of the GO terms within the cellular component namespace associated to the 10 longest human proteins (see Exercise 3.4). Use the function written in Exercise 4.1 to find which of the GO annotations belong to the given namespace and what are their human-friendly names. Your output could look like this:

P35555: microfibril, extracellular region, proteinaceous extracellular matrix, basement membrane, extracellular space, extracellular matrix, extracellular exosome
P50851: lysosome, endoplasmic reticulum, Golgi apparatus, plasma membrane, endomembrane system, membrane, integral component of membrane, cytoplasmic, membrane-bounded vesicle, extrinsic component of membrane



In [ ]:

Exercise 4.3 BLAST query

a) Run a BLAST query for the following protein sequences against the Protein Data Bank (pdb). Save the results to a file for further analysis.

>protein_1
YYERLGLIPAIERTEKGYR
>protein_2
HWGAASSEISGSDHTVDG

b) How many hits are there in each query result?



In [ ]:

Exercise 4.4 Best BLAST hits

Filter the hits obtained in Exercise 4.3 such that only the HSPs with E-value $<10^{-4}$ remain. Sort the hits by E-value in ascending order. Print the E-values along with the corresponding hit ids.

(If a hit has more than one HSP, only consider the one with the smallest E-value.)



In [ ]:

Exercise 4.5 Information from HSP

Print the following pieces of information for each HSP obtained in Exercise 4.4:

the start and end positions of the matched segment of the query sequence
the start and end positions of the matched segment of the database sequence
the percent identity (i.e. the proportion of identical residues in the alignment)
the percent coverage (i.e. the proportion of the query sequence that was matched)



In [ ]: