Project

The file mRNA.fasta contains a set of mRNA sequences expressed by a hypothetical organism. The objective of this project is to analyse these target sequences to discover what proteins the organism produces and how they differ from the corresponding proteins in other species.

The project is divided into four tasks which form a series of analyses. In each task, you will be asked to save the output data to a file and summarise your discoveries. The expected outputs of the tasks 1 and 2 will be given as additional files so that you can attempt any of the tasks even if you fail to solve some of them.

To pass the project, you must submit a Jupyter notebook file that contains the report and the code of your analysis. Your submission must demonstrate that you understand how to analyse the data to reach the objectives and how implement the analysis. Some errors are allowed as long as your approach is correct. Explain your approach and results in your report and include comments to your code to clarify your implementation. You may be asked to revise your submission if it is not acceptable.

There are also extra tasks. Successfully completing these tasks will earn you extra points in the exam.

1. Coding sequence extraction

Extract the coding sequences from the mRNA sequences.

a) For each target mRNA sequence, find the longest open reading frame (ORF) in the sequence. An open reading frame is a continuous sequence that has the start codon ATG in the beginning and a stop codon at the end. There are no stop codons in the middle of an ORF. Remember that the start codon can be at any position in the sequence.

b) Translate the ORFs to protein sequences for further analysis.

c) Save the target protein sequences to the file 1-proteins.gb in the GenBank format.

d) Summarise your discoveries by reporting the position and length of the longest ORF for each target mRNA sequence.



In [ ]:

2. Similarity search

Find similar proteins from the NCBI Protein database.

a) For each target protein sequence in the task 1, run BLAST to obtain similar sequences. Save the BLAST results to the file 2-blast.xml.

b) Select the hits containing a high-scoring pair (HSP) with the following properties:

has E-value $\leq 10^{-6}$
spans $\geq 75\%$ of the target sequence
percent identity $\geq 30\%$

c) Update the records of the target protein sequences by adding the database ids of the selected hits. Use a sequence feature with the type BLAST and the qualifier key db_xref containing a list of ids. Save the updated records to the file 2-proteins.gb.

d) Summarise your discoveries by reporting the ids of the selected hits for each target protein sequence.



In [ ]:

3. Protein function prediction

Infer the putative functions by similarity.

a) Download the UniProt entries of the selected hits in the task 2. Save the entries to the files XXXXXX.xml where XXXXXX is the UniProt id.

b) Separately for each target protein sequence in the task 1, find those Gene Ontology (GO) terms which occur in all of the matched UniProt entries.

c) Update the records of the target protein sequences by adding the found GO terms. Use a sequence feature with the type GO and the qualifier key db_xref containing a list of ids. Also add the qualifier evidence with value inferred by similarity to the feature. Save the updated records to the file 3-proteins.gb.

d) Summarise your discoveries by reporting the putative GO terms for each target protein sequence. Print both the GO term ids and their human-friendly names.



In [ ]:

4. Sequence alignment

Find the positions at which the sequences differ from consensus.

a) For each target protein sequence in the task 1, align the target sequence with the sequences of the UniProt entries in the task 3. Save the alignments to the file 3-alignment-X.fasta in the FASTA format where X is a running number.

b) Find the positions in the alignments at which the target sequence is not identical to the consensus. Use 60% consensus (i.e. the consensus exists if at least 60% of the aligned sequences contain the same residue). Ignore the non-consensus positions and the gaps in either sequence.

c) Summarise your discoveries by reporting the positions where the target protein sequence is not identical to the consensus sequence for each alignment.

d) (extra, 1p) Update the records of the target protein sequences by adding the non-identical positions. Use a sequence feature with the type mutation and the qualifier keys observed and consensus to store the residues in the target sequence and the consensus sequence, respectively. Save the updated records to the file 4-proteins.gb.



In [ ]:

5. (extra, 2p) Prediction of protein features

Predict protein features by using DTU online services.

a) For each target protein sequence in the task 1, use the NetPhosBac service to predict the phosphorylation sites in the sequence.

b) For each target protein sequence in the task 1, use the NetTurnP service to predict the residues that reside within turns. Do not predict the types of turns.

c) Update the records of the target protein sequences by adding the predicted features. Use sequence features to store the relevant information. Save the updated records to the file 5-proteins.gb.

c) Summarise your discoveries by reporting the predicted phosphorylation sites and turns for each target protein sequence.



In [ ]: