Individual means that you do it yourself. You won't learn to code if you don't struggle for yourself and write your own code. Remember that while you can discuss the general (algorithmic) way to solve a problem, you should not even be looking at anyone else's code or showing anyone else your code for an individual assignment.
Review the Group Work guidelines on Cavas and/or ask an instructor if you have any questions.
Be sure to spell all function names correctly - misspelled functions will lose points (and often break anyway since no one is sure what to type to call it). If you prefer showing your earlier, scratch work as you figure out what you are doing, please be sure that you make a final, complete, correct last function in its own cell that you then call several times to test. In other words, separate your thought process/working versions from the final one (a comment that tells us which is the final version would be lovely).
Every function should have at least a docstring at the start that states what it does (see Lesson3 Team Notebook if you need a reminder). Make other comments as necessary.
Make sure that you are running test cases (plural) for everything and commenting on the results in markdown. Your comments should discuss how you know that the test case results are correct.
Don't forget to refer to the DNA, RNA, and Protein information that is in the Pre-Activity for Lesson 10.
A. Using the pseudocode that your team developed as a reference, create a codon
function that takes a DNA sequence string as its parameter and prints the DNA codons. A codon should be represented as a string of three characters.
Note: Your list of codons should start with the first ATG found in the sequence and end at the first stop codon (if there is a stop codon) or the end of the DNA sequence input (if there is not a stop codon). At this point we are only going to analyze the DNA from the first base given to the last base given (which corresponds to the 5'→3' direction). The function should not print anything if no ATG is found. For now don’t worry about multiple ATG codons since any internal ATG codons just make a Met amino acid and for now we will assume any after the stop are not translated. Assume any one DNA string will only be translated once from the first ATG.
Restriction: For this exercise, you cannot use the Python find
string method.
In [ ]:
Test your code with the DNA sequence below:
'TCATGACAGCTATGACCATGATTACGGATTCACTGTAACTGTATGTCCGTGTCGTT'
Since your function should only start at the first ATG, you should get the following codons from this test case with this version of codon
(shown as a list):
['ATG', 'ACA', 'GCT', 'ATG', 'ACC', 'ATG', 'ATT, 'ACG', 'GAT', 'TCA', 'CTG']
or
['ATG', 'ACA', 'GCT', 'ATG', 'ACC', 'ATG', 'ATT, 'ACG', 'GAT', 'TCA', 'CTG', 'TAA']
Note: Biologically it doesn't matter if you include the stop codon (TAA) is in the sequence or not, since when we are analyzing DNA sequence data to look for ORFs (as we often do with new genome sequences) the stop codons are never made into amino acids and therefore not included in the protein sequence so we don't really need them. But IF you made me pick, I would say include them, it's nice to know that there is actually a stop codon in there and not just some mistaken ending to your algorithm
Make sure to comment on the accuracy of your function.
In [ ]:
Write, test, and interpret the results of at least four more of your own test cases. They should contain a decent number of bases. Your tests should contain (at least):
In [ ]:
B. Copy, paste, and modify your codon
function so that it executes successfully if there are uppercase and/or lowercase letters in the DNA string sequence passed to the function.
In [ ]:
Write, test, and interpret the results of at least two of your own test cases. One should have mixed uppercase and lowercase DNA bases.
In [ ]:
C. Copy, paste, and modify your codon
function to return a list of codons instead of printing all the codons.
In [ ]:
Write, test, and interpret the results of at least two of your own test cases.
D. Copy, paste your codon function. Rename it orf
and modify it to return a single string of all of the codons together. This is what we call an ORF (Open Reading Frame). Your ORF can have all of the bases together in one, or be blocked into codons separated by a space. Either of these 2 options:
['ATGATGACGATCGTAGATCGGCTGATTAA'] or
['ATG ACA GCT ATG ACC ATG ATT ACG GAT TCA CTG TAA']
see the note above in part A about stop codons.
In [ ]:
Test your code again with the DNA sequence below:
'TCATGACAGCTATGACCATGATTACGGATTCACTGTAACTGTATGTCCGTGTCGTT'
It should produce output like what is shown above. Make sure to comment on your results.
In [ ]:
Write, test, and interpret the results of at least two of your own test cases. One should have mixed uppercase and lowercase DNA bases.
In [ ]:
E. Copy, paste, and modify your orf
function to return multiple ORFs if another ATG in the DNA sequence is found after the stop codon of the intial ORF. The stop codon for the first ORF must be "in frame" (triplet aligned) with its start codon, but any subsequent start codon does not need to be in the same frame with the initial input base or the previous stop codon. For example, the second ATG could be 1, 2, 3, 4, ... bases after the stop codon for the first orf.
Restriction: For this exercise, you still cannot use the Python find string method.
In [ ]:
Test your code again with the DNA sequence below:
'TCATGACAGCTATGACCATGATTACGGATTCACTGTAACTGTATGTCCGTGTCGTT'
It should produce output like what is shown below (with some minor differences on the ends - TAA
in first orf and TT
in second may not be present):
['ATG ACA GCT ATG ACC ATG ATT ACG GAT TCA CTG TAA', 'ATG TCC GTG TCG TT']
Make sure to comment on your results.
In [ ]:
Write, test, and interpret the results of at least two more of your own test cases. Make sure at least one has multiple orfs.
In [ ]:
Don't forget to refer to the DNA, RNA, and Protein information that is in the Pre-Activity for Lesson 10.
F. Define a transcribe
function that takes a sequence of DNA as its parameter (first to last base is 5'→3' by convention) and returns the corresponding RNA sequence (same 5'→3' convention). Both input and output sequences should be string objects (str).
In [ ]:
Write, test, and interpret the results of at least three of your own test cases.
Don't forget to refer to the DNA, RNA, and Protein information that is in the Pre-Activity for Lesson 10. GC content varies across the genomes of different organisms and is used for several different kinds of analyses. Link for more info.
G. Define a gc_percent
function that takes a sequence string of DNA or RNA as its parameter, and:
In [ ]:
Write, test, and interpret the results of at least three of your own test cases. At least one should be RNA and one DNA.
In [ ]: