Lesson 13 Individual Assignment

Individual means that you do it yourself. You won't learn to code if you don't struggle for yourself and write your own code. Remember that while you can discuss the general (algorithmic) way to solve a problem, you should not even be looking at anyone else's code or showing anyone else your code for an individual assignment.
Review the Group Work guidelines on Cavas and/or ask an instructor if you have any questions.

Background info

Downloading, parsing files, extracting, and processing necessary data is essential to modern science. Here we continue our examination of files from Biological sequence databases, but most scientific data is housed in a database that has a file format and it is a very common task to write computer code to open, extract, process, and otherwise manipulate records or datafiles in an automated way. Here are some examples from other disciplines:

Chemistry:
https://www.rsc.org/merck-index
http://www.chemspider.com/
Physics:
http://ned.ipac.caltech.edu/
Neuroscience:
https://ida.loni.usc.edu/
Environmental Science:
https://edg.epa.gov/metadata/catalog/main/home.page

Programming Practice

Be sure to spell all function names correctly - misspelled functions will lose points (and often break anyway since no one is sure what to type to call it). If you prefer showing your earlier, scratch work as you figure out what you are doing, please be sure that you make a final, complete, correct last function in its own cell that you then call several times to test. In other words, separate your thought process/working versions from the final one (a comment that tells us which is the final version would be lovely).

Every function should have at least a docstring at the start that states what it does (see Lesson3 Team Notebook if you need a reminder). Make other comments as necessary.

Make sure that you are running test cases (plural) for everything and commenting on the results in markdown. Your comments should discuss how you know that the test case results are correct.

A. Pseudocode Copy and paste your high-level pseudocode for your readXYZfromPDB function from Lesson 13 Team Notebook. High-level pseudocode is an outline - for example, you might just say "loop" instead of a while-loop with the condition and how the variable is incremented inside the loop.

Next, either:

copy and paste your detailed pseudocode for your readXYZfromPDB function from Lesson 13 Team Notebook, or
list the order in which you will develop your code (i.e., a walking skeleton which we will discuss in a future class).

B. You should have a PDB file 2LDJ.pdb from the pre-activity. Make sure that either you have the PDB file in the same directory as this Jupyter notebook, or you provide a complete file path.

Next download 1L2Y.pdb (download from PDB http://www.rcsb.org/pdb/home/home.do) as a second test case.

For each of the two files, list the first three x coordinates, the first three y coordinates, and the first three z coordinates.

2LDJ.pdb:

1L2Y.pdb:

Next, define a readXYZfromPDB function that:

takes the name of an input file as a parameter
opens and reads in the named PDB file, and
return three different lists containing atomic coordinates (as floats) of all atoms in the first model in the PDB file.

To clarify the output, for the PDB file specified in the argument, your function should return three lists that each contains the floating-point numbers of the x, y, or z coordinates, respectively. For example, the x-coordinate list for 2LDJ.pdb would be [11.030, 9.640, 8.650, 8.185, ...].



In [ ]:

Test your readXYZfromPDB function with the files 2LDJ.pdb and 1L2Y.pdb as test cases. Make sure you compare your function's output to your predictions from part A.

When you are done testing, write a brief interpretation of the test results.



In [ ]:

C. You are about to define a plain_coord function that extracts the coordinates from a PDB file. It should:

take two strings as its parameters:
1. the name of a pdb file and
2. the name of an output file
read the x, y, and z coordinates from the first model in the pdb file and
write a text file as output that is nicely formatted with X-Y-Z coordinates and the atom type designated by the atomic number from the periodic table (e.g., hydrogen = 1, carbon = 6, nitrogen = 7, oxygen = 8 and sulfur = 16)

Example output in the file would look like this:

6   -1.415689945221    0.020083000883    0.255733013153
6   -0.670023024082    1.240668058395   -0.075638003647
6    0.656983017921    1.203186035156   -0.208253994584
1    1.132277011871   -2.126605033875   -0.221330001950
1    1.316627025604   -0.007073000073    1.437821984291
1    0.618321001530   -0.923030018806   -1.357839941978

Before writing any Python code, provide pseudocode that shows how the plain_coord function will work with the readXYZfromPDB function from Part A.

Now define the plain_coord function:



In [ ]:

Test your plainCoord function with at least with the files 2LDJ.pdb and 1L2Y.pdb. Make sure you also predict and report at least a part of the lists as expected output for these test cases.

When you are done testing, write a brief interpretation of the test results.



In [ ]:

D. First write high-level pseudocode for a function called multi_translate that will read in a single DNA sequence in FASTA file format (see below and Lesson 13 Pre-Activity) and output all of the translated protein sequences(s) (ORFs) from that DNA in FASTA format. You have already written the "guts" of this in Lesson 10 and Lesson 11 Individual - you should have functions that make codons and translate - the new parts are the input and output files and parsing/outputting FASTA format.

Your function should:

take the name/path of the DNA FASTA file as a parameter
open that file to read the DNA sequence (skipping over the comment line)
call all appropriate functions (written previously, copied, pasted, and defined again in the same cell here) that will convert the DNA sequence to a single letter code protein sequence
write out a text file of the amino acid sequence(s) in FASTA format, this file should:
- have the same name as the DNA FASTA file, but with "_ORFS" appended (e.g. input DNA FASTA is "test.fasta" then the protein FASTA output should be "test_ORFS.fasta")
- each protein ORF should have a description line with ORF" and a number, e.g. ORF1, ORF2, etc. (if there is only one, then ORF1 is fine)

More info on FASTA format here.

Examples:
(input file test.fasta, contents below)

>short test seq
TCATGTTGGTGAAGGCGCCACGACTCGAGATGCAGTAGGAGG

(expected output - file name test_ORFS.fasta)

>ORF1
MLVKAPRLEMQ

Now write the function...

...called multi_translate described and pseudocoded above.



In [ ]:

Test your multi_translate function with the following files (from Canvas):

test.fasta
test2.fasta
EU285557.fasta

The actual predicted output from test.fasta is included above.

When you are done testing, write a brief interpretation of the test results.



In [ ]: