Programming Bootcamp 2016

Lesson 7 Exercises

!!! This is the last graded problem set! And it's due early: 11pm on 9/29 !!!

Winners will be announced after the final lecture on 9/30, and mugs will be distributed then
If you earn a mug but can't attend, you can pick it up from me later. Send me an email to arrange a time.

Earning points (optional)

Enter your name below.
Email your .ipynb file to me (sarahmid@mail.med.upenn.edu) before 11:00 pm on 9/29.
You do not need to complete all the problems to get points.
I will give partial credit for effort when possible.
At the end of the course, everyone who gets at least 90% of the total points will get a prize (bootcamp mug!).

Name:

1. Creating your function file (1pt)

Up until now, we've created all of our functions inside the Jupyter notebook. However, in order to use your functions across different scripts, it's best to put them into a separate file. Then you can load this file from anywhere with a single line of code and have access to all your custom functions!

Do the following:

Open up your plain-text editor.
Copy and paste all of the functions you created in lab6, question 3, into a blank file and save it as my_utils.py (don't forget the .py extension). Save it anywhere on your computer.
Edit the code below so that it imports your my_utils.py functions. You will need to change '../utilities/my_utils.py' to where you saved your file.

Some notes:

You need to supply the path relative to where this notebook is. See the slides for some examples on how to specify paths.
You can use this method to import your functions from anywhere on your computer! In contrast, the regular import function will only find custom functions that are in the same directory as the current notebook/script.



In [ ]:

    
import imp
my_utils = imp.load_source('my_utils', '../utilities/my_utils.py') #CHANGE THIS PATH

# test that this worked
print "Test my_utils.gc():", my_utils.gc("ATGGGCCCAATGG")
print "Test my_utils.reverse_compl():", my_utils.reverse_compl("GGGGTCGATGCAAATTCAAA")
print "Test my_utils.read_fasta():", my_utils.read_fasta("horrible.fasta")
print "Test my_utils.rand_seq():", my_utils.rand_seq(23)
print "Test my_utils.shuffle_nt():", my_utils.shuffle_nt("AAAAAAGTTTCCC")

print "\nIf the above produced no errors, then you're good!"

Feel free to use these functions (and any others you've created) to solve the problems below. You can see in the test code above how they can be accessed.

2. Command line arguments (6pts)

Note: Do the following in a SCRIPT, not in the notebook. You can not use command line arguments within Jupyter notebooks.

After testing your code as a script, copy and paste it here for grading purposes only.

(A) Write a script that expects 4 arguments, and prints those four arguments to the screen. Test this script by running it (on the command line) as shown in the lecture. Copy and paste the code below once you have it working.



In [ ]:

(B) Write a script that expects 3 numerical arguments ("a", "b", and "c") from the command line.

Check that the correct number of arguments is supplied (based on the length of sys.argv)
If not, print an error message and exit
Otherwise, go on to add the three numbers together and print the result.

Copy and paste your code below once you have it working.

Note: All command line arguments are read in as strings (just like with raw_input). To use them as numbers, you must convert them with float().



In [ ]:

(C) Here you will create a script that generates a random dataset of sequences.

Your script should expect the following command line arguments, in this order. Remember to convert strings to ints when needed:

outFile - string; name of the output file the generated sequences will be printed to
numSeqs - integer; number of sequences to create
minLength - integer; minimum sequence length
maxLength - integer; maximum sequence length

The script should read in these arguments and check if the correct number of arguments is supplied (exit if not).

If all looks good, then print the indicated number of randomly generated sequences as follows:

the length of each individual sequence should be randomly chosen to be between minLength and maxLength (so that not all sequences are the same length)
each sequence should be given a unique ID (e.g. using a counter to make names like seq1, seq2, ...)
the output should be in fasta format (>seqID\nsequence\n)
the output shold be printed to the indicated file

Then, run your script to create a file called fake.fasta containing 100,000 random sequences of random length 50-500 nt.

Copy and paste your code below once you have it working.



In [ ]:

3. `time` practice (7pts)

For the following problems, use the file you created in the previous problem (fake.fasta) and the time.time() function. (Note: there is also a copy of fake.fasta on Piazza if you need it.)

Note: Do not include the time it takes to read the file in your time calculation! Loading files can take a while.

(A) Initial practice with timing. Add code to the following cell to time how long it takes to run. Print the result.



In [ ]:

    
sillyList = []
for i in range(50000):
    sillyList.append(sum(sillyList))

(B) Counting characters. Is it faster to use the built-in function str.count() or to loop through a string and count characters manually? Compare the two by counting all the A's in all the sequences in fake.fasta using each method and comparing how long they take to run.

(You do not need to output the counts)



In [ ]:

    
# Method 1 (Manual counting)



In [ ]:

    
# Method 2 (.count())

Which was faster? Your answer:

(C) Replacing characters. Is it faster to use the built-in function str.replace() or to loop through a string and replace characters manually? Compare the two by replacing all the T's with U's in all the sequences in fake.fasta using each method, and comparing how long they take to run.

(You do not need to output the edited sequences)



In [ ]:

    
# Method 1 (Manual replacement)



In [ ]:

    
# Method 2 (.replace())

Which was faster? Your answer:

(D) Lookup speed in data structures. Is it faster to get unique IDs using a list or a dictionary? Read in fake.fasta, ignoring everything but the header lines. Count the number of unique IDs (headers) using a list or dictionary, and compare how long each method takes to run.

Be patient; this one might take a while to run!



In [ ]:

    
# Method 1 (list)



In [ ]:

    
# Method 2 (dictionary)

Which was faster? Your answer:

If you're curious, below is a brief explanation of the outcomes you should have observed:

(B) The built-in method should be much faster! Most built in functions are pretty well optimized, so they will often (but not always) be faster.

(C) Again, the built in function should be quite a bit faster.

(D) If you did this right, then the dictionary should be faster by several orders of magnitude. When you use a dictionary, Python jumps directly to where the requested key should be, if it were in the dictionary. This is very fast (it's an O(1) operation, for those who are familiar with the terminology). With lists, on the other hand, Python will scan through the whole list until it finds the requested element (or until it reaches the end). This gets slower and slower on average as you add more elements (it's an O(n) operation). Just something to keep in mind if you start working with very large datasets!

4. `os` and `glob` practice (6pts)

Use horrible.fasta as a test fasta file for the following.

(A) Write code that prompts the user (using raw_input()) for two pieces of information: an input file name (assumed to be a fasta file) and an output folder name (does not need to already exist). Then do the following:

Check if the input file exists
If it doesn't, print an error message
Otherwise, go on to check if the output folder exists
If it doesn't, create it



In [ ]:

(B) Add to the code above so that it also does the following after creating the output folder:

Read in the fasta file (ONLY if it exists)
Print each individual sequence to a separate file in the specified output folder.
The files should be named <SEQID>.fasta, where <SEQID> is the name of the sequence (from the fasta header)



In [ ]:

(C) Now use glob to get a list of all files in the output folder from part (B) that have a .fasta extension. For each file, print just the file name (not the file path) to the screen.



In [ ]:

Programming Bootcamp 2016

Lesson 7 Exercises

!!! This is the last graded problem set! And it's due early: 11pm on 9/29 !!!

1. Creating your function file (1pt)

2. Command line arguments (6pts)

3. time practice (7pts)

4. os and glob practice (6pts)

3. `time` practice (7pts)

4. `os` and `glob` practice (6pts)