Programming Bootcamp 2016

Lesson 7 Exercises - ANSWERS



1. Creating your function file (1pt)

Up until now, we've created all of our functions inside the Jupyter notebook. However, in order to use your functions across different scripts, it's best to put them into a separate file. Then you can load this file from anywhere with a single line of code and have access to all your custom functions!

Do the following:

  • Open up your plain-text editor.
  • Copy and paste all of the functions you created in lab6, question 3, into a blank file and save it as my_utils.py (don't forget the .py extension). Save it anywhere on your computer.
  • Edit the code below so that it imports your my_utils.py functions. You will need to change '../utilities/my_utils.py' to where you saved your file.

Some notes:

  • You need to supply the path relative to where this notebook is. See the slides for some examples on how to specify paths.
  • You can use this method to import your functions from anywhere on your computer! In contrast, the regular import function will only find custom functions that are in the same directory as the current notebook/script.

In [3]:
import imp
my_utils = imp.load_source('my_utils', '../utilities/my_utils.py') #CHANGE THIS PATH

# test that this worked
print "Test my_utils.gc():", my_utils.gc("ATGGGCCCAATGG")
print "Test my_utils.reverse_compl():", my_utils.reverse_compl("GGGGTCGATGCAAATTCAAA")
print "Test my_utils.read_fasta():", my_utils.read_fasta("horrible.fasta")
print "Test my_utils.rand_seq():", my_utils.rand_seq(23)
print "Test my_utils.shuffle_nt():", my_utils.shuffle_nt("AAAAAAGTTTCCC")

print "\nIf the above produced no errors, then you're good!"


Test my_utils.gc(): 0.62
Test my_utils.reverse_compl(): TTTGAATTTGCATCGACCCC
Test my_utils.read_fasta(): {'varlen2_uc007xie.1_4456': 'ACACCGGCTGAAACAGCTGGCAGCACCACAGGCCCAGCTGTAAGCACAACCTCTGCTGGTTCCACATCTGCAAGAGCCGCCACCACCTCCCCAGGAGGTT', 'varlen2_uc010mlp.1_79': 'CCGCGCAGCGCCTACCCTTTCCTCGCTCCGGGCCGGCAGTGTGGGGCGGCGCGCTGGGGGCGCGGCGTGTCTGGGGACATCTTGTGATGTTGGCGAGAACAGGACATGATCTCACATGGCGAGAAGCTCTTTAGTTCCTTAATCATTTCACGGTGCCTTCGGACGCTTTTTTTCCACCTAAAACGTTTAGTTTCAGCTCAGTGATCAG', 'varlen2_uc009bxt.1_1728': 'AGAAGAAGGTTTTTATGAGTTAATTTTACCCATAAATATTTCAGTAATTTTGGAGTGGAAGTTGGGTTGCTAACATCAGAGCTCATGTTCATGTTTTAGAAGCCCTTTCTTTCTAGGTGTCTCTCAAAAGACTGGCTTAATTGGAATAAAATAGCTTTTTTTTTTTAAGTGTCCCAAAGATAAGATATCTTGGGGTCCACCATGAGCTTCCAGTTTCAGCCGAGGCAGCAGTTTTCCTGGGATAATTAGCCAAGGGTTCCCCAGAGTCTTCCTAGGTATCTCTGGGAACCAGCAGAAGGAATTTTCACCTA', 'varlen2_uc009div.2_242': 'CCAACGGTCATCAAGAGAAGTGCGCCTTCCTCCTCAAACCGACGCGAGGGACGTTCCTTAGGGGT', 'varlen2_uc003its.2_2976': 'CGCCCACCTGGATGTCCCAAAGTGCTAGATTCCATGGAATTTTCCTGCATGGAGTGCTACATCCTCTTGTTCCAGTCTCAGCTGGCTGCTCCGTGGGCATACTTCATACCTTTCATCACAGAGCTTTGCTTGGCCTCTCTCCTGTGTGTAATCTCTGTTTCTTGGATCCTGTGCTTACC', 'varlen2_uc003nvg.4_2466': 'CTCGGGCTGGCAGCAGTGAAAGTATTGCT', 'varlen2_uc029ygd.1_73': 'AGAAGAATGGCTCTTAAGAAATTGAAGGTGATAC', 'varlen2_uc007kxx.1_2963': 'TGTGTACAAAGATCAAATATCAGTATTGCCAAATGAACAAGACCTTGCGAGAGAAGAAGCTCAGAAAAT', 'varlen2_uc009wph.3_423': 'AAGGGGGTTTTCTGGGCCTCTCCAATATTAAGTTCAGGCCAGGATCTGTGGTGGTACAATTGACTCTGGCCTTCCGAGAAGGTAC', 'varlen2_uc010osx.2_1007': 'CAAATTTGATGGGAAGAGGTTCCAGAATGTTGTCAAGGAAGGAATGAAGTTTGATAAAAGTGAGAAAACTAAGGAGAGTCATGAAGCAGTTGAGAAAGAATTTGAGCCTCTGCTCAATTGGATGAAAGATAAAGCCCTTAAGGACAAAATTGAGAAGGCTGTGGTGTCTCAGCACCTGACAGAATCTCTGTGTGCTTTG', 'varlen2_uc001agr.3_7': 'TCGCTGCGCTGGGCCTCTCGGCGGTCCCCCCCACCCGGGTCCCGGGCCC', 'varlen2_uc001pmn.3_3476': 'TCTTACTATTATGTGGAATTTAACTTTTCAGAGAAAATTTCCAAAGTCCTCTACCTTATTTGGCTTTTATCTGTGTCTTTTGGAAAAGGAGCTGCATGGCCTCAGTGCCTCCATAAAAAGCTTAAGTTGAGTTCATGGGAGAACATTAAGATCGTTTTTTTTCCCTTTTAGCTCGGTAATCTCATGTACCACAGGTAATGACCAAGCATCTCTGCTAGAGGATGTAACTCTCATAACTCCTGTGTAAAATGAGGATCCTGTTGGTGATCCTCATTCCAGTTTCCAAAGC', 'varlen2_uc003khi.3_3': 'CACTATGTGCTTCCGGTCTCCAATACCGCAGGAGGGCGGTCTTCCCCGGCTCGCCAACTCGGCTGCTCTGGGGGATTCGTGCGCGGTACCACTGATGAAATTATGACCAGGTGGGCCCGAGTTAGTACCACATATAACAAGAGACCCTTGCCTGCAACATCATGGGAGGACATGAAGAAGGGATCCTTTGAGGGAACAAGCCAAAACCTACCAAAGCGTAAACAACTTGAAGCCAATAGGCTATCCCTCAAAAATGATGCACCCCAAGCAAAACATAAAAAGAACAAAAAGAAAAAAGAGTACTTAAATGAAGATGTGAATGGATTCATGGAATACCTAAGACAGAATTCACAGATGGTTCACAATGGGCAAATTATAGCAACAGACAGTGAGGAAGTAAGGGAAGAAATTGCAGTTGCTTTAAAGAAAGACAGTCGACGGGAAGGAAGAAGATTAAAA', 'varlen2_uc021qfk.1>2_1472': 'CCTTGGATAAGGAAATTACATCAGCAAATTATGCTGGTGTCTGTACATCATCTGTGATTAAAGAAGAAAACATTGA', 'varlen2_uc011moe.2_5914': 'CTGAGGTCTAACATTGTTATCCTATATCATTTTCATCCCAAGTAGTGATATCTGTGAAATCACAGGTTTGATGTGTGCTAATTATGTATTCTTCTAATACATATTAAAAGACATAACTATCAAAACAAAATAAATTTGTCTGTTTTCAACCAAAGAAGTCACGTACCACTGGTGGTACTGTGTGCCATAATTTGGCAAATGCTGGCCTTTATGGACGAGCACAATTCGGGGGTCAGACCTGGTTCAAATTCTAGCTGTAGAAACTTGTGCAAGTTACTTCACCTCTGAGCCTAAGTTTCCACATCTGTAAAAGGAGATAATAAACACCTACCTTGCAGTAGTGAAGCAAAGAGAAAATTAAATATATATGAAG', 'varlen2_uc003hyy.2_273': 'CCACTCTGGAAATCTT', 'varlen2_uc007nte.2_374': 'GGACTGACCTCTGTCTTCCAGATTCAAGTGCTACTGCTTTCTCAGCTTCTGGAGATGGTGTAGTTTCAAGAAACCAGAGTTTCCTGAGAACTGCAATTCAAAGGACACCTCATGAAGTAATGAGAAGAGAAAGCCACAGACTGTCTGCCCCTTCTTACCTTGTCAGGAGCCTAGCAGATGTCCCTCGAGAGTGTGGCTCATCACAGTCATTTTTGACAGAAGTTAACTTTGCTGTTGAGAATGGAGACTCTGGCTCCCGATACTTCTTCTCAGATAACTTTTTTGATGGACAGAGAAGGCGGCCACTTGGAGATCGTGCACAAGAAGATTACAGATATTATGAATACAACCATGATCTCTTCCAGAGGATGCCACAGAGTCAGGGGAGGCACACTTCAGGTATTGGGAGAGTCACGGCTACATCTCTAGGGAATTTAACTAACCATGG', 'varlen2_uc007fws.1_377': 'AAACCAGTCCCGGAAAGG', 'varlen2_uc003pij.1_129': 'GATTTTATTATGGCAGAAGTCAATTT', 'varlen2_uc002wkt.1_1569': 'AGGCTCAGCCCGCTGGAGCATGAGCTCTGTGTGTACCGAGAACTGGGGTGATGTTTTACTTTTCACAGTATGGGCTACACAGCAGCTGTTCA', 'varlen2_uc010suq.2_3895': 'TAAGATGTCAGTATTCTCCTCATGGATATTTTCAGTCTCAATGTTGCCAGTCTAATAGTATAGTATGTGGTTGCTTTACTGCTGTTCTCCCCACCCCCGTGGAGTTGTGTCATTATTTTAATGAATGTGAGCTCTTGACTTACTCTAGAATTCTAATACAGATACTTTCTGCACTAGTATCACTTTTATTATTGCAGTTCCATTTAAAAGCATAACTGGCTAAGTCACCGCCCCACCCGCCGCATTACATTTCCTAAAACATACTGCTGCCAAAGACCAACTGTAGAATCCTTAAGTCCTGTTCAAGTGTCACTTCATTTTTTATTAGGGGTTTCTTAGTTGGAGGGCATGAGAAACTTGGCCCTTTCCAATGTTGTCTTCACATTGGAGTATAAGAATTCTCTATAGGACTCTTTAATTTTTTTTTTTTTTTGGTAGAGATGGGGTCTTGCTGTGTTGCCCAGGCTGGTCTCAAACTCCTAGGCTTAAGC', 'varlen2_uc003yos.2_1634': 'CCATCAGGAAAATGCTTTTTCCAGGGAGACCACGGATTTGATAACAAGGTCAACAGCATGCAGACTGTTTTTGTAGGTTATGGCTCAACATTTAAGTACAAGACTAAAGTGCCTCCATTTGAAAACATTGAACTTTACAATGTTATGTGTGATCTCCTGGGATTGAAGCCAGCTCCTAATAATGGGACCCATGGAAGTTTGAATCATCTCCTGCGCACTAATACCTTCAGGCCAACCATGC'}
Test my_utils.rand_seq(): CCACCTAATCCCACCAACATACA
Test my_utils.shuffle_nt(): TAGCAATCAATAC

If the above produced no errors, then you're good!

Feel free to use these functions (and any others you've created) to solve the problems below. You can see in the test code above how they can be accessed.


2. Command line arguments (6pts)

Note: Do the following in a SCRIPT, not in the notebook. You can not use command line arguments within Jupyter notebooks.

After testing your code as a script, copy and paste it here for grading purposes only.

(A) Write a script that expects 4 arguments, and prints those four arguments to the screen. Test this script by running it (on the command line) as shown in the lecture. Copy and paste the code below once you have it working.


In [ ]:
import sys

print sys.argv[1]
print sys.argv[2]
print sys.argv[3]
print sys.argv[4]

(B) Write a script that expects 3 numerical arguments ("a", "b", and "c") from the command line.

  • Check that the correct number of arguments is supplied (based on the length of sys.argv)
  • If not, print an error message and exit
  • Otherwise, go on to add the three numbers together and print the result.

Copy and paste your code below once you have it working.

Note: All command line arguments are read in as strings (just like with raw_input). To use them as numbers, you must convert them with float().


In [ ]:
import sys

if len(sys.argv) == 4:
    a = float(sys.argv[1])
    b = float(sys.argv[2])
    c = float(sys.argv[3])
else: 
    print "Incorrect args. Please enter three numbers."
    sys.exit()

print a + b + c

(C) Here you will create a script that generates a random dataset of sequences.

Your script should expect the following command line arguments, in this order. Remember to convert strings to ints when needed:

  1. outFile - string; name of the output file the generated sequences will be printed to
  2. numSeqs - integer; number of sequences to create
  3. minLength - integer; minimum sequence length
  4. maxLength - integer; maximum sequence length

The script should read in these arguments and check if the correct number of arguments is supplied (exit if not).

If all looks good, then print the indicated number of randomly generated sequences as follows:

  • the length of each individual sequence should be randomly chosen to be between minLength and maxLength (so that not all sequences are the same length)
  • each sequence should be given a unique ID (e.g. using a counter to make names like seq1, seq2, ...)
  • the output should be in fasta format (>seqID\nsequence\n)
  • the output shold be printed to the indicated file

Then, run your script to create a file called fake.fasta containing 100,000 random sequences of random length 50-500 nt.

Copy and paste your code below once you have it working.


In [ ]:
import sys, random, imp
my_utils = imp.load_source('my_utils', '../utilities/my_utils.py')

if len(sys.argv) == 5:
    outFile = sys.argv[1]
    numSeqs = int(sys.argv[2])
    minLength = int(sys.argv[3])
    maxLength = int(sys.argv[4])
else: 
    print "Incorrect args. Please enter an outfile, numSeqs, minLength, maxLength."
    sys.exit()

outs = open(outFile, 'w')
for i in range(numSeqs):
    randLen = random.randint(minLength, maxLength)
    randSeq = my_utils.rand_seq(randLen)
    seqName = "seq" + str(i)
    outs.write(">" + seqName + "\n" + randSeq + "\n")

outs.close()

3. time practice (7pts)

For the following problems, use the file you created in the previous problem (fake.fasta) and the time.time() function. (Note: there is also a copy of fake.fasta on Piazza if you need it.)

Note: Do not include the time it takes to read the file in your time calculation! Loading files can take a while.

(A) Initial practice with timing. Add code to the following cell to time how long it takes to run. Print the result.


In [1]:
import time

start = time.time()

sillyList = []
for i in range(50000):
    sillyList.append(sum(sillyList))

end = time.time()

print end - start


6.65499997139

(B) Counting characters. Is it faster to use the built-in function str.count() or to loop through a string and count characters manually? Compare the two by counting all the A's in all the sequences in fake.fasta using each method and comparing how long they take to run.

(You do not need to output the counts)


In [4]:
# Method 1 (Manual counting)
seqDict = my_utils.read_fasta("fake.fasta")

start = time.time()

for seqID in seqDict:
    seq = seqDict[seqID]
    count = 0
    for char in seq:
        if char == "A":
            count += 1
            
end = time.time()
print end - start


2.55799984932

In [5]:
# Method 2 (.count())
seqDict = my_utils.read_fasta("fake.fasta")

start = time.time()

for seqID in seqDict:
    seq = seqDict[seqID]
    count = seq.count("A")
            
end = time.time()
print end - start


0.111999988556

Which was faster? Method 2, str.count()

(C) Replacing characters. Is it faster to use the built-in function str.replace() or to loop through a string and replace characters manually? Compare the two by replacing all the T's with U's in all the sequences in fake.fasta using each method, and comparing how long they take to run.

(You do not need to output the edited sequences)


In [6]:
# Method 1 (Manual replacement)
seqDict = my_utils.read_fasta("fake.fasta")

start = time.time()

for seqID in seqDict:
    seq = seqDict[seqID]
    newSeq = ""
    for char in seq:
        if char == "T":
            newSeq += "U"
        else:
            newSeq += char
            
end = time.time()
print end - start


5.87399983406

In [7]:
# Method 2 (.replace())
seqDict = my_utils.read_fasta("fake.fasta")

start = time.time()

for seqID in seqDict:
    seq = seqDict[seqID]
    newSeq = seq.replace("T", "U")
            
end = time.time()
print end - start


0.12700009346

Which was faster? Method 2, str.replace()

(D) Lookup speed in data structures. Is it faster to get unique IDs using a list or a dictionary? Read in fake.fasta, ignoring everything but the header lines. Count the number of unique IDs (headers) using a list or dictionary, and compare how long each method takes to run.

Be patient; this one might take a while to run!


In [8]:
# Method 1 (list)
seqDict = my_utils.read_fasta("fake.fasta")

uniqueIDs = []
start = time.time()

for seqID in seqDict:
    if seqID not in uniqueIDs:
        uniqueIDs.append(seqID)
            
end = time.time()
print end - start


101.358999968

In [9]:
# Method 2 (dictionary)
seqDict = my_utils.read_fasta("fake.fasta")

uniqueIDs = {}
start = time.time()

for seqID in seqDict:
    if seqID not in uniqueIDs:
        uniqueIDs[seqID] = True
            
end = time.time()
print end - start


0.0309998989105

Which was faster? Method 2, dictionary

If you're curious, below is a brief explanation of the outcomes you should have observed:

(B) The built-in method should be much faster! Most built in functions are pretty well optimized, so they will often (but not always) be faster.

(C) Again, the built in function should be quite a bit faster.

(D) If you did this right, then the dictionary should be faster by several orders of magnitude. When you use a dictionary, Python jumps directly to where the requested key should be, if it were in the dictionary. This is very fast (it's an O(1) operation, for those who are familiar with the terminology). With lists, on the other hand, Python will scan through the whole list until it finds the requested element (or until it reaches the end). This gets slower and slower on average as you add more elements (it's an O(n) operation). Just something to keep in mind if you start working with very large datasets!


4. os and glob practice (6pts)

Use horrible.fasta as a test fasta file for the following.

(A) Write code that prompts the user (using raw_input()) for two pieces of information: an input file name (assumed to be a fasta file) and an output folder name (does not need to already exist). Then do the following:

  • Check if the input file exists
  • If it doesn't, print an error message
  • Otherwise, go on to check if the output folder exists
  • If it doesn't, create it

Note: I did this below with command line args instead of raw_input() to give more examples of using args


In [ ]:
import sys, os

# read in command line args
if len(sys.argv) == 3:
    inputFile = sys.argv[1]
    outputFolder = sys.argv[2]
else:
    print ">>Error: Incorrect args. Please provide an input file name and an output folder. Exiting."
    sys.exit()

# check if input file / output directory exist
if not os.path.exists(inputFile):
    print ">>Error: input file (%s) does not exist. Exiting." % inputFile
    sys.exit()
    
if not os.path.exists(outputFolder):
    print "Creating output folder (%s)" % outputFolder
    os.mkdir(outputFolder)

(B) Add to the code above so that it also does the following after creating the output folder:

  • Read in the fasta file (ONLY if it exists)
  • Print each individual sequence to a separate file in the specified output folder.
  • The files should be named <SEQID>.fasta, where <SEQID> is the name of the sequence (from the fasta header)

In [ ]:
import sys, os, my_utils

# read in command line args
if len(sys.argv) == 3:
    inputFile = sys.argv[1]
    outputFolder = sys.argv[2]
else:
    print ">>Error: Incorrect args. Please provide an input file name and an output folder. Exiting."
    sys.exit()

# check if input file / output directory exist
if not os.path.exists(inputFile):
    print ">>Error: input file (%s) does not exist. Exiting." % inputFile
    sys.exit()
    
if not os.path.exists(outputFolder):
    print "Creating output folder (%s)" % outputFolder
    os.mkdir(outputFolder)
    
# read in sequences from fasta file & print to separate output files
# you'll get an error for one of them because there's a ">" in the sequence id,
# which is not allowed in a file name. you can handle this however you want.
# here I used a try-except statement and just skipped the problematic file (with a warning message)
seqs = my_utils.read_fasta(inputFile)
for seqID in seqs:
    outFile = "%s/%s.fasta" % (outputFolder, seqID)
    outStr = ">%s\n%s\n" % (seqID, seqs[seqID])
    
    try:
        outs = open(outFile, 'w')
        outs.write(outStr)
        outs.close()
    except IOError:
        print ">>Warning: Could not print (%s) file. Skipping." % outFile

(C) Now use glob to get a list of all files in the output folder from part (B) that have a .fasta extension. For each file, print just the file name (not the file path) to the screen.


In [ ]:
import sys, glob, os

# read in command line args
if len(sys.argv) == 2:
    folderName = sys.argv[1]
else:
    print ">>Error: Incorrect args. Please provide an folder name. Exiting."
    sys.exit()

fastaList = glob.glob(folderName + "/*.fasta")
for filePath in fastaList:
    print os.path.basename(filePath)