Up until now, we've created all of our functions inside the Jupyter notebook. However, in order to use your functions across different scripts, it's best to put them into a separate file. Then you can load this file from anywhere with a single line of code and have access to all your custom functions!
Do the following:
my_utils.py
functions. You will need to change '../utilities/my_utils.py'
to where you saved your file. Some notes:
import
function will only find custom functions that are in the same directory as the current notebook/script.
In [3]:
import imp
my_utils = imp.load_source('my_utils', '../utilities/my_utils.py') #CHANGE THIS PATH
# test that this worked
print "Test my_utils.gc():", my_utils.gc("ATGGGCCCAATGG")
print "Test my_utils.reverse_compl():", my_utils.reverse_compl("GGGGTCGATGCAAATTCAAA")
print "Test my_utils.read_fasta():", my_utils.read_fasta("horrible.fasta")
print "Test my_utils.rand_seq():", my_utils.rand_seq(23)
print "Test my_utils.shuffle_nt():", my_utils.shuffle_nt("AAAAAAGTTTCCC")
print "\nIf the above produced no errors, then you're good!"
Feel free to use these functions (and any others you've created) to solve the problems below. You can see in the test code above how they can be accessed.
(A) Write a script that expects 4 arguments, and prints those four arguments to the screen. Test this script by running it (on the command line) as shown in the lecture. Copy and paste the code below once you have it working.
In [ ]:
import sys
print sys.argv[1]
print sys.argv[2]
print sys.argv[3]
print sys.argv[4]
(B) Write a script that expects 3 numerical arguments ("a", "b", and "c") from the command line.
Copy and paste your code below once you have it working.
Note: All command line arguments are read in as strings (just like with raw_input). To use them as numbers, you must convert them with float().
In [ ]:
import sys
if len(sys.argv) == 4:
a = float(sys.argv[1])
b = float(sys.argv[2])
c = float(sys.argv[3])
else:
print "Incorrect args. Please enter three numbers."
sys.exit()
print a + b + c
(C) Here you will create a script that generates a random dataset of sequences.
Your script should expect the following command line arguments, in this order. Remember to convert strings to ints when needed:
The script should read in these arguments and check if the correct number of arguments is supplied (exit if not).
If all looks good, then print the indicated number of randomly generated sequences as follows:
Then, run your script to create a file called fake.fasta
containing 100,000 random sequences of random length 50-500 nt.
Copy and paste your code below once you have it working.
In [ ]:
import sys, random, imp
my_utils = imp.load_source('my_utils', '../utilities/my_utils.py')
if len(sys.argv) == 5:
outFile = sys.argv[1]
numSeqs = int(sys.argv[2])
minLength = int(sys.argv[3])
maxLength = int(sys.argv[4])
else:
print "Incorrect args. Please enter an outfile, numSeqs, minLength, maxLength."
sys.exit()
outs = open(outFile, 'w')
for i in range(numSeqs):
randLen = random.randint(minLength, maxLength)
randSeq = my_utils.rand_seq(randLen)
seqName = "seq" + str(i)
outs.write(">" + seqName + "\n" + randSeq + "\n")
outs.close()
time
practice (7pts)For the following problems, use the file you created in the previous problem (fake.fasta
) and the time.time()
function. (Note: there is also a copy of fake.fasta
on Piazza if you need it.)
Note: Do not include the time it takes to read the file in your time calculation! Loading files can take a while.
(A) Initial practice with timing. Add code to the following cell to time how long it takes to run. Print the result.
In [1]:
import time
start = time.time()
sillyList = []
for i in range(50000):
sillyList.append(sum(sillyList))
end = time.time()
print end - start
(B) Counting characters. Is it faster to use the built-in function str.count()
or to loop through a string and count characters manually? Compare the two by counting all the A's in all the sequences in fake.fasta
using each method and comparing how long they take to run.
(You do not need to output the counts)
In [4]:
# Method 1 (Manual counting)
seqDict = my_utils.read_fasta("fake.fasta")
start = time.time()
for seqID in seqDict:
seq = seqDict[seqID]
count = 0
for char in seq:
if char == "A":
count += 1
end = time.time()
print end - start
In [5]:
# Method 2 (.count())
seqDict = my_utils.read_fasta("fake.fasta")
start = time.time()
for seqID in seqDict:
seq = seqDict[seqID]
count = seq.count("A")
end = time.time()
print end - start
Which was faster? Method 2, str.count()
(C) Replacing characters. Is it faster to use the built-in function str.replace()
or to loop through a string and replace characters manually? Compare the two by replacing all the T's with U's in all the sequences in fake.fasta
using each method, and comparing how long they take to run.
(You do not need to output the edited sequences)
In [6]:
# Method 1 (Manual replacement)
seqDict = my_utils.read_fasta("fake.fasta")
start = time.time()
for seqID in seqDict:
seq = seqDict[seqID]
newSeq = ""
for char in seq:
if char == "T":
newSeq += "U"
else:
newSeq += char
end = time.time()
print end - start
In [7]:
# Method 2 (.replace())
seqDict = my_utils.read_fasta("fake.fasta")
start = time.time()
for seqID in seqDict:
seq = seqDict[seqID]
newSeq = seq.replace("T", "U")
end = time.time()
print end - start
Which was faster? Method 2, str.replace()
(D) Lookup speed in data structures. Is it faster to get unique IDs using a list or a dictionary? Read in fake.fasta
, ignoring everything but the header lines. Count the number of unique IDs (headers) using a list or dictionary, and compare how long each method takes to run.
Be patient; this one might take a while to run!
In [8]:
# Method 1 (list)
seqDict = my_utils.read_fasta("fake.fasta")
uniqueIDs = []
start = time.time()
for seqID in seqDict:
if seqID not in uniqueIDs:
uniqueIDs.append(seqID)
end = time.time()
print end - start
In [9]:
# Method 2 (dictionary)
seqDict = my_utils.read_fasta("fake.fasta")
uniqueIDs = {}
start = time.time()
for seqID in seqDict:
if seqID not in uniqueIDs:
uniqueIDs[seqID] = True
end = time.time()
print end - start
Which was faster? Method 2, dictionary
If you're curious, below is a brief explanation of the outcomes you should have observed:
(B) The built-in method should be much faster! Most built in functions are pretty well optimized, so they will often (but not always) be faster.
(C) Again, the built in function should be quite a bit faster.
(D) If you did this right, then the dictionary should be faster by several orders of magnitude. When you use a dictionary, Python jumps directly to where the requested key should be, if it were in the dictionary. This is very fast (it's an O(1) operation, for those who are familiar with the terminology). With lists, on the other hand, Python will scan through the whole list until it finds the requested element (or until it reaches the end). This gets slower and slower on average as you add more elements (it's an O(n) operation). Just something to keep in mind if you start working with very large datasets!
(A) Write code that prompts the user (using raw_input()
) for two pieces of information: an input file name (assumed to be a fasta file) and an output folder name (does not need to already exist). Then do the following:
Note: I did this below with command line args instead of raw_input() to give more examples of using args
In [ ]:
import sys, os
# read in command line args
if len(sys.argv) == 3:
inputFile = sys.argv[1]
outputFolder = sys.argv[2]
else:
print ">>Error: Incorrect args. Please provide an input file name and an output folder. Exiting."
sys.exit()
# check if input file / output directory exist
if not os.path.exists(inputFile):
print ">>Error: input file (%s) does not exist. Exiting." % inputFile
sys.exit()
if not os.path.exists(outputFolder):
print "Creating output folder (%s)" % outputFolder
os.mkdir(outputFolder)
(B) Add to the code above so that it also does the following after creating the output folder:
<SEQID>.fasta
, where <SEQID>
is the name of the sequence (from the fasta header)
In [ ]:
import sys, os, my_utils
# read in command line args
if len(sys.argv) == 3:
inputFile = sys.argv[1]
outputFolder = sys.argv[2]
else:
print ">>Error: Incorrect args. Please provide an input file name and an output folder. Exiting."
sys.exit()
# check if input file / output directory exist
if not os.path.exists(inputFile):
print ">>Error: input file (%s) does not exist. Exiting." % inputFile
sys.exit()
if not os.path.exists(outputFolder):
print "Creating output folder (%s)" % outputFolder
os.mkdir(outputFolder)
# read in sequences from fasta file & print to separate output files
# you'll get an error for one of them because there's a ">" in the sequence id,
# which is not allowed in a file name. you can handle this however you want.
# here I used a try-except statement and just skipped the problematic file (with a warning message)
seqs = my_utils.read_fasta(inputFile)
for seqID in seqs:
outFile = "%s/%s.fasta" % (outputFolder, seqID)
outStr = ">%s\n%s\n" % (seqID, seqs[seqID])
try:
outs = open(outFile, 'w')
outs.write(outStr)
outs.close()
except IOError:
print ">>Warning: Could not print (%s) file. Skipping." % outFile
(C) Now use glob
to get a list of all files in the output folder from part (B) that have a .fasta extension. For each file, print just the file name (not the file path) to the screen.
In [ ]:
import sys, glob, os
# read in command line args
if len(sys.argv) == 2:
folderName = sys.argv[1]
else:
print ">>Error: Incorrect args. Please provide an folder name. Exiting."
sys.exit()
fastaList = glob.glob(folderName + "/*.fasta")
for filePath in fastaList:
print os.path.basename(filePath)