Programming Bootcamp 2016

Lesson 6 Exercises


Earning points (optional)

  • Enter your name below.
  • Email your .ipynb file to me (sarahmid@mail.med.upenn.edu) before 9:00 am on 9/27.
  • You do not need to complete all the problems to get points.
  • I will give partial credit for effort when possible.
  • At the end of the course, everyone who gets at least 90% of the total points will get a prize (bootcamp mug!).

Name:


1. Guess the output: scope practice (2pts)

Refer to the code below to answer the following questions:


In [ ]:
def fancy_calc(a, b, c):
    x1 = basic_calc(a,b)
    x2 = basic_calc(b,c)
    x3 = basic_calc(c,a)
    z = x1 * x2 * x3
    return z

def basic_calc(x, y):
    result = x + y
    return result

x = 1
y = 2
z = 3
result = fancy_calc(x, y, z)

(A) List the line numbers of the code above in the order that they will be executed. If a line will be executed more than once, list it each time.

NOTE: Select the cell above and hit "L" to activate line numbering!

Your answer:

(B) Guess the output if you were to run each of the following pieces of code immediately after running the code above. Then run the code to see if you're right. (Remember to run the code above first)


In [ ]:
print x

Your guess:


In [ ]:
print z

Your guess:


In [ ]:
print x1

Your guess:


In [ ]:
print result

Your guess:


2. Data structure woes (2pt)

(A) Passing a data structure to a function. Guess the output of the following lines of code if you were to run them immediately following the code block below. Then run the code yourself to see if you're right.


In [ ]:
# run this first!

def getMax(someList):
    someList.sort()
    x = someList[-1]
    return x

scores = [9, 5, 7, 1, 8]
maxScore = getMax(scores)

In [ ]:
print maxScore

Your guess:


In [ ]:
print someList

Your guess:


In [ ]:
print scores

Your guess:

Why does scores get sorted?

When you pass a data structure as a parameter to a function, it's not a copy of the data structure that gets passed (as what happens with regular variables). What gets passed is a direct reference to the data structure itself.

The reason this is done is because data structures are typically expected to be fairly large, and copying/re-assigning the whole thing can be both time- and memory-consuming. So doing things this way is more efficient. It can also surprise you, though, if you're not aware it's happening. If you would like to learn more about this, look up "Pass by reference vs pass by value".

(B) Copying data structures. Guess the output of the following code if you were to run them immediately following the code block below. Then run the code yourself to see if you're right.


In [ ]:
# run this first!
list1 = [1, 2, 3, 4]
list2 = list1
list2[0] = "HELLO"

In [ ]:
print list2

Your guess:


In [ ]:
print list1

Your guess:

Yes, that's right--even when you try to make a new copy of a list, it's actually just a reference to the same list! This is called aliasing. The same thing will happen with a dictionary. This can really trip you up if you don't know it's happening.

So what if we want to make a truly separate copy? Here's a way for lists:


In [ ]:
# for lists
list1 = [1, 2, 3, 4]
list2 = list(list1) #make a true copy of the list
list2[0] = "HELLO"

print list2
print list1

And here's a way for dictionaries:


In [ ]:
# for dictionaries
dict1 = {'A':1, 'B':2, 'C':3}
dict2 = dict1.copy() #make a true copy of the dict
dict2['A'] = 99

print dict2
print dict1

3. Writing custom functions (8pts)

Complete the following. For some of these problems, you can use your code from previous labs as a starting point.

(If you didn't finish those problems, feel free to use the code from the answer sheet, just make sure you understand how they work! Optionally, for extra practice you can try re-writing them using some of the new things we've learned since then.)

(A) (1pt) Create a function called "gc" that takes a single sequence as a parameter and returns the GC content of the sequence (as a 2 decimal place float).


In [ ]:

(B) (1pt) Create a function called "reverse_compl" that takes a single sequence as a parameter and returns the reverse complement.


In [ ]:

(C) (1pt) Create a function called "read_fasta" that takes a file name as a parameter (which is assumed to be in fasta format), puts each fasta entry into a dictionary (using the header line as a key and the sequence as a value), and then returns the dictionary.


In [ ]:

(D) (2pts) Create a function called "rand_seq" that takes an integer length as a parameter, and then returns a random DNA sequence of that length.

Hint: make a list of the possible nucleotides


In [ ]:

(E) (2pts) Create a function called "shuffle_nt" that takes a single sequence as a parameter and returns a string that is a shuffled version of the sequence (i.e. the same nucleotides, but in a random order).

Hint: Look for Python functions that will make this easier. For example, the random module has some functions for shuffling. There may also be some built-in string functions that are useful. However, you can also do this just using things we've learned.


In [ ]:

(F) (1pt) Run the code below to show that all of your functions work. Try to fix any that have problems.


In [ ]:
##### testing gc
gcCont = gc("ATGGGCCCAATGG")

if type(gcCont) != float:
    print ">> Problem with gc: answer is not a float, it is a %s." % type(gcCont)
elif gcCont != 0.62:
    print ">> Problem with gc: incorrect answer (should be 0.62; your code gave", gcCont, ")"      
else:
    print "gc: Passed."


##### testing reverse_compl
revCompl = reverse_compl("GGGGTCGATGCAAATTCAAA")

if type(revCompl) != str:
        print ">> Problem with reverse_compl: answer is not a string, it is a %s." % type(revCompl)  
elif revCompl != "TTTGAATTTGCATCGACCCC":
    print ">> Problem with reverse_compl: answer (%s) does not match expected (%s)" % (revCompl, "TTTGAATTTGCATCGACCCC")         
else:
    print "reverse_compl: Passed."
    

##### testing read_fasta
try:
    ins = open("horrible.fasta", 'r')
except IOError:
    print ">> Can not test read_fasta because horrible.fasta is missing. Please add it to the directory with this notebook."
else:
    seqDict = read_fasta("horrible.fasta")
    
    if type(seqDict) != dict:
        print ">> Problem with read_fasta: answer is not a dictionary, it is a %s." % type(seqDict)
    elif len(seqDict) != 22:
        print ">> Problem with read_fasta: # of keys in dictionary (%s) does not match expected (%s)" % (len(seqDict), 22)
    else:
        print "read_fasta: Passed."


##### testing rand_seq
randSeq1 = rand_seq(23)
randSeq2 = rand_seq(23)

if type(randSeq1) != str:
    print ">> Problem with rand_seq: answer is not a string, it is a %s." % type(randSeq1)
elif len(randSeq1) != 23:
    print ">> Problem with rand_seq: answer length (%s) does not match expected (%s)." % (len(randSeq1), 23)
elif randSeq1 == randSeq2:
    print ">> Problem with rand_seq: generated the same sequence twice (%s) -- are you sure this is random?" % randSeq1
else:
    print "rand_seq: Passed."


##### testing shuffle_nt
shuffSeq = shuffle_nt("AAAAAAGTTTCCC")

if type(shuffSeq) != str:
    print ">> Problem with shuffle_nt: answer is not a string, it is a %s." % type(shuffSeq)
elif len(shuffSeq) != 13:
    print ">> Problem with shuffle_nt: answer length (%s) does not match expected (%s)." % (len(shuffSeq), 12)
elif shuffSeq == "AAAAAAGTTTCCC":
    print ">> Problem with shuffle_nt: answer is exactly the same as the input. Are you sure this is shuffling?"
elif shuffSeq.count('A') != 6:
    print ">> Problem with shuffle_nt: answer doesn't contain the same # of each nt as the input."
else:
    print "shuff_seq: Passed."

4. Using your functions (5pts)

Use the functions you created above to complete the following.

(A) (1pt) Create 20 random nucleotide sequences of length 50 and print them to the screen.


In [ ]:

(B) (1pt) Read in horrible.fasta into a dictionary. For each sequence, print its reverse complement to the screen.


In [ ]:

(C) (3pts) Read in horrible.fasta into a dictionary. For each sequence, find the length and the gc content. Print the results to the screen in the following format:

SeqID    Len    GC
...      ...    ...

That is, print the header shown above (separating each column's title by a tab (\t)), followed by the corresponding info about each sequence on a separate line. The "columns" should be separated by tabs. Remember that you can do this printing as you loop through the dictionary... that way you don't have to store the length and gc content.

(In general, this is the sort of formatting you should use when printing data files!)


In [ ]:


Bonus question: K-mer generation (+2 bonus points)

This question is optional, but if you complete it, I'll give you two bonus points. You won't lose points if you skip it.

Create a function called get_kmers that takes a single integer parameter, k, and returns a list of all possible k-mers of A/T/G/C. For example, if the supplied k was 2, you would generate all possible 2-mers, i.e. [AA, AT, AG, AC, TA, TT, TG, TC, GA, GT, GG, GC, CA, CT, CG, CC].

Notes:

  • This function must be generic, in the sense that it can take any integer value of k and produce the corresponding set of k-mers.
  • As there are $4^k$ possible k-mers for a given k, stick to smaller values of k for testing!!
  • I have not really taught you any particularly obvious way to solve this problem, so feel free to get creative in your solution!

There are many ways to do this, and plenty of examples online. Since the purpose of this question is to practice problem solving, don't directly look up "k-mer generation"... try to figure it out yourself. You're free to look up more generic things, though.


In [ ]:
def get_kmers(k):
    kmers = []

    # your code here

    return kmers


Extra problems (0pts)

(A) Create a function that counts the number of occurences of each nt in a specified string. Your function should accept a nucleotide string as a parameter, and should return a dictionary with the counts of each nucleotide (where the nt is the key and the count is the value).


In [ ]:

(B) Create a function that generates a random nt sequence of a specified length with specified nt frequencies. Your function should accept as parameters:

  • a length
  • a dictionary of nt frequences.

and should return the generated string. You'll need to figure out a way to use the supplied frequencies to generate the sequence.

An example of the nt freq dictionary could be: {'A':0.60, 'G':0.10, 'C':0.25, 'T':0.05}


In [ ]: