Programming Bootcamp 2016

Lesson 6 Exercises -- ANSWERS


Earning points (optional)

  • Enter your name below.
  • Email your .ipynb file to me (sarahmid@mail.med.upenn.edu) before 9:00 am on 9/27.
  • You do not need to complete all the problems to get points.
  • I will give partial credit for effort when possible.
  • At the end of the course, everyone who gets at least 90% of the total points will get a prize (bootcamp mug!).

Name:


1. Guess the output: scope practice (2pts)

Refer to the code below to answer the following questions:


In [2]:
def fancy_calc(a, b, c):
    x1 = basic_calc(a,b)
    x2 = basic_calc(b,c)
    x3 = basic_calc(c,a)
    z = x1 * x2 * x3
    return z

def basic_calc(x, y):
    result = x + y
    return result

x = 1
y = 2
z = 3
result = fancy_calc(x, y, z)

(A) List the line numbers of the code above in the order that they will be executed. If a line will be executed more than once, list it each time.

NOTE: Select the cell above and hit "L" to activate line numbering!

Answer:

12
13
14
15
1
2
8
9
10
2
3
8
9
10
3
4
8
9
10
4
5
6
15

(B) Guess the output if you were to run each of the following pieces of code immediately after running the code above. Then run the code to see if you're right. (Remember to run the code above first)


In [3]:
print x


1

In [4]:
print z


3

In [5]:
print x1


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-5-8d4362aff03b> in <module>()
----> 1 print x1

NameError: name 'x1' is not defined

In [6]:
print result


60

2. Data structure woes (2pt)

(A) Passing a data structure to a function. Guess the output of the following lines of code if you were to run them immediately following the code block below. Then run the code yourself to see if you're right.


In [7]:
# run this first!

def getMax(someList):
    someList.sort()
    x = someList[-1]
    return x

scores = [9, 5, 7, 1, 8]
maxScore = getMax(scores)

In [8]:
print maxScore


9

In [9]:
print someList


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-9-1767a8b07a7f> in <module>()
----> 1 print someList

NameError: name 'someList' is not defined

In [10]:
print scores


[1, 5, 7, 8, 9]

Why does scores get sorted?

When you pass a data structure as a parameter to a function, it's not a copy of the data structure that gets passed (as what happens with regular variables). What gets passed is a direct reference to the data structure itself.

The reason this is done is because data structures are typically expected to be fairly large, and copying/re-assigning the whole thing can be both time- and memory-consuming. So doing things this way is more efficient. It can also surprise you, though, if you're not aware it's happening. If you would like to learn more about this, look up "Pass by reference vs pass by value".

(B) Copying data structures. Guess the output of the following code if you were to run them immediately following the code block below. Then run the code yourself to see if you're right.


In [11]:
# run this first!
list1 = [1, 2, 3, 4]
list2 = list1
list2[0] = "HELLO"

In [12]:
print list2


['HELLO', 2, 3, 4]

In [13]:
print list1


['HELLO', 2, 3, 4]

Yes, that's right--even when you try to make a new copy of a list, it's actually just a reference to the same list! This is called aliasing. The same thing will happen with a dictionary. This can really trip you up if you don't know it's happening.

So what if we want to make a truly separate copy? Here's a way for lists:


In [14]:
# for lists
list1 = [1, 2, 3, 4]
list2 = list(list1) #make a true copy of the list
list2[0] = "HELLO"

print list2
print list1


['HELLO', 2, 3, 4]
[1, 2, 3, 4]

And here's a way for dictionaries:


In [15]:
# for dictionaries
dict1 = {'A':1, 'B':2, 'C':3}
dict2 = dict1.copy() #make a true copy of the dict
dict2['A'] = 99

print dict2
print dict1


{'A': 99, 'C': 3, 'B': 2}
{'A': 1, 'C': 3, 'B': 2}

3. Writing custom functions (8pts)

Complete the following. For some of these problems, you can use your code from previous labs as a starting point.

(If you didn't finish those problems, feel free to use the code from the answer sheet, just make sure you understand how they work! Optionally, for extra practice you can try re-writing them using some of the new things we've learned since then.)

(A) (1pt) Create a function called "gc" that takes a single sequence as a parameter and returns the GC content of the sequence (as a 2 decimal place float).


In [16]:
def gc(seq):
    gcCount = seq.count("C") + seq.count("G")
    gcFrac = float(gcCount) / len(seq)
    
    return round(gcFrac,2)

(B) (1pt) Create a function called "reverse_compl" that takes a single sequence as a parameter and returns the reverse complement.


In [17]:
def reverse_compl(seq):
    complements = {'A':'T', 'C':'G', 'G':'C', 'T':'A'}
    compl = ""
    for char in seq:
        compl = complements[char] + compl
    
    return compl

(C) (1pt) Create a function called "read_fasta" that takes a file name as a parameter (which is assumed to be in fasta format), puts each fasta entry into a dictionary (using the header line as a key and the sequence as a value), and then returns the dictionary.


In [18]:
def read_fasta(fileName):
    ins = open(fileName, 'r')
    seqDict = {}
    activeID = ""

    for line in ins:
        line = line.rstrip('\r\n')
        
        if line[0] == ">":
            activeID = line[1:] 
            if activeID in seqDict:
                print ">>> Warning: repeat id:", activeID, "-- overwriting previous ID."
            seqDict[activeID] = ""
            
        else:
            seqDict[activeID] += line 
    
    ins.close()    
    
    return seqDict

(D) (2pts) Create a function called "rand_seq" that takes an integer length as a parameter, and then returns a random DNA sequence of that length.

Hint: make a list of the possible nucleotides


In [19]:
def rand_seq(length):
    import random
    
    nts = ['A','C','G','T']
    seq = ""
    for i in range(length):
        seq += random.choice(nts)
    
    return seq

(E) (2pts) Create a function called "shuffle_nt" that takes a single sequence as a parameter and returns a string that is a shuffled version of the sequence (i.e. the same nucleotides, but in a random order).

Hint: Look for Python functions that will make this easier. For example, the random module has some functions for shuffling. There may also be some built-in string functions that are useful. However, you can also do this just using things we've learned.


In [20]:
def shuffle_nt(seq):
    import random
    
    strList = list(seq)
    random.shuffle(strList)
    shuffSeq = "".join(strList)
    
    return shuffSeq

(F) (1pt) Run the code below to show that all of your functions work. Try to fix any that have problems.


In [21]:
##### testing gc
gcCont = gc("ATGGGCCCAATGG")

if type(gcCont) != float:
    print ">> Problem with gc: answer is not a float, it is a %s." % type(gcCont)
elif gcCont != 0.62:
    print ">> Problem with gc: incorrect answer (should be 0.62; your code gave", gcCont, ")"      
else:
    print "gc: Passed."


##### testing reverse_compl
revCompl = reverse_compl("GGGGTCGATGCAAATTCAAA")

if type(revCompl) != str:
        print ">> Problem with reverse_compl: answer is not a string, it is a %s." % type(revCompl)  
elif revCompl != "TTTGAATTTGCATCGACCCC":
    print ">> Problem with reverse_compl: answer (%s) does not match expected (%s)" % (revCompl, "TTTGAATTTGCATCGACCCC")         
else:
    print "reverse_compl: Passed."
    

##### testing read_fasta
try:
    ins = open("horrible.fasta", 'r')
except IOError:
    print ">> Can not test read_fasta because horrible.fasta is missing. Please add it to the directory with this notebook."
else:
    seqDict = read_fasta("horrible.fasta")
    
    if type(seqDict) != dict:
        print ">> Problem with read_fasta: answer is not a dictionary, it is a %s." % type(seqDict)
    elif len(seqDict) != 22:
        print ">> Problem with read_fasta: # of keys in dictionary (%s) does not match expected (%s)" % (len(seqDict), 22)
    else:
        print "read_fasta: Passed."


##### testing rand_seq
randSeq1 = rand_seq(23)
randSeq2 = rand_seq(23)

if type(randSeq1) != str:
    print ">> Problem with rand_seq: answer is not a string, it is a %s." % type(randSeq1)
elif len(randSeq1) != 23:
    print ">> Problem with rand_seq: answer length (%s) does not match expected (%s)." % (len(randSeq1), 23)
elif randSeq1 == randSeq2:
    print ">> Problem with rand_seq: generated the same sequence twice (%s) -- are you sure this is random?" % randSeq1
else:
    print "rand_seq: Passed."


##### testing shuffle_nt
shuffSeq = shuffle_nt("AAAAAAGTTTCCC")

if type(shuffSeq) != str:
    print ">> Problem with shuffle_nt: answer is not a string, it is a %s." % type(shuffSeq)
elif len(shuffSeq) != 13:
    print ">> Problem with shuffle_nt: answer length (%s) does not match expected (%s)." % (len(shuffSeq), 12)
elif shuffSeq == "AAAAAAGTTTCCC":
    print ">> Problem with shuffle_nt: answer is exactly the same as the input. Are you sure this is shuffling?"
elif shuffSeq.count('A') != 6:
    print ">> Problem with shuffle_nt: answer doesn't contain the same # of each nt as the input."
else:
    print "shuff_seq: Passed."


gc: Passed.
reverse_compl: Passed.
read_fasta: Passed.
rand_seq: Passed.
shuff_seq: Passed.

4. Using your functions (5pts)

Use the functions you created above to complete the following.

(A) (1pt) Create 20 random nucleotide sequences of length 50 and print them to the screen.


In [22]:
for i in range(20):
    print rand_seq(50)


AGGATTGGTATTTACAATCCAGGGATATATTACATGTGCTCGACCCCGGA
GCAGCGGACGAACAGCTTGGCCCTCAATCGCACGGAGCCATAAACCCATC
TTGTGCGCACTCGCAGGGCCTCAATCTGCTTCGGTCCTGCAATCCTCCTG
TTCAGCGTGGTGAGGGGGGTGACTGTTAGCCAGCCGGGTACAGTGGGGAG
GGGAACTATGCATCTAGGCCCCGTTGTACGTACAACCTCGGCTAAGCTCC
ATTCAACAAGCGTAATGCCACAATCAATTAGTTTATCGATGGCCTAAGCT
TCGCCGGGTTACGAGACGGGCTCCGTGGTAGAGGGGCGCCACCTTGATGG
CGCGTGTATCTAATCCCAGAAACGGATGCCCCTCTCGTACCCGCCCCACA
CGGTGGGGCGAAGCGAGATCCCACTTCATTAATGTGCCCCTTACTCGATG
GCGTCGGAAGATCACAAACGTGTGCATAAAGCCCCAAGAAGCCACTAGCT
CCTACATTAAGACATTCAGCAATAATATTCTTTCTTGTGGGTAGTACGGA
AGTTGTGCTTGACGGGGTATGTACATGGCGTAATAAAGACCGTAACGACA
AAATGGCACCTAGACTTGCCGACGCTTGCCAGTTTATTTAGTTTGCGAAC
GTGGGTTGCGCCAGACACTGAGTGTTGAGTCGGCAGGCGTGATCAAATTA
TAAGTCTCAGGGAGGACCCATCCATTTCATGCTGTAAATATCGAACAGTC
TAGATGGGCAAGGTGCTTCGGTACAACCTCTCGCTTCATTCATGCCCTAC
TCATCGATCAATACCCTATACACTGCCAGCCGGAAGCGAGGAGAGATATG
AACATGGTCTATCTACGGCCCTAGACAAAGACCCGAGACTTTTGATCGCC
TAGGGAATGCTGTATATCCACAATAGTGGGATCTCAGCTTACACATGCGG
TCTTCCGCTCGTCTGTAACTCCACAATTCTGTGTCATAAAGTGCCCGAAG

(B) (1pt) Read in horrible.fasta into a dictionary. For each sequence, print its reverse complement to the screen.


In [23]:
seqDict = read_fasta("horrible.fasta")

for seqID in seqDict:
    print reverse_compl(seqDict[seqID])


AACCTCCTGGGGAGGTGGTGGCGGCTCTTGCAGATGTGGAACCAGCAGAGGTTGTGCTTACAGCTGGGCCTGTGGTGCTGCCAGCTGTTTCAGCCGGTGT
CTGATCACTGAGCTGAAACTAAACGTTTTAGGTGGAAAAAAAGCGTCCGAAGGCACCGTGAAATGATTAAGGAACTAAAGAGCTTCTCGCCATGTGAGATCATGTCCTGTTCTCGCCAACATCACAAGATGTCCCCAGACACGCCGCGCCCCCAGCGCGCCGCCCCACACTGCCGGCCCGGAGCGAGGAAAGGGTAGGCGCTGCGCGG
TAGGTGAAAATTCCTTCTGCTGGTTCCCAGAGATACCTAGGAAGACTCTGGGGAACCCTTGGCTAATTATCCCAGGAAAACTGCTGCCTCGGCTGAAACTGGAAGCTCATGGTGGACCCCAAGATATCTTATCTTTGGGACACTTAAAAAAAAAAAGCTATTTTATTCCAATTAAGCCAGTCTTTTGAGAGACACCTAGAAAGAAAGGGCTTCTAAAACATGAACATGAGCTCTGATGTTAGCAACCCAACTTCCACTCCAAAATTACTGAAATATTTATGGGTAAAATTAACTCATAAAAACCTTCTTCT
ACCCCTAAGGAACGTCCCTCGCGTCGGTTTGAGGAGGAAGGCGCACTTCTCTTGATGACCGTTGG
GGTAAGCACAGGATCCAAGAAACAGAGATTACACACAGGAGAGAGGCCAAGCAAAGCTCTGTGATGAAAGGTATGAAGTATGCCCACGGAGCAGCCAGCTGAGACTGGAACAAGAGGATGTAGCACTCCATGCAGGAAAATTCCATGGAATCTAGCACTTTGGGACATCCAGGTGGGCG
AGCAATACTTTCACTGCTGCCAGCCCGAG
GTATCACCTTCAATTTCTTAAGAGCCATTCTTCT
ATTTTCTGAGCTTCTTCTCTCGCAAGGTCTTGTTCATTTGGCAATACTGATATTTGATCTTTGTACACA
GTACCTTCTCGGAAGGCCAGAGTCAATTGTACCACCACAGATCCTGGCCTGAACTTAATATTGGAGAGGCCCAGAAAACCCCCTT
CAAAGCACACAGAGATTCTGTCAGGTGCTGAGACACCACAGCCTTCTCAATTTTGTCCTTAAGGGCTTTATCTTTCATCCAATTGAGCAGAGGCTCAAATTCTTTCTCAACTGCTTCATGACTCTCCTTAGTTTTCTCACTTTTATCAAACTTCATTCCTTCCTTGACAACATTCTGGAACCTCTTCCCATCAAATTTG
GGGCCCGGGACCCGGGTGGGGGGGACCGCCGAGAGGCCCAGCGCAGCGA
GCTTTGGAAACTGGAATGAGGATCACCAACAGGATCCTCATTTTACACAGGAGTTATGAGAGTTACATCCTCTAGCAGAGATGCTTGGTCATTACCTGTGGTACATGAGATTACCGAGCTAAAAGGGAAAAAAAACGATCTTAATGTTCTCCCATGAACTCAACTTAAGCTTTTTATGGAGGCACTGAGGCCATGCAGCTCCTTTTCCAAAAGACACAGATAAAAGCCAAATAAGGTAGAGGACTTTGGAAATTTTCTCTGAAAAGTTAAATTCCACATAATAGTAAGA
TTTTAATCTTCTTCCTTCCCGTCGACTGTCTTTCTTTAAAGCAACTGCAATTTCTTCCCTTACTTCCTCACTGTCTGTTGCTATAATTTGCCCATTGTGAACCATCTGTGAATTCTGTCTTAGGTATTCCATGAATCCATTCACATCTTCATTTAAGTACTCTTTTTTCTTTTTGTTCTTTTTATGTTTTGCTTGGGGTGCATCATTTTTGAGGGATAGCCTATTGGCTTCAAGTTGTTTACGCTTTGGTAGGTTTTGGCTTGTTCCCTCAAAGGATCCCTTCTTCATGTCCTCCCATGATGTTGCAGGCAAGGGTCTCTTGTTATATGTGGTACTAACTCGGGCCCACCTGGTCATAATTTCATCAGTGGTACCGCGCACGAATCCCCCAGAGCAGCCGAGTTGGCGAGCCGGGGAAGACCGCCCTCCTGCGGTATTGGAGACCGGAAGCACATAGTG
TCAATGTTTTCTTCTTTAATCACAGATGATGTACAGACACCAGCATAATTTGCTGATGTAATTTCCTTATCCAAGG
CTTCATATATATTTAATTTTCTCTTTGCTTCACTACTGCAAGGTAGGTGTTTATTATCTCCTTTTACAGATGTGGAAACTTAGGCTCAGAGGTGAAGTAACTTGCACAAGTTTCTACAGCTAGAATTTGAACCAGGTCTGACCCCCGAATTGTGCTCGTCCATAAAGGCCAGCATTTGCCAAATTATGGCACACAGTACCACCAGTGGTACGTGACTTCTTTGGTTGAAAACAGACAAATTTATTTTGTTTTGATAGTTATGTCTTTTAATATGTATTAGAAGAATACATAATTAGCACACATCAAACCTGTGATTTCACAGATATCACTACTTGGGATGAAAATGATATAGGATAACAATGTTAGACCTCAG
AAGATTTCCAGAGTGG
CCATGGTTAGTTAAATTCCCTAGAGATGTAGCCGTGACTCTCCCAATACCTGAAGTGTGCCTCCCCTGACTCTGTGGCATCCTCTGGAAGAGATCATGGTTGTATTCATAATATCTGTAATCTTCTTGTGCACGATCTCCAAGTGGCCGCCTTCTCTGTCCATCAAAAAAGTTATCTGAGAAGAAGTATCGGGAGCCAGAGTCTCCATTCTCAACAGCAAAGTTAACTTCTGTCAAAAATGACTGTGATGAGCCACACTCTCGAGGGACATCTGCTAGGCTCCTGACAAGGTAAGAAGGGGCAGACAGTCTGTGGCTTTCTCTTCTCATTACTTCATGAGGTGTCCTTTGAATTGCAGTTCTCAGGAAACTCTGGTTTCTTGAAACTACACCATCTCCAGAAGCTGAGAAAGCAGTAGCACTTGAATCTGGAAGACAGAGGTCAGTCC
CCTTTCCGGGACTGGTTT
AAATTGACTTCTGCCATAATAAAATC
TGAACAGCTGCTGTGTAGCCCATACTGTGAAAAGTAAAACATCACCCCAGTTCTCGGTACACACAGAGCTCATGCTCCAGCGGGCTGAGCCT
GCTTAAGCCTAGGAGTTTGAGACCAGCCTGGGCAACACAGCAAGACCCCATCTCTACCAAAAAAAAAAAAAAATTAAAGAGTCCTATAGAGAATTCTTATACTCCAATGTGAAGACAACATTGGAAAGGGCCAAGTTTCTCATGCCCTCCAACTAAGAAACCCCTAATAAAAAATGAAGTGACACTTGAACAGGACTTAAGGATTCTACAGTTGGTCTTTGGCAGCAGTATGTTTTAGGAAATGTAATGCGGCGGGTGGGGCGGTGACTTAGCCAGTTATGCTTTTAAATGGAACTGCAATAATAAAAGTGATACTAGTGCAGAAAGTATCTGTATTAGAATTCTAGAGTAAGTCAAGAGCTCACATTCATTAAAATAATGACACAACTCCACGGGGGTGGGGAGAACAGCAGTAAAGCAACCACATACTATACTATTAGACTGGCAACATTGAGACTGAAAATATCCATGAGGAGAATACTGACATCTTA
GCATGGTTGGCCTGAAGGTATTAGTGCGCAGGAGATGATTCAAACTTCCATGGGTCCCATTATTAGGAGCTGGCTTCAATCCCAGGAGATCACACATAACATTGTAAAGTTCAATGTTTTCAAATGGAGGCACTTTAGTCTTGTACTTAAATGTTGAGCCATAACCTACAAAAACAGTCTGCATGCTGTTGACCTTGTTATCAAATCCGTGGTCTCCCTGGAAAAAGCATTTTCCTGATGG

(C) (3pts) Read in horrible.fasta into a dictionary. For each sequence, find the length and the gc content. Print the results to the screen in the following format:

SeqID    Len    GC
...      ...    ...

That is, print the header shown above (separating each column's title by a tab (\t)), followed by the corresponding info about each sequence on a separate line. The "columns" should be separated by tabs. Remember that you can do this printing as you loop through the dictionary... that way you don't have to store the length and gc content.

(In general, this is the sort of formatting you should use when printing data files!)


In [25]:
seqDict = read_fasta("horrible.fasta")
print "SeqID\tLen\tGC"

for seqID in seqDict:
    seq = seqDict[seqID]
    seqLen = len(seq)
    seqGC = gc(seq)
    print seqID + "\t" + str(seqLen) + "\t" + str(seqGC)


SeqID	Len	GC
varlen2_uc007xie.1_4456	100	0.61
varlen2_uc010mlp.1_79	208	0.57
varlen2_uc009bxt.1_1728	311	0.4
varlen2_uc009div.2_242	65	0.58
varlen2_uc003its.2_2976	179	0.5
varlen2_uc003nvg.4_2466	29	0.55
varlen2_uc029ygd.1_73	34	0.35
varlen2_uc007kxx.1_2963	69	0.36
varlen2_uc009wph.3_423	85	0.51
varlen2_uc010osx.2_1007	199	0.41
varlen2_uc001agr.3_7	49	0.84
varlen2_uc001pmn.3_3476	289	0.39
varlen2_uc003khi.3_3	459	0.45
varlen2_uc021qfk.1>2_1472	76	0.34
varlen2_uc011moe.2_5914	373	0.36
varlen2_uc003hyy.2_273	16	0.44
varlen2_uc007nte.2_374	448	0.46
varlen2_uc007fws.1_377	18	0.56
varlen2_uc003pij.1_129	26	0.27
varlen2_uc002wkt.1_1569	92	0.52
varlen2_uc010suq.2_3895	491	0.4
varlen2_uc003yos.2_1634	241	0.42

Bonus question: K-mer generation (+2 bonus points)

This question is optional, but if you complete it, I'll give you two bonus points. You won't lose points if you skip it.

Create a function called get_kmers that takes a single integer parameter, k, and returns a list of all possible k-mers of A/T/G/C. For example, if the supplied k was 2, you would generate all possible 2-mers, i.e. [AA, AT, AG, AC, TA, TT, TG, TC, GA, GT, GG, GC, CA, CT, CG, CC].

Notes:

  • This function must be generic, in the sense that it can take any integer value of k and produce the corresponding set of k-mers.
  • As there are $4^k$ possible k-mers for a given k, stick to smaller values of k for testing!!
  • I have not really taught you any particularly obvious way to solve this problem, so feel free to get creative in your solution!

There are many ways to do this, and plenty of examples online. Since the purpose of this question is to practice problem solving, don't directly look up "k-mer generation"... try to figure it out yourself. You're free to look up more generic things, though.


In [36]:
# Method 1
# Generic kmer generation for any k and any alphabet (default is DNA nt)
# Pretty fast
def get_kmers1(k, letters=['A','C','G','T']):
    kmers = []
    choices = len(letters)
    finalNum = choices ** k
    
    # initialize to blank strings
    for i in range(finalNum):
        kmers.append("")
    
    # imagining the kmers lined up vertically, generate one "column" at a time
    for i in range(k): 
        consecReps = choices ** (k - (i + 1))   #number of times to consecutively repeat each letter
        patternReps = choices ** i #number of times to repeat pattern of letters
        
        # create the current column of letters
        index = 0
        for j in range(patternReps):
            for m in range(choices):
                for n in range(consecReps):
                    kmers[index] += letters[m]
                    index += 1
            
    return kmers

In [37]:
get_kmers1(3)


Out[37]:
['AAA',
 'AAC',
 'AAG',
 'AAT',
 'ACA',
 'ACC',
 'ACG',
 'ACT',
 'AGA',
 'AGC',
 'AGG',
 'AGT',
 'ATA',
 'ATC',
 'ATG',
 'ATT',
 'CAA',
 'CAC',
 'CAG',
 'CAT',
 'CCA',
 'CCC',
 'CCG',
 'CCT',
 'CGA',
 'CGC',
 'CGG',
 'CGT',
 'CTA',
 'CTC',
 'CTG',
 'CTT',
 'GAA',
 'GAC',
 'GAG',
 'GAT',
 'GCA',
 'GCC',
 'GCG',
 'GCT',
 'GGA',
 'GGC',
 'GGG',
 'GGT',
 'GTA',
 'GTC',
 'GTG',
 'GTT',
 'TAA',
 'TAC',
 'TAG',
 'TAT',
 'TCA',
 'TCC',
 'TCG',
 'TCT',
 'TGA',
 'TGC',
 'TGG',
 'TGT',
 'TTA',
 'TTC',
 'TTG',
 'TTT']

In [ ]:
# Method 2 
# Generate numbers, discard any that aren't 1/2/3/4's, convert to letters. 
# Super slow~
def get_kmers2(k):
    discard = ["0", "5", "6", "7", "8", "9"]
    convert = {"1": "A", "2": "T", "3": "G", "4": "C"}
    min = int("1" * k)
    max = int("4" * k)
    kmers = []
    tmp = []
    for num in range(min, (max + 1)): # generate numerical kmers
        good = True
        for digit in str(num):
            if digit in discard:
                good = False
                break
        if good == True:
            tmp.append(num)        
    for num in tmp: # convert numerical kmers to ATGC
        result = ""
        for digit in str(num):
            result += convert[digit]
        kmers.append(result)
    
    return kmers

In [ ]:
# Method 3 (by Nate)
# A recursive solution. Fast!
# (A recursive function is a function that calls itself)
def get_kmers3(k):
    nt = ['A', 'T', 'G', 'C']
    k_mers = []

    if k == 1:
        return nt

    else:
        for i in get_kmers3(k - 1):
            for j in nt:
                k_mers.append(i + j)
    return k_mers

In [ ]:
# Method 4 (by Nate)
# Fast
def get_kmers4(k):
    nt = ['A', 'T', 'G', 'C']
    k_mers = []
    total_kmers = len(nt)**k

    # make a list of size k with all zeroes. 
    # this keeps track of which base we need at each position 
    pointers = []
    for p in range(k):
        pointers.append(0)

    for k in range(total_kmers):

        # use the pointers to generate the next k-mer
        k_mer = ""
        for p in pointers:
            k_mer += nt[p]
        k_mers.append(k_mer)

        # get the pointers ready for the next k-mer by updating them left to right
        pointersUpdated = False
        i = 0
        while not pointersUpdated and i < len(pointers):
            if pointers[i] < len(nt) - 1:
                pointers[i] += 1
                pointersUpdated = True
            else:
                pointers[i] = 0
                i += 1
    
    return k_mers

In [ ]:
# Method 5 (by Justin Becker, bootcamp 2013)
# Fast!
def get_kmers5(k):   #function requires int as an argument
    kmers = [""]
    for i in range(k):   #after each loop, kmers will store the complete set of i-mers
        currentNumSeqs = len(kmers)
        for j in range(currentNumSeqs): #each loop takes one i-mer and converts it to 4 (i+1)=mers
            currentSeq = kmers[j]
            kmers.append(currentSeq + 'C')
            kmers.append(currentSeq + 'T')
            kmers.append(currentSeq + 'G')
            kmers[j] += 'A'
    return kmers

In [ ]:
# Method 6 (by Nick)
# Convert to base-4
def get_kmers6(k):
    bases = ['a', 'g', 'c', 't']
    kmers = []

    for i in range(4**k):
        digits = to_base4(i, k)
        mystr = ""
        for baseidx in digits:
            mystr += bases[baseidx]

        kmers.append(mystr)

    return kmers

# convert num to a k-digit base-4 int
def to_base4(num, k):

    digits = []

    while k > 0:

        digits.append(num/4**(k-1))

        num %= 4**(k-1)
        k -= 1

    return digits

In [ ]:
# Below: more from Nate
import random
import time
alphabet = ['A', 'C', 'G', 'T']

## Modulus based
def k_mer_mod(k):
    k_mers = []
    for i in range(4**k):
        k_mer = ''
        for j in range(k):
            k_mer = alphabet[(i/4**j) % 4]+ k_mer
        k_mers.append(k_mer)
    return k_mers


## maybe the range operator slows things down by making a big tuple
def k_mer_mod_1(k):
    k_mers = []
    total = 4**k
    i = 0
    while i < total:
        k_mer = ''
        for j in range(k):
            k_mer = alphabet[(i/4**j) % 4]+ k_mer
        k_mers.append(k_mer)
        i += 1
    return k_mers



## Does initializing the list of k_mers help?
def k_mer_mod_2(k):
    k_mers = [''] * 4**k
    for i in range(4**k):
        k_mer = ''
        for j in range(k):
            k_mer = alphabet[(i/4**j) % 4] + k_mer
        k_mers[i] = k_mer
    return k_mers


## What's faster? element assignment or hashing?
def k_mer_mod_set(k):
    k_mers = set() 
    for i in range(4**k):
        k_mer = ''
        for j in range(k):
            k_mer = alphabet[(i/4**j) % 4] + k_mer
        k_mers.add(k_mer)
    return list(k_mers)


## does creating the string up front help?
#def k_mer_mod_3(k):
#n    k_mers = []
#    k_mer = "N" * k    
#    for i in range(4**k):
#        for j in range(k):
#            k_mer[j] = alphabet[(i/4**j) % 4]
#        k_mers.append(k_mer)
#    return k_mers
# Nope! String are immutable, dummy!


# maybe we can do something tricky with string substitution
def k_mer_mod_ssub(k):
    template = "\%s" * k
    k_mers = []

    for i in range(4**k):
        k_mer = []
        for j in range(k):
            k_mer.append(alphabet[(i/4**j) % 4])
        k_mers.append(template % k_mer)
    return k_mers


# what about using a list?
def k_mer_mod_4(k):
    k_mers = [''] * 4**k
    k_mer = [''] * k
    
    for i in range(4**k):
        for j in range(k):
            k_mer[j] = alphabet[(i/4**j) % 4]
        k_mers[i] = "".join(k_mer)
    return k_mers

## recursive version
def k_mer_recursive(k):
    if k == 0:
        return ['']
    else:
        k_mers = []
        for k_mer in k_mer_recursive(k-1):
            for n in alphabet:
                k_mers.append("%s%s" % (k_mer, n))
        return k_mers

             
## That works, but what I wanted to be like, really obnoxious about it
def k_mer_recursive_2(k):
    if k == 0:
        return ['']
    else:
        k_mers = []
        [[k_mers.append("%s%s" % (k_mer, n)) for n in alphabet] for k_mer in k_mer_recursive_2(k-1)]
        return k_mers

# using list instead of strings to store the k_mers
def k_mer_recursive_3(k, j = False):
    if k == 0:
        return [[]]
    else:
        k_mers = []
        [[k_mers.append((k_mer + [n])) if j else k_mers.append("".join(k_mer + [n])) for n in alphabet] for k_mer in k_mer_recursive_3(k-1, True)]
        return k_mers

## stochastic (I have a good feeling about this one!)
def k_mer_s(k):
    s = set()
    i = 0
    while i < 4**k:
        k_mer = ''
        for j in range(k):
            k_mer = k_mer + random.choice(alphabet)
        if k_mer not in s:
            s.add(k_mer)
            i += 1
    return list(s)


## I sure hope this works because now we're pretty much cheating
import array
def k_mer_mod_array(k):

    k_mers = [] 
    k_mer = array.array('c', ['N'] * k)
    
    for i in range(4**k):
        for j in range(k):
            k_mer[j] = alphabet[(i/4**j) % 4]
        k_mers.append("".join(k_mer))
    return k_mers
## That could have gone better.


Extra problems (0pts)

(A) Create a function that counts the number of occurences of each nt in a specified string. Your function should accept a nucleotide string as a parameter, and should return a dictionary with the counts of each nucleotide (where the nt is the key and the count is the value).


In [26]:
def nt_counts(seq):
    counts = {}
    for nt in seq:
        if nt not in counts:
            counts[nt] = 1
        else:
            counts[nt] += 1
    
    return counts

In [27]:
nt_counts("AAAAATTTTTTTGGGGC")


Out[27]:
{'A': 5, 'C': 1, 'G': 4, 'T': 7}

(B) Create a function that generates a random nt sequence of a specified length with specified nt frequencies. Your function should accept as parameters:

  • a length
  • a dictionary of nt frequences.

and should return the generated string. You'll need to figure out a way to use the supplied frequencies to generate the sequence.

An example of the nt freq dictionary could be: {'A':0.60, 'G':0.10, 'C':0.25, 'T':0.05}


In [30]:
def generate_nucleotide(length, freqs):
    import random
    seq = ""
    samplingStr = ""
    
    # maybe not the best way to do this, but fun:
    # create a list with the indicated freq of nt
    for nt in freqs:
        occurPer1000 = int(1000*freqs[nt])
        samplingStr += nt*occurPer1000
    samplingList = list(samplingStr)

    # sample from the list
    for i in range(length):
        newChar = random.choice(samplingList)
        seq += newChar

    return seq

In [31]:
generate_nucleotide(100, {'A':0.60, 'G':0.10, 'C':0.25, 'T':0.05})


Out[31]:
'ACAAAACTGCAACAAGACACAAACACGACACAAAAACAAAACACGCAAAAAAAAAAAACACCAAAGCTAAACCAACCAAATTGAAAAAAAGCAGAAAAAA'

In [35]:
# let's check if it's really working
n = 10000
testSeq = generate_nucleotide(n, {'A':0.60, 'G':0.10, 'C':0.25, 'T':0.05})
obsCounts = nt_counts(testSeq)
for nt in obsCounts:
    print nt, float(obsCounts[nt]) / n


A 0.5941
C 0.2568
T 0.0457
G 0.1034

Looks good!