Programming Bootcamp 2016

Lesson 4 Exercises

Earning points (optional)

Enter your name below.
Email your .ipynb file to me (sarahmid@mail.med.upenn.edu) before 9:00 am on 9/20.
You do not need to complete all the problems to get points.
I will give partial credit for effort when possible.
At the end of the course, everyone who gets at least 90% of the total points will get a prize (bootcamp mug!).

Name:

1. Guess the output: list practice (1pt)

For the following blocks of code, first try to guess what the output will be, and then run the code yourself. Points will be given for filling in the guesses; guessing wrong won't be penalized.



In [ ]:

    
ages = [65, 34, 96, 47]

print len(ages)

Your guess:



In [ ]:

    
names = ["Wilfred", "Manfred", "Wadsworth", "Jeeves"]
ages = [65, 34, 96, 47]

print len(ages) == len(names)

Your guess:



In [ ]:

    
ages = [65, 34, 96, 47]

for hippopotamus in ages:
    print hippopotamus

Your guess:



In [ ]:

    
ages = [65, 34, 96, 47]

print ages[1:3]

Your guess:



In [ ]:

    
names = ["Wilfred", "Manfred", "Wadsworth", "Jeeves"]

if "Willard" not in names:
    names.append("Willard")
print names

Your guess:



In [ ]:

    
names = ["Wilfred", "Manfred", "Wadsworth", "Jeeves"]
ages = [65, 34, 96, 47]

for i in range(len(names)):
    print names[i],"is",ages[i]

Your guess:



In [ ]:

    
ages = [65, 34, 96, 47]

ages.sort()
print ages

Your guess:



In [ ]:

    
ages = [65, 34, 96, 47]

ages = ages.sort()
print ages

Your guess:

Remember that .sort() is an in-place function. Its return value is "None". (This is a special value in Python that basically means "null". It's used as a placeholder sometimes when we don't want to give something a value.)



In [ ]:

    
ages = [65, 34, 96, 47]

print max(ages)

Your guess:



In [ ]:

    
cat = "Mitsworth"

for i in range(len(cat)):
    print cat[i]

Your guess:



In [ ]:

    
cat = "Mitsworth"

print cat[:4]

Your guess:



In [ ]:

    
str1 = "Good morning, Mr. Mitsworth."

parts = str1.split()
print parts
print str1

Your guess:



In [ ]:

    
str1 = "Good morning, Mr. Mitsworth."

parts = str1.split(",")
print parts

Your guess:



In [ ]:

    
names = ["Wilfred", "Manfred", "Wadsworth", "Jeeves"]

print names[-1]

Your guess:



In [ ]:

    
oldList = [2, 2, 6, 1, 2, 6]
newList = []

for item in oldList:
    if item not in newList:
        newList.append(item)
print newList

Your guess:

2. On your own: Lists (3pts)

Write code to accomplish each of the following tasks using list functions. Do not copy and paste the list to make changes. You must pretend you don't know what's in the list.



In [ ]:

    
# run this first!
geneNames = ["Ppia", "Gria2", "Mecp2", "Omd", "Zfp410", "Hsp1", "Mtap1a", "Cfl1", 
             "Slc25a40", "Dync1i1", "V1ra4", "Fmnl1", "Mtap2", "Atp5b", "Olfr259", 
             "Atf3", "Vapb", "Dhx8", "Slc22a15", "Orai3", "Ifitm7", "Kcna2", "Timm23", "Shank1"]

(A) (1pt) Replace the 12th element in geneNames with the string "Camk2a".



In [ ]:

(B) (1pt) Add the string "Shank3" to the end of the geneNames list.



In [ ]:

(C) (1pt) Using a loop, print the elements in geneNames that start with "S" or "O".



In [ ]:

3. On your own: Strings & Splitting (3pts)

Write code to accomplish each of the following tasks using string slicing and splitting.

(A) (1pt) Print characters 3-9 of the string stored in the variable magicWords using string slicing. (You should get "acadabr")



In [ ]:

    
magicWords = "abracadabra"

(B) (1pt) Prompt the user to input a sentence using raw_input(). Print each word of their sentence on a separate line.



In [ ]:

(C) (1pt) Using raw_input(), prompt the user to enter some good ideas for cat names (have them enter the names on one line, separating each name with a comma). Separate the names so that each one is a separate element of a list. Then choose one randomly (using the random module) and print it out.



In [ ]:

4. File reading and lists (3pts)

For this problem, use the file genes.txt provided on Piazza. It contains some gene IDs, one on each line.

(A) (1pt) Read the file and print to the screen only the gene IDs that contain "uc007".

[ Check your answer ] You should end up printing 14 gene IDs.



In [ ]:

(B) (1 pt) Print to the screen only unique gene IDs (remove the duplicates). Do not assume repeat IDs appear consecutively in the file.

Hint: see problem 1 for an example of duplicate checking.

[ Check your answer ] You should end up printing 49 gene IDs.



In [ ]:

(C) (1 pt) Print to the screen only the gene IDs that are still unique after removing the ".X" suffix (where X is a number).

[ Check your answer ] You should end up printing 46 gene IDs.



In [ ]:

5. Practice with `.split()` (4pts)

Use init_sites.txt to complete the following. This file contains a subset of translation initiation sites in mouse, identified by Ingolia et al. (Cell, 2011). Note that this file has a header, which you will want to skip over.

(A) (2 pt) Write a script that reads this file and computes the average CDS length (i.e. average the values in the 7th column).

[ Check your answer ] You should get an average of 236.36.



In [ ]:

(B) (2 pt) Write a script that reads this file and prints the "Init Context" from each line (i.e. the 6th column) if and only if the "Codon" column (column 12) is "aug" for that line.

[ Check your answer ] You should print 38 init contexts.



In [ ]:

(C) (0 pt) For fun (?), copy and paste your output from (b) into http://weblogo.berkeley.edu/logo.cgi to create a motif logo of the sequence around these initiation sites. What positions/nt seem to be most common?

Your answer:

6. Cross-referencing (4pts)

Here you will extract and print the data from init_sites.txt that corresponds to genes with high expression. There isn't gene expression data in init_sites.txt, so we'll have to integrate information from another file.

First, use gene_expr.txt to create a list of genes with high expression. We'll say high expression is anything >= 50.
Then read through init_sites.txt and print the GeneName (2nd column) and PeakScore (11th column) from any line that matches an ID in your high-expression list.
Finally, separately compute the average PeakScore for high expression genes and non-high expression genes. Print both averages to the screen.

[ Check your answer ] There should be 10 lines corresponding to high-expression genes that you print info about. Your average peak scores should be 4.371 and 4.39325 for high and non-high expression genes, respectively.



In [ ]:

7. All-against-all comparisons (4pts)

A common situation that arises in data analysis is that we have a list of data points and we would like to compare each data point to each other data point. Here, we will write a script that computes the "distance" between each pair of strings in a file and outputs a distance matrix. We will define "distance" between two strings as the number of mismatches between two strings when they are lined up, divided by their length.

First we'll use a toy dataset. We'll create a list as follows:

things = [1, 2, 5, 10, 25, 50]

We'll start off by doing a very simple type of pairwise comparison: taking the numerical difference between two numbers. To systematically do this for all possible pairs of numbers in our list, we can make a nested for loop:



In [ ]:

    
things = [1, 2, 5, 10, 25, 50]

for i in range(len(things)):
    for j in range(len(things)):
        print abs(things[i] - things[j]) #absolute value of the difference

Try running this code yourself and observe the output. Everything prints out on its own line, which isn't what we want -- we'd usually prefer a matrix-type format. Try this slightly modified code:



In [ ]:

    
things = [1, 2, 5, 10, 25, 50]

for i in range(len(things)):
    for j in range(len(things)):
        print abs(things[i] - things[j]), "\t",
    print ""

This gives us the matrix format we want. Make sure you understand how this code works. FYI, the "\t" is a tab character, and much like "\n", it is invisible once it's printed (it becomes a tab). The comma at the very end of the print statement suppresses the \n that print usually adds on to the end.

So now we know how to do an all-against-all comparison. But how do we compute the number of mismatches between strings? As long as the strings are the same length, we can do something simple like the following:



In [ ]:

    
str1 = "Wilfred"
str2 = "Manfred"
diffs = 0

for k in range(len(str1)):
    if str1[k] != str2[k]: #compare the two strings at the same index
        diffs = diffs + 1
        
print "dist =", round(float(diffs) / len(str1), 2)

So this outputs the distance between the two strings, where the distance is defined as the fraction of the sequence length that is mismatched.

Using these two pieces of code as starting points, complete the following:

(A) (2 pt) Create a list of a few short strings of the same length. For example:

things = ["bear", "pear", "boar", "tops", "bops"]

Write code that prints a distance matrix for this list. As in the last example, use the fraction of mismatches between a given pair of words as the measure of their "distance" from each other. Round the distances to 2 decimals.



In [ ]:

(B) (2 pt) Now, instead of using a hard coded list like you did in (A), create a list of DNA sequences by reading in the file sequences2.txt. Compute the distance matrix between these sequences and print the distance matrix. Looking at this matrix, do you see a pair of sequences that are much less "distant" from each other than all the rest?



In [ ]:

Extra questions (0pts)

These questions are for people who would like extra practice. They will not be counted for points.

(A) Following from problem 7 above: As you can see, the distance matrix you made in (B) is symmetrical around the diagonal. This means dist(i,j) is the same as dist(j,i), so we're doing some redundant calculations.

Change the code so that we don't do any unnecessary calculations (including comparing a sequence to itself, which always is 0). For any calculations you skip, you can print "-" or some other place-holder to keep the printed matrix looking neat.

Hint: There's a really simple way to do this! Think about the range of the second loop...



In [ ]:

(B) Below is a loop that creates a list. Do the same thing but with a list comprehension instead.



In [ ]:

    
# Loop version:
import random
randomNums = []

for i in range(100):
    randomNums.append(random.randint(0,10))
    
print randomNums



In [ ]:

    
# Your list comprehension version here:

(C) Below is a loop that creates a list. Do the same thing but with a list comprehension instead.



In [ ]:

    
# Loop version:
import random
randomNums = []

for i in range(100):
    randNum = random.randint(0,10)
    if (randNum % 2) == 0:
        randNumStr = str(randNum)
        randomNums.append(randNumStr)
    
print randomNums



In [ ]:

    
# Your list comprehension version here:

Programming Bootcamp 2016

Lesson 4 Exercises

1. Guess the output: list practice (1pt)

2. On your own: Lists (3pts)

3. On your own: Strings & Splitting (3pts)

4. File reading and lists (3pts)

5. Practice with .split() (4pts)

6. Cross-referencing (4pts)

7. All-against-all comparisons (4pts)

Extra questions (0pts)

5. Practice with `.split()` (4pts)