Earning points (optional)
.ipynb
file to me (sarahmid@mail.med.upenn.edu) before 9:00 am on 9/20. Name:
In [ ]:
ages = [65, 34, 96, 47]
print len(ages)
Your guess:
In [ ]:
names = ["Wilfred", "Manfred", "Wadsworth", "Jeeves"]
ages = [65, 34, 96, 47]
print len(ages) == len(names)
Your guess:
In [ ]:
ages = [65, 34, 96, 47]
for hippopotamus in ages:
print hippopotamus
Your guess:
In [ ]:
ages = [65, 34, 96, 47]
print ages[1:3]
Your guess:
In [ ]:
names = ["Wilfred", "Manfred", "Wadsworth", "Jeeves"]
if "Willard" not in names:
names.append("Willard")
print names
Your guess:
In [ ]:
names = ["Wilfred", "Manfred", "Wadsworth", "Jeeves"]
ages = [65, 34, 96, 47]
for i in range(len(names)):
print names[i],"is",ages[i]
Your guess:
In [ ]:
ages = [65, 34, 96, 47]
ages.sort()
print ages
Your guess:
In [ ]:
ages = [65, 34, 96, 47]
ages = ages.sort()
print ages
Your guess:
Remember that
.sort()
is an in-place function. Its return value is "None". (This is a special value in Python that basically means "null". It's used as a placeholder sometimes when we don't want to give something a value.)
In [ ]:
ages = [65, 34, 96, 47]
print max(ages)
Your guess:
In [ ]:
cat = "Mitsworth"
for i in range(len(cat)):
print cat[i]
Your guess:
In [ ]:
cat = "Mitsworth"
print cat[:4]
Your guess:
In [ ]:
str1 = "Good morning, Mr. Mitsworth."
parts = str1.split()
print parts
print str1
Your guess:
In [ ]:
str1 = "Good morning, Mr. Mitsworth."
parts = str1.split(",")
print parts
Your guess:
In [ ]:
names = ["Wilfred", "Manfred", "Wadsworth", "Jeeves"]
print names[-1]
Your guess:
In [ ]:
oldList = [2, 2, 6, 1, 2, 6]
newList = []
for item in oldList:
if item not in newList:
newList.append(item)
print newList
Your guess:
In [ ]:
# run this first!
geneNames = ["Ppia", "Gria2", "Mecp2", "Omd", "Zfp410", "Hsp1", "Mtap1a", "Cfl1",
"Slc25a40", "Dync1i1", "V1ra4", "Fmnl1", "Mtap2", "Atp5b", "Olfr259",
"Atf3", "Vapb", "Dhx8", "Slc22a15", "Orai3", "Ifitm7", "Kcna2", "Timm23", "Shank1"]
(A) (1pt) Replace the 12th element in geneNames
with the string "Camk2a"
.
In [ ]:
(B) (1pt) Add the string "Shank3"
to the end of the geneNames
list.
In [ ]:
(C) (1pt) Using a loop, print the elements in geneNames
that start with "S" or "O".
In [ ]:
(A) (1pt) Print characters 3-9 of the string stored in the variable magicWords
using string slicing. (You should get "acadabr")
In [ ]:
magicWords = "abracadabra"
(B) (1pt) Prompt the user to input a sentence using raw_input()
. Print each word of their sentence on a separate line.
In [ ]:
(C) (1pt) Using raw_input()
, prompt the user to enter some good ideas for cat names (have them enter the names on one line, separating each name with a comma). Separate the names so that each one is a separate element of a list. Then choose one randomly (using the random module) and print it out.
In [ ]:
(A) (1pt) Read the file and print to the screen only the gene IDs that contain "uc007".
[ Check your answer ] You should end up printing 14 gene IDs.
In [ ]:
(B) (1 pt) Print to the screen only unique gene IDs (remove the duplicates). Do not assume repeat IDs appear consecutively in the file.
Hint: see problem 1 for an example of duplicate checking.
[ Check your answer ] You should end up printing 49 gene IDs.
In [ ]:
(C) (1 pt) Print to the screen only the gene IDs that are still unique after removing the ".X" suffix (where X is a number).
[ Check your answer ] You should end up printing 46 gene IDs.
In [ ]:
(A) (2 pt) Write a script that reads this file and computes the average CDS length (i.e. average the values in the 7th column).
[ Check your answer ] You should get an average of 236.36.
In [ ]:
(B) (2 pt) Write a script that reads this file and prints the "Init Context" from each line (i.e. the 6th column) if and only if the "Codon" column (column 12) is "aug" for that line.
[ Check your answer ] You should print 38 init contexts.
In [ ]:
(C) (0 pt) For fun (?), copy and paste your output from (b) into http://weblogo.berkeley.edu/logo.cgi to create a motif logo of the sequence around these initiation sites. What positions/nt seem to be most common?
Your answer:
Here you will extract and print the data from init_sites.txt
that corresponds to genes with high expression. There isn't gene expression data in init_sites.txt
, so we'll have to integrate information from another file.
gene_expr.txt
to create a list of genes with high expression. We'll say high expression is anything >= 50. init_sites.txt
and print the GeneName (2nd column) and PeakScore (11th column) from any line that matches an ID in your high-expression list. [ Check your answer ] There should be 10 lines corresponding to high-expression genes that you print info about. Your average peak scores should be 4.371 and 4.39325 for high and non-high expression genes, respectively.
In [ ]:
A common situation that arises in data analysis is that we have a list of data points and we would like to compare each data point to each other data point. Here, we will write a script that computes the "distance" between each pair of strings in a file and outputs a distance matrix. We will define "distance" between two strings as the number of mismatches between two strings when they are lined up, divided by their length.
First we'll use a toy dataset. We'll create a list as follows:
things = [1, 2, 5, 10, 25, 50]
We'll start off by doing a very simple type of pairwise comparison: taking the numerical difference between two numbers. To systematically do this for all possible pairs of numbers in our list, we can make a nested for loop:
In [ ]:
things = [1, 2, 5, 10, 25, 50]
for i in range(len(things)):
for j in range(len(things)):
print abs(things[i] - things[j]) #absolute value of the difference
Try running this code yourself and observe the output. Everything prints out on its own line, which isn't what we want -- we'd usually prefer a matrix-type format. Try this slightly modified code:
In [ ]:
things = [1, 2, 5, 10, 25, 50]
for i in range(len(things)):
for j in range(len(things)):
print abs(things[i] - things[j]), "\t",
print ""
This gives us the matrix format we want. Make sure you understand how this code works. FYI, the "\t" is a tab character, and much like "\n", it is invisible once it's printed (it becomes a tab). The comma at the very end of the print
statement suppresses the \n that print
usually adds on to the end.
So now we know how to do an all-against-all comparison. But how do we compute the number of mismatches between strings? As long as the strings are the same length, we can do something simple like the following:
In [ ]:
str1 = "Wilfred"
str2 = "Manfred"
diffs = 0
for k in range(len(str1)):
if str1[k] != str2[k]: #compare the two strings at the same index
diffs = diffs + 1
print "dist =", round(float(diffs) / len(str1), 2)
So this outputs the distance between the two strings, where the distance is defined as the fraction of the sequence length that is mismatched.
Using these two pieces of code as starting points, complete the following:
(A) (2 pt) Create a list of a few short strings of the same length. For example:
things = ["bear", "pear", "boar", "tops", "bops"]
Write code that prints a distance matrix for this list. As in the last example, use the fraction of mismatches between a given pair of words as the measure of their "distance" from each other. Round the distances to 2 decimals.
In [ ]:
(B) (2 pt) Now, instead of using a hard coded list like you did in (A), create a list of DNA sequences by reading in the file sequences2.txt
. Compute the distance matrix between these sequences and print the distance matrix. Looking at this matrix, do you see a pair of sequences that are much less "distant" from each other than all the rest?
In [ ]:
(A) Following from problem 7 above: As you can see, the distance matrix you made in (B) is symmetrical around the diagonal. This means dist(i,j) is the same as dist(j,i), so we're doing some redundant calculations.
Change the code so that we don't do any unnecessary calculations (including comparing a sequence to itself, which always is 0). For any calculations you skip, you can print "-" or some other place-holder to keep the printed matrix looking neat.
Hint: There's a really simple way to do this! Think about the range of the second loop...
In [ ]:
(B) Below is a loop that creates a list. Do the same thing but with a list comprehension instead.
In [ ]:
# Loop version:
import random
randomNums = []
for i in range(100):
randomNums.append(random.randint(0,10))
print randomNums
In [ ]:
# Your list comprehension version here:
(C) Below is a loop that creates a list. Do the same thing but with a list comprehension instead.
In [ ]:
# Loop version:
import random
randomNums = []
for i in range(100):
randNum = random.randint(0,10)
if (randNum % 2) == 0:
randNumStr = str(randNum)
randomNums.append(randNumStr)
print randomNums
In [ ]:
# Your list comprehension version here: