In [45]:
ages = [65, 34, 96, 47]
print len(ages)
In [46]:
names = ["Wilfred", "Manfred", "Wadsworth", "Jeeves"]
ages = [65, 34, 96, 47]
print len(ages) == len(names)
In [47]:
ages = [65, 34, 96, 47]
for hippopotamus in ages:
print hippopotamus
In [48]:
ages = [65, 34, 96, 47]
print ages[1:3]
In [49]:
names = ["Wilfred", "Manfred", "Wadsworth", "Jeeves"]
if "Willard" not in names:
names.append("Willard")
print names
In [50]:
names = ["Wilfred", "Manfred", "Wadsworth", "Jeeves"]
ages = [65, 34, 96, 47]
for i in range(len(names)):
print names[i],"is",ages[i]
In [51]:
ages = [65, 34, 96, 47]
ages.sort()
print ages
In [52]:
ages = [65, 34, 96, 47]
ages = ages.sort()
print ages
Remember that
.sort()
is an in-place function. Its return value is "None". (This is a special value in Python that basically means "null". It's used as a placeholder sometimes when we don't want to give something a value.)
In [53]:
ages = [65, 34, 96, 47]
print max(ages)
In [54]:
cat = "Mitsworth"
for i in range(len(cat)):
print cat[i]
In [55]:
cat = "Mitsworth"
print cat[:4]
In [56]:
str1 = "Good morning, Mr. Mitsworth."
parts = str1.split()
print parts
print str1
In [57]:
str1 = "Good morning, Mr. Mitsworth."
parts = str1.split(",")
print parts
In [58]:
names = ["Wilfred", "Manfred", "Wadsworth", "Jeeves"]
print names[-1]
In [59]:
oldList = [2, 2, 6, 1, 2, 6]
newList = []
for item in oldList:
if item not in newList:
newList.append(item)
print newList
This is an example of how to remove duplicates.
In [60]:
# run this first!
geneNames = ["Ppia", "Gria2", "Mecp2", "Omd", "Zfp410", "Hsp1", "Mtap1a", "Cfl1",
"Slc25a40", "Dync1i1", "V1ra4", "Fmnl1", "Mtap2", "Atp5b", "Olfr259",
"Atf3", "Vapb", "Dhx8", "Slc22a15", "Orai3", "Ifitm7", "Kcna2", "Timm23", "Shank1"]
(A) (1pt) Replace the 12th element in geneNames
with the string "Camk2a"
.
In [61]:
geneNames[11] = "Camk2a"
(B) (1pt) Add the string "Shank3"
to the end of the geneNames
list.
In [62]:
geneNames.append("Shank3")
(C) (1pt) Using a loop, print the elements in geneNames
that start with "S" or "O".
In [63]:
for gene in geneNames:
firstChar = gene[0]
if firstChar == "S" or firstChar == "O":
print gene
(A) (1pt) Print characters 3-9 of the string stored in the variable magicWords
using string slicing. (You should get "racadab")
In [64]:
magicWords = "abracadabra"
print magicWords[2:9]
(B) (1pt) Prompt the user to input a sentence using raw_input()
. Print each word of their sentence on a separate line.
In [66]:
sentence = raw_input("Input a sentence: ")
splitSentence = sentence.split()
for word in splitSentence:
print word
(C) (1pt) Using raw_input()
, prompt the user to enter some good ideas for cat names (have them enter the names on one line, separating each name with a comma). Separate the names so that each one is a separate element of a list. Then choose one randomly (using the random module) and print it out.
In [67]:
import random
catNamesStr = raw_input("Input some cat names (separate by comma): ")
catNamesList = catNamesStr.split(",")
randNum = random.randint(0, len(catNamesList)-1)
print catNamesList[randNum]
Note: if you want to strip off the extra whitespace, you can do this:
print catNamesList[randNum].strip(" ")
(A) (1pt) Read the file and print to the screen only the gene IDs that contain "uc007".
[ Check your answer ] You should end up printing 14 gene IDs.
In [68]:
inFile = "genes.txt"
ins = open(inFile, 'r')
for line in ins:
line = line.rstrip('\n')
if "uc007" in line:
print line
ins.close()
(B) (1 pt) Print to the screen only unique gene IDs (remove the duplicates). Do not assume repeat IDs appear consecutively in the file.
Hint: see problem 1 for an example of duplicate checking.
[ Check your answer ] You should end up printing 49 gene IDs.
In [69]:
inFile = "genes.txt"
geneList = []
ins = open(inFile, 'r')
for line in ins:
line = line.rstrip('\n')
if line not in geneList: #keep track of what we've already seen using a list
geneList.append(line)
print line
ins.close()
(C) (1 pt) Print to the screen only the gene IDs that are still unique after removing the ".X" suffix (where X is a number).
[ Check your answer ] You should end up printing 46 gene IDs.
In [70]:
inFile = "genes.txt"
geneList = []
ins = open(inFile, 'r')
for line in ins:
line = line.rstrip('\n')
splitLine = line.split(".") #could also do this using string slicing: line[:-2]
geneID = splitLine[0] #the splitting returns a list, and the geneID should be the first part
if geneID not in geneList:
geneList.append(geneID)
print geneID
ins.close()
(A) (2 pt) Write a script that reads this file and computes the average CDS length (i.e. average the values in the 7th column).
[ Check your answer ] You should get an average of 236.36.
In [71]:
fileName = "init_sites.txt"
totalLen = 0
numLines = 0
ins = open(fileName, 'r')
ins.readline()
for line in ins:
line = line.rstrip('\n')
lineParts = line.split()
totalLen = totalLen + int(lineParts[6]) #all file input is read as string; must convert to int
numLines = numLines + 1
print float(totalLen)/numLines
ins.close()
(B) (2 pt) Write a script that reads this file and prints the "Init Context" from each line (i.e. the 6th column) if and only if the "Codon" column (column 12) is "aug" for that line.
[ Check your answer ] You should print 38 init contexts.
In [72]:
fileName = "init_sites.txt"
ins = open(fileName, 'r')
for line in ins:
line = line.rstrip('\n')
lineParts = line.split()
if lineParts[11] == "aug":
print lineParts[5]
ins.close()
(C) (0 pt) For fun (?), copy and paste your output from (b) into http://weblogo.berkeley.edu/logo.cgi to create a motif logo of the sequence around these initiation sites. What positions/nt seem to be most common?
ATG at positions 4,5,6 (as expected, since this is the canonical initiation codon) and a preference for A/G at position 1
Here you will extract and print the data from init_sites.txt
that corresponds to genes with high expression. There isn't gene expression data in init_sites.txt
, so we'll have to integrate information from another file.
gene_expr.txt
to create a list of genes with high expression. We'll say high expression is anything >= 50. init_sites.txt
and print the GeneName (2nd column) and PeakScore (11th column) from any line that matches an ID in your high-expression list. [ Check your answer ] There should be 10 lines corresponding to high-expression genes that you print info about. Your average peak scores should be 4.371 and 4.39325 for high and non-high expression genes, respectively.
In [73]:
initFile = "init_sites.txt"
exprFile = "gene_expr.txt"
# create a list of "high expression" genes
highExpr = []
ins = open(exprFile)
ins.readline() #skip header
for line in ins:
line = line.rstrip('\n')
lineParts = line.split()
if float(lineParts[1]) >= 50:
highExpr.append(lineParts[0])
ins.close()
# average the peak scores for high vs low genes
highExprTotal = 0
highExprCount = 0
lowExprTotal = 0
lowExprCount = 0
ins = open(initFile, 'r')
ins.readline() #skip header
for line in ins:
line = line.rstrip('\n')
lineParts = line.split()
if lineParts[0] in highExpr:
print lineParts[1], "\t", lineParts[10]
highExprTotal = highExprTotal + float(lineParts[10])
highExprCount = highExprCount + 1
else:
lowExprTotal = lowExprTotal + float(lineParts[10])
lowExprCount = lowExprCount + 1
ins.close()
print ""
print "Avg PeakScore high expression genes:", float(highExprTotal) / highExprCount
print "Avg PeakScore low expression genes:", float(lowExprTotal) / lowExprCount
A common situation that arises in data analysis is that we have a list of data points and we would like to compare each data point to each other data point. Here, we will write a script that computes the "distance" between each pair of strings in a file and outputs a distance matrix. We will define "distance" between two strings as the number of mismatches between two strings when they are lined up, divided by their length.
First we'll use a toy dataset. We'll create a list as follows:
things = [1, 2, 5, 10, 25, 50]
We'll start off by doing a very simple type of pairwise comparison: taking the numerical difference between two numbers. To systematically do this for all possible pairs of numbers in our list, we can make a nested for loop:
In [74]:
things = [1, 2, 5, 10, 25, 50]
for i in range(len(things)):
for j in range(len(things)):
print abs(things[i] - things[j]) #absolute value of the difference
Try running this code yourself and observe the output. Everything prints out on its own line, which isn't what we want -- we'd usually prefer a matrix-type format. Try this slightly modified code:
In [75]:
things = [1, 2, 5, 10, 25, 50]
for i in range(len(things)):
for j in range(len(things)):
print abs(things[i] - things[j]), "\t",
print ""
This gives us the matrix format we want. Make sure you understand how this code works. FYI, the "\t" is a tab character, and much like "\n", it is invisible once it's printed (it becomes a tab). The comma at the very end of the print
statement suppresses the \n that print
usually adds on to the end.
So now we know how to do an all-against-all comparison. But how do we compute the number of mismatches between strings? As long as the strings are the same length, we can do something simple like the following:
In [76]:
str1 = "Wilfred"
str2 = "Manfred"
diffs = 0
for k in range(len(str1)):
if str1[k] != str2[k]: #compare the two strings at the same index
diffs = diffs + 1
print "dist =", round(float(diffs) / len(str1), 2)
So this outputs the distance between the two strings, where the distance is defined as the fraction of the sequence length that is mismatched.
Using these two pieces of code as starting points, complete the following:
(A) (2 pt) Create a list of a few short strings of the same length. For example:
things = ["bear", "pear", "boar", "tops", "bops"]
Write code that prints a distance matrix for this list. As in the last example, use the fraction of mismatches between a given pair of words as the measure of their "distance" from each other. Round the distances to 2 decimals.
In [77]:
things = ["bear", "pear", "boar", "tops", "bops"]
for i in range(len(things)):
for j in range(len(things)):
str1 = things[i]
str2 = things[j]
diffs = 0
for k in range(len(str1)):
if str1[k] != str2[k]:
diffs += 1
print round(float(diffs)/len(str1),2), "\t",
print ""
(B) (2 pt) Now, instead of using a hard coded list like you did in (A), create a list of DNA sequences by reading in the file sequences2.txt
. Compute the distance matrix between these sequences and print the distance matrix. Looking at this matrix, do you see a pair of sequences that are much less "distant" from each other than all the rest?
In [78]:
# create list of sequences
things = []
ins = open("sequences2.txt", 'r')
for line in ins:
line = line.rstrip('\n')
things.append(line)
ins.close()
# create distance matrix
for i in range(len(things)):
for j in range(len(things)):
str1 = things[i]
str2 = things[j]
diffs = 0
for k in range(len(str1)):
if str1[k] != str2[k]:
diffs += 1
print round(float(diffs)/len(str1),2), "\t",
print ""
(A) Following from problem 7 above: As you can see, the distance matrix you made in (B) is symmetrical around the diagonal. This means dist(i,j) is the same as dist(j,i), so we're doing some redundant calculations.
Change the code so that we don't do any unnecessary calculations (including comparing a sequence to itself, which always is 0). For any calculations you skip, you can print "-" or some other place-holder to keep the printed matrix looking neat.
Hint: There's a really simple way to do this! Think about the range of the second loop...
In [79]:
### METHOD #1
print ""
print "METHOD 1: lower triangle"
print ""
things = []
ins = open("sequences2.txt", 'r')
for line in ins:
line = line.rstrip('\n')
things.append(line)
ins.close()
for i in range(len(things)):
for j in range(0,i): #<-- this is the only change we need to make
str1 = things[i]
str2 = things[j]
diffs = 0
for k in range(len(str1)):
if str1[k] != str2[k]:
diffs += 1
print round(float(diffs)/len(str1),2), "\t",
print ""
### METHOD #2
# note that the way it prints will look weird! see below for another example that fixes this
print ""
print "METHOD 2: upper triangle, messed up format"
print ""
things = []
ins = open("sequences2.txt", 'r')
for line in ins:
line = line.rstrip('\n')
things.append(line)
ins.close()
for i in range(len(things)):
for j in range(i+1,len(things)): #<--
str1 = things[i]
str2 = things[j]
diffs = 0
for k in range(len(str1)):
if str1[k] != str2[k]:
diffs += 1
print round(float(diffs)/len(str1),2), "\t",
print ""
### METHOD #3
# similar to above, but fixed output formatting to be more easily interpreted
print ""
print "METHOD 3: upper triangle with formatting"
print ""
things = []
ins = open("sequences2.txt", 'r')
for line in ins:
line = line.rstrip('\n')
things.append(line)
ins.close()
for i in range(len(things)):
for j in range(len(things)):
if j > i:
str1 = things[i]
str2 = things[j]
diffs = 0
for k in range(len(str1)):
if str1[k] != str2[k]:
diffs += 1
print round(float(diffs)/len(str1),2), "\t",
else:
print " - \t",
print ""
(B) Below is a loop that creates a list. Do the same thing but with a list comprehension instead.
In [80]:
# Loop version:
import random
randomNums = []
for i in range(100):
randomNums.append(random.randint(0,10))
print randomNums
In [81]:
# Your list comprehension version here:
randomNums = [random.randint(0,10) for i in range(100)]
print randomNums
(C) Below is a loop that creates a list. Do the same thing but with a list comprehension instead.
In [82]:
# Loop version:
import random
randomNums = []
for i in range(100):
if (i % 2) == 0:
randNum = random.randint(0,10)
randNumStr = str(randNum)
randomNums.append(randNumStr)
print randomNums
Note: I changed this slightly from how it was originally.
In [83]:
# Your list comprehension version here:
randomNums = [str(random.randint(0,10)) for i in range(100) if i % 2 == 0]
print randomNums