In [1]:
# run this cell first!
fruits = {"apple":"red", "banana":"yellow", "grape":"purple"}
In [2]:
print fruits["banana"]
In [3]:
query = "apple"
print fruits[query]
In [4]:
print fruits[0]
There's no concept of "first element" in a dictionary, since it's unordered. (Of course, if you happened to have a key in your dictionary that was literally 0, this code would work.)
In [5]:
print fruits.keys()
In [6]:
print fruits.values()
In [7]:
for key in fruits:
print fruits[key]
In [8]:
del fruits["banana"]
print fruits
In [9]:
print fruits["pear"]
In [10]:
fruits["pear"] = "green"
print fruits["pear"]
In [11]:
fruits["apple"] = fruits["apple"] + " or green"
print fruits["apple"]
Since fruits["apple"] contains a string, we can concatenate to that string just like we would normally. And if instead the value was a number, we could add to it. Basically, you can treat the "dict[key]" as a variable that holds whatever the value is.
(A) Create a dictionary based on the data above, where each person's name is a key, and their favorite foods are the values.
In [15]:
favFoods = {"Wilfred":"Steak", "Manfred":"French fries", "Wadsworth":"Spaghetti", "Jeeves":"Ice cream"}
print favFoods
(B) Using a for
loop, go through the dictionary you created above and print each name and food combination in the format:
<NAME>'s favorite food is <FOOD>
In [16]:
for name in favFoods:
print name + "'s favorite food is " + favFoods[name]
(C) (1) Change the dictionary so that Wilfred's favorite food is Pizza. (2) Add a new entry for Mitsworth, whose favorite food is Tuna.
Do not recreate the whole dictionary while doing these things. Edit the dictionary you created in (A) using the syntax described in the lecture.
In [17]:
favFoods["Wilfred"] = "Pizza"
favFoods["Mitsworth"] = "Tuna"
print favFoods
(D) Prompt the user to input a name. Check if the name they entered is a valid key in the dictionary using an if
statement. If the name is in the dictionary, print out the corresponding favorite food. If not, print a message saying "That name is not in our database".
In [18]:
name = raw_input("Enter a name: ")
if name in favFoods:
print favFoods[name]
else:
print "That name is not in our database"
(E) Print just the names in the dictionary in alphabetical order. Use the sorting example from the slides.
In [19]:
for name in sorted(favFoods):
print name
(F) Print just the names in sorted order based on their favorite food. Use the value-sorting example from the slides.
In [20]:
for name in sorted(favFoods, key=favFoods.get):
print name
(A) Write code that prints "Hello, world" to a file called hello.txt
In [21]:
outs = open("hello.txt", 'w')
outs.write("Hello, world")
outs.close()
(B) Write code that prints the following text to a file called meow.txt
. It must be formatted exactly as it here (you will need to use \n and \t):
Dear Mitsworth,
Meow, meow meow meow.
Sincerely,
A friend
In [22]:
outs = open("meow.txt", 'w')
outs.write("Dear Mitsworth,\n\n\tMeow, meow meow meow.\n\nSincerely,\nA friend")
outs.close()
(C) Write code that reads in the gene IDs from genes.txt
and prints the unique gene IDs to a new file called genes_unique.txt
. (You can re-use your code or the answer sheet from lab4 for getting the unique IDs.)
In [23]:
inFile = "genes.txt"
outFile = "genes_unique.txt"
unique = []
ins = open(inFile, 'r')
outs = open(outFile, 'w')
for line in ins:
line = line.rstrip('\n')
if line not in unique:
outs.write(line + "\n")
unique.append(line)
outs.close()
ins.close()
(A) Write code that reads a file of sequences and tallies how many sequences there are of each length. Use sequences3.txt
as input.
Hint: you can use a dictionary to keep track of all the tallies. For example:
In [ ]:
# hint code:
tallyDict = {}
seq = "ATGCTGATCGATATA"
length = len(seq)
if length not in tallyDict:
tallyDict[length] = 1 #initialize to 1 if this is the first occurrence of the length...
else:
tallyDict[length] = tallyDict[length] + 1 #...otherwise just increment the count.
In [24]:
inFile = "sequences3.txt"
lenDict = {}
ins = open(inFile, 'r')
for line in ins:
line = line.rstrip('\n') #important to do here, since '\n' counts as a character and thus increases the length of the sequence by 1
seqLen = len(line)
if seqLen not in lenDict:
lenDict[seqLen] = 1
else:
lenDict[seqLen] += 1
ins.close()
(B) Using the tally dictionary you created above, figure out which sequence length was the most common, and print it to the screen.
In [25]:
# print all tallies for informational purposes
for name in sorted(lenDict.keys()):
print name, ":", lenDict[name]
# now to get the max length, we can sort the keys by value
sortedLens = sorted(lenDict, key=lenDict.get) #this returns a list of the keys sorted by value
# this sorts from smallest to largest, so we take the last element
mostCommon = sortedLens[-1]
print "Most common length is", mostCommon, "nts"
For this question, use codon_table.txt
, which contains a list of all possible codons and their corresponding amino acids. We will be using this info to create a dictionary, which will allow us to translate a nucleotide sequence into amino acids. Each part of this question builds off the previous parts.
(A) Thinkin' question (short answer, not code): If we want to create a codon dictionary and use it to translate nucleotide sequences, would it be better to use the codons or amino acids as keys?
Codons should be the keys, since we are starting with nucleotide sequence. That is the info we'll want to use to retrieve the amino acid.
If you were instead going from protein --> nucleotide, then you'd want the amino acids to be the keys. (But that can be problematic because a single amino acid usually maps to more than one codon!)
(B) Read in codon_table.txt
(note that it has a header line) and use it to create a codon dictionary. Then use raw_input()
prompt the user to enter a single codon (e.g. ATG) and print the amino acid corresponding to that codon to the screen.
In [26]:
inFile = "codon_table.txt"
codon2aa = {}
ins = open(inFile, 'r')
ins.readline() #skip header
for line in ins:
line = line.rstrip('\n')
lineParts = line.split()
codon = lineParts[0]
aa = lineParts[1]
if codon not in codon2aa:
codon2aa[codon] = aa
else:
print "Warning! Multiple entries found for the same codon (" + codon + "). Skipping."
ins.close()
request = raw_input("Codon to translate: ").upper() #covert to uppercase, just in case
if request in codon2aa:
print request, "-->", codon2aa[request]
else:
print "Did not recognize that codon."
(C) Now we will adapt the code in (B) to translate a longer sequence. Instead of prompting the user for a single codon, allow them to enter a longer sequence. First, check that the sequence they entered has a length that is a multiple of 3 (Hint: use the mod operator, %), and print an error message if it is not. If it is valid, then go on to translate every three nucleotides to an amino acid. Print the final amino acid sequence to the screen.
In [27]:
inFile = "codon_table.txt"
codon2aa = {}
ins = open(inFile, 'r')
ins.readline() #skip header
for line in ins:
line = line.rstrip('\n')
lineParts = line.split()
codon = lineParts[0]
aa = lineParts[1]
if codon not in codon2aa:
codon2aa[codon] = aa
else:
print "Warning! Multiple entries found for the same codon (" + codon + "). Skipping."
ins.close()
# get user input
request = raw_input("Sequence to translate (multiple of 3): ").upper()
protSeq = ""
if (len(request) % 3) == 0:
# this indexing/slicing is tricky! definitely test this sort of thing to make sure you get it right.
for i in range(0,len(request),3):
codon = request[i:i+3]
if codon in codon2aa:
protSeq = protSeq + codon2aa[codon]
else:
print "Warning! Unrecognized codon ("+codon+"). Using X as a place holder."
protSeq = protSeq + "X"
print "Your protein sequence is:", protSeq
else:
print "Please enter a sequence length that is a multiple of 3."
(D) Now, instead of taking user input, you will apply your translator to a set of sequences stored in a file. Read in the sequences from sequences3.txt
(assume each line is a separate sequence), translate it to amino acids, and print it to a new file called proteins.txt
.
In [28]:
inFile = "codon_table.txt"
codon2aa = {}
ins = open(inFile, 'r')
ins.readline() #skip header
for line in ins:
line = line.rstrip('\n')
lineParts = line.split()
codon = lineParts[0]
aa = lineParts[1]
if codon not in codon2aa:
codon2aa[codon] = aa
else:
print "Warning! Multiple entries found for the same codon (" + codon + "). Skipping."
ins.close()
# read file of sequences
inFile = "sequences3.txt"
outFile = "proteins.txt"
ins = open(inFile, 'r')
outs = open(outFile, 'w')
for line in ins:
line = line.rstrip('\n')
protSeq = "" #best to define this with the loop so it's re-created for each separate sequence.
if (len(line) % 3) == 0:
for i in range(0,len(line),3):
codon = line[i:i+3]
if codon in codon2aa:
protSeq += codon2aa[codon]
else:
print "Warning! Unrecognized codon ("+codon+"). Using X as a place holder."
protSeq += "X"
outs.write(protSeq + '\n') # write to output file
else:
print "Encountered sequence length that is not a multiple of 3. Skipping it. Sequence:", line
outs.close()
ins.close()
Write code that reads sequences from a fasta file and stores them in a dictionary according to their header (i.e. use the header line as the key and sequence as the value). You will use horrible.fasta
to test your code.
If you are not familiar with fasta files, they have the following general format:
>geneName1
ATCGCTAGTCGATCGATGGTTTCGCGTAGCGTTGCTAGCGTAGCTGATG
TCGATCGATGGTTTCGCGTAGCGTTGCTAGCGTAGCTGATGATGCTCAA
GCTGGATGGCTAGCTGATGCTAG
>geneName2
ATCGATGGGCTGGATCGATGCGGCTCGGCGATCGA
...
There are many slight variations; for example the header often contains different information, depending where you got the file from, and the sequence for a given entry may span any number of lines. To write a good fasta parser, you must make as few assumptions about the formatting as possible. This will make your code more "robust".
For fasta files, pretty much the only things you can safely assume are that a new entry will be marked by the >
sign, which is immediately followed by a (usually) unique header, and all sequence belonging to that entry will be located immediately below. However, you can't assume how many lines the sequence will take up.
With this in mind, write a robust fasta parser that reads in horrible.fasta
and stores each sequence in a dictionary according to its header line. Call the dictionary seqDict
. Remove any newline characters. Don't include the >
sign in the header. Hint: use string slicing or .lstrip()
In [29]:
ins = open("horrible.fasta", 'r')
seqDict = {}
activeID = ""
for line in ins:
line = line.rstrip('\n')
# if the first character is >, this line is a header.
if line[0] == ">":
activeID = line[1:]
if activeID in seqDict:
print ">>> Warning: repeat id:", activeID, "-- overwriting previous ID."
seqDict[activeID] = ""
# otherwise, this is a sequence line--add it to the activeID entry
else:
seqDict[activeID] = seqDict[activeID] + line
ins.close()
After you've written your code above and you think it works, run it and then run the following code to to spot-check whether you did everything correctly. If you didn't name your dictionary seqDict
, you'll need to change it below to whatever you named your dictionary.
In [30]:
error = False
if ">varlen2_uc001pmn.3_3476" in seqDict:
print "Remove > chars from headers!"
error = True
elif "varlen2_uc001pmn.3_3476" not in seqDict:
print "Something's wrong with your dictionary: missing keys"
error = True
if "varlen2_uc021qfk.1>2_1472" not in seqDict:
print "Only remove the > chars from the beginning of the header!"
error = True
if len(seqDict["varlen2_uc009wph.3_423"]) > 85:
if "\n" in seqDict["varlen2_uc009wph.3_423"]:
print "Remove newline chars from sequences"
error = True
else:
print "Length of sequences longer than expected for some reason"
error = True
elif len(seqDict["varlen2_uc009wph.3_423"]) < 85:
print "Length of sequences shorter than expected for some reason"
error = True
if error == False:
print "Congrats, you passed all my tests!"