Earning points (optional)
.ipynb
file to me (sarahmid@mail.med.upenn.edu) before 9:00 am on 9/23. Name:
In [ ]:
# run this cell first!
fruits = {"apple":"red", "banana":"yellow", "grape":"purple"}
In [ ]:
print fruits["banana"]
Your guess:
In [ ]:
query = "apple"
print fruits[query]
Your guess:
In [ ]:
print fruits[0]
Your guess:
In [ ]:
print fruits.keys()
Your guess:
In [ ]:
print fruits.values()
Your guess:
In [ ]:
for key in fruits:
print fruits[key]
Your guess:
In [ ]:
del fruits["banana"]
print fruits
Your guess:
In [ ]:
print fruits["pear"]
Your guess:
In [ ]:
fruits["pear"] = "green"
print fruits["pear"]
Your guess:
In [ ]:
fruits["apple"] = fruits["apple"] + " or green"
print fruits["apple"]
Your guess:
(A) Create a dictionary based on the data above, where each person's name is a key, and their favorite foods are the values.
In [ ]:
(B) Using a for
loop, go through the dictionary you created above and print each name and food combination in the format:
<NAME>'s favorite food is <FOOD>
In [ ]:
(C) (1) Change the dictionary so that Wilfred's favorite food is Pizza. (2) Add a new entry for Mitsworth, whose favorite food is Tuna.
Do not recreate the whole dictionary while doing these things. Edit the dictionary you created in (A) using the syntax described in the lecture.
In [ ]:
(D) Prompt the user to input a name. Check if the name they entered is a valid key in the dictionary using an if
statement. If the name is in the dictionary, print out the corresponding favorite food. If not, print a message saying "That name is not in our database".
In [ ]:
(E) Print just the names in the dictionary in alphabetical order. Use the sorting example from the slides.
In [ ]:
(F) Print just the names in sorted order based on their favorite food. Use the value-sorting example from the slides.
In [ ]:
(A) Write code that prints "Hello, world" to a file called hello.txt
In [ ]:
(B) Write code that prints the following text to a file called meow.txt
. It must be formatted exactly as it here (you will need to use \n and \t):
Dear Mitsworth,
Meow, meow meow meow.
Sincerely,
A friend
In [ ]:
(C) Write code that reads in the gene IDs from genes.txt
and prints the unique gene IDs to a new file called genes_unique.txt
. (You can re-use your code or the answer sheet from lab4 for getting the unique IDs.)
In [ ]:
(A) Write code that reads a file of sequences and tallies how many sequences there are of each length. Use sequences3.txt
as input.
Hint: you can use a dictionary to keep track of all the tallies. For example:
In [ ]:
# hint code:
tallyDict = {}
seq = "ATGCTGATCGATATA"
length = len(seq)
if length not in tallyDict:
tallyDict[length] = 1 #initialize to 1 if this is the first occurrence of the length...
else:
tallyDict[length] = tallyDict[length] + 1 #...otherwise just increment the count.
In [ ]:
(B) Using the tally dictionary you created above, figure out which sequence length was the most common, and print it to the screen.
In [ ]:
For this question, use codon_table.txt
, which contains a list of all possible codons and their corresponding amino acids. We will be using this info to create a dictionary, which will allow us to translate a nucleotide sequence into amino acids. Each part of this question builds off the previous parts.
(A) Thinkin' question (short answer, not code): If we want to create a codon dictionary and use it to translate nucleotide sequences, would it be better to use the codons or amino acids as keys?
Your answer:
(B) Read in codon_table.txt
(note that it has a header line) and use it to create a codon dictionary. Then use raw_input()
prompt the user to enter a single codon (e.g. ATG) and print the amino acid corresponding to that codon to the screen.
In [ ]:
(C) Now we will adapt the code in (B) to translate a longer sequence. Instead of prompting the user for a single codon, allow them to enter a longer sequence. First, check that the sequence they entered has a length that is a multiple of 3 (Hint: use the mod operator, %), and print an error message if it is not. If it is valid, then go on to translate every three nucleotides to an amino acid. Print the final amino acid sequence to the screen.
In [ ]:
(D) Now, instead of taking user input, you will apply your translator to a set of sequences stored in a file. Read in the sequences from sequences3.txt
(assume each line is a separate sequence), translate it to amino acids, and print it to a new file called proteins.txt
.
In [ ]:
Write code that reads sequences from a fasta file and stores them in a dictionary according to their header (i.e. use the header line as the key and sequence as the value). You will use horrible.fasta
to test your code.
If you are not familiar with fasta files, they have the following general format:
>geneName1
ATCGCTAGTCGATCGATGGTTTCGCGTAGCGTTGCTAGCGTAGCTGATG
TCGATCGATGGTTTCGCGTAGCGTTGCTAGCGTAGCTGATGATGCTCAA
GCTGGATGGCTAGCTGATGCTAG
>geneName2
ATCGATGGGCTGGATCGATGCGGCTCGGCGATCGA
...
There are many slight variations; for example the header often contains different information, depending where you got the file from, and the sequence for a given entry may span any number of lines. To write a good fasta parser, you must make as few assumptions about the formatting as possible. This will make your code more "robust".
For fasta files, pretty much the only things you can safely assume are that a new entry will be marked by the >
sign, which is immediately followed by a (usually) unique header, and all sequence belonging to that entry will be located immediately below. However, you can't assume how many lines the sequence will take up.
With this in mind, write a robust fasta parser that reads in horrible.fasta
and stores each sequence in a dictionary according to its header line. Call the dictionary seqDict
. Remove any newline characters. Don't include the >
sign in the header. Hint: use string slicing or .lstrip()
In [ ]:
After you've written your code above and you think it works, run it and then run the following code to to spot-check whether you did everything correctly. If you didn't name your dictionary seqDict
, you'll need to change it below to whatever you named your dictionary.
In [ ]:
error = False
if ">varlen2_uc001pmn.3_3476" in seqDict:
print "Remove > chars from headers!"
error = True
elif "varlen2_uc001pmn.3_3476" not in seqDict:
print "Something's wrong with your dictionary: missing keys"
error = True
if "varlen2_uc021qfk.1>2_1472" not in seqDict:
print "Only remove the > chars from the beginning of the header!"
error = True
if len(seqDict["varlen2_uc009wph.3_423"]) > 85:
if "\n" in seqDict["varlen2_uc009wph.3_423"]:
print "Remove newline chars from sequences"
error = True
else:
print "Length of sequences longer than expected for some reason"
error = True
elif len(seqDict["varlen2_uc009wph.3_423"]) < 85:
print "Length of sequences shorter than expected for some reason"
error = True
if error == False:
print "Congrats, you passed all my tests!"