Programming Bootcamp 2016

Lesson 5 Exercises


Earning points (optional)

  • Enter your name below.
  • Email your .ipynb file to me (sarahmid@mail.med.upenn.edu) before 9:00 am on 9/23.
  • You do not need to complete all the problems to get points.
  • I will give partial credit for effort when possible.
  • At the end of the course, everyone who gets at least 90% of the total points will get a prize (bootcamp mug!).

Name:


1. Guess the output: dictionary practice (1pt)

For the following blocks of code, first try to guess what the output will be, and then run the code yourself. Points will be given for filling in the guesses; guessing wrong won't be penalized.


In [ ]:
# run this cell first!
fruits = {"apple":"red", "banana":"yellow", "grape":"purple"}

In [ ]:
print fruits["banana"]

Your guess:


In [ ]:
query = "apple"
print fruits[query]

Your guess:


In [ ]:
print fruits[0]

Your guess:


In [ ]:
print fruits.keys()

Your guess:


In [ ]:
print fruits.values()

Your guess:


In [ ]:
for key in fruits:
    print fruits[key]

Your guess:


In [ ]:
del fruits["banana"]
print fruits

Your guess:


In [ ]:
print fruits["pear"]

Your guess:


In [ ]:
fruits["pear"] = "green"
print fruits["pear"]

Your guess:


In [ ]:
fruits["apple"] = fruits["apple"] + " or green"
print fruits["apple"]

Your guess:


2. On your own: using dictionaries (6pts)

Using the info in the table below, write code to accomplish the following tasks.

Name Favorite Food
Wilfred Steak
Manfred French fries
Wadsworth Spaghetti
Jeeves Ice cream

(A) Create a dictionary based on the data above, where each person's name is a key, and their favorite foods are the values.


In [ ]:

(B) Using a for loop, go through the dictionary you created above and print each name and food combination in the format:

<NAME>'s favorite food is <FOOD>

In [ ]:

(C) (1) Change the dictionary so that Wilfred's favorite food is Pizza. (2) Add a new entry for Mitsworth, whose favorite food is Tuna.

Do not recreate the whole dictionary while doing these things. Edit the dictionary you created in (A) using the syntax described in the lecture.


In [ ]:

(D) Prompt the user to input a name. Check if the name they entered is a valid key in the dictionary using an if statement. If the name is in the dictionary, print out the corresponding favorite food. If not, print a message saying "That name is not in our database".


In [ ]:

(E) Print just the names in the dictionary in alphabetical order. Use the sorting example from the slides.


In [ ]:

(F) Print just the names in sorted order based on their favorite food. Use the value-sorting example from the slides.


In [ ]:


3. File writing (3pts)

(A) Write code that prints "Hello, world" to a file called hello.txt


In [ ]:

(B) Write code that prints the following text to a file called meow.txt. It must be formatted exactly as it here (you will need to use \n and \t):

Dear Mitsworth,

    Meow, meow meow meow.

Sincerely,
A friend

In [ ]:

(C) Write code that reads in the gene IDs from genes.txt and prints the unique gene IDs to a new file called genes_unique.txt. (You can re-use your code or the answer sheet from lab4 for getting the unique IDs.)


In [ ]:


4. The "many counters" problem (4pts)

(A) Write code that reads a file of sequences and tallies how many sequences there are of each length. Use sequences3.txt as input.

Hint: you can use a dictionary to keep track of all the tallies. For example:


In [ ]:
# hint code:

tallyDict = {}
seq = "ATGCTGATCGATATA"
length = len(seq)

if length not in tallyDict:
    tallyDict[length] = 1  #initialize to 1 if this is the first occurrence of the length...
else:
    tallyDict[length] = tallyDict[length] + 1   #...otherwise just increment the count.

In [ ]:

(B) Using the tally dictionary you created above, figure out which sequence length was the most common, and print it to the screen.


In [ ]:


5. Codon table (6pts)

For this question, use codon_table.txt, which contains a list of all possible codons and their corresponding amino acids. We will be using this info to create a dictionary, which will allow us to translate a nucleotide sequence into amino acids. Each part of this question builds off the previous parts.

(A) Thinkin' question (short answer, not code): If we want to create a codon dictionary and use it to translate nucleotide sequences, would it be better to use the codons or amino acids as keys?

Your answer:

(B) Read in codon_table.txt (note that it has a header line) and use it to create a codon dictionary. Then use raw_input() prompt the user to enter a single codon (e.g. ATG) and print the amino acid corresponding to that codon to the screen.


In [ ]:

(C) Now we will adapt the code in (B) to translate a longer sequence. Instead of prompting the user for a single codon, allow them to enter a longer sequence. First, check that the sequence they entered has a length that is a multiple of 3 (Hint: use the mod operator, %), and print an error message if it is not. If it is valid, then go on to translate every three nucleotides to an amino acid. Print the final amino acid sequence to the screen.


In [ ]:

(D) Now, instead of taking user input, you will apply your translator to a set of sequences stored in a file. Read in the sequences from sequences3.txt (assume each line is a separate sequence), translate it to amino acids, and print it to a new file called proteins.txt.


In [ ]:



Bonus question: Parsing fasta files (+2 bonus pts)

This question is optional, but if you complete it, I'll give you two bonus points. You won't lose points if you skip it.

Write code that reads sequences from a fasta file and stores them in a dictionary according to their header (i.e. use the header line as the key and sequence as the value). You will use horrible.fasta to test your code.

If you are not familiar with fasta files, they have the following general format:

>geneName1
ATCGCTAGTCGATCGATGGTTTCGCGTAGCGTTGCTAGCGTAGCTGATG
TCGATCGATGGTTTCGCGTAGCGTTGCTAGCGTAGCTGATGATGCTCAA
GCTGGATGGCTAGCTGATGCTAG
>geneName2
ATCGATGGGCTGGATCGATGCGGCTCGGCGATCGA
...

There are many slight variations; for example the header often contains different information, depending where you got the file from, and the sequence for a given entry may span any number of lines. To write a good fasta parser, you must make as few assumptions about the formatting as possible. This will make your code more "robust".

For fasta files, pretty much the only things you can safely assume are that a new entry will be marked by the > sign, which is immediately followed by a (usually) unique header, and all sequence belonging to that entry will be located immediately below. However, you can't assume how many lines the sequence will take up.

With this in mind, write a robust fasta parser that reads in horrible.fasta and stores each sequence in a dictionary according to its header line. Call the dictionary seqDict. Remove any newline characters. Don't include the > sign in the header. Hint: use string slicing or .lstrip()


In [ ]:

After you've written your code above and you think it works, run it and then run the following code to to spot-check whether you did everything correctly. If you didn't name your dictionary seqDict, you'll need to change it below to whatever you named your dictionary.


In [ ]:
error = False
if ">varlen2_uc001pmn.3_3476" in seqDict:
    print "Remove > chars from headers!"
    error = True
elif "varlen2_uc001pmn.3_3476" not in seqDict:
    print "Something's wrong with your dictionary: missing keys"
    error = True
if "varlen2_uc021qfk.1>2_1472" not in seqDict:
    print "Only remove the > chars from the beginning of the header!"
    error = True
if len(seqDict["varlen2_uc009wph.3_423"]) > 85:
    if "\n" in seqDict["varlen2_uc009wph.3_423"]:
        print "Remove newline chars from sequences"
        error = True
    else:
        print "Length of sequences longer than expected for some reason"
        error = True
elif len(seqDict["varlen2_uc009wph.3_423"]) < 85:
    print "Length of sequences shorter than expected for some reason"
    error = True

if error == False:
    print "Congrats, you passed all my tests!"