Day 17 In-class assignment: Shotgun Sequencing

Today we are going to think about how we would write a program to do shotgun sequencing.

Student Names

//Put the names of everybody in your group here!

Learning Goals

  • Understand genome sequencing
  • Get some practice working with functions, strings and loops

Work in pairs and Shotgun Assemble by hand

We are going to try to assemble some sequences from a famous book in class. Your instructor(s) will hand out strips of paper. See if you can assemble the original text from the strips of paper!

Question 1: How hard was this task to do by hand? Did you manage to get the entire text? (Also, please take a picture of what you have submit the picture to d2l with this notebook.)

// put your answer here

Question 2: Write some pseudocode or a flowchart showing the steps you would take to write your own code to assemble strips of paper. We strongly suggest that you describe this pseudocode as a bunch of high-level function calls with general descriptions of what your functions will look like.

// Include your pseudocode here or reference a picture that you will submit with your notebook to d2l.

Question 3: What functions do you think will be the most difficult to write? Why?

// put your answer here.

Before moving on, verify your answers to questions 1-3 with an instructor.

Code Review and Testing

In this section we are going to program a basic shotgun assembler. With your group, read though the program and try to understand what it is doing. Code review is an extreamly important skill in computational science - it's very common for people to inherit code from coworkers, or to find examples that they would like to use on the web. Dissecting the code helps you to learn, and is a great way to make yourself a better programmer. Some questions to ask yourself while looking at code include:

  1. Can you identify any clever techniques that you could use in your own coding practices?
  2. Can you identify any bad techniques that you could change to do better?
  3. What assumptions do you think the programmer is making about the model or problem? Are these assumptions well documented?
  4. Can you come up with nominally "valid" test cases that will break the code?

Let's write the first piece of an assembler. To start, consider the following example strings we will use as data. Can you see how these strings go together?


In [ ]:
start_string_list =  ['er_way__in_short_the_period_was_so_far_like_the_pr', \
                      '__in_short_the_period_was_so_far_like_the_present_', \
                      'he_present_period_that_some_of_its_noisiest_author', \
                      '_period_that_some_of_its_noisiest_authorities_insi']

Shotgun Assembly Program:

Let us assume we have inherited the following code from a previous student in our research lab. The student has most of the code written but did not get a chance to finish the slide_left_right function. The student let you know that everything is working and all you need to do to make a usable shotgun assembler is to finish slide_left_right. Run the program on the above string to see what it does.


In [ ]:
#define the internal slide function
# If the strings overlap, return the combined longer string and the length of the overlap
def slide_left_right(left,right):
    # remember to calculate the overlap here - THIS NEEDS TO BE FINISHED!
    overlap=10
    return([left[:-overlap]+right, overlap ])

In [ ]:
def string_overlap(s1, s2):
    #Slide string 1 over string 2 and calculate the overlap
    [s3_1, i1] = slide_left_right(s1,s2)
    
    #Slide string 2 over string 1 and calculate the overlap
    [s3_2, i2] = slide_left_right(s2,s1)
    
    #Check to see which overlap is biggest. 
    if(i1 > i2):
        return(s3_1)    
    if(i2 == 0):
        #Return empty string if the strings do not overlap
        return('')
    return(s3_2)

In [ ]:
#new shotgun function 
def shotgun(cuts):
    #use the first cut to initialize the string
    assembled_string = cuts[0]
    
    #Make a list of the remaining cuts
    unassembled_cuts = cuts[1:]
    #Counter to help us keep track if the list did not change
    previous_list_len = 0; 
    
    #Keep looping until we found all overlapping regions. 
    #i.e. there are still some unassembled_cuts and the list length is changing
    while(unassembled_cuts and previous_list_len != len(unassembled_cuts)):
        #reset list of unmatching cuts
        non_overlapping_cuts = []
        
        #Record the previous list length
        previous_list_len = len(unassembled_cuts)
        
        #Loop though unassembled cuts and add them to the assembled_string
        for cut in unassembled_cuts:
            s_tmp = string_overlap(assembled_string, cut)
            
            #If the temporary string is empty store the cut in the non_overlapping_cuts list
            if(s_tmp == ''):
                non_overlapping_cuts.append(cut)
            else:
                assembled_string = s_tmp;
        unassembled_cuts = non_overlapping_cuts.copy()
    return(assembled_string)

In [ ]:
#Run the shotgun assembly
print(shotgun(start_string_list))

Question 4: Discuss this code with your partner or group and describe what it does. What is the slide_left_right function supposed to do? What are the inputs and outputs to slide_left_right?

//Put your answer here.

The purpose of this function is to "slide" the two strings past each other, compare the overlapping segments, and return the string that corresponds to the maximum overlap between the two strings (i.e., stitch the strings together). The inputs are the two strings; the outputs are a single combined string and a number indicating the number of characters in those strings that overlap.

Question 5: Assume for now, that the student who wrote this code is correct and it will work. Help the student finish the program by writing the slide_right_left function.


In [ ]:
#copy, paste and modify the slide_left_right function from above
def slide_left_right(left,right):
    overlap = 0;
    for i in range(1,len(left)+1):
        if(left[-i:] == right[:i]):
            overlap = i;
    if overlap == 0:
        if right in left:
            return left, len(left)
    return left[:-overlap]+right, overlap

Unit Testing

It is difficult to write code and have it work the first time. Before trying to run your function with the rest of the code, it is a good idea to test just the single function. This is sometimes called "unit testing", and is a standard programming practice. For example, this code checks the slide_left_right function. Before you execute this code, see if you can predict what the output will be!


In [ ]:
# Before running the code see if you can predict what the output will be.
s3, i = slide_left_right(start_string_list[0],start_string_list[1])
print(s3)
print(i)

Did that work? Sometimes it is hard to tell. A better unit test is simple. Consider the following example, and before you run it predict what you think the output will be:


In [ ]:
#Before running the code see if you can predict what the output will be.
s3,i= slide_left_right('aaabbb','bbbbcccc')
print(s3)
print(i)

Using the much simpler imput strings helps you quickly see if the code is working.

Question 6: (Unit Testing, four parts). Now, let's write your own test cases. Since this is your first time, we will be very specfic about what tests you should write. However, you want to be able to write your own unit tests. Always write tests for odd, but valid, inputs. For example, write a unit test for each of the four cases below.

In each case shown below you should be able to just copy the test case from above and make some minor changes. The goal is to learn something different for each case. Write a test case for each of the following conditions:

1) Write a test case to see what happens if you pass slide_left_right two strings that are the same.


In [ ]:
#Test to see if the function works with exactly the same strings
#Put your test code here
[s3, i] = slide_left_right('aaa','aaa')

print(s3)
print(i)

2) Write a test case to see what happends if the strings do not overlap (i.e., they are entirely different).


In [ ]:
#Test to see if the code works for non-overlapping strings
#Put your test code here
[s3, i] = slide_left_right('aaa','bbb')
print(s3)
print(i)

3) Write a test case to see what happens if the right string is entirely inside the left string


In [ ]:
#Test to see if the code works if the right string is entirely inside the left one
#Put your test code here
[s3, i] = slide_left_right('aaabbbccc','bbb')
print(s3)
print(i)

4) Write a test case to see what happens if the right string is empty. (i.e. '')


In [ ]:
#Different Overlapping strings
#Put your test code here
[s3, i] = slide_left_right('aaabbbccc','')
print(s3)
print(i)

Question 7: As a group, discuss the unit tests shown above. What did you learn from these test cases? Can you think of any additional test cases that you haven't implemented? And, finally, did the code do what you expected it to do?

//Write your answers here

Before you go any further:

Make sure that you check with one of the instructors to make sure your answers make sense!

Full Test:

Before moving to this section, make sure you are comfortable with the above tests. Now let's test the entire shotgun program again using the new function you have written:


In [ ]:
#Run the shotgun assembly on the orginial string list
shotgun(start_string_list)

Make sure this got the answer you are expecting. Let's try a different order to the strings to make sure that is working as well:


In [ ]:
second_string_list = [start_string_list[0], \
                      start_string_list[3], \
                      start_string_list[2], \
                      start_string_list[1]]

print(shotgun(second_string_list))

Before moving on, make sure you got the same answer for both inputs. Adjust the function slide_left_right as needed to make sure this happens!

Generate Genome Cuts

Now let's download a really big test to see if the code can do what you did with the slips of paper in class.

First, download the "Example-Assembly_Exercise.ipynb" notebook from the d2l website. Run the notebook to generate the "Genome_cuts.txt" file. You do not need to understand the notebook in order to generate the file. However, it would be a good exercise to read though the notebook and see what pieces make sense.

Open the "Genome_cuts.txt" file using the following command:


In [ ]:
cuts = open("Genome_cuts.txt").read()
cuts=cuts.split('\n')

Run the shotgun assembler on the code and talk about the outputs with your group members:


In [ ]:
print(shotgun(cuts))

Question 8: Although it is possible that the randomly-generated "Genome_cuts.txt" produced input to the function that work, most likely the above string is not fully assembled (i.e., the assembly doesn't produce the paragraph of text you would expect). Why do you think this may not have worked?

//put your answer here

Question 9: When writing unit tests. It is generally best to come up with the smallest test that reproduces the error. This small tests help you isolate the problem. Come up with a very short unit test that demonstrates the failure exhibited above.

//put your answer here

Question 10: Sequencing machines are not always correct. How would you change the program to account for errors in some of the letters?

//put your answer here

Feedback on this assignment

Question 11: What questions do you (or does your group) have after this assignment?

Put your answer here.

Submit this assignment

Log into the course Desire2Learn website (d2l.msu.edu) and go to the "In-class assignments" and the "Day 17" folder. Upload this notebook there. You only have to upload one notebook per group - just make sure that everybody's name is at the top of the notebook!