MAFFT (http://www.ebi.ac.uk/Tools/msa/mafft/) is an online tool for aligning three or more sequences. It works on a set of nucleotide or protein sequences in FASTA format. Remember from the databases lecture that the FASTA format can be used to store multiple sequences. MAFFT takes as input all the sequences you want to align in a single file. Therefore you must first copy and paste your sequences of interest into one file (as specified in the preparatory work for this lab). In this exercise we will analyze the set of Twist gene homologs you generated in the prelab.
Question 1: Which two sequences are the closest, according to the pairwise score table? To obtain this table, go to the "Results Summary" tab and then click on "Percent Identity Matrix". Copy and paste the output matrix.
Question 2: Which two sequences are the furthest (most divergent)?
Question 3: The data in the new alignment file has two numbers at the top. What do these numbers mean?
All PHYLIP programs have a similar interface: each parameter is assigned a single letter, and the options are displayed as a list. In some cases, pressing the key corresponding to a parameter(followed by “Enter”) will cycle through the options. In other cases, you may be asked to input your setting for the parameter. As specified in the prelab instructions, you can find information on the PHYLIP suite of programs here http://evolution.genetics.washington.edu/phylip/
Note- To use phylip
in the command line, type phylip
plus the name of the command. All commands can be viewed by typing phylip
and then press enter.
Task: Find the maximum parsimony tree for the DNA sequences in twist.aln.
phylip dnapars
in a terminal. Note: PHYLIP programs by default always look for input kept in the infile file and generate output in the files starting with out. It is always a good idea to rename your input/output to something more meaningful.
Task: Ignoring the edge distances, draw an unrooted representation of this tree from the Newick format.
phylip drawtree
in the terminal window.
When prompted for the input file; type twist-dnapars.tree.Question 4: What are the possible bi-partitions of this tree? You can denote the partition using the | character.
Question 5: Which event probably occurred first: division of the Twist gene into Twist1 andTwist2, or the divergence of mice and humans from their common ancestor? Explain the answer in terms of the tree generated by PHYLIP.
Question 6: Match the number with the letter(s). Remember that orthologs are defined as homologs due to speciation and that paralogs are defined as homologs due to gene duplication. Conceptually, each internal node of the tree should represent a common ancestor (either ancestor of the species before they diverged, or a copy of the gene before it was duplicated in the genome).
1) Human Twist1 and Human Twist2 are
2) Human Twist1 and Mouse Twist1 are
3) Human Twist1 and Mouse Twist2 are
A. Orthlogs
B. Homologs
C. Paralogs
Question 7: How would we make a best guess for what the sequence was for each ancestral node? Note: you do not need to actually calculate the ancestral sequences, but discuss how you would do it.
Task: Draw a rooted version of your twist-dnapars.tree tree.
phylip retree
on the command prompt in the phylogeny_lab directory This allows you to modify and manipulate the tree in an interactive setting.Question 8: Based on your knowledge of the tree of life, where is the most logical place to root this tree? Does this match the midpoint root?
Question 9a: Give the Newick format representation of this tree (ignoring distances) and explain how it is different from the maximum parsimony tree you drew in Question 8.
Task: Assess the robustness of our tree in twist.aln by using the bootstrapping tool in PHYLIP
First, generate bootstrap samples of the alignment:
phylip seqboot
Second, make the tree corresponding to each of these alignment samples.
phylip dnapars
on the command prompt in the phylogeny_lab directory. Question 9b: What is contained in this file?
Task: Find the consensus tree from all of the trees generated by the bootstrap program.
phylip consense
on the command prompt in the phylogeny_lab directory.Question 10: In which ways (if any) were the input trees the same? How do you know?
Question 11: What do the edge weights indicate on this tree?
Question 12: What does this consensus tree tell us about the robustness of the maximum parsimony tree of the sequences in twist.aln?
We have found that students that attempt this homework without reading the lecture notes discussed in the pre-lab struggle mightily and the homework becomes very time consuming.
Task: Think of your favorite gene, possibly one you're researching or one you find interesting. Using BLAST, NCBI or the refseq database, obtain the sequence of at least 5 of its homologs (Not isoforms - see below, from at least 2 different species, that you think will be exciting. Make an alignment, make a tree, and bootstrap that tree as above. If possible, root the tree in a manner which makes sense. Use parameter settings that make the most sense to you.
We'll also note that some of this software uses pseudo-random numbers. Systems for generating pseudo-random numbers all start with a seed. If you're asked for a seed, choose the value 15. This will allow us to regenerate your results. We talk a bit more about pseudo-random numbers and random seeds more in later portions of the class.
This is an open-ended question. Some sequences will be more alignable than others or will produce a tree with better bootstrap values. This is okay. Often times, the most interesting phylogenetic questions are the most challenging.
On the topic of isoforms: You might be tempted to analyze different isoforms of the same gene. If you found yourself tempted to do this, this is a great opportunity to engage in a thought experiment about why that might not make sense. When you're done with this thought experiment, scroll to the bottom of this exercise for an answer.
Question 1: Describe how you obtained your sequences. (1 point)
Question 2: Copy and paste your alignment below: (2 points)
Question 3: Copy and paste the contents of the bootstrapped tree file (the outtree file generated by dnapars): (2 points)
Question 4: Provide a one or two paragraph-length description of your thinking as you performed this analysis. What choices did you have to make as you went through this process, especially when it came to the alignment? How did you decide what to do? What parameters did you use (give us enough detail that we could reproduce your analysis)? Were there any challenges?
In particular, describe what the resulting tree tells you about the evolutionary relationship between these sequences (try to use terms like ortholog and paralog). Are you confident in the output of the tree based on the results of bootstrapping? If you performed this analysis again, what would you do differently? Any parameters you would change or sequences you would add? Did you learn anything suprising or interesting from the phylogeny? (5 points)
More on isoforms: Isoforms are different sequences produced from the same gene. Phylogenetics is used to examine evolutionary differences. However, differences in isoforms arise from splicing, not evolution, and thus are not suitable for phylogenetic analyses. There are entire classes of algorithms that focus on the analysis of splicing.