In-Class : Building Phylogenies from alignments

Part 1: Multiple Sequence Alignment using MAFFT

MAFFT (http://www.ebi.ac.uk/Tools/msa/mafft/) is an online tool for aligning three or more sequences. It works on a set of nucleotide or protein sequences in FASTA format. Remember from the databases lecture that the FASTA format can be used to store multiple sequences. MAFFT takes as input all the sequences you want to align in a single file. Therefore you must first copy and paste your sequences of interest into one file (as specified in the preparatory work for this lab). In this exercise we will analyze the set of Twist gene homologs you generated in the prelab.

  • Go to the MAFFT website (http://www.ebi.ac.uk/Tools/msa/mafft/).
  • Configure this run of the MAFFT algorithm. Set the upload file to twist.fasta or copy the entire set of genes into the text box at the bottom of the form. Leave all parameters at their default settings.
  • Click the “Submit” button. The results should appear in a few minutes at most.
  • Make sure your sequence names are not too long! (Fewer than 9 characters)

Question 1: Which two sequences are the closest, according to the pairwise score table? To obtain this table, go to the "Results Summary" tab and then click on "Percent Identity Matrix". Copy and paste the output matrix.

Question 2: Which two sequences are the furthest (most divergent)?

  • Create an output alignment file called twist.aln that is compatible with PHYLIP. To get the MAFFT alignment, click on the 'Alignments' Tab, and then click 'Download Alignment File'. This output file is not compatible with PHYLIP, so you will need to convert from the fasta formatted alignment to phylip format. To do this, you can use EMBOSS Seqret: http://www.ebi.ac.uk/Tools/sfc/emboss_seqret/. Choose DNA as the type of sequence, fasta as your input format and phylip interleaved as your output format. Use the 'Download' button to download the resulting file and name it twist.aln. Make sure the file you saved is plain txt rather than rtf.

Question 3: The data in the new alignment file has two numbers at the top. What do these numbers mean?

Part 2: Phylogeny Reconstruction with PHYLIP

All PHYLIP programs have a similar interface: each parameter is assigned a single letter, and the options are displayed as a list. In some cases, pressing the key corresponding to a parameter(followed by “Enter”) will cycle through the options. In other cases, you may be asked to input your setting for the parameter. As specified in the prelab instructions, you can find information on the PHYLIP suite of programs here http://evolution.genetics.washington.edu/phylip/

Note- To use phylip in the command line, type phylip plus the name of the command. All commands can be viewed by typing phylip and then press enter.

Task: Find the maximum parsimony tree for the DNA sequences in twist.aln.

  • Upload twist.aln to CoCalc by clicking on the 'New' Tab in CoCalc, and then using the 'Upload files from your computer' option.
  • Type: phylip dnapars in a terminal.
  • You will be prompted for an input file; type twist.aln. Leave the default parameters and just type “Y” to proceed. PHYLIP will run the maximum parsimony algorithm and write its results to outfile and outtree text files in the same directory.
  • Rename outfile to twist-dnapars.out (mv outfile twist-dnapars.out) and rename outtree to twist-dnapars.tree(mv outtree twist-dnapars.tree).
  • Open the tree file in a text editor to confirm that it has only one tree. This is in the Newick format mentioned in the powerpoint.

Note: PHYLIP programs by default always look for input kept in the infile file and generate output in the files starting with out. It is always a good idea to rename your input/output to something more meaningful.

Task: Ignoring the edge distances, draw an unrooted representation of this tree from the Newick format.

  • Use drawtree to make a picture of the unrooted tree. Type phylip drawtree in the terminal window. When prompted for the input file; type twist-dnapars.tree.
  • Rename the output file plotfile to twist-dnapars.ps (mv plotfile twist-dnapars.ps).
  • Download the twist-dnapars.ps file and open it. (If you can not open it, try dragging it onto a Powerpoint slide or using an online pdf convertor like online2pdf.com/convert-ps-to-pdf)

Question 4: What are the possible bi-partitions of this tree? You can denote the partition using the | character.

Question 5: Which event probably occurred first: division of the Twist gene into Twist1 andTwist2, or the divergence of mice and humans from their common ancestor? Explain the answer in terms of the tree generated by PHYLIP.

Question 6: Match the number with the letter(s). Remember that orthologs are defined as homologs due to speciation and that paralogs are defined as homologs due to gene duplication. Conceptually, each internal node of the tree should represent a common ancestor (either ancestor of the species before they diverged, or a copy of the gene before it was duplicated in the genome).

1) Human Twist1 and Human Twist2 are

2) Human Twist1 and Mouse Twist1 are

3) Human Twist1 and Mouse Twist2 are

A. Orthlogs

B. Homologs

C. Paralogs

Question 7: How would we make a best guess for what the sequence was for each ancestral node? Note: you do not need to actually calculate the ancestral sequences, but discuss how you would do it.

Task: Draw a rooted version of your twist-dnapars.tree tree.

  • Type phylip retree on the command prompt in the phylogeny_lab directory This allows you to modify and manipulate the tree in an interactive setting.
  • Leave the default parameters and type “Y” and press “Enter”.
  • You will be asked for an input tree. Type twist-dnapars.tree and you should see your tree appear on the screen.
  • Type “?” and press “Enter” to see the list of options. Use the option that roots the tree at its midpoint (press “M”).

Question 8: Based on your knowledge of the tree of life, where is the most logical place to root this tree? Does this match the midpoint root?

  • Save the rooted tree to a file: Use the “W” option to output the new rooted tree (make sure to select “R” to output as a rooted tree). Rename the outtree file to twist-rooted.tree (mv outtree twist-rooted.tree) and open it in a text editor.

Question 9a: Give the Newick format representation of this tree (ignoring distances) and explain how it is different from the maximum parsimony tree you drew in Question 8.

Task: Assess the robustness of our tree in twist.aln by using the bootstrapping tool in PHYLIP

First, generate bootstrap samples of the alignment:

  • Type phylip seqboot
  • When prompted for input type twist.aln as the input file. Set the number of replicates to 10 (Note: in real phylogenetic analysis, you would want to perform at least 100 bootstrapped replicates). Use the number 15 for the random seed when it asks for one. After running, rename outfile to twist-boot.aln. This file contains “bootstrap samples” – random samples of the columns from the true alignment.

Second, make the tree corresponding to each of these alignment samples.

  • Type phylip dnapars on the command prompt in the phylogeny_lab directory.
  • When prompted for input type twist‐boot.aln as the input file. Make sure to set the parameter for “Analyze multiple datasets?” to “Yes”.
  • Use the “D” option to indicate full data sets, not weights, enter 10 for the number of data sets, choose 55 for the seed, and set jumble to 1.
  • Rename outfile to twist‐boot.out (mv outfile twist‐ boot.out) and outtree to twist‐boot.tree (mv outtree twist-boot.tree). Take a look at the file twist‐boot.tree in any text editor.

Question 9b: What is contained in this file?

Task: Find the consensus tree from all of the trees generated by the bootstrap program.

  • Type phylip consense on the command prompt in the phylogeny_lab directory.
  • When prompted for input type twist-boot.tree as the input file. Make sure to set the rule to “Majority Rule”, not the default “Majority Rule (extended)” (Type “C” to change this parameter). “Majority Rule” is the version of the algorithm that includes edges that are found in at least 50% of the trees.
  • Rename outfile to twist-boot-cons.out (mv outfile twist-boot.cons.out) and outtree to twist-boot-cons.tree (mv outtree twist-boot-cons.tree). Open twist-boot-cons.tree in any text editor.

Question 10: In which ways (if any) were the input trees the same? How do you know?

Question 11: What do the edge weights indicate on this tree?

Question 12: What does this consensus tree tell us about the robustness of the maximum parsimony tree of the sequences in twist.aln?

Homework: Your turn

We have found that students that attempt this homework without reading the lecture notes discussed in the pre-lab struggle mightily and the homework becomes very time consuming.

Task: Think of your favorite gene, possibly one you're researching or one you find interesting. Using BLAST, NCBI or the refseq database, obtain the sequence of at least 5 of its homologs (Not isoforms - see below, from at least 2 different species, that you think will be exciting. Make an alignment, make a tree, and bootstrap that tree as above. If possible, root the tree in a manner which makes sense. Use parameter settings that make the most sense to you.

We'll also note that some of this software uses pseudo-random numbers. Systems for generating pseudo-random numbers all start with a seed. If you're asked for a seed, choose the value 15. This will allow us to regenerate your results. We talk a bit more about pseudo-random numbers and random seeds more in later portions of the class.

This is an open-ended question. Some sequences will be more alignable than others or will produce a tree with better bootstrap values. This is okay. Often times, the most interesting phylogenetic questions are the most challenging.

On the topic of isoforms: You might be tempted to analyze different isoforms of the same gene. If you found yourself tempted to do this, this is a great opportunity to engage in a thought experiment about why that might not make sense. When you're done with this thought experiment, scroll to the bottom of this exercise for an answer.

Question 1: Describe how you obtained your sequences. (1 point)

Question 2: Copy and paste your alignment below: (2 points)

Question 3: Copy and paste the contents of the bootstrapped tree file (the outtree file generated by dnapars): (2 points)

Question 4: Provide a one or two paragraph-length description of your thinking as you performed this analysis. What choices did you have to make as you went through this process, especially when it came to the alignment? How did you decide what to do? What parameters did you use (give us enough detail that we could reproduce your analysis)? Were there any challenges?

In particular, describe what the resulting tree tells you about the evolutionary relationship between these sequences (try to use terms like ortholog and paralog). Are you confident in the output of the tree based on the results of bootstrapping? If you performed this analysis again, what would you do differently? Any parameters you would change or sequences you would add? Did you learn anything suprising or interesting from the phylogeny? (5 points)

More on isoforms: Isoforms are different sequences produced from the same gene. Phylogenetics is used to examine evolutionary differences. However, differences in isoforms arise from splicing, not evolution, and thus are not suitable for phylogenetic analyses. There are entire classes of algorithms that focus on the analysis of splicing.