BEAST in a notebook

BEAST (mostly BEAST1, but increasingly BEAST2) is widely used to fit complex evolutionary models to sequence data, often from viruses. The codebase is large and mature, but this can also be daunting for anyone trying to learn the API to develop their own code on top of BEAST. Wouldn't it be nice if one could play with the code interactively, as R and Python users are used to doing? Thanks to Scala, a JVM based language that includes a read-eval-print loop (REPL), and Jupyter, the rebranded IPython notebook, we can do just that.

Installation

Installing Scala

First of all, one needs to install Scala, which can be downloaded from here.

On OSX, you can do this using Homebrew.

brew update
brew install scala
brew install sbt

Installing IPython/Jupyter

For IPython/Jupyter, you'll need an up-to-date Python distribution plus a bunch of dependencies (notably ZeroMQ). Detailed instructions can be found on the website, but assuming you have the dependencies and tho Python package manager pip, you can install as follows.

sudo pip install ipython[all] --upgrade

You may also be using pip3 for the Python3 version of pip.

Installing jupyter-scala

To use Jupyter with Scala, one needs to install a kernel for running Scala. One of these is IScala, but I'll use the more lightweight jupyter-scala kernel. To build from source, do the following - it'll take some time.

git clone https://github.com/alexarchambault/jupyter-scala.git
cd jupyter-scala
sbt cli/packArchive
cd cli/target
tar -zxvf jupyter-scala*.tar.gz
cd jupyter-scala*/bin
chmod +x ./jupyter-scala
./jupyter-scala

Running

If all has installed correctly, change to the working directory, then launch Jupyter.

jupyter notebook

You should then see (currently) scala211 as one of the kernels you can use in a new notebook.

An example: root-to-tip regression

As an example, I'll show how root-to-tip regression can be used to root a phylogenetic tree based on the sampling time of the tips. This is implemented within Andrew Rambaut's Path-O-Gen, and in the source code for BEAST there is also a file, which implements root-to-tip regression for a batch of trees. However, it assumes a single set of taxa, and outputs a NEXUS file. In the following, I'll show how a Scala version of this code can be used to analyse multiple (possibly different) trees, and output a file containing Newick trees.

I first load the BEAST, BEAUTI and Path-O-Gen jars locally. In jupyter-scala, I use load.jar for this. If I was running as a script, I would call Scala using the -cp flag to set the classpath, or within the normal Scala REPL, use :cp instead.


In [1]:
load.jar("lib/beast.jar")
load.jar("lib/beauti.jar")
load.jar("lib/pathogen.jar")



Next, I import the classes needed.


In [2]:
import dr.app.beauti.options.DateGuesser
import dr.app.util.Arguments
import dr.evolution.io.Importer
import dr.evolution.io.NewickImporter
import dr.evolution.io.TreeImporter
import dr.evolution.tree.Tree
import dr.evolution.tree.Tree.Utils
import dr.evolution.util.TaxonList
import dr.stats.Regression
import dr.stats.Variate
import dr.util.Version
import dr.app.pathogen.TemporalRooting
import java.io.FileReader
import java.io.IOException
import java.io.PrintStream
import java.io.PrintWriter
import java.util.ArrayList
import java.util.List


import dr.app.beauti.options.DateGuesser
import dr.app.util.Arguments
import dr.evolution.io.Importer
import dr.evolution.io.NewickImporter
import dr.evolution.io.TreeImporter
import dr.evolution.tree.Tree
import dr.evolution.tree.Tree.Utils
import dr.evolution.util.TaxonList
import dr.stats.Regression
import dr.stats.Variate
import dr.util.Version
import dr.app.pathogen.TemporalRooting
import java.io.FileReader
import java.io.IOException
import java.io.PrintStream
import java.io.PrintWriter
import java.util.ArrayList
import java.util.List

I set the input and output filenames; the input tree has to be rooted. I'm using some simulated data from the PANGEA-HIV methods comparison exercise.


In [3]:
var inputFileName: String = "village.tre"
var outputRegressionFileName: String = "village_regression.txt"
var outputRTTTreeFileName: String = "village_rtt.tre"
var dateOrder = "LAST"


inputFileName: String = "village.tre"
outputRegressionFileName: String = "village_regression.txt"
outputRTTTreeFileName: String = "village_rtt.tre"
dateOrder: String = "LAST"

I initialize some variables, as well as setting up file readers, some containers for the trees and root-to-tip regressions, and a class to guess dates based on the taxon names.


In [4]:
var firstTree = true
var totalTrees = 0
val fileReader = new FileReader(inputFileName)
val importer = new NewickImporter(fileReader)
val regressions = new ArrayList[Regression]()
val trees = new ArrayList[Tree]()
val dg = new DateGuesser()


firstTree: Boolean = true
totalTrees: Int = 0
fileReader: FileReader = java.io.FileReader@33c9bd8d
importer: NewickImporter = dr.evolution.io.NewickImporter@3c346c8b
regressions: ArrayList[Regression] = []
trees: ArrayList[Tree] = []
dg: DateGuesser = dr.app.beauti.options.DateGuesser@754f90cb

The following code extracts the sampling times from the taxon names in the tree.


In [5]:
dg.fromLast = false
  if (dateOrder == "FIRST") {
    dg.order = 0
  } else if (dateOrder == "LAST") {
    dg.order = 0
    dg.fromLast = true
  } else {
    dg.order = java.lang.Integer.parseInt(dateOrder) - 1
    if (dg.order < 0 || dg.order > 100) {
      System.err.println("Error Parsing order of date field: " + dateOrder)
    }
}



The TemporalRooting.findRoot method finds the root of the tree that minimises (in this case) the correlation between the root-to-tip distance and the sequence sampling times. The resulting rerooted trees are stored in an ArrayList.


In [6]:
var taxa: TaxonList = null
var temporalRooting: TemporalRooting = null
try {
  while (importer.hasTree()) {
    val tree = importer.importNextTree()
    taxa = tree
    dg.guessDates(taxa)
    temporalRooting = new TemporalRooting(taxa)
    var rootedTree = tree
    rootedTree = temporalRooting.findRoot(tree, TemporalRooting.RootingFunction.CORRELATION)
    regressions.add(temporalRooting.getRootToTipRegression(rootedTree))
    trees.add(rootedTree)
    totalTrees += 1
  }
} catch {
  case e: Importer.ImportException => {
    System.err.println("Error Parsing Input Tree: " + e.getMessage)
  }
}


taxa: TaxonList = ((((HOUSE5440-5424-MALE_SAMPLED_23.9398544887081:0.030304288760000006,HOUSE213-8853-MALE_SAMPLED_30.4440753691847:0.048564451310000006):0.001566584422999992,(((HOUSE4425-11265-MALE_SAMPLED_40.7310725144362:0.0367453096,HOUSE930-9401-MALE_SAMPLED_40.9680201422226:0.03202150755000001):0.002228724244999994,HOUSE3476-7434-MALE_SAMPLED_41.240714742628:0.027473325450000002):0.036025480959999995,((HOUSE1066-9744-FEMALE_SAMPLED_40.5949480801292:0.034257315159999996,(HOUSE5013-4985-FEMALE_SAMPLED_40.0179873942715:0.030488125040000008,HOUSE269-13719-MALE_SAMPLED_44.8127190966625:0.041655773509999994):0.0018901675680000063):0.03418911803000001,HOUSE4957-11691-FEMALE_SAMPLED_41.1275386752759:0.06180793280000001):0.0013417752599999971):0.001216050718000003):0.026975177350000004,(((((((HOUSE436-423-FEMALE_SAMPLED_20.5177062214352:0.010084563010000006,(HOUSE2317-10437-MALE_SAMPLED_42.9017288412433:0.012632914669999995,HOUSE1993-1981-FEMALE_SAMPLED_40.4272734640254:0.010659886300000004):0.06635069577):0.008366545813999995,((HOUSE5552-8570-MALE_SAMPLED_40.1816582144663:0.015123975250000005,HOUSE3651-3595-FEMALE_SAMPLED_41.9283354467792:0.025518900090000005):0.030767162969999998,((HOUSE1699-13906-FEMALE_SAMPLED_43.6345837940462:0.05880356948,HOUSE946-11446-MALE_SAMPLED_43.6794608389027:0.050249877129999995):6.460262866000016E-4,HOUSE1528-9191-FEMALE_SAMPLED_31.991910797311:0.025338908529999996):0.005331173425000002):0.012397956749999994):2.8890048349994313E-6,(HOUSE6019-14747-FEMALE_SAMPLED_44...
temporalRooting: TemporalRooting = dr.app.pathogen.TemporalRooting@4298f7e7

Now that the results have been generated, I save them to a text file.


In [7]:
var printWriter: PrintWriter = null  
printWriter = new PrintWriter(outputRegressionFileName)
printWriter.println("tree\trttslope\tx-intercept\ty-intercept\tcorrelation")
for (i <- 0 until totalTrees) {
  val r = regressions.get(i)
  printWriter.print(i)
  printWriter.print("\t" + r.getGradient)
  printWriter.print("\t" + r.getXIntercept)
  printWriter.print("\t" + r.getYIntercept)
  printWriter.println("\t" + r.getCorrelationCoefficient)
}
printWriter.close()


printWriter: PrintWriter = java.io.PrintWriter@411d1f7a

Although there is code to export trees in Nexus format (which is what was done in RootToTip.java), there isn't any convenience function to export Newick trees (aside from some code from JEBL). However, Tree does provide some utilities, amongst them one which generates a Newick string, that can be saved to file.


In [8]:
var rttPrintStream: PrintStream = null
rttPrintStream = new PrintStream(outputRTTTreeFileName)
for(i <- 0 until totalTrees) {
  rttPrintStream.println(Utils.newick(trees.get(i)))
}
rttPrintStream.close()


rttPrintStream: PrintStream = java.io.PrintStream@40e2f54d

So, that's it for now. Whenever I get a chance, I'll write a bit about calling R from Scala using rscala, as well as writing plugins for BEAST2 using Scala.