BEAST (mostly BEAST1, but increasingly BEAST2) is widely used to fit complex evolutionary models to sequence data, often from viruses. The codebase is large and mature, but this can also be daunting for anyone trying to learn the API to develop their own code on top of BEAST. Wouldn't it be nice if one could play with the code interactively, as R and Python users are used to doing? Thanks to Scala, a JVM based language that includes a read-eval-print loop (REPL), and Jupyter, the rebranded IPython notebook, we can do just that.
First of all, one needs to install Scala, which can be downloaded from here.
On OSX, you can do this using Homebrew.
brew update
brew install scala
brew install sbt
For IPython/Jupyter, you'll need an up-to-date Python distribution plus a bunch of dependencies (notably ZeroMQ). Detailed instructions can be found on the website, but assuming you have the dependencies and tho Python package manager pip, you can install as follows.
sudo pip install ipython[all] --upgrade
You may also be using pip3 for the Python3 version of pip.
To use Jupyter with Scala, one needs to install a kernel for running Scala. One of these is IScala, but I'll use the more lightweight jupyter-scala kernel. To build from source, do the following - it'll take some time.
git clone https://github.com/alexarchambault/jupyter-scala.git
cd jupyter-scala
sbt cli/packArchive
cd cli/target
tar -zxvf jupyter-scala*.tar.gz
cd jupyter-scala*/bin
chmod +x ./jupyter-scala
./jupyter-scala
If all has installed correctly, change to the working directory, then launch Jupyter.
jupyter notebook
You should then see (currently) scala211 as one of the kernels you can use in a new notebook.
As an example, I'll show how root-to-tip regression can be used to root a phylogenetic tree based on the sampling time of the tips. This is implemented within Andrew Rambaut's Path-O-Gen, and in the source code for BEAST there is also a file, which implements root-to-tip regression for a batch of trees. However, it assumes a single set of taxa, and outputs a NEXUS file. In the following, I'll show how a Scala version of this code can be used to analyse multiple (possibly different) trees, and output a file containing Newick trees.
I first load the BEAST, BEAUTI and Path-O-Gen jars locally. In jupyter-scala, I use load.jar for this. If I was running as a script, I would call Scala using the -cp flag to set the classpath, or within the normal Scala REPL, use :cp instead.
In [1]:
load.jar("lib/beast.jar")
load.jar("lib/beauti.jar")
load.jar("lib/pathogen.jar")
Next, I import the classes needed.
In [2]:
import dr.app.beauti.options.DateGuesser
import dr.app.util.Arguments
import dr.evolution.io.Importer
import dr.evolution.io.NewickImporter
import dr.evolution.io.TreeImporter
import dr.evolution.tree.Tree
import dr.evolution.tree.Tree.Utils
import dr.evolution.util.TaxonList
import dr.stats.Regression
import dr.stats.Variate
import dr.util.Version
import dr.app.pathogen.TemporalRooting
import java.io.FileReader
import java.io.IOException
import java.io.PrintStream
import java.io.PrintWriter
import java.util.ArrayList
import java.util.List
I set the input and output filenames; the input tree has to be rooted. I'm using some simulated data from the PANGEA-HIV methods comparison exercise.
In [3]:
var inputFileName: String = "village.tre"
var outputRegressionFileName: String = "village_regression.txt"
var outputRTTTreeFileName: String = "village_rtt.tre"
var dateOrder = "LAST"
I initialize some variables, as well as setting up file readers, some containers for the trees and root-to-tip regressions, and a class to guess dates based on the taxon names.
In [4]:
var firstTree = true
var totalTrees = 0
val fileReader = new FileReader(inputFileName)
val importer = new NewickImporter(fileReader)
val regressions = new ArrayList[Regression]()
val trees = new ArrayList[Tree]()
val dg = new DateGuesser()
The following code extracts the sampling times from the taxon names in the tree.
In [5]:
dg.fromLast = false
if (dateOrder == "FIRST") {
dg.order = 0
} else if (dateOrder == "LAST") {
dg.order = 0
dg.fromLast = true
} else {
dg.order = java.lang.Integer.parseInt(dateOrder) - 1
if (dg.order < 0 || dg.order > 100) {
System.err.println("Error Parsing order of date field: " + dateOrder)
}
}
The TemporalRooting.findRoot method finds the root of the tree that minimises (in this case) the correlation between the root-to-tip distance and the sequence sampling times. The resulting rerooted trees are stored in an ArrayList.
In [6]:
var taxa: TaxonList = null
var temporalRooting: TemporalRooting = null
try {
while (importer.hasTree()) {
val tree = importer.importNextTree()
taxa = tree
dg.guessDates(taxa)
temporalRooting = new TemporalRooting(taxa)
var rootedTree = tree
rootedTree = temporalRooting.findRoot(tree, TemporalRooting.RootingFunction.CORRELATION)
regressions.add(temporalRooting.getRootToTipRegression(rootedTree))
trees.add(rootedTree)
totalTrees += 1
}
} catch {
case e: Importer.ImportException => {
System.err.println("Error Parsing Input Tree: " + e.getMessage)
}
}
Now that the results have been generated, I save them to a text file.
In [7]:
var printWriter: PrintWriter = null
printWriter = new PrintWriter(outputRegressionFileName)
printWriter.println("tree\trttslope\tx-intercept\ty-intercept\tcorrelation")
for (i <- 0 until totalTrees) {
val r = regressions.get(i)
printWriter.print(i)
printWriter.print("\t" + r.getGradient)
printWriter.print("\t" + r.getXIntercept)
printWriter.print("\t" + r.getYIntercept)
printWriter.println("\t" + r.getCorrelationCoefficient)
}
printWriter.close()
Although there is code to export trees in Nexus format (which is what was done in RootToTip.java), there isn't any convenience function to export Newick trees (aside from some code from JEBL). However, Tree does provide some utilities, amongst them one which generates a Newick string, that can be saved to file.
In [8]:
var rttPrintStream: PrintStream = null
rttPrintStream = new PrintStream(outputRTTTreeFileName)
for(i <- 0 until totalTrees) {
rttPrintStream.println(Utils.newick(trees.get(i)))
}
rttPrintStream.close()
So, that's it for now. Whenever I get a chance, I'll write a bit about calling R from Scala using rscala, as well as writing plugins for BEAST2 using Scala.