Which maximum likelihood tree builder should I use?

Geneious contains plugins for maximum-likelihood tree builders PHYML, Garli, RAxML, PAUP* and FastTree.  In this post, we'll briefly review what sort of datasets each is most suitable for, which is fastest, and what options you get with each.  For details on the algorithms each program uses, please refer to the program's website.

If you are publishing your results from these plugins, please remember to cite the original authors of the program you used.  Citation information can be found on the respective plugin pages.

Background to each program

1. PHYML

PHYML was written by Stephane Guindon and his colleagues at LIRMM, University of Montpellier in France. It was first published in 2003, and the Geneious plugin uses version 3.2, which is described in this paper.  PHYML is one of the best known maximum-likelihood programs for its simplicity, accuracy and speed.

2. RAxML

RaxML comes from the Alexandros Stamatakis’ Exelixis lab at the Heidelberg Institute for Theoretical Studies, Germany. It was developed for handling large datasets with its comparatively low memory consumption, advanced search algorithms and use of accelerated likelihood.  

The Geneious plugin currently uses RAxML version 8.2.7, so the features listed in the table below are for that version.

3. Garli

Garli is written and maintained by Derrick Zwickl who is currently at the University of Kansas. It is loosely based on the program GAML (Lewis 1998).  Documentation for the program can be found here.

4. PAUP*

PAUP* is a popular phylogenetics program written by Dave Swofford, which can be used to build maximum parsimony, distance and maximum likelihood trees.  The information regarding PAUP* in this article relates only to maximum likelihood trees. PAUP*4.0b10 used to be available for purchase from Sinnauer Associates, but is currently undergoing a major update. Currently free "test" versions are available from here.  

Note that the Geneious PAUP* plugin does not contain the program itself, it only provides an interface for running your own copy of PAUP*.  You must download your own copy of PAUP* and set the path to the executable the first time you run the plugin in Geneious.  The plugin is currently compatible with the old 4.0b10 version, and the new test alpha versions (4.0a149 and above)

4. FastTree

FastTree was developed by Morgan N. Price in Adam Arkin’s group at Lawrence Berkeley National Lab.  It is optimized for extremely large alignments of up to 1 million sequences and uses a combination of neighbor-joining, minimum evolution and maximum likelihood to infer approximately-maximum-likelihood trees.  A detailed description of how it works is given here, but to summarize, FastTree uses neighbor-joining to get an approximate starting tree, then minimum evolution methods to reduce the length of the tree, and then maximum likelihood further improve the tree. Geneious implements FastTree 2.1.5.

What can you do with these programs?

All programs will build trees from both DNA and protein alignments, however there are some differences in the options you get with each one, summarized in the table below.  Note that PAUP* will build maximum parsimony and distance trees for protein alignments, but not maximum likelihood trees.

Screen_Shot_2017-07-27_at_2.52.05_PM.png **Because of the way Garli is set up, only the default options of GTR+G+I model and no bootstrapping are currently implemented in the Geneious plugin. However, if you want a different option, such as bootstrapping or partitioning, please contact support, or you can edit the Garli config file (located in the plugin folder) yourself according to the Garli documentation.

PHYML and PAUP* give you the widest choice of models, with the ability to input most of the models that Modeltest compares for DNA data. However, bear in mind that most of these models are nested within the General-Time-Reversible (GTR) model which is implemented in the other programs.  PAUP* includes Modeltest so you have the option to run this as part of the tree building process.  For PHYML and other programs you will need to run jModeltest outside of Geneious, then manually configure the appropriate model options in Geneious. 

PHYML also gives you a variety of methods for calculating support values, but it does have an inbuilt constraint on the number of taxa.  I’m not aware of similar dataset size constraints for Garli, PAUP* and RAxML (although as you’ll see below these programs are all out-performed by FastTree for really large datasets).

RAxML and PAUP* allow you to partition your data, for example if you wish to estimate different rates for different codon positions or genes.  In PAUP* this is done by editing the custom command block - see the PAUP* command line guide for a full list of the options you can implement this way.  

A brief note about how these programs run in Geneious

These plugins don’t run within the Geneious Java run-time environment, and thus they do not use the RAM allocated to Geneious. Instead they run as stand-alone programs with Geneious providing an interface.  Geneious exports your file to the plugin, the plugin program runs, and then the results are imported back into Geneious.  Although the tree building process itself does not use the RAM allocated to Geneious, you do need to have enough RAM allocated to Geneious to be able handle the export/import of files - and for large files this can require a significant amount. 

Which is fastest?

The answer to this question depends a lot on the type of dataset you have.  As a very general rule, speed goes something like this:  FastTree >> RAxML > PHYML > Garli >> PAUP*.  

FastTree is by far the fastest algorithm for large trees with a large number of taxa.  FastTree can produce a 10,000 taxon tree with support values in only a couple of minutes, whereas the same tree built by RAxML or Garli may take several days to run.  PHYML won’t even run on an alignment this large, as it has a built-in cutoff of 4000 taxa.  However, trees produced by FastTree are “approximately maximum likelihood” trees, and for datasets where the relationships between taxa are not so clear-cut, they may not be as accurate as trees produced by the other methods which perform a more intensive search of tree topologies (see the FastTree website for a more thorough discussion on the speed and accuracy of FastTree vs PHYML vs RAxML).  

If you have extremely long sequences, but only a few taxa (for example if you’re building a tree from a small number of bacterial genomes), then RAxML and PHYML out-perform FastTree.  A tree of five sequences, 4 million bases in length (computed without support values) took around 14 minutes in FastTree and only about 1 minute in RaxML and PHYML.  Garli does not handle long sequences well and is best used for shorter alignments. 

Of the full maximum-likelihood tree builders, RAxML appears to be most efficient for large trees from DNA data.  PHYML is a good choice for smaller datasets, as according to the PHYML manual the “comfort zone” for PhyML generally lies around 100-200 sequences less than 2,000 characters long.  The PHYML website has some extensive comparisons between PHYML and RAxML using a range of datasets.  

PAUP* is the slowest of the maximum likelihood tree builders, particularly when run with the default options. PAUP* uses tree bisection and reconnection (TBR) by default for topology searching, which evaluates many more trees than the default topology search options in PHYML (NNI, nearest neighbour interchange) or RAxML (rapid hill climbing).  To configure PAUP* to use NNI rather than PBR, open the custom command block and add SWAP=NNI to the HSEARCH line.  This will speed things up considerably, but the speed still does not approach that of PHYML or RAxML. 

How can I make my tree run faster?

The short answer is to get a faster computer.  Having more RAM available for your treebuilder won’t necessarily speed it up, but may mean you can build larger trees without running out of memory.  Speed is primarily determined by the speed of your processor, and currently all the tree builders mentioned here only use a single processor and cannot be configured to run across multiple cores.

So, which tree is best?

There is no one answer to this question as it is entirely dependent on the nature of your dataset, and how well the chosen model fits your data. Maximum likelihood tree-builders return the tree with the highest likelihood of being correct, given the data and the model you have chosen, but because of the differences in algorithms, the likelihood values produced by each program can’t be directly compared. It is good practice to use more than one method of tree building to assess how robust your tree topology is.  

 

Have more questions? Submit a request

Comments