4. About
This slides should give a short introduction into phylogenetic
reconstruction in R. It focuses mostly on the packages ape and
phangorn. I have to thank Emmanuel Paradis for his work on ape.
The slides are produced with literate programming using Latex,
Beamer, Sweave and R. So all the code and graphics are ”real”!
5. Help
To install an R package it is good to have administrator rights.
Download R from www.cran.r-project.org. You can easily install
packges from within R:
> install.packages("phangorn")
> install.packages("phytools")
> install.packages("pegas")
> install.packages("seqLogo")
> q()
Then you can load the packages you need:
> library("phangorn")
> library("seqLogo")
6. Help
The R homepage provides lots of general documentation, faqs, etc.
There are help pages for all the functions and most of them
contain examples.
> library(help="phangorn")
> help.start()
> ?pml
> help(pml)
> example(pml)
> vignette("Ancestral")
Copy and paste the parts of the code in the examples is a good
start. If you prefer reading a book (even they are fast outdated):
Paradis, E. (2012) Analysis of Phylogenetics and Evolution with R
(Second Edition) New York: Springer
There is a mailing list stat.ethz.ch/mailman/listinfo/r-sig-phylo
where you can ask questions, after browsing through the archive.
8. Data Structures
Reminder:
1. Data in R are made of vector + attribute(s) (and
combinations of these). Vector: a series of elements all of the
same kind (a list is a vector of pointers).
2. The class is the attribute determining the action of generic
functions (plot, summary, etc.)
We will make heavily use of the following 3 data structures:
1. phyDat: sequences (DNA, AA, codons, user defined) in
phangorn
2. DNAbin: DNA sequences (ape format)
3. phylo: phylogenetic trees
9. Class phylo
This class represents phylogenetic trees. The tip labels may be
replicated, the node labels (which may be absent). Input:
1. read.tree: Newick files
2. read.nexus: NEXUS files
If the file contains several trees, these two functions return an
object of class multiPhylo which is a list of trees of class phylo.
And you can write objects of class phylo using write.tree or
write.nexus.
10. Plotting trees
ape has great plotting capabilities.
> help(plot.phylo)
Some simple example
> tree <- rtree(10)
> par(mfrow=c(2,2), mar=rep(0,4))
> plot(tree)
> plot(tree, type="fan")
> plot(tree, type="unrooted")
> plot(tree, type="cladogram")
12. Transforming trees
There are many functions in ape and phangorn to transform trees
(i.e. objects of class phylo)
> root(tree, outgroup)
> drop.tip(tree, "t1")
> extract.clade(phy, 1)
> bind.tree(tree1, tree2)
> unroot(tree)
> multi2di(tree)
> di2multi(tree)
> nni(tree)
> rSPR(tree)
13. Class phyDat
The starting point for phylogenetic reconstruction are sequence
alignments. ape can call clustal,tcoffee and muscle and
phyloch can call mafft, prank and gblocks.
More frequently you will just read in an alignment
> align1 <- read.phyDat("myfile")
phangorn (phyDat) and ape (DNAbin) use different formats to
represent alignments, but it is easy to convert formats.
> align2 <- read.dna("myfile") # ape format
> align3 <- as.phyDat(align1) # phangorn format
15. Distance based methods
Distance methods take a distance or dissimilarity matrix as input.
Ultrametric Additive
upgmaa fastme.ols
wpgmaa fastme.bal
nj
UNJa
bionj
a in phangorn the rest in ape.
Fast methods O(n2) or O(n3) → big data sets can be
analysed.
Distances can be calculated for different kinds of data.
In phylogenetics often used to compute starting trees for ML,
MP or inside species tree methods.
19. Maximum parsimony
In contrast to the distance methods (maximum) parsimony uses
sequence alignments as input. The target is to minimize an
optimality criterion, i.e. a score to a tree, given the data. For the
parsimony method the score is the minimal number of substitutions
needed to account for the data on a phylogeny.
> data(Laurasiatherian)
> tree = nj(dist.ml(Laurasiatherian))
> parsimony(tree, Laurasiatherian)
[1] 9776
> tree2 = optim.parsimony(tree, Laurasiatherian,
trace=FALSE, rearrangement="SPR")
> parsimony(tree2, Laurasiatherian)
[1] 9713
> tree3 = pratchet(Laurasiatherian, rearrangement="SPR", t
20. Branch and bound
Normally it is not possible to evaluate an optimality criterion for all
trees, as there are just too many trees.
> sapply(3:10, howmanytrees, FALSE)
[1] 1 3 15 105 945 10395
[7] 135135 2027025
> howmanytrees(20, FALSE)
[1] 2.216431e+20
For small datasets it is possible to find all most parsimonious trees
using a branch and bound algorithm. For datasets with more than
10 taxa this can take a long time and depends strongly on how
tree like the data are.
> besttree <- bab(subset(Laurasiatherian,1:10), trace=0)
> parsimony(besttree, Laurasiatherian)
[1] 2695
21. Ancestral reconstruction
To reconstruct ancestral sequences we first load some data and
reconstruct a tree:
> primates = read.phyDat("primates.dna")
> tree = pratchet(primates, trace=0)
> tree = acctran(tree, primates)
> parsimony(tree, primates)
[1] 746
In parsimony analysis the edge length represent the observed
number of changes. Reconstructiong ancestral states therefore
defines also the edge lengths of a tree. However there can exist
several equally parsimonious reconstructions or states can be
ambiguous and therefore edge length can differ (e.g. ”MPR”or
”ACCTRAN”).
> anc.acctran = ancestral.pars(tree, primates, "ACCTRAN")
> anc.mpr = ancestral.pars(tree, primates, "MPR")
23. Ancestral reconstruction MPR
> plotAnc(tree, anc.mpr, 17)
> title("MPR")
Mouse
Bovine
Lemur
Tarsier
Squir Monk
Jpn Macaq
Rhesus Mac
Crab−E.Mac
BarbMacaq
Gibbon
Orang
Gorilla
Chimp
Human
a
c
g
t
MPR
24. Ancestral reconstruction ACCTRAN
> plotAnc(tree, anc.acctran, 17)
> title("ACCTRAN")
Mouse
Bovine
Lemur
Tarsier
Squir Monk
Jpn Macaq
Rhesus Mac
Crab−E.Mac
BarbMacaq
Gibbon
Orang
Gorilla
Chimp
Human
a
c
g
t
ACCTRAN
26. Maximum Likelihood
”[In 1961] I had visions of evolutionary tree estimation being much
the same [than linkage estimation] but with the addition of the
need to estimate the form of the tree itself, surely a fatal
complexity: my intuition was that there would be insufficient data
for the task.”
—A.W.F. Edwards (2009)
Phylogenetic likelihood is the probability f (x|θ, τ) of observing an
alignment X given a model of (nucleotide) substitution with
parameters θ and phylogenetic tree τ.
L(θ, τ, x) =
N
i=1
f (xi |θ, τ)
where N is the number of sites in the alignment. It is common to
maximise the log-likelihood function
(θ, τ, x) = N
i=1 log (f (xi |θ, τ)) which also maximises L(θ, τ, x).
27. Applications in phylogenetics
Felsenstein (1981) introduced the pruning algorithm which made
the computation of the likelihood feasible. Let nodes j and k have
a direct ancestor h. We can estimate the conditional likelihood
Lh(xh) =
xj
Lj (xj )pxj ,xh
(tj )
×
xk
Lk(xk)pxk ,xh
(tk)
The likelihood of the tree is evaluated by traversing the tree in
postorder fashion from the tips towards the root. For unrooted
trees, a root can be chosen arbitrarily as our models are
time-reversible. We get the likelihood of the tree if we multiply the
conditional likelihood of the root node r with the base composition
π, as
fh(x|θ, τ) =
xr
πxr Lr (xr ),
These formulas can be adapted to estimate ancestral sequences.
31. ML in phylogenetics
1|0|0|0 1|0|0|0 0|0|1|0 0|0|0|1
0.000988|0.000031|0.000595|0.000744
0.027161|0.000559|0.016240|0.000559
0.923613|0.000168|0.000168|0.000169
32. Finding the best topology
A binary unrooted tree has 5 edges and 3 distinct topologies. Here
are the general formulas for binary unrooted trees:
2n − 3 edges
(2n − 5)!! = 1 × 3 × 5 × · · · × (2n − 3) topologies
Rooted binary trees have 2n − 2 edges and (2n − 5)!! topologies.
A function exists for this:
> howmanytrees(4, rooted=FALSE)
[1] 3
> howmanytrees(10, rooted=FALSE)
[1] 2027025
> howmanytrees(20, rooted=FALSE)
[1] 2.216431e+20
33. Finding the best trees
The strategy of evaluating the likelihood criterion for all trees in
order to find the best tree topoology is in most cases highly
impracticable. Instead, local tree rearrangements are used to
search locally within the tree space. The idea behind such a
heuristic is to use a starting tree and search locally for improved
scores (parsimony, maximum likelihood, Least-Squares), until no
further rearrangements can lead to a tree with a better score.
34. Nearest neighbor interchange
For any internal edge of a binary tree there exist three different
ways to connect its four subtrees, one of which is the current tree.
A
B
C
D
A
C
B
D
A
D
B
C
35. Modelling rate variation
We assume that the substitution rate varies between different sites
(intron vs. exon, codon positions, etc). Two approaches are
commonly used:
define different partitions
model rate variation with different rate categories, with a
(discrete) Γ distribution and/or proportion of variables sites
36. Comparing trees and models
The phylogenetic likelihood allows us to compare many different
models or trees. There is often a bias vs. variance trade-off.
Simple models are easy to interpret but can often be biased.
MSE
Variance
Bias2
number of parameters
37. Comparing trees and models
The phylogenetic likelihood allows us to compare many different
models or trees.
If two models are nested - that is, one model can be described
as a special case of the other – then we can directly compare
their likelihoods under their ML parameter estimates for a
fixed tree using a likelihood ratio test (LRT)
For non nested models we can use the Akaike Information
Criteria (AIC) or the Bayesian Information Criteria (BIC):
AIC = − (θ, τ, x) + 2 ∗ df
BIC = − (θ, τ, x) + ln(n) ∗ df
where df is the number of parameters of the model and n the
number of sites.
Or use the Shimodaira-Hasegawa test or similar bootstrap
approaches.
38. Detection of molecular adaptation
We look at each triplet of nucletides and assume that only one
nucleotide can be replaced at a time. Then we can distinguish
between nucleotide substitutions that result in the same amino
acid (synonymous substitutions) or a different amino acid
(non-synonymous substitutions). The ratio dN/dS of
non-synonymous to synonymous substitutions can be an indication
of the kind of selective pressure acting on the codon site. Under
negative selection, we expect that non-synonymous substitutions
will accumulate more slowly than synonymous ones. And under
positive or diversifying selection, we expect more amino acid
changing replacements.
39. Applications with phangorn
The two main functions are pml to set up the model and
optim.pml for optimising parameters and the tree with ML.
Example session for Jukes Cantor, GTR and GTR+Γ+I model:
> data(Laurasiatherian)
> tr <- nj(dist.ml(Laurasiatherian))
> m0 <- pml(tr, Laurasiatherian)
> m.jc69 <- optim.pml(m0, optNni=TRUE)
> m.gtr <- optim.pml(m0, optNni=TRUE, model="GTR")
> m.gtr.G.I <- optim.pml(update(m.gtr, k=4), model=
"GTR", optNni=TRUE, optGamma=TRUE, optInv=TRUE)
By default, only the edge lengths are optimized. Currently
phangorn only supports NNI tree rearrangements (equivalent to
PhyML vers. 2)
40. There exist several useful generic functions like update, anova or
AIC for objects of class pml.
> methods(class="pml")
[1] anova.pml logLik.pml plot.pml print.pml
[5] update.pml vcov.pml
For example we can compare the different models as they are
nested with likelihood ratio test:
> anova(m.jc69, m.gtr, m.gtr.G.I)
Likelihood Ratio Test Table
Log lik. Df Df change Diff log lik. Pr(>|Chi|)
1 -54113 91
2 -50603 99 8 7020 < 2.2e-16 ***
3 -44527 101 2 12151 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘
41. Partition models
pmlPart(global ∼ local, object, model)
global local
bf bf
Q Q
inv inv
shape shape
edge edge
rate
nni
Each component can be only used once in the formula.
42. Partition models
There are two different ways to set up partition models.
1. Setting up partition models for different genes.
> fit1 <- pml(tree, g1)
> fit2 <- pml(tree, g2)
> fit3 <- pml(tree, g3)
> fit4 <- pml(tree, g4)
> genePart <- pmlPart(Q + bf ∼ edge,
list(fit1, fit2, fit3, fit4), optRooted=TRUE)
> trees <- lapply(genePart$fits, function(x)x$tree)
> class(trees) <- "multiPhylo"
> densiTree(trees, type="phylogram", col="red")
where g1, g2, g3 and g4 are objects of class phyDat.
44. Partition models
2. Partitioning via a weight matrix.
> woody <- phyDat(woodmouse)
> tree <- nj(dist.ml(woody))
> fit <- pml(tree, woody)
> w <- attr(woody, "index")
> weight <- table(w, rep(c(1,2,3), length=length(w)))
> codonPart <- pmlPart(edge ∼ rate, fit,
model=c("JC", "JC", "GTR"), weight=weight)
45. Model / tree comparison
Alternatively we can use the Shimodaira-Hasegawa test to check
for differences between models:
> SH.test(m.jc69, m.gtr, m.gtr.G.I)
Trees ln L Diff ln L p-value
[1,] 1 -54112.74 9585.685 0.0000
[2,] 2 -50602.74 6075.683 0.0000
[3,] 3 -44527.06 0.000 0.5911
46. Model selection
Two possibilities
ape: phymltest
> write.phyDat(woody, "woody.phy")
> out <- phymltest("woody.phy", execname =
"~/phyml")
phangorn: modelTest
> mt <- modelTest(Laurasiatherian, model=c("JC",
"F81", "HKY", "GTR"))
modelTest works also for amino acid models similar to ProtTest.
> mt <- modelTest(myAAData, model=c("WAG", "JTT",
"LG","Dayhoff"))
49. Codon Models
qij =
0 if i and j differ in more than one position
πj for synonymous transversion
πj κ for synonymous transition
πj ω for non-synonymous transversion
πj ωκ for non-synonymous transition
or if we make abstraction of pij (frequency of base j):
qij =
0 if i and j differ in more than one position
1 for synonymous transversion
κ for synonymous transition
ω for non-synonymous transversion
ωκ for non-synonymous transition
where ω is the dN/dS ratio, κ the transition transversion ratio and
πj is the the equilibrium frequencies of codon j.
50. Codon Models
> (dat <- phyDat(as.character(yeast), "CODON"))
> tree <- nj(dist.ml(yeast))
> fit <- pml(tree, dat)
> ctr <- pml.control(trace=0)
> fit0 <- optim.pml(fit, control = ctr)
> fit1 <- optim.pml(fit0, model="codon1", control=ctr)
> fit2 <- optim.pml(fit0, model="codon2", control=ctr)
> fit3 <- optim.pml(fit0, model="codon3", control=ctr)
Model κ ω
codon0 1 1
codon1 free free
codon2 1 free
codon3 free 1
Additionally, the equilibrium frequencies of the codons πj can be
estimated setting the parameter optBf=TRUE.