Phylogenetics Analysis in R

Phylogenetic Analyses in R
Klaus Schliep
Universidad de Vigo
Porto, 15–16 July 2013

Outline
Getting started
Data Structures
Distance based methods
Maximum Parsimony
Maximum likelihood

About
This slides should give a short introduction into phylogenetic
reconstruction in R. It focuses mostly on the packages ape and
phangorn. I have to thank Emmanuel Paradis for his work on ape.
The slides are produced with literate programming using Latex,
Beamer, Sweave and R. So all the code and graphics are ”real”!

Help
To install an R package it is good to have administrator rights.
Download R from www.cran.r-project.org. You can easily install
packges from within R:
> install.packages("phangorn")
> install.packages("phytools")
> install.packages("pegas")
> install.packages("seqLogo")
> q()
Then you can load the packages you need:
> library("phangorn")
> library("seqLogo")

Help
The R homepage provides lots of general documentation, faqs, etc.
There are help pages for all the functions and most of them
contain examples.
> library(help="phangorn")
> help.start()
> ?pml
> help(pml)
> example(pml)
> vignette("Ancestral")
Copy and paste the parts of the code in the examples is a good
start. If you prefer reading a book (even they are fast outdated):
Paradis, E. (2012) Analysis of Phylogenetics and Evolution with R
(Second Edition) New York: Springer
There is a mailing list stat.ethz.ch/mailman/listinfo/r-sig-phylo
where you can ask questions, after browsing through the archive.

Data Structures
Reminder:
1. Data in R are made of vector + attribute(s) (and
combinations of these). Vector: a series of elements all of the
same kind (a list is a vector of pointers).
2. The class is the attribute determining the action of generic
functions (plot, summary, etc.)
We will make heavily use of the following 3 data structures:
1. phyDat: sequences (DNA, AA, codons, user deﬁned) in
phangorn
2. DNAbin: DNA sequences (ape format)
3. phylo: phylogenetic trees

Class phylo
This class represents phylogenetic trees. The tip labels may be
replicated, the node labels (which may be absent). Input:
1. read.tree: Newick files
2. read.nexus: NEXUS files
If the file contains several trees, these two functions return an
object of class multiPhylo which is a list of trees of class phylo.
And you can write objects of class phylo using write.tree or
write.nexus.

Plotting trees
ape has great plotting capabilities.
> help(plot.phylo)
Some simple example
> tree <- rtree(10)
> par(mfrow=c(2,2), mar=rep(0,4))
> plot(tree)
> plot(tree, type="fan")
> plot(tree, type="unrooted")
> plot(tree, type="cladogram")

Plotting trees
t9
t10
t4
t8
t5
t3
t6
t1
t2
t7
t9
t10
t4
t8
t5
t3
t6
t1
t2
t7
t9
t10
t4
t8
t5
t3
t6
t1
t2
t7
t9
t10
t4
t8
t5
t3
t6
t1
t2
t7

Transforming trees
There are many functions in ape and phangorn to transform trees
(i.e. objects of class phylo)
> root(tree, outgroup)
> drop.tip(tree, "t1")
> extract.clade(phy, 1)
> bind.tree(tree1, tree2)
> unroot(tree)
> multi2di(tree)
> di2multi(tree)
> nni(tree)
> rSPR(tree)

Class phyDat
The starting point for phylogenetic reconstruction are sequence
alignments. ape can call clustal,tcoffee and muscle and
phyloch can call mafft, prank and gblocks.
More frequently you will just read in an alignment
> align1 <- read.phyDat("myfile")
phangorn (phyDat) and ape (DNAbin) use diﬀerent formats to
represent alignments, but it is easy to convert formats.
> align2 <- read.dna("myfile") # ape format
> align3 <- as.phyDat(align1) # phangorn format

Section 3

Distance methods take a distance or dissimilarity matrix as input.
Ultrametric Additive
upgmaa fastme.ols
wpgmaa fastme.bal
nj
UNJa
bionj
a in phangorn the rest in ape.
Fast methods O(n2) or O(n3) → big data sets can be
analysed.
Distances can be calculated for diﬀerent kinds of data.
In phylogenetics often used to compute starting trees for ML,
MP or inside species tree methods.

> set.seed(1)
> bs <- bootstrap.phyDat(Laurasiatherian, FUN = function(x
> class(bs) <- 'multiPhylo'
> cnet = consensusNet(bs, .3)
> plot(cnet, show.tip.label=FALSE, show.nodes=TRUE)

Maximum parsimony
In contrast to the distance methods (maximum) parsimony uses
sequence alignments as input. The target is to minimize an
optimality criterion, i.e. a score to a tree, given the data. For the
parsimony method the score is the minimal number of substitutions
needed to account for the data on a phylogeny.
> data(Laurasiatherian)
> tree = nj(dist.ml(Laurasiatherian))
> parsimony(tree, Laurasiatherian)
[1] 9776
> tree2 = optim.parsimony(tree, Laurasiatherian,
trace=FALSE, rearrangement="SPR")
> parsimony(tree2, Laurasiatherian)
[1] 9713
> tree3 = pratchet(Laurasiatherian, rearrangement="SPR", t

Branch and bound
Normally it is not possible to evaluate an optimality criterion for all
trees, as there are just too many trees.
> sapply(3:10, howmanytrees, FALSE)
[1] 1 3 15 105 945 10395
[7] 135135 2027025
> howmanytrees(20, FALSE)
[1] 2.216431e+20
For small datasets it is possible to ﬁnd all most parsimonious trees
using a branch and bound algorithm. For datasets with more than
10 taxa this can take a long time and depends strongly on how
tree like the data are.
> besttree <- bab(subset(Laurasiatherian,1:10), trace=0)
> parsimony(besttree, Laurasiatherian)
[1] 2695

Ancestral reconstruction
To reconstruct ancestral sequences we first load some data and
reconstruct a tree:
> primates = read.phyDat("primates.dna")
> tree = pratchet(primates, trace=0)
> tree = acctran(tree, primates)
> parsimony(tree, primates)
[1] 746
In parsimony analysis the edge length represent the observed
number of changes. Reconstructiong ancestral states therefore
defines also the edge lengths of a tree. However there can exist
several equally parsimonious reconstructions or states can be
ambiguous and therefore edge length can differ (e.g. ”MPR”or
”ACCTRAN”).
> anc.acctran = ancestral.pars(tree, primates, "ACCTRAN")
> anc.mpr = ancestral.pars(tree, primates, "MPR")

Ancestral reconstruction
> seqLogo( t(subset(anc.mpr, getRoot(tree), 1:20)[[1]]), i
1 2 3 4 5 6 7 8 910 12 14 16 18 20
Position
0
0.2
0.4
0.6
0.8
1Probability

Ancestral reconstruction MPR
> plotAnc(tree, anc.mpr, 17)
> title("MPR")
Mouse
Bovine
Lemur
Tarsier
Squir Monk
Jpn Macaq
Rhesus Mac
Crab−E.Mac
BarbMacaq
Gibbon
Orang
Gorilla
Chimp
Human
a
c
g
t
MPR

Ancestral reconstruction ACCTRAN
> plotAnc(tree, anc.acctran, 17)
> title("ACCTRAN")
Mouse
Bovine
Lemur
Tarsier
Squir Monk
Jpn Macaq
Rhesus Mac
Crab−E.Mac
BarbMacaq
Gibbon
Orang
Gorilla
Chimp
Human
a
c
g
t
ACCTRAN

Maximum Likelihood
”[In 1961] I had visions of evolutionary tree estimation being much
the same [than linkage estimation] but with the addition of the
need to estimate the form of the tree itself, surely a fatal
complexity: my intuition was that there would be insuﬃcient data
for the task.”
—A.W.F. Edwards (2009)
Phylogenetic likelihood is the probability f (x|θ, τ) of observing an
alignment X given a model of (nucleotide) substitution with
parameters θ and phylogenetic tree τ.
L(θ, τ, x) =
N
i=1
f (xi |θ, τ)
where N is the number of sites in the alignment. It is common to
maximise the log-likelihood function
(θ, τ, x) = N
i=1 log (f (xi |θ, τ)) which also maximises L(θ, τ, x).

Applications in phylogenetics
Felsenstein (1981) introduced the pruning algorithm which made
the computation of the likelihood feasible. Let nodes j and k have
a direct ancestor h. We can estimate the conditional likelihood
Lh(xh) =


xj
Lj (xj )pxj ,xh
(tj )

 ×
xk
Lk(xk)pxk ,xh
(tk)
The likelihood of the tree is evaluated by traversing the tree in
postorder fashion from the tips towards the root. For unrooted
trees, a root can be chosen arbitrarily as our models are
time-reversible. We get the likelihood of the tree if we multiply the
conditional likelihood of the root node r with the base composition
π, as
fh(x|θ, τ) =
xr
πxr Lr (xr ),
These formulas can be adapted to estimate ancestral sequences.

ML in phylogenetics
5
6
7
human chimp gorilla orangutan

ML in phylogenetics
1|0|0|0 1|0|0|0 0|0|1|0 0|0|0|1

ML in phylogenetics
1|0|0|0 1|0|0|0 0|0|1|0 0|0|0|1
0.000988|0.000031|0.000595|0.000744
0.027161|0.000559|0.016240|0.000559
0.923613|0.000168|0.000168|0.000169

Finding the best topology
A binary unrooted tree has 5 edges and 3 distinct topologies. Here
are the general formulas for binary unrooted trees:
2n − 3 edges
(2n − 5)!! = 1 × 3 × 5 × · · · × (2n − 3) topologies
Rooted binary trees have 2n − 2 edges and (2n − 5)!! topologies.
A function exists for this:
> howmanytrees(4, rooted=FALSE)
[1] 3
[1] 2027025
[1] 2.216431e+20

Finding the best trees
The strategy of evaluating the likelihood criterion for all trees in
order to ﬁnd the best tree topoology is in most cases highly
impracticable. Instead, local tree rearrangements are used to
search locally within the tree space. The idea behind such a
heuristic is to use a starting tree and search locally for improved
scores (parsimony, maximum likelihood, Least-Squares), until no
further rearrangements can lead to a tree with a better score.

Nearest neighbor interchange
For any internal edge of a binary tree there exist three diﬀerent
ways to connect its four subtrees, one of which is the current tree.
A
B
C
D
A
C
B
D
A
D
B
C

Modelling rate variation
We assume that the substitution rate varies between different sites
(intron vs. exon, codon positions, etc). Two approaches are
commonly used:
define different partitions
model rate variation with different rate categories, with a
(discrete) Γ distribution and/or proportion of variables sites

Comparing trees and models
The phylogenetic likelihood allows us to compare many diﬀerent
models or trees. There is often a bias vs. variance trade-oﬀ.
Simple models are easy to interpret but can often be biased.
MSE
Variance
Bias2
number of parameters

Comparing trees and models
The phylogenetic likelihood allows us to compare many diﬀerent
models or trees.
If two models are nested - that is, one model can be described
as a special case of the other – then we can directly compare
their likelihoods under their ML parameter estimates for a
ﬁxed tree using a likelihood ratio test (LRT)
For non nested models we can use the Akaike Information
Criteria (AIC) or the Bayesian Information Criteria (BIC):
AIC = − (θ, τ, x) + 2 ∗ df
BIC = − (θ, τ, x) + ln(n) ∗ df
where df is the number of parameters of the model and n the
number of sites.
Or use the Shimodaira-Hasegawa test or similar bootstrap
approaches.

Detection of molecular adaptation
We look at each triplet of nucletides and assume that only one
nucleotide can be replaced at a time. Then we can distinguish
between nucleotide substitutions that result in the same amino
acid (synonymous substitutions) or a diﬀerent amino acid
(non-synonymous substitutions). The ratio dN/dS of
non-synonymous to synonymous substitutions can be an indication
of the kind of selective pressure acting on the codon site. Under
negative selection, we expect that non-synonymous substitutions
will accumulate more slowly than synonymous ones. And under
positive or diversifying selection, we expect more amino acid
changing replacements.

Applications with phangorn
The two main functions are pml to set up the model and
optim.pml for optimising parameters and the tree with ML.
Example session for Jukes Cantor, GTR and GTR+Γ+I model:
> data(Laurasiatherian)
> tr <- nj(dist.ml(Laurasiatherian))
> m0 <- pml(tr, Laurasiatherian)
> m.jc69 <- optim.pml(m0, optNni=TRUE)
> m.gtr <- optim.pml(m0, optNni=TRUE, model="GTR")
> m.gtr.G.I <- optim.pml(update(m.gtr, k=4), model=
"GTR", optNni=TRUE, optGamma=TRUE, optInv=TRUE)
By default, only the edge lengths are optimized. Currently
phangorn only supports NNI tree rearrangements (equivalent to
PhyML vers. 2)

There exist several useful generic functions like update, anova or
AIC for objects of class pml.
> methods(class="pml")
[1] anova.pml logLik.pml plot.pml print.pml
[5] update.pml vcov.pml
For example we can compare the diﬀerent models as they are
nested with likelihood ratio test:
> anova(m.jc69, m.gtr, m.gtr.G.I)
Likelihood Ratio Test Table
Log lik. Df Df change Diff log lik. Pr(>|Chi|)
1 -54113 91
2 -50603 99 8 7020 < 2.2e-16 ***
3 -44527 101 2 12151 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘

Partition models
pmlPart(global ∼ local, object, model)
global local
bf bf
Q Q
inv inv
shape shape
edge edge
rate
nni
Each component can be only used once in the formula.

Partition models
There are two diﬀerent ways to set up partition models.
1. Setting up partition models for diﬀerent genes.
> fit1 <- pml(tree, g1)
> genePart <- pmlPart(Q + bf ∼ edge,
list(fit1, fit2, fit3, fit4), optRooted=TRUE)
> trees <- lapply(genePart$fits, function(x)x$tree)
> class(trees) <- "multiPhylo"
> densiTree(trees, type="phylogram", col="red")
where g1, g2, g3 and g4 are objects of class phyDat.

ML in phylogenetics
Scer
Spar
Smik
Skud
Sbay
Scas
Sklu
Calb

Partition models
2. Partitioning via a weight matrix.
> woody <- phyDat(woodmouse)
> tree <- nj(dist.ml(woody))
> fit <- pml(tree, woody)
> w <- attr(woody, "index")
> weight <- table(w, rep(c(1,2,3), length=length(w)))
> codonPart <- pmlPart(edge ∼ rate, fit,
model=c("JC", "JC", "GTR"), weight=weight)

Model / tree comparison
Alternatively we can use the Shimodaira-Hasegawa test to check
for diﬀerences between models:
> SH.test(m.jc69, m.gtr, m.gtr.G.I)
Trees ln L Diff ln L p-value
[1,] 1 -54112.74 9585.685 0.0000
[2,] 2 -50602.74 6075.683 0.0000
[3,] 3 -44527.06 0.000 0.5911

Model selection
Two possibilities
ape: phymltest
> write.phyDat(woody, "woody.phy")
> out <- phymltest("woody.phy", execname =
"~/phyml")
phangorn: modelTest
> mt <- modelTest(Laurasiatherian, model=c("JC",
"F81", "HKY", "GTR"))
modelTest works also for amino acid models similar to ProtTest.
> mt <- modelTest(myAAData, model=c("WAG", "JTT",
"LG","Dayhoff"))

Model Selection
Model df logLik AIC BIC
1 JC 91.00 -54303.67 108789.35 109341.20
2 JC+I 92.00 -50673.32 101530.63 102088.55
3 JC+G 92.00 -48684.10 97552.19 98110.11
4 JC+G+I 93.00 -48605.03 97396.06 97960.05
5 F81 94.00 -54212.64 108613.27 109183.32
6 F81+I 95.00 -50549.53 101289.06 101865.17
7 F81+G 95.00 -48500.49 97190.99 97767.10
8 F81+G+I 96.00 -48416.26 97024.51 97606.69
9 HKY 95.00 -51275.86 102741.72 103317.83
10 HKY+I 96.00 -47451.73 95095.45 95677.63
11 HKY+G 96.00 -44893.11 89978.23 90560.40
12 HKY+G+I 97.00 -44770.18 89734.36 90322.60
13 GTR 99.00 -50759.89 101717.79 102318.16
14 GTR+I 100.00 -47081.77 94363.55 94969.98
15 GTR+G 100.00 -44759.49 89718.99 90325.42
16 GTR+G+I 101.00 -44624.02 89450.04 90062.54

Bootstrap
> bs <- bootstrap.pml(m.gtr, bs=100, optNni=TRUE)
> plotBS(m.gtr$tree, bs, type="phylo", bs.adj=c(.5,0))
Platypus
Wallaroo
Possum
Bandicoot
Opposum
Armadillo
Elephant
Aardvark
Tenrec
Hedghog
Gymnure
Mole
Shrew
Rbat
FlyingFox
RyFlyFox
FruitBat
LongTBat
Horse
Donkey
WhiteRhino
IndianRhin
Pig
Alpaca
Cow
Sheep
Hippo
FinWhale
BlueWhale
SpermWhale
Rabbit
Pika
Squirrel
Dormouse
GuineaPig
Mouse
Vole
CaneRat
Baboon
Human
Loris
Cebus
Cat
Dog
HarbSeal
FurSeal
GraySeal
10058
100
100
100
58
93
100100
100100
64
58
100
86
100
100
98
96
100100
87
100
44
79
100
88
97
64
86
73
75
100
5489
100
70
47
91
55
68
67
100
100

Codon Models
qij =



0 if i and j diﬀer in more than one position
πj for synonymous transversion
πj κ for synonymous transition
πj ω for non-synonymous transversion
πj ωκ for non-synonymous transition
or if we make abstraction of pij (frequency of base j):
qij =



0 if i and j diﬀer in more than one position
1 for synonymous transversion
κ for synonymous transition
ω for non-synonymous transversion
ωκ for non-synonymous transition
where ω is the dN/dS ratio, κ the transition transversion ratio and
πj is the the equilibrium frequencies of codon j.

Codon Models
> (dat <- phyDat(as.character(yeast), "CODON"))
> tree <- nj(dist.ml(yeast))
> fit <- pml(tree, dat)
> ctr <- pml.control(trace=0)
> fit0 <- optim.pml(fit, control = ctr)
> fit1 <- optim.pml(fit0, model="codon1", control=ctr)
Model κ ω
codon0 1 1
codon1 free free
codon2 1 free
codon3 free 1
Additionally, the equilibrium frequencies of the codons πj can be
estimated setting the parameter optBf=TRUE.

Codon Models
> anova(fit0, fit2, fit1)
1 -1054762 13
2 -648282 14 1 812961 < 2.2e-16 ***
3 -642807 15 1 10949 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘
> anova(fit0, fit3, fit1)
1 -1054762 13
2 -708674 14 1 692176 < 2.2e-16 ***
3 -642807 15 1 131735 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘

Phylogenetics Analysis in R

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (15)

Similar to Phylogenetics Analysis in R

Similar to Phylogenetics Analysis in R (20)

Phylogenetics Analysis in R