Seminar presented at the Maths Department, University of Portsmouth, 19th November 2014
Evolutionary biologists represent actual or hypothesised evolutionary relations between living organisms using phylogenies, directed bifurcating graphs (trees) that describe evolutionary processes in terms of speciation or splitting events (nodes) and elapsed evolutionary time or distance (edges). Molecular evolution itself is largely dominated by mutations in DNA sequences, a stochastic process. Traditionally, probabilistic models of molecular evolution and phylogenies are fitted to DNA sequence data by maximum likelihood on the assumption that a single simple phylogeny will serve to approximate the evolution of a majority of DNA positions in the dataset. However modern studies now routinely sample several orders of magnitude more DNA positions, and this assumption no longer holds. Unfortunately, our conception of ‘tree space’ - a notional multidimensional surface containing all possible phylogenies - is extremely imprecise, and similarly techniques to model phylogeny model fitting in very large datasets are limited. I will show the background to this field and present some of the challenges arising from the present limited analytical framework.
COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)
Interpreting ‘tree space’ in the context of very large empirical datasets
1. Interpreting ‘tree space’ in the
context of very large empirical
datasets
Joe Parker
School of Biological and Chemical Sciences
Queen Mary University of London
2. Topics
• What evolutionary biology is
– And what we do in the lab
• Introducing phylogenies (trees / digraphs)
• Molecular evolution
• Tests involving phylogeny comparison
• Problems in phylogeny comparison
• Conclusion / thanks / questions
7. Prestin evolution
Human NDLTRNRFFENPALWELLFH… SIHDAVLGSQLREALAEQEASAPPSQ
Rat NDLTSNRFFENPALKELLFH… SIHDAVLGSQVREAMAEQETTVLPPQ
Dog NDLTQNRFFENPALKELLFH… SIHDAVLGSQLREALAEQEASALPPQ
Dolphin SDLTRNQFFENPALLDLLFH… SIHDAVLGSLVREALAEKEAAAATPQ
Horseshoe Bat SDLTRNRFFENPALLDLLFH… SIHDAVLGSLVREALEEKEAAAATPQ
11. Tree space
• Phylogeneticists often talk about tree space -
the set of all possible trees
• Within tree space two graphs are said to be
adjacent if they differ at e.g. one internal node
• Trees are said to be ‘near’ if they are similar
e.g. only a few rearrangements
• It is not actually a well-defined concept
however
13. Molecular evolution
• Molecular evolution is the study of the processes by
which DNA sequences change over time
• Stochastic changes dominate over short time-scales
but over longer ones directional natural selection is
apparent
• Normally modelled as stochastic process
• Unlike classical physical phenomena largely
understood as a statistical not mechanical
phenomenon
14. Simple model: Jukes-Cantor 69
• Letters {A,C,G,T}
• Equal frequencies at equilibrium
• Transition probabilities u / 3 in time t
• e.g. A C:
ut ⎛
More generally:
Felsenstein (2004) Inferring Phylogenies. Springer, NY
(Following model figures and formulae: ibid.)
Pr(C | A • u • t) =
1
4
1−e
− 4
3
⎝ ⎜
⎞
⎠ ⎟
15. Maximum likelihood
• One of the most popular frameworks for
understanding and modelling molecular
evolution and phylogenies
• Likelihood of data given model, phylogeny:
mΠ
• Likelihood-maximisation gives a way to
parametize model and/or phylogeny
L = Pr(D |T) = Pr(D(i) |T)
i=1
16. mΠ
L = Pr(D |T) = Pr(D(i) |T)
i=1
w Σ
z Σ
y Σ
x Σ
Independence of sites (1) Independence of branches (2)
= Pr(A,C,C,C,G, x, y,z,w,T)
17. Phylogenomics
• Advances mean data sets several orders of
magnitude larger
• Shift in emphasis from ML on specific
phylogenies to statistics of all
flickr/stephenjjohnson Illumina.com spectrum.ieee.org
18. Phylogenomics
• Stochastic property of
molecular evolution
becomes apparent in
large datasets
• Goodness-of-fit varies by
site / gene for a single
phylogeny / model
• Corollary: goodness-of-fit
varies amongst
models for a single
genome
31. Continuous distributions
• Output approximates a continuous distribution
• Comparing alternative hypotheses it is apparent that selection of tree gives largely
determines location skew etc (perhaps as expected)
• But given that distribution tails are considered significant meaning of values in
these tails problematic / comparable
32. Significance by simulation
• Very common technique in evolutionary
biology – simulate a large dataset under the
null model, compare w/empirical
• in this context simulate data get
unexpectedness U:
U = 1 – cdf ( ΔSSLSH0-Ha | j )
34. Multiple hypotheses
• Alternative hypotheses drawn from tree space
• Same dataset different Ha, different U
• What U expected for Ha?
• More simulation – multiple draws from tree
space:
Uc,= U – mean Uc
35. Tree space
• In the context of ML tree
space can be thought of as the
distance in lnL units (or any
other related statistic*)
between two trees with
otherwise identical models /
data
• In our previous results this
appeared continuous.
• This may be misleading; in
reality tree space, or derived
statistics, can be highly
discontinuous.
36. Multiple comparisons
• However…. We recall that distance in tree space,
or shape of tree space, not well determined.
• How to sample effectively to control U (as Uc)?
• How to compare Uc for Ha?
• Sample every point (tree)?
• Sample lots?
• Sample systematically? Inverse-distance? Etc
37. Tree space
• Previously with small empirical datasets
assume a single phylogeny a good descriptor
of most/many sites
• With large datasets this may not be true
– Both small adjustments better fit for many sites
– And also some large rearrangements
• Perhaps a better definition of tree space
• Considering two Ha equidistant from H0
38.
39. Tree distance properties
• Scalar distances informative
• Triagonality
• Proportional to L for a given model(?)
• Vectors informative (?)
40. Tree distance candidates
• Statistic or model-based measures:
– Parsimony, ML or amino-acid/nucleotide distance
– ΔlnL
• Topology-based measures:
– Number / type of rearrangement moves, e.g.
• Nearest-neighbour interchange
• Subtree prune-and-regraft
• Tree bisection-and-reconnection
• Algorithm-based measures:
– # Of algorithm move steps
– Wall clock time
41. Acknowledgements
• School of Biological and Chemical Sciences, Queen Mary, University of
London – Rossiter Group
– Prof. Steve Rossiter (PI)
– Drs Kalina Davies, Georgia Tsagkogeorga, Michael McGowen, Mao
Xiuguang
– Seb Bailey, Kim Warren
• Others:
– Profs Richard Nichols, Andrew Leitch (SBCS)
– Drs Yannick Wurm, Richard Buggs, Chris Faulkes, Steve Le Comber (SBCS)
– Drs Chris Walker & Rob Horton (GridPP HTC)
• Sanger Centre
– Dr James Cotton
(L-R): Joe Parker; GeorgiaTsagkogeorga; Kalina Davies; Steve Rossiter; Xiuguang Mao; Seb Bailey
Notas del editor
Abstract:
Interpreting ‘tree space’ in the context of very large empirical datasets
Evolutionary biologists represent actual or hypothesised evolutionary relations between living organisms using phylogenies, directed bifurcating graphs (trees) that describe evolutionary processes in terms of speciation or splitting events (nodes) and elapsed evolutionary time or distance (edges). Molecular evolution itself is largely dominated by mutations in DNA sequences, a stochastic process.
Traditionally, probabilistic models of molecular evolution and phylogenies are fitted to DNA sequence data by maximum likelihood on the assumption that a single simple phylogeny will serve to approximate the evolution of a majority of DNA positions in the dataset. However modern studies now routinely sample several orders of magnitude more DNA positions, and this assumption no longer holds.
Unfortunately, our conception of ‘tree space’ - a notional multidimensional surface containing all possible phylogenies - is extremely imprecise, and similarly techniques to model phylogeny model fitting in very large datasets are limited. I will show the background to this field and present some of the challenges arising from the present limited analytical framework.
Abstract:
Interpreting ‘tree space’ in the context of very large empirical datasets
Evolutionary biologists represent actual or hypothesised evolutionary relations between living organisms using phylogenies, directed bifurcating graphs (trees) that describe evolutionary processes in terms of speciation or splitting events (nodes) and elapsed evolutionary time or distance (edges). Molecular evolution itself is largely dominated by mutations in DNA sequences, a stochastic process.
Traditionally, probabilistic models of molecular evolution and phylogenies are fitted to DNA sequence data by maximum likelihood on the assumption that a single simple phylogeny will serve to approximate the evolution of a majority of DNA positions in the dataset. However modern studies now routinely sample several orders of magnitude more DNA positions, and this assumption no longer holds.
Unfortunately, our conception of ‘tree space’ - a notional multidimensional surface containing all possible phylogenies - is extremely imprecise, and similarly techniques to model phylogeny model fitting in very large datasets are limited. I will show the background to this field and present some of the challenges arising from the present limited analytical framework.
The phylogenies
(OHC diagram)
PIC OF DARWIN
Phenotypes diverging
Genes diverging
Phylogeny
REMOVE wording from pentadactyly diagram
CLEARER example phylogeny
Prestin sequences
Prestin Phylogeny
“”BIOLOGISTS AND BIOCHEMISTS REPRESENT PROTEINS AS SEQUENCES OF LETTERS
REMINDER PHYLOGENY of mammals spp. tree
Abstract:
Interpreting ‘tree space’ in the context of very large empirical datasets
Evolutionary biologists represent actual or hypothesised evolutionary relations between living organisms using phylogenies, directed bifurcating graphs (trees) that describe evolutionary processes in terms of speciation or splitting events (nodes) and elapsed evolutionary time or distance (edges). Molecular evolution itself is largely dominated by mutations in DNA sequences, a stochastic process.
Traditionally, probabilistic models of molecular evolution and phylogenies are fitted to DNA sequence data by maximum likelihood on the assumption that a single simple phylogeny will serve to approximate the evolution of a majority of DNA positions in the dataset. However modern studies now routinely sample several orders of magnitude more DNA positions, and this assumption no longer holds.
Unfortunately, our conception of ‘tree space’ - a notional multidimensional surface containing all possible phylogenies - is extremely imprecise, and similarly techniques to model phylogeny model fitting in very large datasets are limited. I will show the background to this field and present some of the challenges arising from the present limited analytical framework.
Given observed microbial diversity
Phylogeny reveals evolutionary history; trait acquired once?
Or multiple times – biologically significant…
Abstract:
Interpreting ‘tree space’ in the context of very large empirical datasets
Evolutionary biologists represent actual or hypothesised evolutionary relations between living organisms using phylogenies, directed bifurcating graphs (trees) that describe evolutionary processes in terms of speciation or splitting events (nodes) and elapsed evolutionary time or distance (edges). Molecular evolution itself is largely dominated by mutations in DNA sequences, a stochastic process.
Traditionally, probabilistic models of molecular evolution and phylogenies are fitted to DNA sequence data by maximum likelihood on the assumption that a single simple phylogeny will serve to approximate the evolution of a majority of DNA positions in the dataset. However modern studies now routinely sample several orders of magnitude more DNA positions, and this assumption no longer holds.
Unfortunately, our conception of ‘tree space’ - a notional multidimensional surface containing all possible phylogenies - is extremely imprecise, and similarly techniques to model phylogeny model fitting in very large datasets are limited. I will show the background to this field and present some of the challenges arising from the present limited analytical framework.
Abstract:
Interpreting ‘tree space’ in the context of very large empirical datasets
Evolutionary biologists represent actual or hypothesised evolutionary relations between living organisms using phylogenies, directed bifurcating graphs (trees) that describe evolutionary processes in terms of speciation or splitting events (nodes) and elapsed evolutionary time or distance (edges). Molecular evolution itself is largely dominated by mutations in DNA sequences, a stochastic process.
Traditionally, probabilistic models of molecular evolution and phylogenies are fitted to DNA sequence data by maximum likelihood on the assumption that a single simple phylogeny will serve to approximate the evolution of a majority of DNA positions in the dataset. However modern studies now routinely sample several orders of magnitude more DNA positions, and this assumption no longer holds.
Unfortunately, our conception of ‘tree space’ - a notional multidimensional surface containing all possible phylogenies - is extremely imprecise, and similarly techniques to model phylogeny model fitting in very large datasets are limited. I will show the background to this field and present some of the challenges arising from the present limited analytical framework.
Pervasive phylogenetic incongruence
test for phylogenetic discordance attributable to genetic convergence,
when applied to different contexts it could equally be used to measure discordance that has arisen by other processes,
some of which will be more applicable to tropical systems:
- Horizontal gene transfer among bacteria
- Introgression across species barriers
- Incomplete lineage sorting
Abstract:
Interpreting ‘tree space’ in the context of very large empirical datasets
Evolutionary biologists represent actual or hypothesised evolutionary relations between living organisms using phylogenies, directed bifurcating graphs (trees) that describe evolutionary processes in terms of speciation or splitting events (nodes) and elapsed evolutionary time or distance (edges). Molecular evolution itself is largely dominated by mutations in DNA sequences, a stochastic process.
Traditionally, probabilistic models of molecular evolution and phylogenies are fitted to DNA sequence data by maximum likelihood on the assumption that a single simple phylogeny will serve to approximate the evolution of a majority of DNA positions in the dataset. However modern studies now routinely sample several orders of magnitude more DNA positions, and this assumption no longer holds.
Unfortunately, our conception of ‘tree space’ - a notional multidimensional surface containing all possible phylogenies - is extremely imprecise, and similarly techniques to model phylogeny model fitting in very large datasets are limited. I will show the background to this field and present some of the challenges arising from the present limited analytical framework.
Is there a way to work out the expectation of Uc (Ha) or a better measure?
Uc for two Ha dependent on distance
Ha<->b
What is tree distance?