SlideShare una empresa de Scribd logo
1 de 41
Interpreting ‘tree space’ in the 
context of very large empirical 
datasets 
Joe Parker 
School of Biological and Chemical Sciences 
Queen Mary University of London
Topics 
• What evolutionary biology is 
– And what we do in the lab 
• Introducing phylogenies (trees / digraphs) 
• Molecular evolution 
• Tests involving phylogeny comparison 
• Problems in phylogeny comparison 
• Conclusion / thanks / questions
Introduction to our work (1/5)
A Tale of Bats and Whales
The Prestin gene & high-frequency hearing
Evolution
Prestin evolution 
Human NDLTRNRFFENPALWELLFH… SIHDAVLGSQLREALAEQEASAPPSQ 
Rat NDLTSNRFFENPALKELLFH… SIHDAVLGSQVREAMAEQETTVLPPQ 
Dog NDLTQNRFFENPALKELLFH… SIHDAVLGSQLREALAEQEASALPPQ 
Dolphin SDLTRNQFFENPALLDLLFH… SIHDAVLGSLVREALAEKEAAAATPQ 
Horseshoe Bat SDLTRNRFFENPALLDLLFH… SIHDAVLGSLVREALEEKEAAAATPQ
Introduction to phylogenies (2/5)
Phylogenies 
• Phylogenies are directed graphs that show 
evolutionary relations between taxa 
• Or our hypotheses about them
Comparative approaches
Tree space 
• Phylogeneticists often talk about tree space - 
the set of all possible trees 
• Within tree space two graphs are said to be 
adjacent if they differ at e.g. one internal node 
• Trees are said to be ‘near’ if they are similar 
e.g. only a few rearrangements 
• It is not actually a well-defined concept 
however
Introduction to molecular evolution 
(3/5)
Molecular evolution 
• Molecular evolution is the study of the processes by 
which DNA sequences change over time 
• Stochastic changes dominate over short time-scales 
but over longer ones directional natural selection is 
apparent 
• Normally modelled as stochastic process 
• Unlike classical physical phenomena largely 
understood as a statistical not mechanical 
phenomenon
Simple model: Jukes-Cantor 69 
• Letters {A,C,G,T} 
• Equal frequencies at equilibrium 
• Transition probabilities u / 3 in time t 
• e.g. A  C: 
ut ⎛ 
More generally: 
Felsenstein (2004) Inferring Phylogenies. Springer, NY 
(Following model figures and formulae: ibid.) 
   
Pr(C | A • u • t) = 
1 
4 
1−e 
− 4 
3 
⎝ ⎜ 
⎞ 
⎠ ⎟
Maximum likelihood 
• One of the most popular frameworks for 
understanding and modelling molecular 
evolution and phylogenies 
• Likelihood of data given model, phylogeny: 
mΠ 
• Likelihood-maximisation gives a way to 
parametize model and/or phylogeny 
   
L = Pr(D |T) = Pr(D(i) |T) 
i=1
mΠ 
L = Pr(D |T) = Pr(D(i) |T) 
i=1 
w Σ 
z Σ 
y Σ 
x Σ 
Independence of sites (1) Independence of branches (2) 
   
= Pr(A,C,C,C,G, x, y,z,w,T)
Phylogenomics 
• Advances mean data sets several orders of 
magnitude larger 
• Shift in emphasis from ML on specific 
phylogenies to statistics of all 
flickr/stephenjjohnson Illumina.com spectrum.ieee.org
Phylogenomics 
• Stochastic property of 
molecular evolution 
becomes apparent in 
large datasets 
• Goodness-of-fit varies by 
site / gene for a single 
phylogeny / model 
• Corollary: goodness-of-fit 
varies amongst 
models for a single 
genome
Hypothesis-comparison tests using 
multiple phylogenies (4/5)
Convergence detection by ΔSSLS - 
Parker e t al. (2013) 
• De novo genomes: 
– four taxa 
– 2,321 protein-coding loci 
– 801,301 codons 
• Published: 
– 18 genomes 
• ~69,000 simulated datasets 
• ~3,500 cluster cores 
DSSLSi = ln Li,H0 − ln Li,Ha
Our pipeline for detecting genome-wide convergence
mean = 0.05
mean = 0.05 mean = -0.01 mean = -0.08 

Continuous distributions 
• Output approximates a continuous distribution 
• Comparing alternative hypotheses it is apparent that selection of tree gives largely 
determines location skew etc (perhaps as expected) 
• But given that distribution tails are considered significant meaning of values in 
these tails problematic / comparable
Significance by simulation 
• Very common technique in evolutionary 
biology – simulate a large dataset under the 
null model, compare w/empirical 
• in this context simulate data get 
unexpectedness U: 
U = 1 – cdf ( ΔSSLSH0-Ha | j )
Problems in multiple-hypothesis 
phylogeny comparisons (5/5)
Multiple hypotheses 
• Alternative hypotheses drawn from tree space 
• Same dataset different Ha, different U 
• What U expected for Ha? 
• More simulation – multiple draws from tree 
space: 
Uc,= U – mean Uc
Tree space 
• In the context of ML tree 
space can be thought of as the 
distance in lnL units (or any 
other related statistic*) 
between two trees with 
otherwise identical models / 
data 
• In our previous results this 
appeared continuous. 
• This may be misleading; in 
reality tree space, or derived 
statistics, can be highly 
discontinuous.
Multiple comparisons 
• However…. We recall that distance in tree space, 
or shape of tree space, not well determined. 
• How to sample effectively to control U (as Uc)? 
• How to compare Uc for Ha? 
• Sample every point (tree)? 
• Sample lots? 
• Sample systematically? Inverse-distance? Etc
Tree space 
• Previously with small empirical datasets 
assume a single phylogeny a good descriptor 
of most/many sites 
• With large datasets this may not be true 
– Both small adjustments better fit for many sites 
– And also some large rearrangements 
• Perhaps a better definition of tree space 
• Considering two Ha equidistant from H0
Tree distance properties 
• Scalar distances informative 
• Triagonality 
• Proportional to L for a given model(?) 
• Vectors informative (?)
Tree distance candidates 
• Statistic or model-based measures: 
– Parsimony, ML or amino-acid/nucleotide distance 
– ΔlnL 
• Topology-based measures: 
– Number / type of rearrangement moves, e.g. 
• Nearest-neighbour interchange 
• Subtree prune-and-regraft 
• Tree bisection-and-reconnection 
• Algorithm-based measures: 
– # Of algorithm move steps 
– Wall clock time
Acknowledgements 
• School of Biological and Chemical Sciences, Queen Mary, University of 
London – Rossiter Group 
– Prof. Steve Rossiter (PI) 
– Drs Kalina Davies, Georgia Tsagkogeorga, Michael McGowen, Mao 
Xiuguang 
– Seb Bailey, Kim Warren 
• Others: 
– Profs Richard Nichols, Andrew Leitch (SBCS) 
– Drs Yannick Wurm, Richard Buggs, Chris Faulkes, Steve Le Comber (SBCS) 
– Drs Chris Walker & Rob Horton (GridPP HTC) 
• Sanger Centre 
– Dr James Cotton 
(L-R): Joe Parker; GeorgiaTsagkogeorga; Kalina Davies; Steve Rossiter; Xiuguang Mao; Seb Bailey

Más contenido relacionado

Destacado

A Comparison of NGS Platforms.
A Comparison of NGS Platforms.A Comparison of NGS Platforms.
A Comparison of NGS Platforms.
mkim8
 

Destacado (12)

20160308 dtl ngs_focus_group_meeting_slideshare
20160308 dtl ngs_focus_group_meeting_slideshare20160308 dtl ngs_focus_group_meeting_slideshare
20160308 dtl ngs_focus_group_meeting_slideshare
 
Oxford Nanopore MinION
Oxford Nanopore MinIONOxford Nanopore MinION
Oxford Nanopore MinION
 
Nanotechnology in biology and medicine
Nanotechnology in biology and medicineNanotechnology in biology and medicine
Nanotechnology in biology and medicine
 
Reframing Phylogenomics
Reframing PhylogenomicsReframing Phylogenomics
Reframing Phylogenomics
 
NGS data formats and analyses
NGS data formats and analysesNGS data formats and analyses
NGS data formats and analyses
 
Phylogenomic methods for comparative evolutionary biology - University Colleg...
Phylogenomic methods for comparative evolutionary biology - University Colleg...Phylogenomic methods for comparative evolutionary biology - University Colleg...
Phylogenomic methods for comparative evolutionary biology - University Colleg...
 
A Comparison of NGS Platforms.
A Comparison of NGS Platforms.A Comparison of NGS Platforms.
A Comparison of NGS Platforms.
 
Driving Connectivity in the Scottish Islands: Droneways and Airmasts
Driving Connectivity in the Scottish Islands: Droneways and AirmastsDriving Connectivity in the Scottish Islands: Droneways and Airmasts
Driving Connectivity in the Scottish Islands: Droneways and Airmasts
 
An Introduction to IoT: Connectivity & Case Studies
An Introduction to IoT: Connectivity & Case StudiesAn Introduction to IoT: Connectivity & Case Studies
An Introduction to IoT: Connectivity & Case Studies
 
5G Network Architecture and Design
5G Network Architecture and Design5G Network Architecture and Design
5G Network Architecture and Design
 
3GPP Standards for the Internet-of-Things
3GPP Standards for the Internet-of-Things3GPP Standards for the Internet-of-Things
3GPP Standards for the Internet-of-Things
 
NGS - Basic principles and sequencing platforms
NGS - Basic principles and sequencing platformsNGS - Basic principles and sequencing platforms
NGS - Basic principles and sequencing platforms
 

Similar a Interpreting ‘tree space’ in the context of very large empirical datasets

Testing for heterogeneity in rates of morphological evolution: discrete chara...
Testing for heterogeneity in rates of morphological evolution: discrete chara...Testing for heterogeneity in rates of morphological evolution: discrete chara...
Testing for heterogeneity in rates of morphological evolution: discrete chara...
Graeme Lloyd
 
Variations in citation practices across the scientific landscape: Analysis ba...
Variations in citation practices across the scientific landscape: Analysis ba...Variations in citation practices across the scientific landscape: Analysis ba...
Variations in citation practices across the scientific landscape: Analysis ba...
Wout Lamers
 
Bls 303 l1.phylogenetics
Bls 303 l1.phylogeneticsBls 303 l1.phylogenetics
Bls 303 l1.phylogenetics
Bruno Mmassy
 

Similar a Interpreting ‘tree space’ in the context of very large empirical datasets (20)

Perl for Phyloinformatics
Perl for PhyloinformaticsPerl for Phyloinformatics
Perl for Phyloinformatics
 
Introduction to Bayesian Divergence Time Estimation
Introduction to Bayesian Divergence Time EstimationIntroduction to Bayesian Divergence Time Estimation
Introduction to Bayesian Divergence Time Estimation
 
maximum parsimony.pdf
maximum parsimony.pdfmaximum parsimony.pdf
maximum parsimony.pdf
 
BTC 506 Phylogenetic Analysis.pptx
BTC 506 Phylogenetic Analysis.pptxBTC 506 Phylogenetic Analysis.pptx
BTC 506 Phylogenetic Analysis.pptx
 
07_Phylogeny_2022.pdf
07_Phylogeny_2022.pdf07_Phylogeny_2022.pdf
07_Phylogeny_2022.pdf
 
SVP 2012 Talk: Time-Scaling Trees in the Fossil Record
SVP 2012 Talk: Time-Scaling Trees in the Fossil RecordSVP 2012 Talk: Time-Scaling Trees in the Fossil Record
SVP 2012 Talk: Time-Scaling Trees in the Fossil Record
 
Testing for heterogeneity in rates of morphological evolution: discrete chara...
Testing for heterogeneity in rates of morphological evolution: discrete chara...Testing for heterogeneity in rates of morphological evolution: discrete chara...
Testing for heterogeneity in rates of morphological evolution: discrete chara...
 
Phylogenetic tree construction
Phylogenetic tree constructionPhylogenetic tree construction
Phylogenetic tree construction
 
Bioinformatics presentation shabir .pptx
Bioinformatics presentation shabir .pptxBioinformatics presentation shabir .pptx
Bioinformatics presentation shabir .pptx
 
Teaching Population Genetics with R
Teaching Population Genetics with RTeaching Population Genetics with R
Teaching Population Genetics with R
 
EVE161: Microbial Phylogenomics - Class 4 - Phylogeny
EVE161: Microbial Phylogenomics - Class 4 - PhylogenyEVE161: Microbial Phylogenomics - Class 4 - Phylogeny
EVE161: Microbial Phylogenomics - Class 4 - Phylogeny
 
Phylogenetic analysis
Phylogenetic analysis Phylogenetic analysis
Phylogenetic analysis
 
Variations in citation practices across the scientific landscape: Analysis ba...
Variations in citation practices across the scientific landscape: Analysis ba...Variations in citation practices across the scientific landscape: Analysis ba...
Variations in citation practices across the scientific landscape: Analysis ba...
 
Introduction to 16S rRNA gene multivariate analysis
Introduction to 16S rRNA gene multivariate analysisIntroduction to 16S rRNA gene multivariate analysis
Introduction to 16S rRNA gene multivariate analysis
 
A distance-based method for phylogenetic tree reconstruction using algebraic ...
A distance-based method for phylogenetic tree reconstruction using algebraic ...A distance-based method for phylogenetic tree reconstruction using algebraic ...
A distance-based method for phylogenetic tree reconstruction using algebraic ...
 
Bls 303 l1.phylogenetics
Bls 303 l1.phylogeneticsBls 303 l1.phylogenetics
Bls 303 l1.phylogenetics
 
Curso Lichos - MOP and (separately) Niche conservatism 201606
Curso Lichos - MOP and (separately) Niche conservatism 201606Curso Lichos - MOP and (separately) Niche conservatism 201606
Curso Lichos - MOP and (separately) Niche conservatism 201606
 
Using Semantics of Textbook Highlights to Predict Student Comprehension and K...
Using Semantics of Textbook Highlights to Predict Student Comprehension and K...Using Semantics of Textbook Highlights to Predict Student Comprehension and K...
Using Semantics of Textbook Highlights to Predict Student Comprehension and K...
 
Cg7 trees
Cg7 treesCg7 trees
Cg7 trees
 
6238578.ppt
6238578.ppt6238578.ppt
6238578.ppt
 

Último

Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
levieagacer
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
Areesha Ahmad
 
Introduction,importance and scope of horticulture.pptx
Introduction,importance and scope of horticulture.pptxIntroduction,importance and scope of horticulture.pptx
Introduction,importance and scope of horticulture.pptx
Bhagirath Gogikar
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Sérgio Sacani
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
1301aanya
 

Último (20)

Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
 
IDENTIFICATION OF THE LIVING- forensic medicine
IDENTIFICATION OF THE LIVING- forensic medicineIDENTIFICATION OF THE LIVING- forensic medicine
IDENTIFICATION OF THE LIVING- forensic medicine
 
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts ServiceJustdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
 
Introduction,importance and scope of horticulture.pptx
Introduction,importance and scope of horticulture.pptxIntroduction,importance and scope of horticulture.pptx
Introduction,importance and scope of horticulture.pptx
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdf
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
 
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
 

Interpreting ‘tree space’ in the context of very large empirical datasets

  • 1. Interpreting ‘tree space’ in the context of very large empirical datasets Joe Parker School of Biological and Chemical Sciences Queen Mary University of London
  • 2. Topics • What evolutionary biology is – And what we do in the lab • Introducing phylogenies (trees / digraphs) • Molecular evolution • Tests involving phylogeny comparison • Problems in phylogeny comparison • Conclusion / thanks / questions
  • 3. Introduction to our work (1/5)
  • 4. A Tale of Bats and Whales
  • 5. The Prestin gene & high-frequency hearing
  • 7. Prestin evolution Human NDLTRNRFFENPALWELLFH… SIHDAVLGSQLREALAEQEASAPPSQ Rat NDLTSNRFFENPALKELLFH… SIHDAVLGSQVREAMAEQETTVLPPQ Dog NDLTQNRFFENPALKELLFH… SIHDAVLGSQLREALAEQEASALPPQ Dolphin SDLTRNQFFENPALLDLLFH… SIHDAVLGSLVREALAEKEAAAATPQ Horseshoe Bat SDLTRNRFFENPALLDLLFH… SIHDAVLGSLVREALEEKEAAAATPQ
  • 9. Phylogenies • Phylogenies are directed graphs that show evolutionary relations between taxa • Or our hypotheses about them
  • 11. Tree space • Phylogeneticists often talk about tree space - the set of all possible trees • Within tree space two graphs are said to be adjacent if they differ at e.g. one internal node • Trees are said to be ‘near’ if they are similar e.g. only a few rearrangements • It is not actually a well-defined concept however
  • 12. Introduction to molecular evolution (3/5)
  • 13. Molecular evolution • Molecular evolution is the study of the processes by which DNA sequences change over time • Stochastic changes dominate over short time-scales but over longer ones directional natural selection is apparent • Normally modelled as stochastic process • Unlike classical physical phenomena largely understood as a statistical not mechanical phenomenon
  • 14. Simple model: Jukes-Cantor 69 • Letters {A,C,G,T} • Equal frequencies at equilibrium • Transition probabilities u / 3 in time t • e.g. A  C: ut ⎛ More generally: Felsenstein (2004) Inferring Phylogenies. Springer, NY (Following model figures and formulae: ibid.)   Pr(C | A • u • t) = 1 4 1−e − 4 3 ⎝ ⎜ ⎞ ⎠ ⎟
  • 15. Maximum likelihood • One of the most popular frameworks for understanding and modelling molecular evolution and phylogenies • Likelihood of data given model, phylogeny: mΠ • Likelihood-maximisation gives a way to parametize model and/or phylogeny   L = Pr(D |T) = Pr(D(i) |T) i=1
  • 16. mΠ L = Pr(D |T) = Pr(D(i) |T) i=1 w Σ z Σ y Σ x Σ Independence of sites (1) Independence of branches (2)   = Pr(A,C,C,C,G, x, y,z,w,T)
  • 17. Phylogenomics • Advances mean data sets several orders of magnitude larger • Shift in emphasis from ML on specific phylogenies to statistics of all flickr/stephenjjohnson Illumina.com spectrum.ieee.org
  • 18. Phylogenomics • Stochastic property of molecular evolution becomes apparent in large datasets • Goodness-of-fit varies by site / gene for a single phylogeny / model • Corollary: goodness-of-fit varies amongst models for a single genome
  • 19. Hypothesis-comparison tests using multiple phylogenies (4/5)
  • 20. Convergence detection by ΔSSLS - Parker e t al. (2013) • De novo genomes: – four taxa – 2,321 protein-coding loci – 801,301 codons • Published: – 18 genomes • ~69,000 simulated datasets • ~3,500 cluster cores DSSLSi = ln Li,H0 − ln Li,Ha
  • 21. Our pipeline for detecting genome-wide convergence
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
  • 30. mean = 0.05 mean = -0.01 mean = -0.08 
  • 31. Continuous distributions • Output approximates a continuous distribution • Comparing alternative hypotheses it is apparent that selection of tree gives largely determines location skew etc (perhaps as expected) • But given that distribution tails are considered significant meaning of values in these tails problematic / comparable
  • 32. Significance by simulation • Very common technique in evolutionary biology – simulate a large dataset under the null model, compare w/empirical • in this context simulate data get unexpectedness U: U = 1 – cdf ( ΔSSLSH0-Ha | j )
  • 33. Problems in multiple-hypothesis phylogeny comparisons (5/5)
  • 34. Multiple hypotheses • Alternative hypotheses drawn from tree space • Same dataset different Ha, different U • What U expected for Ha? • More simulation – multiple draws from tree space: Uc,= U – mean Uc
  • 35. Tree space • In the context of ML tree space can be thought of as the distance in lnL units (or any other related statistic*) between two trees with otherwise identical models / data • In our previous results this appeared continuous. • This may be misleading; in reality tree space, or derived statistics, can be highly discontinuous.
  • 36. Multiple comparisons • However…. We recall that distance in tree space, or shape of tree space, not well determined. • How to sample effectively to control U (as Uc)? • How to compare Uc for Ha? • Sample every point (tree)? • Sample lots? • Sample systematically? Inverse-distance? Etc
  • 37. Tree space • Previously with small empirical datasets assume a single phylogeny a good descriptor of most/many sites • With large datasets this may not be true – Both small adjustments better fit for many sites – And also some large rearrangements • Perhaps a better definition of tree space • Considering two Ha equidistant from H0
  • 38.
  • 39. Tree distance properties • Scalar distances informative • Triagonality • Proportional to L for a given model(?) • Vectors informative (?)
  • 40. Tree distance candidates • Statistic or model-based measures: – Parsimony, ML or amino-acid/nucleotide distance – ΔlnL • Topology-based measures: – Number / type of rearrangement moves, e.g. • Nearest-neighbour interchange • Subtree prune-and-regraft • Tree bisection-and-reconnection • Algorithm-based measures: – # Of algorithm move steps – Wall clock time
  • 41. Acknowledgements • School of Biological and Chemical Sciences, Queen Mary, University of London – Rossiter Group – Prof. Steve Rossiter (PI) – Drs Kalina Davies, Georgia Tsagkogeorga, Michael McGowen, Mao Xiuguang – Seb Bailey, Kim Warren • Others: – Profs Richard Nichols, Andrew Leitch (SBCS) – Drs Yannick Wurm, Richard Buggs, Chris Faulkes, Steve Le Comber (SBCS) – Drs Chris Walker & Rob Horton (GridPP HTC) • Sanger Centre – Dr James Cotton (L-R): Joe Parker; GeorgiaTsagkogeorga; Kalina Davies; Steve Rossiter; Xiuguang Mao; Seb Bailey

Notas del editor

  1. Abstract: Interpreting ‘tree space’ in the context of very large empirical datasets Evolutionary biologists represent actual or hypothesised evolutionary relations between living organisms using phylogenies, directed bifurcating graphs (trees)  that describe evolutionary processes in terms of speciation or splitting events (nodes) and elapsed evolutionary time or distance (edges). Molecular evolution itself is largely dominated by mutations in DNA sequences, a stochastic process. Traditionally, probabilistic models of molecular evolution and phylogenies are fitted to DNA sequence data by maximum likelihood on the assumption that a single simple phylogeny will serve to approximate the evolution of a majority of DNA positions in the dataset. However modern studies now routinely sample several orders of magnitude more DNA positions, and this assumption no longer holds. Unfortunately, our conception of ‘tree space’ - a notional multidimensional surface containing all possible phylogenies - is extremely imprecise, and similarly techniques to model phylogeny model fitting in very large datasets are limited. I will show the background to this field and present some of the challenges arising from the present limited analytical framework.
  2. Abstract: Interpreting ‘tree space’ in the context of very large empirical datasets Evolutionary biologists represent actual or hypothesised evolutionary relations between living organisms using phylogenies, directed bifurcating graphs (trees)  that describe evolutionary processes in terms of speciation or splitting events (nodes) and elapsed evolutionary time or distance (edges). Molecular evolution itself is largely dominated by mutations in DNA sequences, a stochastic process. Traditionally, probabilistic models of molecular evolution and phylogenies are fitted to DNA sequence data by maximum likelihood on the assumption that a single simple phylogeny will serve to approximate the evolution of a majority of DNA positions in the dataset. However modern studies now routinely sample several orders of magnitude more DNA positions, and this assumption no longer holds. Unfortunately, our conception of ‘tree space’ - a notional multidimensional surface containing all possible phylogenies - is extremely imprecise, and similarly techniques to model phylogeny model fitting in very large datasets are limited. I will show the background to this field and present some of the challenges arising from the present limited analytical framework.
  3. The phylogenies
  4. (OHC diagram)
  5. PIC OF DARWIN Phenotypes diverging Genes diverging Phylogeny REMOVE wording from pentadactyly diagram CLEARER example phylogeny
  6. Prestin sequences Prestin Phylogeny “”BIOLOGISTS AND BIOCHEMISTS REPRESENT PROTEINS AS SEQUENCES OF LETTERS REMINDER PHYLOGENY of mammals spp. tree
  7. Abstract: Interpreting ‘tree space’ in the context of very large empirical datasets Evolutionary biologists represent actual or hypothesised evolutionary relations between living organisms using phylogenies, directed bifurcating graphs (trees)  that describe evolutionary processes in terms of speciation or splitting events (nodes) and elapsed evolutionary time or distance (edges). Molecular evolution itself is largely dominated by mutations in DNA sequences, a stochastic process. Traditionally, probabilistic models of molecular evolution and phylogenies are fitted to DNA sequence data by maximum likelihood on the assumption that a single simple phylogeny will serve to approximate the evolution of a majority of DNA positions in the dataset. However modern studies now routinely sample several orders of magnitude more DNA positions, and this assumption no longer holds. Unfortunately, our conception of ‘tree space’ - a notional multidimensional surface containing all possible phylogenies - is extremely imprecise, and similarly techniques to model phylogeny model fitting in very large datasets are limited. I will show the background to this field and present some of the challenges arising from the present limited analytical framework.
  8. Given observed microbial diversity Phylogeny reveals evolutionary history; trait acquired once? Or multiple times – biologically significant…
  9. Abstract: Interpreting ‘tree space’ in the context of very large empirical datasets Evolutionary biologists represent actual or hypothesised evolutionary relations between living organisms using phylogenies, directed bifurcating graphs (trees)  that describe evolutionary processes in terms of speciation or splitting events (nodes) and elapsed evolutionary time or distance (edges). Molecular evolution itself is largely dominated by mutations in DNA sequences, a stochastic process. Traditionally, probabilistic models of molecular evolution and phylogenies are fitted to DNA sequence data by maximum likelihood on the assumption that a single simple phylogeny will serve to approximate the evolution of a majority of DNA positions in the dataset. However modern studies now routinely sample several orders of magnitude more DNA positions, and this assumption no longer holds. Unfortunately, our conception of ‘tree space’ - a notional multidimensional surface containing all possible phylogenies - is extremely imprecise, and similarly techniques to model phylogeny model fitting in very large datasets are limited. I will show the background to this field and present some of the challenges arising from the present limited analytical framework.
  10. Abstract: Interpreting ‘tree space’ in the context of very large empirical datasets Evolutionary biologists represent actual or hypothesised evolutionary relations between living organisms using phylogenies, directed bifurcating graphs (trees)  that describe evolutionary processes in terms of speciation or splitting events (nodes) and elapsed evolutionary time or distance (edges). Molecular evolution itself is largely dominated by mutations in DNA sequences, a stochastic process. Traditionally, probabilistic models of molecular evolution and phylogenies are fitted to DNA sequence data by maximum likelihood on the assumption that a single simple phylogeny will serve to approximate the evolution of a majority of DNA positions in the dataset. However modern studies now routinely sample several orders of magnitude more DNA positions, and this assumption no longer holds. Unfortunately, our conception of ‘tree space’ - a notional multidimensional surface containing all possible phylogenies - is extremely imprecise, and similarly techniques to model phylogeny model fitting in very large datasets are limited. I will show the background to this field and present some of the challenges arising from the present limited analytical framework.
  11. Pervasive phylogenetic incongruence test for phylogenetic discordance attributable to genetic convergence, when applied to different contexts it could equally be used to measure discordance that has arisen by other processes, some of which will be more applicable to tropical systems: - Horizontal gene transfer among bacteria - Introgression across species barriers - Incomplete lineage sorting
  12. Abstract: Interpreting ‘tree space’ in the context of very large empirical datasets Evolutionary biologists represent actual or hypothesised evolutionary relations between living organisms using phylogenies, directed bifurcating graphs (trees)  that describe evolutionary processes in terms of speciation or splitting events (nodes) and elapsed evolutionary time or distance (edges). Molecular evolution itself is largely dominated by mutations in DNA sequences, a stochastic process. Traditionally, probabilistic models of molecular evolution and phylogenies are fitted to DNA sequence data by maximum likelihood on the assumption that a single simple phylogeny will serve to approximate the evolution of a majority of DNA positions in the dataset. However modern studies now routinely sample several orders of magnitude more DNA positions, and this assumption no longer holds. Unfortunately, our conception of ‘tree space’ - a notional multidimensional surface containing all possible phylogenies - is extremely imprecise, and similarly techniques to model phylogeny model fitting in very large datasets are limited. I will show the background to this field and present some of the challenges arising from the present limited analytical framework.
  13. Is there a way to work out the expectation of Uc (Ha) or a better measure? Uc for two Ha dependent on distance Ha<->b What is tree distance?