Presentation of the models of gene duplication, transfer, loss, and incomplete lineage sorting developed by my colleagues and myself. Results on gene tree inference, species tree inference are presented. Groups of species studied include mammals, birds, fungi and cyanobacteria.
Models of gene duplication, transfer and loss to study genome evolution
1. Bastien Boussau
LBBE, CNRS, Université de Lyon
Models of gene
duplication, transfer and loss
to study genome evolution
2. Collaborators
Lyon collaborators:
• Adrián Arellano Davín
• Gergely Szöllősi (Budapest)
• Vincent Daubin
• Eric Tannier
• Thomas Bigot
• Magali Semeria
• Manolo Gouy
• Laurent Duret
• Nicolas Lartillot
Austin/Illinois collaborators:
• Siavash Mirarab
• Md. Shamsuzzoha Bayzid
• Tandy Warnow
RevBayes collaborators:
• Sebastian Hoehna
• Michael Landis
• Tracy Heath
• Fredrik Ronquist
• Brian Moore
• John Huelsenbeck
• …
3. Plan
1. Gene duplications and losses
• Mammalian genomes
2. Gene duplications, losses and transfers
• Fungi and Cyanobacteria
3. A fast approach to dealing with incomplete
lineage sorting
• Birds
4. 2 vignettes
4. To study genome evolution:
1. One species tree:
!
!
!
2. Thousands of gene trees:
Species: A B C D
Discrete character:
Continuous character:
a a b a
0.1 0.2 0.2 0.4
T
I
M
E
5. To study genome evolution:
1. One species tree:
!
!
!
2. Thousands of gene trees:
Species: A B C D
Discrete character:
Continuous character:
a a b a
0.1 0.2 0.2 0.4
T
I
M
E
7. Why
our
current
pipeline
can
be
improved
•Gene
alignments:
•Error
prone
(Genes
are
short)
•Point
es:mates
8. Why
our
current
pipeline
can
be
improved
•Gene
trees:
•based
on
alignments
•Point
es:mates
•Gene
alignments:
•Error
prone
(Genes
are
short)
•Point
es:mates
9. Why
our
current
pipeline
can
be
improved
•Gene
trees:
•based
on
alignments
•Point
es:mates
•Species
trees:
•based
on
gene
trees
•Gene
alignments:
•Error
prone
(Genes
are
short)
•Point
es:mates
10. Why
our
current
pipeline
can
be
improved
•Gene
trees:
•based
on
alignments
•Point
es:mates
•Species
trees:
•based
on
gene
trees
•Gene
alignments:
•Error
prone
(Genes
are
short)
•Point
es:mates
11. Species: A B C D
T
I
M
E
Species: A B C D
Discrete character:
Continuous character:
a a b a
0.1 0.2 0.2 0.4
T
I
M
E
12. Species: A B C D
T
I
M
E
Species: A B C D
Discrete character:
Continuous character:
a a b a
0.1 0.2 0.2 0.4
T
I
M
E
Species: A B C D
Discrete character:
Continuous character:
a a b a
0.1 0.2 0.2 0.4
T
I
M
E
13. Species: A B C D
T
I
M
E
Species: A B C D
Discrete character:
Continuous character:
a a b a
0.1 0.2 0.2 0.4
T
I
M
E
Species: A B C D
Discrete character:
Continuous character:
a a b a
0.1 0.2 0.2 0.4
T
I
M
E
14. Species: A B C D
T
I
M
E
Species: A B C D
Discrete character:
Continuous character:
a a b a
0.1 0.2 0.2 0.4
T
I
M
E
Species: A B C D
Discrete character:
Continuous character:
a a b a
0.1 0.2 0.2 0.4
T
I
M
E
D
15. Species: A B C D
T
I
M
E
Species: A B C D
Discrete character:
Continuous character:
a a b a
0.1 0.2 0.2 0.4
T
I
M
E
Species: A B C D
Discrete character:
Continuous character:
a a b a
0.1 0.2 0.2 0.4
T
I
M
E
D DL
16. Species: A B C D
T
I
M
E
Species: A B C D
Discrete character:
Continuous character:
a a b a
0.1 0.2 0.2 0.4
T
I
M
E
Species: A B C D
Discrete character:
Continuous character:
a a b a
0.1 0.2 0.2 0.4
T
I
M
E
LGTD DL
17. Species: A B C D
T
I
M
E
Species: A B C D
Discrete character:
Continuous character:
a a b a
0.1 0.2 0.2 0.4
T
I
M
E
Species: A B C D
Discrete character:
Continuous character:
a a b a
0.1 0.2 0.2 0.4
T
I
M
E
LGT ILSD DL
18. Species: A B C D
T
I
M
E
Species: A B C D
Discrete character:
Continuous character:
a a b a
0.1 0.2 0.2 0.4
T
I
M
E
Species: A B C D
Discrete character:
Continuous character:
a a b a
0.1 0.2 0.2 0.4
T
I
M
E
LGT ILS
DL: Boussau et al., Genome Research 2013
D DL
19. Species: A B C D
T
I
M
E
Species: A B C D
Discrete character:
Continuous character:
a a b a
0.1 0.2 0.2 0.4
T
I
M
E
Species: A B C D
Discrete character:
Continuous character:
a a b a
0.1 0.2 0.2 0.4
T
I
M
E
LGT ILS
DL: Boussau et al., Genome Research 2013
D DL
DL+T:!
Szöllősi et al. "
PNAS 2013
20. Species: A B C D
T
I
M
E
Species: A B C D
Discrete character:
Continuous character:
a a b a
0.1 0.2 0.2 0.4
T
I
M
E
Species: A B C D
Discrete character:
Continuous character:
a a b a
0.1 0.2 0.2 0.4
T
I
M
E
LGT ILS
ILS: !
Mirarab et al.
Science 2014
DL: Boussau et al., Genome Research 2013
D DL
DL+T:!
Szöllősi et al. "
PNAS 2013
21. (thousands
of
alignments)
PHYLDOG
All gene families
Rooted species tree,
numbers of duplications
and losses,
rooted gene trees D1
D2
D3
D4
D5
D6
L2
L1
L4
L3
L5
L6
Joint reconstruction of
the species tree,
gene trees, and
numbers of duplications and losses
Species: A B C D
Discrete character:
Continuous character:
a a b a
0.1 0.2 0.2 0.4
T
I
M
E
D1
D3
D2 D4
D5 D6
L1
L3
L2 L4
L5 L6
Boussau et al., Genome Research 2013
22. (thousands
of
alignments)
PHYLDOG
All gene families
Rooted species tree,
numbers of duplications
and losses,
rooted gene trees D1
D2
D3
D4
D5
D6
L2
L1
L4
L3
L5
L6
Joint reconstruction of
the species tree,
gene trees, and
numbers of duplications and losses
Species: A B C D
Discrete character:
Continuous character:
a a b a
0.1 0.2 0.2 0.4
T
I
M
E
D1
D3
D2 D4
D5 D6
L1
L3
L2 L4
L5 L6
Probabilis5c
models:
• sequence
evolu1on
• gene
family
evolu1on
Boussau et al., Genome Research 2013
23. PHYLDOG: a model of
gene duplication and loss
Assumptions!
•Genes evolve along the species tree:!
•birth events:!
•duplications (rate of duplication)!
•death events:!
•losses (rate of loss)!
•Each gene family is independent of other genes!
•Each gene copy is independent of other copies!
!
!
24. Study
of
mammalian
genome
evolu:on
10
• Challenging
but
well-‐studied
phylogeny
• 36
mammalian
genomes
available
in
Ensembl
v.
57
• About
7000
gene
families
• Correc:on
for
poorly
sequenced
genomes
26. Quality
of
the
gene
trees
12
Comparison
between:
PhyML
(used
for
the
PhylomeDB
and
Homolens
databases
)
TreeBeST
(used
for
the
Ensembl-‐Compara
database)
PHYLDOG
Two
approaches:
• Looking
at
ancestral
genome
sizes
• Assessing
how
well
one
can
recover
ancestral
syntenies
using
reconstructed
gene
trees
(Bérard
et
al.,
Bioinforma:cs
2012)
29. Recent improvements to PHYLDOG
• Easier installation using Cmake or a virtual machine!
• Better algorithms for gene tree inference!
• Better algorithm for starting species tree!
• Faster computations using the Phylogenetic Likelihood Library
(PLL, A. Stamatakis group)!
• Python scripts to help run the program
30. Plan
1. Gene duplications and losses
• Mammalian genomes
2. Gene duplications, losses and transfers
• Fungi and Cyanobacteria
3. A fast approach to dealing with incomplete
lineage sorting
• Birds
4. 2 vignettes
31. Species: A B C D
T
I
M
E
ILS: !
Mirarab et al.
Science 2014
DL: Boussau et al., Genome Research 2013
DL+T:!
Szöllősi et al. "
PNAS 2013
32. Species: A B C D
T
I
M
E
LGT ILS
ILS: !
Mirarab et al.
Science 2014
DL: Boussau et al., Genome Research 2013
D DL
DL+T:!
Szöllősi et al. "
PNAS 2013
33. Gene
transfers
and
the
quixo:c
pursuit
of
the
TOL
DooliYle
WF,
Science
1999
34. Gene
transfers
and
the
quixo:c
pursuit
of
the
TOL
DooliYle
WF,
Science
1999
35. Gene
transfers
and
the
quixo:c
pursuit
of
the
TOL
DooliYle
WF,
Science
1999
“The monistic concept of a single universal tree appears […]
increasingly obsolete. […][It is] no longer the most
scientifically productive position to hold[…][It] accounts for
only a minority of observations from genomes.”!
Bapteste, O’Malley, Beiko, Ereshefsky, Gogarten, Franklin-Hall,
Lapointe, Dupré, Dagan, Boucher, Martin, !
Biology Direct 2009.
36. exODT: a model of
gene duplication, transfer, and loss
Assumptions!
•Genes evolve along the species tree:!
•birth events:!
•duplications (rate of duplication)!
•transfers (rate of receiving a gene)!
•death events:!
•losses (rate of loss)!
•Each gene family is independent of other genes!
•Each gene copy is independent of other copies!
•Transfers can go through unsampled/extinct species!
!
!
37. exODT: a model of
gene duplication, transfer, and loss
Szöllősi et al., Syst. Biol. a 2013
38. exODT: a model of
gene duplication, transfer, and loss
Szöllősi et al., Syst. Biol. a 2013
39. Better gene trees, fewer transfers
Usual
approach
ALE
+DTL
RFdistancetorealtree
Szöllősi et al., Syst. Biol. b 2013
40. Better gene trees, fewer transfers
Usual
approach
ALE
+DTL
Transfereventsperfamily
Usual
approach
ALE
+DTL
RFdistancetorealtree
Szöllősi et al., Syst. Biol. b 2013
41. Application to real data:
Cyanobacteria and Fungi
Cyanobacteria!
• > 2.4 billion years old! !
• 40 species!
• 1,200 to 4,500 protein coding genes!
• 7,410 gene families!
!
Fungi (Dikarya)!
• ~ 1 billion years old!
• 28 species!
• 5,200 to 10,000 protein coding genes!
• 11,387 gene families!
!!
Both cases: !
• fixed species tree, gene trees inferred using the
Duplication, Transfer and Loss model! Szöllősi et al., under review
42. Application to real data:
Cyanobacteria and Fungi
Cyanobacteria!
• > 2.4 billion years old! !
• 40 species!
• 1,200 to 4,500 protein coding genes!
• 7,410 gene families!
!
Fungi (Dikarya)!
• ~ 1 billion years old!
• 28 species!
• 5,200 to 10,000 protein coding genes!
• 11,387 gene families!
!!
Both cases: !
• fixed species tree, gene trees inferred using the
Duplication, Transfer and Loss model!
Transfers are expected
Transfers should be less frequent
Szöllősi et al., under review
49. Comparing transfer rates
• Cyanobacteria and Fungi differ in their age:!
!
We can compare normalized numbers of events:!
T/(T+D)!
!
• The Cyanobacteria and Fungi data sets differ in their
number of species:!
!
We can perform rarefaction studies
Szöllősi et al., under review
58. Using transfers to date clades
?
T
I
M
E
Because we can identify gene transfers, we have information for
ordering the nodes of a species tree
59. Bayesian species tree inference
accounting for DTL events
• STRALE:
• A Bayesian probabilistic method that can interpret thousands of
gene trees in terms of:
• speciation events
• duplication events (D)
• transfer events (T)
• loss events (L)
• A method able to estimate the DTL rates
• A method able to reconstruct the species tree
• A method able to order the nodes of the species tree
61. Conclusion on DTL models
• The use of DTL models shows that the number of gene
transfers has so far been overestimated
• DTL models can be used to study genome evolution
and in particular rates of gene transfer
• DTL models can be used to date the nodes of a species
phylogeny
• DTL models should provide a powerful tool to infer an
accurate account of the history of life
62. Plan
1. Gene duplications and losses
• Mammalian genomes
2. Gene duplications, losses and transfers
• Fungi and Cyanobacteria
3. A fast approach to dealing with incomplete
lineage sorting
• Birds
4. 2 vignettes
63. Species: A B C D
T
I
M
E
ILS: !
Mirarab et al.
Science 2014
DL: Boussau et al., Genome Research 2013
DL+T:!
Szöllősi et al. "
PNAS 2013
64. Species: A B C D
T
I
M
E
LGT ILS
ILS: !
Mirarab et al.
Science 2014
DL: Boussau et al., Genome Research 2013
D DL
DL+T:!
Szöllősi et al. "
PNAS 2013
65. 35
The multispecies coalescent
Rannala and Yang, Genetics 2003
• Divergence times in the species tree!
• Divergence times in the gene trees!
• Effective population sizes in the species tree
66. Faster alternatives to the multispecies coalescent
use fixed gene trees
E.g.: MP-EST (Liu, Yu and Edwards, 2010)!
Input: fixed gene trees!
Output: species tree with branch lengths in coalescent units!
!
Has been shown to be consistent, under one notable assumption: !
gene trees are correct.
67. Errors in gene trees decrease the accuracy of
estimated species trees
Mirarab, Bayzid and Warnow, Syst. Biol 2014
76. 44Mirarab et al., PLoS One, accepted
Improving statistical binning: weighted statistical binning
Practice: weighted binning and unweighted binning have about the same
accuracy !
Theory: weighted statistical binning can be shown to be consistent,
unweighted statistical binning is not.
77. Plan
1. Gene duplications and losses
• Mammalian genomes
2. Gene duplications, losses and transfers
• Fungi and Cyanobacteria
3. A fast approach to dealing with incomplete
lineage sorting
• Birds
4. 2 vignettes
78. RevBayes
• R-like language
• Model-based phylogenetics
• Many models of sequence evolution
• Models for dating
• Models for phylogeography
• Models for continuous traits
• Models for gene tree/species tree inference
• http://revbayes.net
• Sebastian Hoehna
• Michael Landis
• Tracy Heath
• Fredrik Ronquist
• Nicolas Lartillot
• Brian Moore
• John Huelsenbeck
• …
82. Conclusions
• We develop methods for gene tree and species
tree inference
• Improvement of gene trees and species trees in the
presence of:
• duplications and losses,
• transfers,
• incomplete lineage sorting
• Parallel algorithms applicable to genome-scale data
• We study the evolution of life, ancient and recent
83. RevBayes collaborators:
• Sebastian Hoehna
• Michael Landis
• Tracy Heath
• Fredrik Ronquist
• Brian Moore
• John Huelsenbeck
• …
Lyon collaborators:
• Adrián Arellano Davín
• Gergely Szöllősi (Budapest)
• Vincent Daubin
• Eric Tannier
• Thomas Bigot
• Magali Semeria
• Manolo Gouy
• Laurent Duret
• Nicolas Lartillot
Austin/Illinois collaborators:
• Siavash Mirarab
• Md. Shamsuzzoha Bayzid
• Tandy Warnow
Thanks!