This is a progress report presented to the Phylogenomics Group at UVigo in May 2013, about the current status of the software guenomu and the Bayesian model implemented.
At that time I was experimenting with a mixture model, that has been since then abandoned, and the Hdist that is still experimental. The presentation also describes the exhange algorithm to solve doubly-intractable distributions, the generalized Multiple-Try Metropolis, and the parallel PRNG used to minimize communication between jobs.
2. Outline
1 The Model
2 The Sampling
3 The Code
Leo Martins (U Vigo) guenomu software 2013/5/16 2 / 15
3. Hierarchical Bayesian model
P(S, Θ | D) ∝ P(θ0)P(λ0)P(α0)P(S) ×
×
N
i=1
P(Di | Gi , θi )P(θi | θ0)P(Gi | λi , wi , S)P(λi | λ0)P(wi | αi )P(αi | α0)
Leo Martins (U Vigo) guenomu software 2013/5/16 3 / 15
4. The mixture of distance distributions
P(G | λ, w, S) =
w1e−(dDUPS (G,S)/λDUPS +dLOSS (G,S)/λLOSS ) + w2e−(dILS (G,S)/λILS ) + w3e−(dRF (G,S)/λRF )
Z(λ, w, S)
Leo Martins (U Vigo) guenomu software 2013/5/16 4 / 15
5. The mixture of distance distributions
P(G | λ, w, S) =
w1e−(dDUPS (G,S)/λDUPS +dLOSS (G,S)/λLOSS ) + w2e−(dILS (G,S)/λILS ) + w3e−(dRF (G,S)/λRF )
Z(λ, w, S)
wi ∼ Gamma(αgene , 1)
Leo Martins (U Vigo) guenomu software 2013/5/16 4 / 15
6. The mixture of distance distributions
P(G | λ, w, S) =
w1e−(dDUPS (G,S)/λDUPS +dLOSS (G,S)/λLOSS ) + w2e−(dILS (G,S)/λILS ) + w3e−(dRF (G,S)/λRF )
Z(λ, w, S)
wi ∼ Gamma(αgene , 1)
λx ∼ Exp(Λx )
Leo Martins (U Vigo) guenomu software 2013/5/16 4 / 15
7. The mixture of distance distributions
P(G | λ, w, S) =
w1e−(dDUPS (G,S)/λDUPS +dLOSS (G,S)/λLOSS ) + w2e−(dILS (G,S)/λILS ) + w3e−(dRF (G,S)/λRF )
Z(λ, w, S)
wi ∼ Gamma(αgene , 1)
λx ∼ Exp(Λx )
each gene has its own set of wi and λi
Leo Martins (U Vigo) guenomu software 2013/5/16 4 / 15
8. The mixture of distance distributions
P(G | λ, w, S) =
w1e−(dDUPS (G,S)/λDUPS +dLOSS (G,S)/λLOSS ) + w2e−(dILS (G,S)/λILS ) + w3e−(dRF (G,S)/λRF )
Z(λ, w, S)
wi ∼ Gamma(αgene , 1)
λx ∼ Exp(Λx )
each gene has its own set of wi and λi
the distances dx (G, S) are scaled to account for different gene family sizes
Leo Martins (U Vigo) guenomu software 2013/5/16 4 / 15
9. Outline
1 The Model
2 The Sampling
3 The Code
Leo Martins (U Vigo) guenomu software 2013/5/16 5 / 15
11. Doubly-intractable distributions
π(y | θ) =
qθ(y)
Z(θ)
=
eθt
s(y)
Z(θ)
; Z(θ) =
y
eθt
s(y)
(1)
augmented distribution: π(θ , y , θ | y) ∝ π(y | θ)π(θ)h(θ | θ)π(y | θ )
Gibbs update of the auxiliary variables θ ,y :
Leo Martins (U Vigo) guenomu software 2013/5/16 6 / 15
12. Doubly-intractable distributions
π(y | θ) =
qθ(y)
Z(θ)
=
eθt
s(y)
Z(θ)
; Z(θ) =
y
eθt
s(y)
(1)
augmented distribution: π(θ , y , θ | y) ∝ π(y | θ)π(θ)h(θ | θ)π(y | θ )
Gibbs update of the auxiliary variables θ ,y :
I. draw θ ∼ h(· | θ)
Leo Martins (U Vigo) guenomu software 2013/5/16 6 / 15
13. Doubly-intractable distributions
π(y | θ) =
qθ(y)
Z(θ)
=
eθt
s(y)
Z(θ)
; Z(θ) =
y
eθt
s(y)
(1)
augmented distribution: π(θ , y , θ | y) ∝ π(y | θ)π(θ)h(θ | θ)π(y | θ )
Gibbs update of the auxiliary variables θ ,y :
I. draw θ ∼ h(· | θ)
II. draw y ∼ π(· | θ )
Leo Martins (U Vigo) guenomu software 2013/5/16 6 / 15
14. Doubly-intractable distributions
π(y | θ) =
qθ(y)
Z(θ)
=
eθt
s(y)
Z(θ)
; Z(θ) =
y
eθt
s(y)
(1)
augmented distribution: π(θ , y , θ | y) ∝ π(y | θ)π(θ)h(θ | θ)π(y | θ )
Gibbs update of the auxiliary variables θ ,y :
I. draw θ ∼ h(· | θ)
II. draw y ∼ π(· | θ )
exchange ratio from θ to θ
min 1,
qθ(y )π(θ )h(θ | θ )qθ (y)
qθ(y)π(θ)h(θ | θ)qθ (y )
(2)
Leo Martins (U Vigo) guenomu software 2013/5/16 6 / 15
15. Doubly-intractable distributions
π(y | θ) =
qθ(y)
Z(θ)
=
eθt
s(y)
Z(θ)
; Z(θ) =
y
eθt
s(y)
(1)
augmented distribution: π(θ , y , θ | y) ∝ π(y | θ)π(θ)h(θ | θ)π(y | θ )
Gibbs update of the auxiliary variables θ ,y :
I. draw θ ∼ h(· | θ)
II. draw y ∼ π(· | θ )
exchange ratio from θ to θ
min 1,
qθ(y )π(θ )h(θ | θ )qθ (y)
qθ(y)π(θ)h(θ | θ)qθ (y )
(2)
We draw y (the gene tree) through a secondary MCMC starting at its
current value
Leo Martins (U Vigo) guenomu software 2013/5/16 6 / 15
16. Species tree proposal with the exchange algorithm
Leo Martins (U Vigo) guenomu software 2013/5/16 7 / 15
17. Species tree proposal with the exchange algorithm
Leo Martins (U Vigo) guenomu software 2013/5/16 7 / 15
18. Species tree proposal with the exchange algorithm
Leo Martins (U Vigo) guenomu software 2013/5/16 7 / 15
19. Species tree proposal with the exchange algorithm
Leo Martins (U Vigo) guenomu software 2013/5/16 7 / 15
20. Generalized Multiple-Try Metropolis
MH: sample y, decide if accept it with probability r
r =
π(y)
π(x)
q(y, x)
q(x, y)
=
π(y)
π(x)
p(x | y)
p(y | x)
Leo Martins (U Vigo) guenomu software 2013/5/16 8 / 15
21. Generalized Multiple-Try Metropolis
MH: sample y, decide if accept it with probability r
r =
π(y)
π(x)
q(y, x)
q(x, y)
=
π(y)
π(x)
p(x | y)
p(y | x)
MTM: choose y among several samples, according to their relative weights
r =
w(y1, x) + · · · + w(yk , x)
w(x∗
1 , y) + · · · + w(x∗
k , y)
where w(x, y) = π(x)q(x, y)λ(x, y) = π(x)p(y | x)λ(x, y)
Leo Martins (U Vigo) guenomu software 2013/5/16 8 / 15
22. Generalized Multiple-Try Metropolis
MH: sample y, decide if accept it with probability r
r =
π(y)
π(x)
q(y, x)
q(x, y)
=
π(y)
π(x)
p(x | y)
p(y | x)
MTM: choose y among several samples, according to their relative weights
r =
w(y1, x) + · · · + w(yk , x)
w(x∗
1 , y) + · · · + w(x∗
k , y)
where w(x, y) = π(x)q(x, y)λ(x, y) = π(x)p(y | x)λ(x, y)
GMTM: weights w(.) do not need to represent probability distributions.
r =
π(y)pk (x | y)
π(x)pk (y | x)
Wx
Wy
where Wy = wi (yi ,x)
k
j=1 wj (yj ,x)
for the chosen element i
Leo Martins (U Vigo) guenomu software 2013/5/16 8 / 15
23. gene tree proposal with GMTM or MTM
Leo Martins (U Vigo) guenomu software 2013/5/16 9 / 15
24. gene tree proposal with GMTM or MTM
Leo Martins (U Vigo) guenomu software 2013/5/16 9 / 15
25. gene tree proposal with GMTM or MTM
Leo Martins (U Vigo) guenomu software 2013/5/16 9 / 15
26. Outline
1 The Model
2 The Sampling
3 The Code
Leo Martins (U Vigo) guenomu software 2013/5/16 10 / 15
27. RF distance, Assignment cost (Hdist)
Leo Martins (U Vigo) guenomu software 2013/5/16 11 / 15
28. RF distance, Assignment cost (Hdist)
Leo Martins (U Vigo) guenomu software 2013/5/16 11 / 15
29. A parallel pseudo-random number generator (PRNG)
Given a seed and an algorithm, we have a stream of PRNs.
PRNG1
PRNG2
PRNG2
PRNG2
PRNG2
x1
seed
x2
x3
x4
x11 x12
Leo Martins (U Vigo) guenomu software 2013/5/16 12 / 15
30. A parallel pseudo-random number generator (PRNG)
Given a seed and an algorithm, we have a stream of PRNs.
PRNG1
PRNG2
PRNG2
PRNG2
PRNG2
x1
seed
x2
x3
x4
x11 x12
Using a second algorithm, the first
stream will give us a sequence of
seeds. We use the 150 parameter
sets for the Tausworthe (LFSR)
generators (L’ecuyer, Maths Comput
1999, pp.261).
Therefore, given the seed, we can
predict all states of all streams.
Leo Martins (U Vigo) guenomu software 2013/5/16 12 / 15
31. A parallel pseudo-random number generator (PRNG)
In our gene/species model:
PRNG1
PRNG2
PRNG2
PRNG2
PRNG2
x1
seed
x2
x3
x4
x11 x12
we split gene families among jobs
Leo Martins (U Vigo) guenomu software 2013/5/16 13 / 15
32. A parallel pseudo-random number generator (PRNG)
In our gene/species model:
PRNG1
PRNG2
PRNG2
PRNG2
PRNG2
x1
seed
x2
x3
x4
x11 x12
we split gene families among jobs
all jobs receive seed (broadcast)
and therefore can reproduce the
same x1. That’s cheaper than
communicating the states.
Leo Martins (U Vigo) guenomu software 2013/5/16 13 / 15
33. A parallel pseudo-random number generator (PRNG)
In our gene/species model:
PRNG1
PRNG2
PRNG2
PRNG2
PRNG2
x1
seed
x2
x3
x4
x11 x12
we split gene families among jobs
all jobs receive seed (broadcast)
and therefore can reproduce the
same x1. That’s cheaper than
communicating the states.
each job uses its own x(i+1) for
sampling new gene trees etc. and
can work in parallel. They use the
common x1 for sampling e.g. new
species tree, which needs
synchronization.
Leo Martins (U Vigo) guenomu software 2013/5/16 13 / 15
34. A parallel pseudo-random number generator (PRNG)
In our gene/species model:
PRNG1
PRNG2
PRNG2
PRNG2
PRNG2
x1
seed
x2
x3
x4
x11 x12
we split gene families among jobs
all jobs receive seed (broadcast)
and therefore can reproduce the
same x1. That’s cheaper than
communicating the states.
each job uses its own x(i+1) for
sampling new gene trees etc. and
can work in parallel. They use the
common x1 for sampling e.g. new
species tree, which needs
synchronization.
the only thing that must be shared
is thus the proposal values
(AllReduce) when updating
”global” parameters”, so that all
jobs can make the same
acceptance/rejection decision.
Leo Martins (U Vigo) guenomu software 2013/5/16 13 / 15
35. Each job looks like an independent analysis
Leo Martins (U Vigo) guenomu software 2013/5/16 14 / 15