A paper on parallel Monte-Carlo Tree Search:
@inproceedings{bourki:inria-00512854,
hal_id = {inria-00512854},
url = {http://hal.inria.fr/inria-00512854},
title = {{Scalability and Parallelization of Monte-Carlo Tree Search}},
author = {Bourki, Amine and Chaslot, Guillaume and Coulm, Matthieu and Danjean, Vincent and Doghmen, Hassen and H{\'e}rault, Thomas and Hoock, Jean-Baptiste and Rimmel, Arpad and Teytaud, Fabien and Teytaud, Olivier and Vayssi{\`e}re, Paul and Yu, Ziqin},
booktitle = {{The International Conference on Computers and Games 2010}},
address = {Kanazawa, Japon},
audience = {internationale },
collaboration = {Grid'5000 },
year = {2010},
pdf = {http://hal.inria.fr/inria-00512854/PDF/newcluster.pdf},
}
And a paper on parallel optimization:
@inproceedings{teytaud:inria-00369781,
hal_id = {inria-00369781},
url = {http://hal.inria.fr/inria-00369781},
title = {{On the parallel speed-up of Estimation of Multivariate Normal Algorithm and Evolution Strategies}},
author = {Teytaud, Fabien and Teytaud, Olivier},
abstract = {{Motivated by parallel optimization, we experiment EDA-like adaptation-rules in the case of $\lambda$ large. The rule we use, essentially based on estimation of multivariate normal algorithm, is (i) compliant with all families of distributions for which a density estimation algorithm exists (ii) simple (iii) parameter-free (iv) better than current rules in this framework of $\lambda$ large. The speed-up as a function of $\lambda$ is consistent with theoretical bounds.}},
language = {Anglais},
affiliation = {Institut National de la Recherche en Informatique et en Automatique - INRIA FUTURS , UFR Sciences - Universit{\'e} Paris-Sud XI , TAO - INRIA Futurs , Laboratoire de Recherche en Informatique - LRI , TAO - INRIA Saclay - Ile de France},
booktitle = {{EvoNum (evostar workshop)}},
publisher = {springer},
address = {Tuebingen, Allemagne},
volume = {EvoNum},
audience = {internationale },
collaboration = {Grid'5000 },
year = {2009},
pdf = {http://hal.inria.fr/inria-00369781/PDF/lambdaLarge.pdf},
}
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Parallel Artificial Intelligence and Parallel Optimization: a Bias and Variance Point of View
1. High-performance computing
High performance
computing in Artificial
Intelligence & Optimization
Olivier.Teytaud@inria.fr + many people
TAO, Inria-Saclay IDF, Cnrs 8623,
Lri, Univ. Paris-Sud,
Digiteo Labs, Pascal
Network of Excellence.
NCHC, Taiwan.
November 2010.
2. Disclaimer
Many works in parallelism are about
technical tricks on SMP programming,
message-passing, network organization.
==> often moderate improvements, but
for all users using a given
library/methodology
Here, opposite point of view:
Don't worry for 10% loss due to suboptimal
programming
Try to benefit from huge machines
6. Parallelism
Basic principle (here!):
Using more CPUs for being faster
Various cases:
Many cores in one machine (shared memory)
Many cores on a same fast network
(explicit fast communications)
7. Parallelism
Basic principle (here!):
Using more CPUs for being faster
Various cases:
Many cores in one machine (shared memory)
Many cores on a same fast network
(explicit fast communications)
Many cores on a network
(explicit slow communications)
8. Parallelism
Various cases:
Many cores in one machine (shared memory)
==> your laptop
Many cores on a same fast network
(explicit fast communications)
==> your favorite cluster
Many cores on a network
(explicit slow communications)
==> your grid or your lab or internet
9. Parallelism
Definitions:
p = number of processors
Speed-up(P) = ratio
Time for reaching precision when p=1
-------------------------------------------------------------
Time for reaching precision when p=P
Efficiency(p) = speed-up(p)/p
(usually at most 1)
11. Bias and variance
I compute x on a computer.
It's imprecise, I get x'.
How can I parallelize this to
make it faster ?
12. Bias and variance
I compute x on a computer.
It's imprecise, I get x'.
What happens if I compute x
1000 times,
on 1000 different machines ?
I get x'1,...,x'1000.
x' = average( x'1,...,x'1000 )
13. Bias and variance
x' = average( x'1,...,x'1000 )
If the algorithm is deterministic:
all x'i are equal
no benefit
Speed-up = 1, efficiency → 0
==> not good! (trouble=bias!)
14. Bias and variance
x' = average( x'1,...,x'1000 )
If the algorithm is deterministic:
all x'i are equal
no benefit
Speed-up = 1, efficiency → 0
==> not good!
If unbiased Monte-Carlo estimate:
- speed-up=p, efficiency=1
==> ideal case! (trouble = variance)
15. Bias and variance, concluding
Two classical notions for an estimator x':
Bias = E (x' – x)
Variance E (x' – Ex')2
Parallelism can easily reduce variance;
parallelism can not easily reduce the bias.
16. AI & optimization: bias &
variance everywhere
Parallelism
Bias & variance
AI & Optimization
Optimization
Supervised machine learning
Multistage decision making
Conclusions
17. AI & optimization: bias &
variance everywhere
Many (parts of) algorithms can be rewritten
as follows:
Generate sample x1,...,x using current
knowledge
Work on x1,...,x, get y1,...,y.
Update knowledge.
19. Example 1: evolutionary
optimization
Initial knowledge = Gaussian
distribution
While (I have time)
Generate sample x1,...,x using current
knowledge
Work on x1,...,x, get y1,...,y.
Update knowledge.
20. Example 1: evolutionary
optimization
Initial knowledge = Gaussian
distribution G (mean m, variance 2)
While (I have time)
Generate sample x1,...,x using G
Work on x1,...,x, get y1,...,y.
Update knowledge.
21. Example 1: evolutionary
optimization
Initial knowledge = Gaussian
distribution G (mean m, variance 2)
While (I have time)
Generate sample x1,...,x using G
Work on x1,...,x, get
y1=fitness(x1),...,y=fitness(x).
Update knowledge.
22. Example 1: evolutionary
optimization
Initial knowledge = Gaussian
distribution G (mean m, variance 2)
While (I have time)
Generate sample x1,...,x using G
Work on x1,...,x, get
y1=fitness(x1),...,y=fitness(x).
Update G (rank xi's):
m=mean(x1,...,x)
2=var(x1,...,x)
23. Example 1: evolutionary
optimization
Initial knowledge = Gaussian
MANY EVOLUTIONARY
distribution G (mean m, variance 2)
ALGORITHMS ARE WEAK FOR
While (I have time)
LAMBDA LARGE.
GenerateBE EASILY OPTIMIZED
CAN sample x1,...,x using G
BY A BIAS / VARIANCE
Work on x1,...,x, get
ANALYSIS
y1=fitness(x1),...,y=fitness(x).
Update G (rank xi's):
m=mean(x1,...,x)
2=var(x1,...,x)
24. Ex. 1: bias & variance for EO
Initial knowledge = Gaussian
distribution G (mean m, variance 2)
While (I have time)
Generate sample x1,...,x using G
Work on x1,...,x, get
y1=fitness(x1),...,y=fitness(x).
Update G (rank xi's):
m=mean(x1,...,x) <== unweighted!
2=var(x1,...,x)
25. Ex. 1: bias & variance for EO
Huge improvement in EMNA for lambda
large just by taking into account bias/variance
decomposition: reweighting necessary for
cancelling the bias.
Other improvements by classical statistical
tricks:
Reducing for large;
Using quasi-random mutations.
==> really simple and crucial for large
population sizes. (not just for publishing :-) )
26. Ex. 1: bias & variance for EO
Initial knowledge = Gaussian
distribution G (mean m, variance 2)
While (I have time)
Generate sample x1,...,x using G
Work on x1,...,x, get
y1=fitness(x1),...,y=fitness(x).
Update G (rank xi's):
m=mean(x1,...,x) <== unweighted!
2=var(x1,...,x)
27. Example 2: supervised machine
learning (huge dataset)
Generate sample x1,...,x using current
knowledge
Work on x1,...,x, get y1,...,y.
Update knowledge.
28. Example 2: supervised machine
learning (huge dataset D)
Generate data sets D1,...,D using current
knowledge (subsets of the database)
Work on D1,...,D, get f1,...,f. (by learning)
Average the fis.
==> (su)bagging: Di=subset of D
==> random subspace: Di=projection of D on
random vector space
==> random noise: Di=D+noise
==> random forest: Di = D, but noisy algo
29. Example 2: supervised machine
learning (huge dataset D)
Generate data sets D1,...,D using current
Easy tricks for parallelizing supervised
knowledge (subsets of the database)
machine learning:
- use (su)bagging
Work on D1,...,D, get f1,...,f. (by learning)
- use random subspaces
Average the of randomized algorithms
- use average fis.
(random forests)
==> (su)bagging: Di=subset of D
- do the cross-validation in parallel
==> random subspace: Di=projection of D on
random vector space
==> from my experience, complicated parallel tools
==> randomimportantDi=D+noise
are not that noise: …
- polemical issue: many papers on sophisticated parallel
==> supervisedforest: Di = D, algorithms; algo
random machine learning but noisy
- I might be wrong :-)
30. Example 2: active supervised
machine learning (huge dataset)
While I have time
Generate sample x1,...,x using current
knowledge (e.g. sample the
maxUncertainty region)
Work on x1,...,x, get y1,...,y (labels by
experts / expensive code)
Update knowledge (approximate model).
31. Example 3: decision making
under uncertainty
While I have time
Generate simulations x1,...,x using
current knowledge
Work on x1,...,x, get y1,...,y (get
rewards)
Update knowledge (approximate model).
43. Example 3: decision making
under uncertainty
While I have time
Generate simulation x1,...,x using
current knowledge (=scoring rule based
on statistics)
Work on x1,...,x, get y1,...,y (get
rewards)
Update knowledge (= update statistics in
memory ).
44. Example 3: decision making
under uncertainty: parallelizing
While I have time
Generate simulation x1,...,x using
current knowledge (=scoring rule based
on statistics)
Work on x1,...,x, get y1,...,y (get
rewards)
Update knowledge (= update statistics in
memory ).
==> “easily” parallelized on multicore
machines
45. Example 3: decision making
under uncertainty: parallelizing
While I have time
Generate simulation x1,...,x using current knowledge (=scoring rule
based on statistics)
Work on x1,...,x, get y1,...,y (get rewards)
Update knowledge (= update statistics in memory ).
==> parallelized on clusters: one
knowledge base per machine,
average statistics only for crucial
nodes:
nodes with more than 5 % of the sims
nodes at depth < 4
46. Example 3: decision making
under uncertainty: parallelizing
Good news first: it's simple and it
works on huge clusters ! ! !
Comparison with voting schemes;
40 machines, 2 seconds per move.
47. Example 3: decision making
under uncertainty: parallelizing
Good news first: it's simple and it
works on huge clusters ! ! !
Comparing N machines and P machines
==> consistent with linear speed-up in 19x19 !
48. Example 3: decision making
under uncertainty: parallelizing
Good news first: it's simple and it
works on huge clusters ! ! !
When we have produced these numbers, we
believed we were ready to play Go against very
strong players.
Unfortunately not at all :-)
49. Go: from 29 to 6 stones
1998: loss against amateur (6d) 19x19 H29
2008: win against a pro (8p) 19x19, H9 MoGo
2008: win against a pro (4p) 19x19, H8 CrazyStone
2008: win against a pro (4p) 19x19, H7 CrazyStone
2009: win against a pro (9p) 19x19, H7 MoGo
2009: win against a pro (1p) 19x19, H6 MoGo
2010: win against a pro (4p) 19x19, H6 Zen
2007: win against a pro (5p) 9x9 (blitz) MoGo
2008: win against a pro (5p) 9x9 white MoGo
2009: win against a pro (5p) 9x9 black MoGo
2009: win against a pro (9p) 9x9 white Fuego
2009: win against a pro (9p) 9x9 black MoGoTW
==> still 6 stones at least!
50. Go: from 29 to 6 stones
1998: loss against amateur (6d) 19x19 H29
2008: win against a pro (8p) 19x19, H9 MoGo
2008: win against a pro (4p) 19x19, H8 CrazyStone
2008: win against a pro (4p) 19x19, H7 CrazyStone
2009: win against a pro (9p) 19x19, H7 MoGo
2009: win against a pro (1p) 19x19, H6 MoGo
2010: win against a pro (4p) 19x19, H6 Zen
Wins with H6 / H7
are lucky (rare)
2007: win against a pro (5p) 9x9 (blitz) MoGo
wins
2008: win against a pro (5p) 9x9 white MoGo
2009: win against a pro (5p) 9x9 black MoGo
2009: win against a pro (9p) 9x9 white Fuego
2009: win against a pro (9p) 9x9 black MoGoTW
==> still 6 stones at least!
51. Example 3: decision making
under uncertainty: parallelizing
So what happened ?
great speed-up + moderate results;
= contradiction ? ? ?
52. Example 3: decision making
under uncertainty: parallelizing
So what happened ?
great speed-up + moderate results;
= contradiction ? ? ?
Ok, we can simulate the sequential algorithm very
quickly = success.
But even the sequential algorithm is limited, even
with huge computation time!
53. Example 3: decision making
under uncertainty: parallelizing
Poorly
handled
situation,
even with
10 days of
CPU !
54. Example 3: decision making
under uncertainty: limited
scalability
(game of Havannah)
==> killed by the bias!
55. Example 3: decision making
under uncertainty: limited
scalability
(game of Go)
==> bias trouble ! ! !
we reduce the variance but not the
systematic bias.
56. Conclusions
We have seen that “good old”
bias/variance analysis is
quite efficient;
not widely known / used.
57. Conclusions
easy tricks for evolutionary optimization on
grids
==> we published papers with great
speed-ups with just one line of code:
Reweighting mainly,
and also
quasi-random,
selective pressure modified for large pop size.
58. Conclusions
easy tricks for supervised machine
learning:
==> bias/variance analysis here boils
down to: choose an algorithm with more
variance than bias and average:
random subspace;
random subset (subagging);
noise introduction;
“hyper”parameters to be tuned (cross-
validation).
59. Conclusions
For sequential decision making under
uncertainty, disappointing results:
the best algorithms are not
“that” scalable.
A systematic bias remains.
60. Conclusions and references
Our experiments: often on Grid5000:
~5000 cores - Linux
homogeneous environment
union of high-performance clusters
contains multi-core machines
Monte-Carlo Tree Search for decision
making and uncertainty: Coulom, Kocsis
& Szepesvari, Chaslot et al,...
For parallel evolutionary algorithms: Beyer
et al, Teytaud et al (this Teytaud is not me...).