Parallel Artificial Intelligence and Parallel Optimization: a Bias and Variance Point of View

High-performance computing

High performance
computing in Artificial
Intelligence & Optimization
Olivier.Teytaud@inria.fr + many people

TAO, Inria-Saclay IDF, Cnrs 8623,
Lri, Univ. Paris-Sud,
Digiteo Labs, Pascal
Network of Excellence.

NCHC, Taiwan.
November 2010.

Disclaimer

Many works in parallelism are about
technical tricks on SMP programming,
message-passing, network organization.
==> often moderate improvements, but
for all users using a given
library/methodology
Here, opposite point of view:
Don't worry for 10% loss due to suboptimal
programming
Try to benefit from huge machines

Outline

Parallelism
Bias & variance
AI & Optimization
Optimization
Supervised machine learning
Multistage decision making
Conclusions

Parallelism

Basic principle (here!):
Using more CPUs for being faster

Parallelism

Various cases:
Many cores in one machine (shared memory)

Parallelism

Various cases:
Many cores on a same fast network
(explicit fast communications)

Parallelism

Various cases:
Many cores on a network
(explicit slow communications)

Parallelism

Various cases:
==> your laptop
==> your favorite cluster
Many cores on a network
(explicit slow communications)
==> your grid or your lab or internet

Parallelism

Definitions:
p = number of processors
Speed-up(P) = ratio

Time for reaching precision  when p=1
-------------------------------------------------------------
Time for reaching precision  when p=P
Efficiency(p) = speed-up(p)/p
(usually at most 1)

Bias and variance

I compute x on a computer.
It's imprecise, I get x'.
How can I parallelize this to
make it faster ?

Bias and variance

I compute x on a computer.
It's imprecise, I get x'.
What happens if I compute x
1000 times,
on 1000 different machines ?
I get x'1,...,x'1000.
x' = average( x'1,...,x'1000 )

Bias and variance

x' = average( x'1,...,x'1000 )
If the algorithm is deterministic:
all x'i are equal
no benefit
Speed-up = 1, efficiency → 0
==> not good! (trouble=bias!)

Bias and variance

x' = average( x'1,...,x'1000 )
If the algorithm is deterministic:
all x'i are equal
no benefit
Speed-up = 1, efficiency → 0
==> not good!
If unbiased Monte-Carlo estimate:
- speed-up=p, efficiency=1
==> ideal case! (trouble = variance)

Bias and variance, concluding

Two classical notions for an estimator x':
Bias = E (x' – x)
Variance E (x' – Ex')2
Parallelism can easily reduce variance;
parallelism can not easily reduce the bias.

AI & optimization: bias &
variance everywhere

Parallelism
Bias & variance
AI & Optimization
Optimization
Supervised machine learning
Multistage decision making
Conclusions

AI & optimization: bias &
variance everywhere

Many (parts of) algorithms can be rewritten
as follows:

Generate sample x1,...,x using current
knowledge
Work on x1,...,x, get y1,...,y.
Update knowledge.

Example 1: evolutionary
optimization

While (I have time)
knowledge
Work on x1,...,x, get y1,...,y.
Update knowledge.

optimization

Initial knowledge = Gaussian
distribution
While (I have time)
knowledge
Work on x1,...,x, get y1,...,y.
Update knowledge.

optimization

distribution G (mean m, variance  2)
While (I have time)
Generate sample x1,...,x using G
Work on x1,...,x, get y1,...,y.
Update knowledge.

optimization

While (I have time)
Work on x1,...,x, get
y1=fitness(x1),...,y=fitness(x).
Update knowledge.

optimization

While (I have time)
Update G (rank xi's):
m=mean(x1,...,x)
 2=var(x1,...,x)

optimization

MANY EVOLUTIONARY
ALGORITHMS ARE WEAK FOR
While (I have time)
LAMBDA LARGE.
GenerateBE EASILY OPTIMIZED
CAN sample x1,...,x using G
BY A BIAS / VARIANCE
ANALYSIS
m=mean(x1,...,x)
 2=var(x1,...,x)

Ex. 1: bias & variance for EO

While (I have time)
m=mean(x1,...,x) <== unweighted!
 2=var(x1,...,x)

Ex. 1: bias & variance for EO

Huge improvement in EMNA for lambda
large just by taking into account bias/variance
decomposition: reweighting necessary for
cancelling the bias.
Other improvements by classical statistical
tricks:
Reducing  for  large;
Using quasi-random mutations.

==> really simple and crucial for large
population sizes. (not just for publishing :-) )

Example 2: supervised machine
learning (huge dataset)

knowledge
Work on x1,...,x, get y1,...,y.
Update knowledge.

learning (huge dataset D)
Generate data sets D1,...,D using current
knowledge (subsets of the database)
Work on D1,...,D, get f1,...,f. (by learning)
Average the fis.
==> (su)bagging: Di=subset of D
==> random subspace: Di=projection of D on
random vector space
==> random noise: Di=D+noise
==> random forest: Di = D, but noisy algo

learning (huge dataset D)
Generate data sets D1,...,D using current
Easy tricks for parallelizing supervised
knowledge (subsets of the database)
machine learning:
- use (su)bagging
Work on D1,...,D, get f1,...,f. (by learning)
- use random subspaces
Average the of randomized algorithms
- use average fis.
(random forests)
==> (su)bagging: Di=subset of D
- do the cross-validation in parallel
==> random subspace: Di=projection of D on
random vector space
==> from my experience, complicated parallel tools
==> randomimportantDi=D+noise
are not that noise: …
- polemical issue: many papers on sophisticated parallel
==> supervisedforest: Di = D, algorithms; algo
random machine learning but noisy
- I might be wrong :-)

Example 2: active supervised
machine learning (huge dataset)

While I have time
knowledge (e.g. sample the
maxUncertainty region)
Work on x1,...,x, get y1,...,y (labels by
experts / expensive code)
Update knowledge (approximate model).

Example 3: decision making
under uncertainty

While I have time
Generate simulations x1,...,x using
current knowledge
Work on x1,...,x, get y1,...,y (get
rewards)
Update knowledge (approximate model).

UCT (Upper Confidence Trees)

Coulom (06)
Chaslot, Saito & Bouzy (06)
Kocsis Szepesvari (06)

UCT
Kocsis & Szepesvari (06)

Exploitation ...
SCORE =
5/7
+ k.sqrt( log(10)/7 )

... or exploration ?
SCORE =
0/2
+ k.sqrt( log(10)/2 )

under uncertainty

While I have time
Generate simulation x1,...,x using
current knowledge (=scoring rule based
on statistics)
rewards)
Update knowledge (= update statistics in
memory ).

under uncertainty: parallelizing

While I have time
Generate simulation x1,...,x using
current knowledge (=scoring rule based
on statistics)
rewards)
Update knowledge (= update statistics in
memory ).
==> “easily” parallelized on multicore
machines


While I have time

Generate simulation x1,...,x using current knowledge (=scoring rule
based on statistics)
Work on x1,...,x, get y1,...,y (get rewards)
Update knowledge (= update statistics in memory ).

==> parallelized on clusters: one
knowledge base per machine,
average statistics only for crucial
nodes:
nodes with more than 5 % of the sims
nodes at depth < 4


Good news first: it's simple and it
works on huge clusters ! ! !

Comparison with voting schemes;
40 machines, 2 seconds per move.



Comparing N machines and P machines
==> consistent with linear speed-up in 19x19 !



When we have produced these numbers, we
believed we were ready to play Go against very
strong players.

Unfortunately not at all :-)

Go: from 29 to 6 stones
1998: loss against amateur (6d) 19x19 H29
2008: win against a pro (8p) 19x19, H9 MoGo
2008: win against a pro (4p) 19x19, H8 CrazyStone
2010: win against a pro (4p) 19x19, H6 Zen

2007: win against a pro (5p) 9x9 (blitz) MoGo
2008: win against a pro (5p) 9x9 white MoGo
2009: win against a pro (5p) 9x9 black MoGo
2009: win against a pro (9p) 9x9 white Fuego
2009: win against a pro (9p) 9x9 black MoGoTW

==> still 6 stones at least!

Go: from 29 to 6 stones
1998: loss against amateur (6d) 19x19 H29
2010: win against a pro (4p) 19x19, H6 Zen
Wins with H6 / H7
are lucky (rare)
2007: win against a pro (5p) 9x9 (blitz) MoGo
wins
2008: win against a pro (5p) 9x9 white MoGo
2009: win against a pro (5p) 9x9 black MoGo
2009: win against a pro (9p) 9x9 white Fuego
2009: win against a pro (9p) 9x9 black MoGoTW

==> still 6 stones at least!


So what happened ?

great speed-up + moderate results;
= contradiction ? ? ?


So what happened ?

great speed-up + moderate results;
= contradiction ? ? ?

Ok, we can simulate the sequential algorithm very
quickly = success.
But even the sequential algorithm is limited, even
with huge computation time!


Poorly
handled
situation,
even with
10 days of
CPU !

under uncertainty: limited
scalability

(game of Havannah)

==> killed by the bias!

under uncertainty: limited
scalability

(game of Go)

==> bias trouble ! ! !
we reduce the variance but not the
systematic bias.

Conclusions

We have seen that “good old”
bias/variance analysis is
quite efficient;
not widely known / used.

Conclusions

easy tricks for evolutionary optimization on
grids
==> we published papers with great
speed-ups with just one line of code:
Reweighting mainly,
and also
quasi-random,
selective pressure modified for large pop size.

Conclusions
easy tricks for supervised machine
learning:
==> bias/variance analysis here boils
down to: choose an algorithm with more
variance than bias and average:
random subspace;
random subset (subagging);
noise introduction;
“hyper”parameters to be tuned (cross-
validation).

Conclusions

For sequential decision making under
uncertainty, disappointing results:
the best algorithms are not
“that” scalable.

A systematic bias remains.

Conclusions and references

Our experiments: often on Grid5000:
~5000 cores - Linux
homogeneous environment
union of high-performance clusters
contains multi-core machines
Monte-Carlo Tree Search for decision
making and uncertainty: Coulom, Kocsis
& Szepesvari, Chaslot et al,...
For parallel evolutionary algorithms: Beyer
et al, Teytaud et al (this Teytaud is not me...).

Parallel Artificial Intelligence and Parallel Optimization: a Bias and Variance Point of View

Recomendados

Recomendados

Más contenido relacionado

Último

Último (20)

Destacado

Destacado (20)

Parallel Artificial Intelligence and Parallel Optimization: a Bias and Variance Point of View