Complexity bounds for comparison-based optimization and parallel optimization

Complexity bounds in parallel
evolution
A. Auger, H. Fournier,
N. Hansen, P. Rolet,
F. Teytaud, O. Teytaud

Paris, 2010

Tao, Inria Saclay Ile-De-France,
LRI (Université Paris Sud, France),
UMR CNRS 8623, I&A team, Digiteo,
Pascal Network of Excellence.

Outline

Introduction
Complexity bounds
Branching Factor
Automatic Parallelization
Real-world algorithms
Log() corrections

Auger, Fournier, Hansen, Rolet, Teytaud, Teytaud parallel evolution 2

Outline

Introduction
- What is optimization ?
- What are comparison-based optimization
algorithms ?
- Why we are interested in cp-based opt ?
- Why we consider parallel machines ?


Introduction: what is optimization ?

Consider
f: X --> R

We look for x* such that
x,f(x*) ≤ f(x) w random
variable

f is randomly drawn; f(x) = f(x,w).



Quality of “Opt” quantified as follows:

(to be minimized)
w random
variable



Consider
f: X --> R
x,f(x*) ≤ f(x)
==> Quasi-Newton, random search,
Newton, Simplex, Interior points...


Comparison-based optimization

is comparison-based if


The main rules for step-size adaptation

While ( I have time )
{
Generate points (x1,...,x) distributed as N(x,)
Evaluate the fitness at x1,...,x
Update x, update 
}

Main trouble: choosing 

Cumulative step-size adaptation

Mutative self-adaptation

Estimation of Multivariate Normal Algorithm

Example 1: Estimation of Multivariate Normal Algorithm

{
X= mean  best points
= standard deviation of  best points
}

I have a Gaussian...


{
}

I generate 6 points


{
}

I select the three best


{
}

I update the Gaussian


{
}

Obviously 6-parallel

Example 2: Mutative self-adaptation

 = / 4
{
Generate points (1,...,) as  x exp(- k.N)
Generate points (x1,...,x) distributed as N(x,i)
Select the  best points
Update x (=mean), update (=log. mean)
}

Plenty of comparison-based algorithms

EMNA and other EDA

Self-adaptive algorithms

Cumulative step-size adaptation

Pattern Search Methods ...

Families of comparison-based algorithms

Main parameter =  = number of
evaluations per iteration = parallelism

Full-Ranking vs Selection-Based (param )
FR: we know the ranking of the  best
SB: we just know which are the  best

Elitist or not
Elitist: comparison with all visited points
Non-elitist: only within current offspring

EMNA ? Self-adaptation ?

Main parameter =  = number of
evaluations per iteration = parallelism

Full-Ranking vs Selection-Based
FR: we know the ranking of all visited points
SB: we just know which are the  best

Elitist or not
Elitist: comparison with all visited points
Non-elitist: only within current offspring

==> yet, they work quite well

Comparison-based algorithms are robust

Consider
f: X --> R
x,f(x*) ≤ f(x)
==> what if we see g o f (g increasing) ?
==> x* is the same, but xn might change
==> then, comparison-based methods are
optimal


Robustness of comparison-based algorithms: formal
statement

this does not depend on g for a
comparison-based algorithm
a comparison-based algorithm is optimal
for

(I don't give a proof here, but I promise it's true)


Introduction: I like  large

● Grid5000 = 5 000 cores (increasing)
● Submitting jobs ==> grouping runs

==>  much bigger than number of cores.
● Next generations of computers: tenths,

hundreds, thousands of cores.
● Evolutionary algorithms are population

based but they have a bad speed-up.




==>  much bigger than
number of cores.






==>  much bigger
than number of cores.






==>  much bigger than number of cores.




Introduction: concluding :-)

● Optimization = finding minima
● Many algorithms are comparison-based

● ==> good idea for robustness
● Parallel case interesting

●

==> now we can have fun with bounds


Outline

Introduction On a given domain D
On a space F of objective
Complexity bounds functions such that
{x*(f);f∈F}=D
Branching Factor


Complexity bounds (N = dimension)

= nb of fitness evaluations for precision
 with probability at least ½ for all f

N() = cov. number of the search space

Exp ( - Convergence ratio ) = Convergence rate

Convergence ratio ~ 1 / computational cost
==> more convenient for speed-ups

Complexity bounds ½

= nb of fitness evaluations for precision
 with probability at least ½ for all f

N() = cov. number of the search space

Exp ( - Convergence ratio ) = Convergence rate

Convergence ratio ~ 1 / computational cost
==> more convenient for speed-ups

Complexity bounds: basic technique
We want to know how many iterations we need for reaching precision 
in an evolutionary algorithm.

Key observation: (most) evolutionary algorithms are comparison-based

Let's consider (for simplicity) a deterministic selection-based non-elitist
algorithm

First idea: how many different branches we have in a run ?
We select  points among 
Therefore, at most K = ! / ( ! (  -  )!) different branches

Second idea: how many different answers should we able to give ?
Use packing numbers: at least N() different possible answers

Conclusion: the number n of iterations should verify
Kn ≥ N (  )




algorithm



Kn ≥ N (  )


Complexity bounds: -balls


algorithm



Kn ≥ N (  )




algorithm



Kn ≥ N (  )


Complexity bounds on the convergence ratio

FR: full ranking (selected points are ranked)
SB: selection-based (selected points are not ranked)


Linear in  ?


Linear speed-up ? My bound is
tight,
I've proved it!

Bounds:
On a given domain D
On a space F of objective
functions such that
{x*(f);f∈F}=D
==> very strange F possible!
==> much easier than
F={||x-x*|| ; x*∈ D }


Linear speed-up ? My bound is
Ok, tight
bound. tight,
But what I've proved it!
happens with
a
better model ?



- Comparison-based optimization
(or opt. with limited precision numbers)
- We have developped bounds based on:
Branching factor: finitely many possible
informations on the problem per time step
(→ communication. compl)

Packing number (lower bound on number of
possible outcomes)

Adding assumptions ==> better bounds ?

Complexity bounds: improved technique


algorithm



Conclusion: the number nMany of these K verify
of iterations should
Kn ≥ Nbranches are
( )
very unlikely !

Complexity bounds: improved technique


algorithm



Many of these K
n
K ≥ N( )
branches are We'll use...
… VC-dimension !
very unlikely !

(these slides “shattering + VC-dim”
extracted from Xue Mei's talk
at ENEE698A)

Definition of shattering:
A set S of points is shattered by a set H of
sets if for every dichotomy of S there is a
consistent hypothesis in H

Example: Shattering

Is this set of points shattered by the set H o

Yes!

+ - + +

+ + + + - + + -

+ - - -

- - - + + - - -

Is this set of points shattered by circles?

VC-dimension

VC-dimension( set of sets ) =
maximum cardinal of a shattered set
VC-dimension (set of functions ) =
VC-dimension ( level sets)
Known (as a function of the dimension)
for many sets of functions
In particular, quadratic for ellipsoids,
linear for homotheties of a fixed ellipsoid
linear for circles...


VC-dimension



VC-dimension

VC-dimension ( sublevel sets)


VC-dimension: the link with optimization ?

Sauer's lemma:
number of subsets of V points consistent
V
with a set of VC-dim V at most 
So what ?
number of possible selections at most
V
K≤
==> instead of K = ! / ( ! (  -  )!)

(V at least 3, otherwise a few details change...)



Should not be
linear in  !



Something
remains!


Sphere: fitness increases with distance to optimum
1 comparison = 1 hyperplane


Outline

Introduction
Complexity bounds
Branching Factor


Branching factor K (more in Gelly06; Fournier08)

Rewrite your evolutionary algorithm as follows:

g has values in a finite set of cardinal K:
- e.g. subsets of {1,2,...,} of size  (K=! / (!(-)!) )
- or ordered subsets (K=! / (-)! ).
- ...


Outline

Upper bounds for the
Introduction dependency in 
Complexity bounds
Branching Factor


Automatic parallelization


Speculative parallelization with branching factor
3

Consider the sequential algorithm.
(iteration 1)


3

(iteration 2)


3

(iteration 3)

3

Parallel version for D=2.
Population = union of all pops for 2 iterations.


Outline

Introduction Tighter lower bounds for
Complexity bounds specific algorithms ?

Branching Factor


Real world algorithms

Define:

Necessary condition for log() speed-up:
- E log( * ) ~ log()

But for many algorithms,
- E log( * ) = O(1) ==> constant speed-up


One-fifth rule: - E log( * ) = O(1)

= proportion of mutated points better than x

{
Update x = mean
Update 
By 1/5th rule
}

or

One-fifth rule: - E log( * ) = O(1)

= proportion of mutated points better than x
Consider e.g.

Or consider e.g.

In both cases * is lower-bounded
independently of 
==> parameters should
strongly depend on  !

Self-adaptation, cumulative step-size adaptation

In many cases, the same result:
with parameters depending on the
dimension only (and not depending on ),
the speed-up is limited by a constant!


Outline

Introduction
Complexity bounds
Branching Factor


The starting point of this work

●We have shown tight bounds.
●Usual algorithms don't reach the bounds

for  large.
●

●Trouble: the algorithms we propose are
boring (too complicated), people prefer usual
algorithms.
●

● A simple patch for these algorithms?


● In the discrete case (XPs): automatic
parallelization surprisingly efficient.

● Simple trick in the continuous case:
- E log( *) should be linear in log()

(this provides corrections which
work for SA, EMNA and CSA)


{
/= log( / 7)1 / d
}

I select the three best

Ex 2: Log(lambda) correction for mutative self-adapt.

 =  / 4 ==> min( /4,d)
{
Generate points (1,...,) as  x exp(- k.N)
Generate points (x1,...,x) distributed as N(x,i)
Select the  best points
Update x (=mean), update (=log. mean)
}

Log() corrections (SA, dim 3)


● Simple trick in the continuous case

work for SA and CSA)



● Simple trick in the continuous case

work for SA and CSA)

Conclusion

The case of large population size is not well
handled by usual algorithms.
We proposed
(I) theoretical bounds
(II) an automatic parallelization
matching the bound, and
which works well in the discrete case.
(III) a necessary condition for the
continuous case, which provides
useful hints.


Main limitation (of the application to the design of algo)

All this is about a logarithmic speed-up.

The computational
power is like this ==>

<== and the result is like that.

==> much better speed-up for noisy
optimization.


Further work 1

Apply VC-bounds for considering only
“reasonnable” branches in the automatic
parallelization.

Theoretically easy, but provides extremely
complicated algorithms.


Further work 2

We have:
- proofs for complicated algorithms
- efficient (unproved) hints for usual
algorithms

Proofs for the versions with the “trick” ?
Nb: the discrete case is moral: the best
algorithm is the proved one :-)


Further work 3

What if the optimum is not a point but a
subset with topological dimension
N' < N ?


Further work 4

Parallel bandits ?
Experimentally, parallel UCT >> seq. UCT.
with speed-up depending on nb of arms.

Theory ? Perhaps not very hard, but not
done yet.


Complexity bounds for comparison-based optimization and parallel optimization

Recomendados

Recomendados

Más contenido relacionado

Último

Último (20)

Destacado

Destacado (20)

Complexity bounds for comparison-based optimization and parallel optimization

Notas del editor