Learning for Optimization: EDAs, probabilistic modelling, or ...
1. Explicit Modelling in
Metaheuristic Optimization
Dr Marcus Gallagher
School of Information Technology and Electrical
Engineering
University of Queensland Q. 4072
marcusg@itee.uq.edu.au
2. Talk outline:
Optimization, heuristics and metaheuristics.
“Estimation of Distribution” (optimization)
algorithms (EDAs): a brief overview.
A framework for describing EDAs.
Other modelling approaches in
metaheuristics.
Summary
Marcus Gallagher - MASCOS Symposium, 26/11/04 2
3. “Hard” Optimization Problems
Goal: Find
x* S such that f (x*) f (x), x S
where S is often multi-dimensional; real-valued or
binary
n n
S R or S 0,1
Many classes of optimization problems (and
algorithms) exist.
When might it be worthwhile to consider metaheuristic
or machine learning approaches?
Marcus Gallagher - MASCOS Symposium, 26/11/04 3
4. Finding an “exact” solution is intractable.
Limited knowledge of f()
No derivative information.
May be discontinuous, noisy,…
Evaluating f() is expensive in terms of time
or cost.
f() is known or suspected to contain nasty
features
Many local minima, plateaus, ravines.
The search space is high-dimensional.
Marcus Gallagher - MASCOS Symposium, 26/11/04 4
5. What is the “practical” goal of (global)
optimization?
“There exists a goal (e.g. to find as small a
value of f() as possible), there exist resources
(e.g. some number of trials), and the problem
is how to use these resources in an optimal
way.”
A. Torn and A. Zilinskas, Global Optimisation. Springer-
Verlag, 1989. Lecture Notes in Computer Science, Vol.
350.
Marcus Gallagher - MASCOS Symposium, 26/11/04 5
6. Heuristics
Heuristic (or approximate) algorithms aim
to find a good solution to a problem in a
reasonable amount of computation time –
but with no guarantee of “goodness” or
“efficiency” (cf. exact or complete
algorithms).
Broad classes of heuristics:
Constructive methods
Local search methods
Marcus Gallagher - MASCOS Symposium, 26/11/04 6
7. Metaheuristics
Metaheuristics are (roughly) high-level strategies
that combinine lower-level techniques for
exploration and exploitation of the search space.
An overarching term to refer to algorithms including
Evolutionary Algorithms, Simulated Annealing, Tabu
Search, Ant Colony, Particle Swarm, Cross-
Entropy,…
C. Blum and A. Roli. Metaheuristics in Combinatorial
Optimization: Overview and Conceptual Comparison. ACM
Computing Surveys, 35(3), 2003, pp. 268-308.
Marcus Gallagher - MASCOS Symposium, 26/11/04 7
8. Learning/Modelling for Optimization
Most optimization algorithms make some (explicit or
implicit) assumptions about the nature of f().
Many algorithms vary their behaviour during execution
(e.g. simulated annealing).
In some optimization algorithms the search is adaptive
Future search points evaluated depend on previous points
searched (and/or their f() values, derivatives of f() etc).
Learning/modelling can be implicit (e.g, adapting the
step-size in gradient descent, population in an EA).
…or explicit; examples from optimization literature:
Nelder-Mead simplex algorithm.
Response surfaces (metamodelling, surrogate function).
Marcus Gallagher - MASCOS Symposium, 26/11/04 8
9. EDAs: Probabilistic Modelling for
Optimization
Based on the use of (unsupervised) density
estimators/generative statistical models.
Idea is to convert the optimization problem into a
search over probability distributions.
P. Larranaga and J. A. Lozano (eds.). Estimation of Distribution
Algorithms: a new tool for evolutionary computation. Kluwer
Academic Publishers, 2002.
The probabilistic model is in some sense an
explicit model of (currently) promising regions of
the search space.
Marcus Gallagher - MASCOS Symposium, 26/11/04 9
12. GAs and EDAs compared
GA pseudocode
1. Initialize the population, X(t);
2. Evaluate the objective function for each
point;
3. Selection();
4. Crossover();
5. Mutation();
6. Form new population X(t+1);
7. While !(terminate()) Goto 2;
Marcus Gallagher - MASCOS Symposium, 26/11/04 12
13. GAs and EDAs compared
EDA pseudocode
1. Initialize a probability model, Q(x);
2. Create a population of points by
sampling from Q(x);
3. Evaluate the objective function for
each point;
4. Update Q(x) using selected population
and f() values;
5. While !(terminate()) Goto 2;
Marcus Gallagher - MASCOS Symposium, 26/11/04 13
14. EDA Example 1
Population-based Incremental Learning
(PBIL)
S. Baluja, R. Caruana. Removing the Genetics from the
Standard Genetic Algorithm. ICML’95.
p1 = p2 = pn =
Pr(x1=1) Pr(x2=1) Pr(xn=1)
pi 1 pi xib
Marcus Gallagher - MASCOS Symposium, 26/11/04 14
15. EDA Example 2
Mutual Information Maximization for Input
Clustering (MIMIC)
J. De Bonet, C. Isbell and P. Viola. MIMIC: Finding optima by
estimating probability densities. Advances in Neural Information
Processing Systems, vol.9, 1997.
p(x) p( xi1 | xi2 ) p( xi2 | xi3 ) p( xin 1 | xin ) p( xin )
Marcus Gallagher - MASCOS Symposium, 26/11/04 15
16. EDA Example 3
Combining Optimizers with Mutual Information
Trees (COMIT)
S. Baluja and S. Davies. Using optimal dependency-trees for combinatorial
optimization: learning the structure of the search space. Proc. ICML’97.
Uses a tree-structured graphical model
Model can be constructed in O(n2) time using a
variant of the minimum spanning tree algorithm.
Model is optimal, given the restrictions, in the sense
that the Kullback-Liebler divergence between the
model and a full joint distribution is minimized.
Marcus Gallagher - MASCOS Symposium, 26/11/04 16
17. EDA Example 4
Bayesian Optimization Algorithm (BOA)
M. Pelikan, D. Goldberg and E. Cantu-Paz. BOA: The Bayesian
optimization algorithm. In Proc. GECCO’99.
Bayesian network model where nodes can
have at most k parents.
Greedy search over the Bayesian Dirichlet
equivalence metric to find the network
structure.
Marcus Gallagher - MASCOS Symposium, 26/11/04 17
18. Further work on EDAs
EDAs have also been developed
For problems with continuous and mixed
variables.
That use mixture models and kernel
estimators - allowing for the modelling of
multi-modal distributions.
…and more!
Marcus Gallagher - MASCOS Symposium, 26/11/04 18
19. A framework to describe building and adapting a
probabilistic model for optimization
See:
M. Gallagher and M. Frean. Population-Based Continuous
Optimization, Probabilistic Modelling and Mean Shift. To
appear, Evolutionary Computation, 2005.
Consider a continuous EDA with model
n
Q(x) Qi ( xi )
i 1
Consider a Boltzmann distribution over f(x)
1 f ( x)
P( x) exp
Z T
Marcus Gallagher - MASCOS Symposium, 26/11/04 19
20. As T→0, P(x) tends towards a set of impulse
spikes over the global optima.
Now, we have a probability distribution that we
know the form of, Q(x) and we would like to
modify it to be close to P(x). KL divergence:
Q( x)
K Q( x) log dx
x
P( x)
Let Q(x) be a Gaussian; try and minimize K via
gradient descent with respect to the mean
parameter of Q(x).
Marcus Gallagher - MASCOS Symposium, 26/11/04 20
21. The gradient becomes
Q x
Q( x)
v
1
K Q( x).(x ) f ( x)dx
vT x
An approximation to the integral is to use a
sample of x from Q(x)
1
K ( xi ) f ( xi )
nvT xi S
Marcus Gallagher - MASCOS Symposium, 26/11/04 21
22. The algorithm update rule is then
(x i ˆ ( xi )
)f
n xi S
Similar ideas can be found in:
A. Berny. Statistical Machine Learning and Combinatorial
Optimization. In L. Kallel et al. eds, Theoretical Aspects of
Evolutionary Computation, pp. 287-306. Springer. 2001.
M. Toussaint. On the evolution of phenotypic exploration
distributions. In C. Cotta et al. eds, Foundations of Genetic
Algorithms (FOGA VII), pp. 169-182. Morgan Kaufmann. 2003.
Marcus Gallagher - MASCOS Symposium, 26/11/04 22
23. Some insights
The derived update rule is closely related
to those found in Evolution Strategies and
a version of PBIL for continuous spaces.
It is possible to view these existing
algorithms as approximately doing KL
minimization.
The objective function appears explicitly in
this update rule (no selection).
Marcus Gallagher - MASCOS Symposium, 26/11/04 23
24. Other Research in Learning/Modelling
for Optimization
J. A. Boyan and A. W. Moore. Learning Evaluation Functions to
Improve Optimization by Local Search. Journal of Machine Learning
Research 1:2, 2000.
B. Anderson, A. Moore and D. Cohn. A Nonparametric Approach to
Noisy and Costly Optimization. International Conference on
Machine Learning, 2000.
D. R. Jones. A Taxonomy of Global Optimization Methods Based
on Response Surfaces. Journal of Global Optimization 21(4):345-
383, 2001.
Reinforcement learning
R. J. Williams (1992). Simple statistical gradient-following algorithms for
connectionist reinforcement learning. Machine Learning, 8:229-256.
V. V. Miagkikh and W. F. Punch III, An Approach to Solving Combinatorial
Optimization Problems Using a Population of Reinforcement Learning Agents,
Genetic and Evolutionary Computation Conf.(GECCO-99), p.1358-1365, 1999.
Marcus Gallagher - MASCOS Symposium, 26/11/04 24
25. Summary
The field of metaheuristics (including
Evolutionary Computation) has produced
A large variety of optimization algorithms
Demonstrated good performance on a range of real-
world problems.
Metaheuristics are considerably more general:
can even be applied when there isn’t a “true”
objective function (coevolution).
Can evolve non-numerical objects.
Marcus Gallagher - MASCOS Symposium, 26/11/04 25
26. Summary
EDAs take an explicit modelling approach to
optimization.
Existing statistical models and model-fitting algorithms can be
employed.
Potential for solving challenging problems.
Model can be more easily visualized/interpreted than a dynamic
population in a conventional EA.
Although the field is highly active, it is still relatively
immature
Improve quality of experimental results.
Make sure research goals are well-defined.
Lots of preliminary ideas, but lack of comparative/followup
research.
Difficult to keep up with the literature and see connections with
other fields.
Marcus Gallagher - MASCOS Symposium, 26/11/04 26