AACIMP 2010 Summer School lecture by Leonidas Sakalauskas. "Applied Mathematics" stream. "Stochastic Programming and Applications" course. Part 8.
More info at http://summerschool.ssa.org.ua
1. Lecture 8
Stochastic Approximation and
Simulated Annealing
Leonidas Sakalauskas
Institute of Mathematics and Informatics
Vilnius, Lithuania <sakal@ktl.mii.lt>
EURO Working Group on Continuous Optimization
2. Content
Introduction.
Stochastic Approximation:
SPSA with Lipschitz perturbation operator;
SPSA with Uniform perturbation operator;
Standard Finite Difference Approximation
algorithm.
Simulated Annealing
Implementation and Applications
Wrap-Up and Conclusions
3. Introduction
In many practical problems of technical design
some of the data may be subject to significant
uncertainty which is reduced to probabilistic-
statistical models.
The performance of such problems can be
viewed like constrained stochastic optimization
programming tasks.
Stochastic Approximation can be considered as
alternative to traditional optimization methods,
especially when objective functions are no
differentiable or computed with noise.
4. Stochastic Approximation
Application of Stochastic Approximation to
solving of optimization problems, while the
objective function is non-differentiable or
nonsmooth and computed with noise is a topical
theoretical and practical problem.
The known methods of Stochastic
Approximation for solving of these problems use
the idea of stochastic gradient and certain rules
of changing of step length for ensuring the
convergence.
5. Formulation of the optimization problem
The optimization problem is (minimization)
as follows:
f x min
x n
where f : is a bounded from below Lipshitz
n
function.
6. Formulation of the optimization problem
Let f ( x ) be generalized gradient of this function.
Assume X * to be a set of stationary points:
and F * to be a set of function values:
X* x0 f x ,
F* zz f x ,x X* .
7. We consider a function smoothed by
perturbation operator:
f x, Ef x , ~p
where 0 is the value of the perturbation
parameter.
The functions smoothed by this operator are
twice continuously differentiable (Rubinstein &
Shapiro (1993), Bartkute & Sakalauskas
(2004)), that offers certain opportunities
creating optimization algorithms.
8. Advantages of SPSA
At last time the interesting research was focussed
on Simulated Perturbation Stochastic
Approximation (SPSA)
It is enough to calculate values of the function
only in one or some points for the estimation of
the stochastic gradient in SPSA algorithms, that
promises for us to reduce numerical complexity
of optimization.
9. SA algorithms
1. SPSA with Lipschitz perturbation operator.
2. SPSA with Uniform perturbation operator.
3. Standard Finite Difference Approximation
algorithm.
10. General Stochastic Approximation scheme
xk 1
xk k g k , k 1, 2, ...
where g k g xk , k , k stochastic gradient
and g x, E g x, , , g x, g x , 0.
This scheme is the same for different Stochastic
Approximation algorithms whose distinguish only by
approach for stochastic gradient estimation.
11. SPSA with Lipschitz perturbation operator
Gradient estimator of the SPSA with Lipschitz perturbation operator is
expressed as:
f x f x
g x, ,
where -is the value of the perturbation parameter,
vector -is uniformly distributed in the unit ball
1
, if y 1,
y Vn
0 , if y 1.
Vn -is the volume of the n-dimensional ball (Bartkute & Sakalauskas
(2007))
12. SPSA with Uniform perturbation operator
Gradient estimator of the SPSA with Uniform perturbation operator is
expressed as:
f x f x
g x, ,
2
where -is the value of the perturbation parameter,
1, 2 , .... , n -is a vector consisting of variables uniformly
distributed from the interval [-1;1] (Mikhalevitch et al
(1987)).
13. Standard Finite Difference Approximation algorithm
Gradient estimator of the Standard Finite Difference Approximation
algorithm is expressed as:
f x i f x
gi x , , ,
where -is the value of the perturbation parameter,
vector
-is uniformly distributed in the unit ball;
i 0,0,0,....,1,.....,0
-is the vector with zero components except ith one,
which is equal to 1. (Mikhalevitch et al (1987)).
14. Rate of convergence
Let consider that the function f(x) has a sharp
minimum in the point x * , in which the algorithm
converges
a b
, a 0, k , b 0, 0 a 1
k 1, .
when k k b 2 H
k 1 * 2 A K2 a b 1 1
E x x o 2 aH
,
k 1
H k1 b
k
Then
where A>0, H>0, K>0 are certain constants, x* k 1
is
minimum point of the smoothed function.
15. Computer simulation
The proposed methods were tested with following
functions: n
f ak xk M
k 1
where ak is a set of real numbers randomly and
uniformly generated in the interval ,K ,
K 0.
The samples of T=500 test functions were generated,
when 2, K 5.
19. Volatility estimation by Stochastic Approximation algorithm
Let us consider the application of SA to the
minimization of the mean absolute pricing error for
the parameter calibration in the Heston Stochastic
Volatility model [Heston S. L.(1993)].
We consider the mean absolute pricing error
(MAE) defined as :
N
1
MAE , , , , v, CiH , , , , v, Ci
N i 1
where N is the total number of options, C i and CiH represent
the realized market price and the implied the theoretical model price,
respectively, while , , , , v, (n=6) are the parameters of the
Heston model to be estimated.
20. To compute option prices by the Heston model,
one needs input parameters that can hardly be found
from the market data.
We need to estimate the above parameters by an
appropriate calibration procedure. The estimates of
the Heston model parameters are obtained by
minimizing MAE:
MAE , , , , v, min
Let consider the Heston model for the Call option
on SPX (29 May 2002).
22. Optimal Design of Cargo Oil Tankers
In cargo oil tankers design, it is necessary to
choose such sizes for bulkheads, that the weight of
bulkheads would be minimal.
23. The minimization of weight of bulkheads for the cargo oil tank we can
formulate like nonlinear programing task (Reklaitis et al (1986)):
5.885 x4 x1 x3
f x min
2 2
x1 x 3 x 2
subject to 1 2 2
g1 x x2 x4 0.4 x1 x3 8.94 x1 x3 x2 0
6
4
2 1 2 2 3
g2 x x x 4 0.2 x1
2 x3 2.2 8.94 x1 x 3 x 2 0
12
g3 x x4 0.0156 x1 0.15 0
g4 x x4 0.0156 x3 0.15 0
g5 x x4 1.05 0
g6 x x3 x2 0
where x1- width, x2 -debt, x 3 - lenght, x4 - thikness.
24. SPSA with Lipschitz perturbation for the
cargo oil target design
7.5
7.4
7.3
7.2
7.1
7
6.9
6.8
6.7
6.6
6.5
100 1000 1900 2800 3700 4600 5500 6400 7300 8200 9100 10000
Number of iterations
25. Confidence bounds of the minimum
(A=6.84241, T=100, N=1000)
7.1
Upper bound
7
6.9 Lower bound
6.8
Minimum of the
6.7 objective function
6.6
6.5
6.4
2 102 202 302 402 502 602 702 802 902 Number of iterations
26. Simulated Annealing
Global optimization methods
Global algorithms (bounds and branch
algorithms, dynamic programming,
full selection, etc)
Greedy optimization (local search)
Heuristic optimization
28. Simulated Annealing algorithm
Simulated Annealing algorithm is
developed by modeling steel
annealing process (Metropolis et al.
(1953))
A lot of applications in Operational
Research and Data Analysis, etc.
29. Simulated Annealing
Main idea:
to simulate drift of current solution with
probability distribution P( x, T k )
to improve solution updating
- temperature function Tk
- neighborhood function k
30. Simulated Annealing algorithm
0
Step 1. Choose , T x , set k 0,
0.
0
Step 2. Generate drift Z k 1 with probability
distribution P( x, T k )
Step 3. If Zk 1 k
and f ( xk ) f ( xk Z k 1 ) (Metropolis rule)
Tk
e U (0,1)
then accept: k 1 k k 1 ; k=k+1; otherwise Step 2
x x Z
31. Improvement of SA by Pareto Type
models
The theoretical investigation of SA convergence shows,
that in these algorithms Pareto type models can be
applied to form search sequence (Yang (2000)).
Class of Pareto models, main feature and parameter:
Pareto model’s distributions have "heavy tails“.
α - the main parameter of these models, which impacts
the heaviness of the tail
α –stable distributions are Pareto (follows to C.L.T.)
32. Pareto type (Heavy-tailed)
distributions
Main features:
infinite variance, infinite mean
Introduced by Pareto in the 1920’s
Mandelbrot established the use of heavy-tailed
distributions to model real-world fractal
phenomena.
There are a lot of other applications (financial
market, traffic in computer and
telecommunication networks, etc.).
35. Comparison of tail probabilities
for standard normal, Cauchy and Levy
distributions
In this table were compared the
tail probabilities for the three
distributions. It is clear that the
tail probability for the normal
quickly becomes negligible,
whereas the other two
distributions have a significant
probability mass in the tail.
36. Improvement of SA by Pareto type
models
The convergence conditions (Yang (2000)) indicate that,
under suitable conditions, an appropriate choice of the
temperature and neighborhood size updating functions
ensures the convergence of the SA algorithm to the
global minimum of the objective function over the
domain of interest.
The following corollaries give different forms of
temperature and neighborhood size updating functions
corresponding to different kinds of generation probability
density functions to guarantee the global convergence
of the SA algorithm.
38. Improvement of SA in continuous
optimization
The above corollaries indicate that a different form of temperature
updating function has to be used with respect to a different kind of
generation probability density function in order to ensure the global
convergence of the corresponding SA algorithm.
41. Testing of SA for continuous optimization
In global and combinatorial optimization problems, when
optimization algorithms are used, the reliability and
efficiency of these algorithms is needed to be tested.
Special testing functions, known in literature, are used for
this.
Some of these functions have one or more global minimum,
some of them have global and local minimums.
With the help of these functions it can be ensured, that the
methods are efficient enough, thus, it is possible to test and
prevent algorithms from being trapped in local minimum, as
well as the speed and accuracy of convergence and other
parameters can be watched.
42. Testing criteria
By modeling SA algorithm with some testing functions
with two different distributions, and changing some
optional parameters, there were some questions:
which of these distributions guarantees the faster
convergence to global minimum by value of
objective function;
what are probabilities of finding global minimum,
how can impact these probabilities the changing
of some parameters;
what the proper number of iterations, which
guarantees the finding global minimum with
desirable probability.
43. Testing criteria
Characteristics to be evaluated
by Monte-Carlo simulation:
value of minimized objective function;
probability to find global minimum after
some number of iterations.
These characteristics were computed by
Monte-Carlo method - N realizations
(N=100, 500, 1000) with K iterations each
(K=100, 500, 1000, 3000, 10000, 30000).
44. Test functions
An example of test function:
Branin’s RCOS (RC) function (2 variables):
RC(x1,x2)=(x2-(5/(4 2))x12+(5/ )x1-
6)2+10(1-(1/(8 )))cos(x1)+10;
Search domain:
5 < x1 < 10, 0 < x2 < 15;
3 minima:
(x1 , x2)*=(- , 12.275), ( , 2.275), (9.42478 ,
2.475);
RC((x1 , x2)*)=0.397887.
48. Simulation results
1
0,9
0,8
0,7
0,6
0,5
0,4
0,3
0,2
0,1
0
1 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 2600 2800 3000
Iteracijų skaičius
Fig. 1. Probability to find global minimum
by SA for Rastrigin function
49. Wrap-Up and Conclusions
1. The SA methods have been considered for
comparison SPSA with Lipschitz perturbation
operator; SPSA with Uniform perturbation
operator and SFDA method as well Simulated
Annealing;
2. Computer simulation by Monte-Carlo method has
shown that the empirical estimates of the rate of
convergence of SA for nondifferentiable functions
corroborate the theoretical rates O 1 , 1 2
k