Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Stochastic modelling and quasi-random numbers
1. Stochastic models + quasi-random
(Teytaud, Tao (Inria), Lri (Paris-Sud), UMR-Cnrs 8623, France;
OASE Lab, NUTN, Taiwan
First part: randomness.
What is a stochastic / randomized model
Terminology, tools
Second part: quasi-random points
Random points can be very disappointing
Sometimes quasi-random points are better
2. Useful maths
we will need these tools...
Prime number: 2,3,5,7,11,13,17,...
P(A|B): conditionning in probability.
P(dice=1 | dice in {1,2,3} ) ?
P(dice=3 | dice in {1,2} ) ?
Frequency in datas x(1),x(2),...,x(n):
1,2,6,3,7: frequency(odd) ?
frequency ( x(i+1) > x(i) ) ?
frequency ( x(i+1) > 3 | x(i) < 4 ) ?
3. Let's take time for understanding
random simulations
I guess you all know how to simulate a random
variable uniform in [0,1]
e.g. double u=drand48();
But do you know how to simulate one year of
weather in Tainan ?
Not so simple.
Let's see this in more details.
4. Random sequence
in dimension 1
What is a climate model ?
Define:
w1 = weather at time step 1
w2 = weather at time step 2
w3 = weather at time step 3
w4 = weather at time step 4
…
==> let's keep it simple, let's define the weather
by one single number in [0,1].
5. I want a generative model
As well as I can repeat u=drand48(), and
generate a sample u1, u2, u3, I want to be able
to generate
W1=(w11,w12,w13,...,w1T)
W2=(w21,w22,w23,...,w2T)
W3=...
…
==> think of a generator of
curves
6. Random sequence
in dimension 1
What is a climate model ?
Define:
w1 = weather at time step 1
The models tells you how can be w1. For example,
it gives the density function g:
P(w1 in I) = integral of g on I
0
1
7. Take-home message number 1:
a random variable w on R
is entirely defined by
P(w in I)
for each interval I
0
1
8. Random sequence
in dimension 1
P(w1 in I) = integral of g on I
P(w1 <= c) = integral of g on [-infinity,c] = G(c)
g
0
1
9. Generating w1: easy with the
inverse cumulative distribution
P(w1 in I) = integral of g on I
P(w1 <= c) = integral of g on [0,c] = G(c) G
Consider invG= inverse of G.
G=cumulative distribution
i.e. G(invG(x))=x
Trick for generating w1:
u=drand48()
w1=invG(u)=invCDF(u);
1 g
0
10. Generating w1: easy with the
inverse cumulative distribution
1
G
u=drand48()
w1=invG(u);
1
0
11. Generating w1: easy with the
inverse cumulative distribution
1
G
u=drand48()
w1=invG(u);
1
0
12. Generating w1: easy with the
inverse cumulative distribution
1
G
u=drand48()
w1=invG(u);
0
13. Generating w1: easy with the
inverse cumulative distribution
1
G
u=drand48()
w1=invG(u);
0
14. Generating w1: easy with the
inverse cumulative distribution
1
G
u=drand48()
w1=invG(u);
0
15. Take-home message number 2:
a random variable w on R
is more conveniently defined by
P(w <= t)
for each t,
and the best is
invCDF = inverse of (t → P(w<=t))
Because then:
w=invP(drand48());
16. Generating w2: also easy with
inv. cumulative distribution ?
w1=invG1(drand48()); can we
generate
w2=invG2(drand48()); each wi
w3=invG3(drand48()); independently ?
…
==> very easy !
==> but very bad :-(
==> no correlation :-(
==> w4 very high and w5 very low is unrealistic;
but in this model it happens very often!
17. Generating wi: also easy with
inv. cumulative distribution ?
Realistic:
large-scale variations!
Unrealistic;
and average value
almost constant
18. So how can we do ?
ould not give the (independent) distribution of w2, but the distribution of w2
rand48());
rand48());
rand48());
e sense ? This is a Markov Chain.
should NOT be generated independently!
19. Variant
ould not give the (independent) distribution of w2, but the distribution of w2
drand48());
w1, drand48());
w1, drand48());
w2, drand48());
w3, drand48());
kov chain
order 1 for today
20. Let's see an example
Assume that we have a plant.
This plant is a function:
(Production,State,Benefit) =
f( Demand , State , Weather )
Demand = g(weather,economy,noise)
(where Economy is the part of Economy
which is not too dependent on weather)
Benefit per year
21. Graphically
Weather:
w1, w2, w3, w4, w5, … ==> random sequence
==> we assume a distribution of w(i) | w(i-1)
==> this is a Markov Model ( forget w(i-2) )
Economy
e1, e2, e3, e4, e5, … ==> random sequence
==> we assume a distribution of e(i) | e(i-1)
Noise = given distribution
==> n1, n2, n3, ....
22. Graphically
m
ea
ns
:
de
e1 e2 pe
e3 e4 e5
n
de
nc
d1 d2 d3y d4 d5
w1 w2 w3 w4 w5
The “model” should tell you how to generate d2, given d1, e2,w2.
(ei,di,wi) is a Markov chain. (di) is a hidden Markov chain:
a part is hidden.
23. How to build a
stochastic model ?
It's about uncertainties
Even without hidden models, it's complicated
We have not discussed how to design a
stochastic model (typically from historical
archive):
Typically, discretization: w(k) in I1 or I2 or I3
with I1=[- ,a],
I2=]a,b], I3=]b,I ]
G(w,w')= frequency of w(k+1) <= w'
for w(k) in same interval as w
24. Yet another take-home message
Typically, discretization: w(k) in I1 or I2 or I3
with I1=[- ,a], I2=]a,b], I3=]b,, ]
G(w,w')= frequency of w(k+1) <= w'
for w(k) in same interval as w
(obviously more intervals in many real cases...)
==> However, this reduces extreme values
25. A completely different
approach ?
Write p1,p2,p3,...,pN all the parameters of the
model
Collect data x1,...,xD
For each i in {1,2,...,D}, xi=(xi1,...,xiT) = a curve
Optimize p1,p2,p3,...,pN so that all moments of
order <= 2 are (nearly) the same as the moments of
the archive.
Moment1(i) = (x1i+x2i+...+xDi)/D ==> where is i ?
Moment2(i,j) = average of
26. Example of parametric HMM
Parameters = { parameters of e, parameters of
w, parameters of d } = {15 sets of parameters }
= very big
e1 e2 e3 e4 e5
d1 d2 d3 d4 d5
w1 w2 w3 w4 w5
27. Main troubles
Ok, we know what is a stochastic model
The case of HMM is much more complicated
(but tools exist)
But gathering data is not always so easy.
For example, climate: do you trust the 50 last
years for predicting the next 10 years ?
Even if you trust the past 50 years, do you think
it's enough for building a sophisticated model ?
We need a combination between
28. Validation
Statistical models always lie
Because the structure is wrong
Because there are not enough data
==> typically, extreme values are more rare in
models than in reality
Check the extreme events
Usually, it's good to have more extreme values
than datas (because all models tend to make
them too rare...).
29. Example: French climate
France has a quite climate
No big wind
No heavy rains
6.2 times more
No heat wave than 921 earthquake!
But:
2003: huge heat wave. 15 000 died in France.
1999: hurricane-like winds (96 died in Europe;
gusts at 169 km/h in Paris)
1987: huge rain falls (96 mm in 24 hours)
30. Example: 2003 heat wave
Paris:
9 days with max temp. > 35°C
1 night with no less than 25.5°C <== disaster
France: 15 000 died
Italy: 20 000 died
==> European countries
were not ready for this
31. Example: 2003 heat wave
==> plenty of take-home messages
Bad model: air conditionning sometimes
automatically stopped because such
high temperatures = considered as
measurement bugs ==> extreme values
neglected
Heat wave + no wind ==> increased
pollution
==> old people die (babies carefully
32. Example: 2003 heat wave
==> plenty of take-home messages
Be careful with extreme values
neglected
==> extreme values are not always
measurement bugs
==> removing air conditionning
because it's too hot...
(some systems were not ready
33. Example: 2003 heat wave
==> plenty of take-home messages
Be careful with extreme values
neglected
==> extreme values are not always
measurement bugs
Independence is a very strong
assumption
34. Example: 2003 heat wave
==> plenty of take-home messages
Be careful with extreme values
neglected
==> extreme values are not always
measurement bugs
Independence is a very strong
assumption
35. Quasi-random points
(Teytaud, Tao (Inria), Lri (Paris-Sud), UMR-Cnrs 8623;
collabs with S. Gelly, J. Mary, S. Lallich, E. Prudhomme,...)
Quasi-random points ?
Dimension 1
Dimension n
Better in dimension n
Strange spaces
37. Why do we need random /
quasi-random points ?
Numerical integration [thousands of papers; Niederreiter 92]
integral(f) nearly equal to
sum f(xi)
Learning [Cervellera et al, IEEETNN 2004, Mary phD 2005]
Optimization [Teytaud et al, EA'2005]
Modelizat° of random-process [Growe-Kruska et al, IEEEBPTP'03]
Path planning [Tuffin]
38. Where do we need numerical
integration ?
Just everywhere.
Expected pollution (=average pollution...)
= integral of possible
pollutions as a function of many random
variables
(weather, defaults on pieces, gasoline, use
of the car...)
39. Take-home message
When optimizing
the design of something
which is built in a factory,
take into account the variance in the production
system ==> all cars are different.
==> very important effect
==> real piece != specifications
40. Why do we need numerical
integration ?
Expected benefit (=average benefit...)
= integral of possible
benefit as a function of many random
variables
(weather, prices of raw materials...)
==> economical benefit (company)
==> overall welfare (state)
41. Why do we need numerical
integration ?
Risk (=probability of failure...)
= integral of possible
failures as a function of many random
variables
(quakes, flood, heat waves,
electricity breakdowns, human error...)
42. Take-home message
Human error must be taken
into account:
- difficult to modelize
- e.g. a minimum probability that action X
is not performed (for all actions)
(or that unexpected action Y is performed)
(what about an adversarial human ?)
==> protection by independent validations
43. Why do we need numerical
integration ?
Expected benefit as a function
of many prices/random variables,
Expected efficiency depending on machining
vibrations
Evaluating schedulings in industry (with
random events like faults, delay...)
(e.g. processors)
44. How to know if some points
are well distributed ?
I propose N points x=(x1,...,xN)
How to know if these points are well distributed ?
A naive solution:
f(x)=max min ||y-xi|| (maximized)
y i
(naive, but not always so bad)
45. How to know if some points
are well distributed ?
I propose N points x=(x1,...,xN)
How to know if these points are well distributed ?
A naive solution:
g(x)=min min ||xj-xi||2 (maximized)
i j!=i
= “dispersion” (naive, but not always so bad)
48. Is there better than random
points for low discrepancy ?
Random --> Discrepancy ~ sqrt ( 1/n )
Quasi-random --> Discrepancy ~ log(n)^d/n
Quasi-random with N known --> Discrepancy ~ log(n)^(d-1)/n
Koksma & Hlawka :
error in Monte-Carlo integration
< Discrepancy x V
V= total variation (Hardy & Krause)
( many generalizations in Hickernel, A Generalized
Discrepancy and Quadrature Error Bound, 1997 )
==> sometimes V or log(n)^d huge
==> don't always trust QR
55. Dimension 1
What would you do ?
--> Van Der Corput
n=1, n=2, n=3...
n=1, n=10, n=11, n=100, n=101, n=110... (p=2)
x=.1, x=.01, x=.11, x=.001, x=.101, … (binary!)
56. Dimension 1
What would you do ?
--> Van Der Corput
n=1, n=2, n=3...
n=1, n=2, n=10, n=11, n=12, n=20... (p=3)
x=.1, x=.2, x=.01, x=.11, x=.21, x=.02... (ternary!)
57. Dimension 1 more general
p=2, but also p=3, 4, ...
but p=13 is not very nice :
58. Dimension 2: maybe just
use two Van Der Corput sequences
with same p ?
x --> (x,x) ?
60. Dimension 2 or n : Halton
x --> (x,x') with diff. prime numbers is ok
(needs
maths...)
(as small
numbers
are better,
use the n
smallest...)
61. Dimension n+1 : Hammersley
(n/(N+1),xn,x'n) --> closed sequence
(i.e.,
number N
known
in
advance)
62. Dimension n : the trouble
There are not so many small prime numbers
63. Dimension n : scrambling
(here, random comes back)
Pi(p) : [1,p-1] --> [1,p-1]
Pi(p) applied to
coordinate with
prime number p
64. Dimension n : scrambling
Pi(p) : [1,p-1] --> [1,p-1] (randomly chosen)
Pi(p) applied to coordinate with prime p (there
is much more complicated)
65. Beyond low discrepancy ?
Other discrepancies : why rectangles ?
Other solutions : lattices
{x0+nx} modulo 1
(very fast and simple)
Let's see very different approaches
Low discrepancy for other spaces than [0,1]^n
Stratification
Symmetries
66. Some animals
are quite good
Why in the square ? for
low-discrepancy
Other spaces/distributions:gaussians,sphere
67. Why in the square ?
Uniformity in the square is ok
But what about Gaussians distributions ?
x in ]0,1[^d
y(i) such that P( N > y(i) ) = x(i)
with N standard gaussian
then y is quasi-random and gaussian
==> so you can have
quasi-random Gaussian numbers
68. Why in the square ?
Other n-dimensionnal random variables by the
“conditionning” trick
Consider a QR point: (x1,....xn) in [0,1]^n
You want to simulate z with distribution Z
z1=inf { z; P(Z1<z) >x1 } = invG1(x1)
z2=inf { z; P(Z2<z|Z1=z1) > x2 } =
invG2(z1,x2)
z3=inf { z; P(Z3<z|Z1=z1,Z2=z2) > x3 } =
invG2(z1,z2,x3)
69. Why in the square ?
Theorem: If x is random([0,1]n),
then z is distributed as Z !
==> convert the uniform square into strange spaces or variables
70. Why not for random walks ?
500 steps of random walks ==> huge
dimension
Quasi-random basically does not work in huge
dimension
But first coordinates of QR are ok; just use
them for most
important coordinates! ==> change the
order of variables
and use conditionning !
71. Why not for random walks ?
Quasi-random number x in R^500
(e.g. Gaussian)
Change order: y(250) first (y(250) ---> x(1) )
y(1 | y(250) ) <---> x(2)
y(500 | y(1) and y(250)) <---> x(3)
72. Why not for random walks ?
500 steps of random walks ==> huge
dimension
But strong derandomization possible : start by
y(250), then y(1), then y(500), then y(125), then
y(375)...
73. Why not for random walks ?
500 steps of random walks ==> huge
dimension
But strong derandomization possible :
74. Very different approaches for
derandomization ?
Symetries : instead of
x1 and x2 in [0,1],
try
x and 1-x
Or more generally, just draw n/2 points,
and
use their symetries
==> in dimension d, n/2d points and their 2d
77. Very different approaches for
derandomization ?
Control : instead of estimating
E f(x)
Choose g “looking like” f and estimate
E (g-f)(x)
Then E f = E g +E(g-f) is much better
Troubles:
You need a good g
You must be able of evaluating Eg
78. Very different approaches for
derandomization ?
Pi-estimation : instead of estimating
E f(x)
Look for y with density ≃(f)d(x)
Then E f(x) = E f(y) d(x)/d(y)
==> Variance is much better
Troubles:
You have to generate y
You have to know (f)
79. Very different approaches for
derandomization ?
Stratification (jittering) :
Instead of generating n points i.i.d
Generate
k points in stratum 1
k points in stratum 2
...
k points in stratum m
with m.k=n ==> more stable ==>
depends on the choice of strata
82. Summary on MC
improvements ?
In many books you will read that quasi-random
points are great.
Remember that people who spend their life
studying quasi-random numbers will rarely
conclude that all this was a bit useless.
Sometimes it's really good.
Sometimes it's similar to random.
Modern Quasi-Monte-Carlo methods
(randomized) are usually at least as good as
random methods ==> no risk.
83. Summary on MC
improvements ?
Carefully designing the model (from data) is
often more important than the randomization.
Typically, neglecting dependencies is often a
disaster.
Yet, there are cases in which improved MC are
the key.
Remarks on random search: dispersion much
better than discrepancy...
84. Biblio (almost all on google)
“Pi-estimation” books for stratification, symmetries, ...
Owen, A.B. "Quasi-Monte Carlo Sampling", A Chapter on
QMC for a SIGGRAPH 2003 course.
Fred J. Hickernell, A generalized discrepancy and
quadrature error bound, 1998
B. Tuffin, On the Use of low-Discrepancy sequences
in Monte-Carlo methods, 1996
Matousek, Geometric Discrepancy (book 99)
these slides : http://www.lri.fr/~teytaud/btr2.pdf
or http://www.lri.fr/~teytaud/btr2.ppt