Optimal Policies for Winning Slow, Losing Fast

Winning slow, losing fast, and in between.

Reinaldo A Uribe Muriel

Colorado State University. Prof. C. Anderson
Oita University. Prof. K. Shibata
Universidad de Los Andes. Prof. F. Lozano

February 8, 2010

It’s all fun and games until someone proves a theorem.
Outline

1 Fun and games
2 A theorem
3 An algorithm

A game: Snakes & Ladders
Board: Crawford & Son, Melbourne, 1901. (Source: http://www.naa.gov.au/)

Player advances the
number of steps
indicated by a die.
Landing on a snake’s
mouth sends the player
back to the tail.
Landing on a ladder’s
bottom moves the
player forward to the
top.
Goal: reaching state
100.

A game: Snakes & Ladders
Board: Crawford & Son, Melbourne, 1901. (Source: http://www.naa.gov.au/)

Player advances the
number of steps
indicated by a die.
Landing on a snake’s
mouth sends the player
back to the tail.
Boring! Landing on a ladder’s
(No skill required, only luck.)
bottom moves the
player forward to the
top.
Goal: reaching state
100.

Variation: Decision Snakes and Ladders

Sets of “win” and
“loss” terminal states.
Actions: either
“advance” or “retreat,”
to be decided before
throwing the die.

Reinforcement Learning: Finding the optimal policy.

“Natural” Rewards: ±1
on “win”/“lose”, 0
othw.
Optimal policy
maximizes total
expected reward.
Dynamic programming
quickly ﬁnds the
optimal policy.
Probability of winning:
pw = 0.97222 . . .

Reinforcement Learning: Finding the optimal policy.

“Natural” Rewards: ±1
on “win”/“lose”, 0
othw.
Optimal policy
maximizes total
expected reward.
Dynamic programming
quickly ﬁnds the
optimal policy.
Probability of winning:
pw = 0.97222 . . .
But...

Claim:

It is not always desirable to ﬁnd the optimal policy
for that problem.

Claim:

It is not always desirable to ﬁnd the optimal policy
for that problem.

Hint: mean episode length of the optimal policy, d = 84.58333
steps.

Optimal policy revisited.

Seek winning.


Seek winning.
Avoid losing.


Seek winning.
Avoid losing.
Stay safe.

A simple, yet powerful idea.

Introduce a step punishment term −rstep so the
agent has an incentive to terminate faster.


At time t,

 +1 − rstep “win”
r (t) = −1 − rstep “loss”
−rstep othw.



At time t,

 +1 − rstep “win”
r (t) = −1 − rstep “loss”
−rstep othw.


Origin: Maze rewards, −1 except on termination.
Problem: rstep =?
(i.e, cost of staying in the game usually incommensurable with
terminal rewards)

Better than optimal?

Optimal policy for
rstep = 0


Optimal policy for
rstep = 0.08701


Optimal policy for
rstep = 0.08701
pw = 0.48673 (was
0.97222 — 50.06%)
d = 11.17627 (was
84.58333 — 13.21%)


Optimal policy for
rstep = 0.08701
pw = 0.48673 (was
0.97222 — 50.06%)
d = 11.17627 (was
84.58333 — 13.21%)

pw
This policy maximizes d

Chess: White wins∗
Uribe Muriel. Journal of Fabricated Results, Vol 06, No. 8, 2010


∗
in 108 ply.


∗
in 108 ply.
√
Visits only about 5 of the total number of valid states1 , but, if a
ply takes one second, an average game will last three years and two
months.

1
Shannon, 1950.


∗
in 108 ply.
√
months.

Certainly unlikely to be the case, but in fact ﬁnding policies of
maximum winning probability remains the usual goal in RL.

1
Shannon, 1950.


∗
in 108 ply.
√
months.

Certainly unlikely to be the case, but in fact finding policies of
maximum winning probability remains the usual goal in RL.

Discount factor γ, used to ensure values are finite, has effect in
episode length, but is unpredictable and suboptimal (for the pdw

problem)

1
Shannon, 1950.

Main result.

For a general ±1-rewarded problem, there exists an
∗
rstep for which the value-optimal solution maximizes
pw
d and the value of the initial state is -1

∗
∃rstep |
pw
π ∗ = argmax v = argmax
π∈Π π∈Π d
∗
v (s0 ) = v = −1

Stating the obvious.

Every policy has a mean episode length d ≥ 1 and probability
of winning 0 ≤ pw ≤ 1.



v = 2pw − 1 − rstep d



(Lemma: Extensible to vectors using indicator variables)



(Lemma: Extensible to vectors using indicator variables)

The proof rests on a solid foundation of duh!

Key substitution.
The w − l space

w = pd
w l = 1−pw
d

Key substitution.
The w − l space

w = pd
w l = 1−pw
d

Each policy is represented by a unique point in the w − l
plane.

Key substitution.
The w − l space

w = pd
w l = 1−pw
d

Each policy is represented by a unique point in the w − l
plane.
The policy cloud is limited by the triangle with vertices (1,0),
(0,1), and (0,0).

Execution and speed in the w − l space.

Winning probability: Mean episode length:
w 1
pw = d=
w +l w +l

Proof Outline - Value in the w − l space.

w − l − rstep
v=
w +l

So...

All level sets intersect at the same point,
(rstep , −rstep )
There is a one-to-one relationship between
values and slopes.
Value (for all rstep ), mean episode length and
winning probability level sets are lines
Optimal policies in the convex hull of the policy
cloud.

And done!

pw
π ∗ = max = max w
π d π

(Vertical level sets) When vt ≈ −1, we’re there.

Algorithm

Set ε
Initialize π0
rstep ← 0
Repeat:
+
Find π + , vπ (solve from π0 by any RL method)
rstep ← rstep
π0 ← π +
+
Until |vπ (s0 ) + 1| < ε

Algorithm

Set ε
Initialize π0
rstep ← 0
Repeat:
+
Find π + , vπ (solve from π0 by any RL method)
rstep ← rstep
π0 ← π +
+
Until |vπ (s0 ) + 1| < ε

On termination, π + ≈ π ∗ .
rstep update using a learning rate µ > 0,
+
rstep = rstep + µ[vπ (s0 ) + 1]

Optimal rstep update.

Minimizing the interval of rstep uncertainty in the next
iteration.
Requires solving a minmax problem. Either root of an 8th
degree polynomial in rstep or zero of the diﬀerence of two
rational functions of order 4. (Easy using secant method).
O(log 1 ) complexity.

Extensions.

Problems solvable through a similar method
Convex (linear) tradeoff.
π ∗ = argmaxπ∈Π {αpw − (1 − α)d}
Greedy tradeoff.
∗ 2pw −1
π = argmaxπ∈Π d
Arbitrary tradeoffs.
∗ αpw −β
π = argmaxπ∈Π d
Asymmetric rewards.
rwin = a, rloss = −b; a, b ≥ 0
Games with tie outcomes.
Games with multiple win / loss rewards.

Harder family of problems

Maximize the probability of having won before n steps / m
episodes.

Why? Non-linear level sets / non-convex functions in the w − l
space.

Outline of future research.
Towards robustness.

Policy variation in tasks with ﬁxed episode length. Inclusion of
time as a component of the state space.

Towards robustness.

Deﬁning policy neighbourhoods.

Towards robustness.

1 Continuous/discrete statewise action neighbourhoods.
2 Discrete policy neighbourhoods for structured tasks.
3 General policy neighbourhoods.

Towards robustness.

Feature-robustness

Towards robustness.

Feature-robustness
1 Value/Speed/Execution neighbourhoods in the w − l space.
2 Robustness as a trading oﬀ of features

Towards robustness.

Feature-robustness
Can traditional Reinforcement Learning methods still be used
to handle the learning?

Thank you.
muriel@cs.colostate.edu - r-uribe@uniandes.edu.co

Untitled by Li Wei, School of Design, Oita University, 2009.

Optimal Policies for Winning Slow, Losing Fast

Recomendados

Recomendados

Más contenido relacionado

Destacado

Destacado (8)

Optimal Policies for Winning Slow, Losing Fast