The document discusses balancing speed and probability of winning in reinforcement learning problems. It presents a game called Decision Snakes and Ladders where introducing a step punishment term incentivizes faster termination while still maximizing winning probability. This leads to finding a policy that maximizes winning probability over mean episode length. The approach can extend to other problems and trading off different metrics. Future work includes incorporating robustness to policy and state variations.
1. Winning slow, losing fast, and in between.
Reinaldo A Uribe Muriel
Colorado State University. Prof. C. Anderson
Oita University. Prof. K. Shibata
Universidad de Los Andes. Prof. F. Lozano
February 8, 2010
2. It’s all fun and games until someone proves a theorem.
Outline
1 Fun and games
2 A theorem
3 An algorithm
3. A game: Snakes & Ladders
Board: Crawford & Son, Melbourne, 1901. (Source: http://www.naa.gov.au/)
Player advances the
number of steps
indicated by a die.
Landing on a snake’s
mouth sends the player
back to the tail.
Landing on a ladder’s
bottom moves the
player forward to the
top.
Goal: reaching state
100.
4. A game: Snakes & Ladders
Board: Crawford & Son, Melbourne, 1901. (Source: http://www.naa.gov.au/)
Player advances the
number of steps
indicated by a die.
Landing on a snake’s
mouth sends the player
back to the tail.
Boring! Landing on a ladder’s
(No skill required, only luck.)
bottom moves the
player forward to the
top.
Goal: reaching state
100.
5. Variation: Decision Snakes and Ladders
Sets of “win” and
“loss” terminal states.
Actions: either
“advance” or “retreat,”
to be decided before
throwing the die.
6. Reinforcement Learning: Finding the optimal policy.
“Natural” Rewards: ±1
on “win”/“lose”, 0
othw.
Optimal policy
maximizes total
expected reward.
Dynamic programming
quickly finds the
optimal policy.
Probability of winning:
pw = 0.97222 . . .
7. Reinforcement Learning: Finding the optimal policy.
“Natural” Rewards: ±1
on “win”/“lose”, 0
othw.
Optimal policy
maximizes total
expected reward.
Dynamic programming
quickly finds the
optimal policy.
Probability of winning:
pw = 0.97222 . . .
But...
8. Claim:
It is not always desirable to find the optimal policy
for that problem.
9. Claim:
It is not always desirable to find the optimal policy
for that problem.
Hint: mean episode length of the optimal policy, d = 84.58333
steps.
17. A simple, yet powerful idea.
Introduce a step punishment term −rstep so the
agent has an incentive to terminate faster.
18. A simple, yet powerful idea.
Introduce a step punishment term −rstep so the
agent has an incentive to terminate faster.
At time t,
+1 − rstep “win”
r (t) = −1 − rstep “loss”
−rstep othw.
19. A simple, yet powerful idea.
Introduce a step punishment term −rstep so the
agent has an incentive to terminate faster.
At time t,
+1 − rstep “win”
r (t) = −1 − rstep “loss”
−rstep othw.
Origin: Maze rewards, −1 except on termination.
Problem: rstep =?
(i.e, cost of staying in the game usually incommensurable with
terminal rewards)
25. Chess: White wins∗
Uribe Muriel. Journal of Fabricated Results, Vol 06, No. 8, 2010
∗
in 108 ply.
26. Chess: White wins∗
Uribe Muriel. Journal of Fabricated Results, Vol 06, No. 8, 2010
∗
in 108 ply.
√
Visits only about 5 of the total number of valid states1 , but, if a
ply takes one second, an average game will last three years and two
months.
1
Shannon, 1950.
27. Chess: White wins∗
Uribe Muriel. Journal of Fabricated Results, Vol 06, No. 8, 2010
∗
in 108 ply.
√
Visits only about 5 of the total number of valid states1 , but, if a
ply takes one second, an average game will last three years and two
months.
Certainly unlikely to be the case, but in fact finding policies of
maximum winning probability remains the usual goal in RL.
1
Shannon, 1950.
28. Chess: White wins∗
Uribe Muriel. Journal of Fabricated Results, Vol 06, No. 8, 2010
∗
in 108 ply.
√
Visits only about 5 of the total number of valid states1 , but, if a
ply takes one second, an average game will last three years and two
months.
Certainly unlikely to be the case, but in fact finding policies of
maximum winning probability remains the usual goal in RL.
Discount factor γ, used to ensure values are finite, has effect in
episode length, but is unpredictable and suboptimal (for the pdw
problem)
1
Shannon, 1950.
29. Main result.
For a general ±1-rewarded problem, there exists an
∗
rstep for which the value-optimal solution maximizes
pw
d and the value of the initial state is -1
∗
∃rstep |
pw
π ∗ = argmax v = argmax
π∈Π π∈Π d
∗
v (s0 ) = v = −1
30. Stating the obvious.
Every policy has a mean episode length d ≥ 1 and probability
of winning 0 ≤ pw ≤ 1.
31. Stating the obvious.
Every policy has a mean episode length d ≥ 1 and probability
of winning 0 ≤ pw ≤ 1.
v = 2pw − 1 − rstep d
32. Stating the obvious.
Every policy has a mean episode length d ≥ 1 and probability
of winning 0 ≤ pw ≤ 1.
v = 2pw − 1 − rstep d
(Lemma: Extensible to vectors using indicator variables)
33. Stating the obvious.
Every policy has a mean episode length d ≥ 1 and probability
of winning 0 ≤ pw ≤ 1.
v = 2pw − 1 − rstep d
(Lemma: Extensible to vectors using indicator variables)
The proof rests on a solid foundation of duh!
35. Key substitution.
The w − l space
w = pd
w l = 1−pw
d
Each policy is represented by a unique point in the w − l
plane.
36. Key substitution.
The w − l space
w = pd
w l = 1−pw
d
Each policy is represented by a unique point in the w − l
plane.
The policy cloud is limited by the triangle with vertices (1,0),
(0,1), and (0,0).
37. Execution and speed in the w − l space.
Winning probability: Mean episode length:
w 1
pw = d=
w +l w +l
38. Proof Outline - Value in the w − l space.
w − l − rstep
v=
w +l
39. So...
All level sets intersect at the same point,
(rstep , −rstep )
There is a one-to-one relationship between
values and slopes.
Value (for all rstep ), mean episode length and
winning probability level sets are lines
Optimal policies in the convex hull of the policy
cloud.
40. And done!
pw
π ∗ = max = max w
π d π
(Vertical level sets) When vt ≈ −1, we’re there.
41. Algorithm
Set ε
Initialize π0
rstep ← 0
Repeat:
+
Find π + , vπ (solve from π0 by any RL method)
rstep ← rstep
π0 ← π +
+
Until |vπ (s0 ) + 1| < ε
42. Algorithm
Set ε
Initialize π0
rstep ← 0
Repeat:
+
Find π + , vπ (solve from π0 by any RL method)
rstep ← rstep
π0 ← π +
+
Until |vπ (s0 ) + 1| < ε
On termination, π + ≈ π ∗ .
rstep update using a learning rate µ > 0,
+
rstep = rstep + µ[vπ (s0 ) + 1]
43. Optimal rstep update.
Minimizing the interval of rstep uncertainty in the next
iteration.
Requires solving a minmax problem. Either root of an 8th
degree polynomial in rstep or zero of the difference of two
rational functions of order 4. (Easy using secant method).
O(log 1 ) complexity.
44. Extensions.
Problems solvable through a similar method
Convex (linear) tradeoff.
π ∗ = argmaxπ∈Π {αpw − (1 − α)d}
Greedy tradeoff.
∗ 2pw −1
π = argmaxπ∈Π d
Arbitrary tradeoffs.
∗ αpw −β
π = argmaxπ∈Π d
Asymmetric rewards.
rwin = a, rloss = −b; a, b ≥ 0
Games with tie outcomes.
Games with multiple win / loss rewards.
45. Harder family of problems
Maximize the probability of having won before n steps / m
episodes.
Why? Non-linear level sets / non-convex functions in the w − l
space.
46. Outline of future research.
Towards robustness.
Policy variation in tasks with fixed episode length. Inclusion of
time as a component of the state space.
47. Outline of future research.
Towards robustness.
Policy variation in tasks with fixed episode length. Inclusion of
time as a component of the state space.
Defining policy neighbourhoods.
48. Outline of future research.
Towards robustness.
Policy variation in tasks with fixed episode length. Inclusion of
time as a component of the state space.
Defining policy neighbourhoods.
1 Continuous/discrete statewise action neighbourhoods.
2 Discrete policy neighbourhoods for structured tasks.
3 General policy neighbourhoods.
49. Outline of future research.
Towards robustness.
Policy variation in tasks with fixed episode length. Inclusion of
time as a component of the state space.
Defining policy neighbourhoods.
Feature-robustness
50. Outline of future research.
Towards robustness.
Policy variation in tasks with fixed episode length. Inclusion of
time as a component of the state space.
Defining policy neighbourhoods.
Feature-robustness
1 Value/Speed/Execution neighbourhoods in the w − l space.
2 Robustness as a trading off of features
51. Outline of future research.
Towards robustness.
Policy variation in tasks with fixed episode length. Inclusion of
time as a component of the state space.
Defining policy neighbourhoods.
Feature-robustness
Can traditional Reinforcement Learning methods still be used
to handle the learning?