SlideShare una empresa de Scribd logo
1 de 52
Descargar para leer sin conexión
Winning slow, losing fast, and in between.

             Reinaldo A Uribe Muriel

       Colorado State University. Prof. C. Anderson
             Oita University. Prof. K. Shibata
        Universidad de Los Andes. Prof. F. Lozano


                  February 8, 2010
It’s all fun and games until someone proves a theorem.
Outline




     1    Fun and games
     2    A theorem
     3    An algorithm
A game: Snakes & Ladders
Board: Crawford & Son, Melbourne, 1901. (Source: http://www.naa.gov.au/)




                                                     Player advances the
                                                     number of steps
                                                     indicated by a die.
                                                     Landing on a snake’s
                                                     mouth sends the player
                                                     back to the tail.
                                                     Landing on a ladder’s
                                                     bottom moves the
                                                     player forward to the
                                                     top.
                                                     Goal: reaching state
                                                     100.
A game: Snakes & Ladders
Board: Crawford & Son, Melbourne, 1901. (Source: http://www.naa.gov.au/)




                                                     Player advances the
                                                     number of steps
                                                     indicated by a die.
                                                     Landing on a snake’s
                                                     mouth sends the player
                                                     back to the tail.
               Boring!                               Landing on a ladder’s
       (No skill required, only luck.)
                                                     bottom moves the
                                                     player forward to the
                                                     top.
                                                     Goal: reaching state
                                                     100.
Variation: Decision Snakes and Ladders




                                     Sets of “win” and
                                     “loss” terminal states.
                                     Actions: either
                                     “advance” or “retreat,”
                                     to be decided before
                                     throwing the die.
Reinforcement Learning: Finding the optimal policy.



                                      “Natural” Rewards: ±1
                                      on “win”/“lose”, 0
                                      othw.
                                      Optimal policy
                                      maximizes total
                                      expected reward.
                                      Dynamic programming
                                      quickly finds the
                                      optimal policy.
                                      Probability of winning:
                                      pw = 0.97222 . . .
Reinforcement Learning: Finding the optimal policy.



                                      “Natural” Rewards: ±1
                                      on “win”/“lose”, 0
                                      othw.
                                      Optimal policy
                                      maximizes total
                                      expected reward.
                                      Dynamic programming
                                      quickly finds the
                                      optimal policy.
                                      Probability of winning:
                                      pw = 0.97222 . . .
                                      But...
Claim:



   It is not always desirable to find the optimal policy
                     for that problem.
Claim:



   It is not always desirable to find the optimal policy
                     for that problem.



    Hint: mean episode length of the optimal policy, d = 84.58333
                               steps.
Optimal policy revisited.




                            Seek winning.
Optimal policy revisited.




                            Seek winning.
Optimal policy revisited.




                            Seek winning.
                            Avoid losing.
Optimal policy revisited.




                            Seek winning.
                            Avoid losing.
                            Stay safe.
Optimal policy revisited.




                            Seek winning.
                            Avoid losing.
                            Stay safe.
Optimal policy revisited.




                            Seek winning.
                            Avoid losing.
                            Stay safe.
Optimal policy revisited.




                            Seek winning.
                            Avoid losing.
                            Stay safe.
A simple, yet powerful idea.


  Introduce a step punishment term −rstep so the
  agent has an incentive to terminate faster.
A simple, yet powerful idea.


  Introduce a step punishment term −rstep so the
  agent has an incentive to terminate faster.
  At time t,
                         
                          +1 − rstep “win”
                 r (t) =   −1 − rstep “loss”
                              −rstep othw.
                         
A simple, yet powerful idea.


  Introduce a step punishment term −rstep so the
  agent has an incentive to terminate faster.
  At time t,
                            
                             +1 − rstep “win”
                    r (t) =   −1 − rstep “loss”
                                 −rstep othw.
                            



  Origin: Maze rewards, −1 except on termination.
  Problem: rstep =?
  (i.e, cost of staying in the game usually incommensurable with
  terminal rewards)
Better than optimal?



                       Optimal policy for
                       rstep = 0
Better than optimal?



                       Optimal policy for
                       rstep = 0.08701
Better than optimal?



                       Optimal policy for
                       rstep = 0.08701
                       pw = 0.48673 (was
                       0.97222 — 50.06%)
                       d = 11.17627 (was
                       84.58333 — 13.21%)
Better than optimal?



                       Optimal policy for
                       rstep = 0.08701
                       pw = 0.48673 (was
                       0.97222 — 50.06%)
                       d = 11.17627 (was
                       84.58333 — 13.21%)

                                               pw
                       This policy maximizes    d
Chess: White wins∗
Uribe Muriel. Journal of Fabricated Results, Vol 06, No. 8, 2010
Chess: White wins∗
Uribe Muriel. Journal of Fabricated Results, Vol 06, No. 8, 2010



   ∗
       in 108 ply.
Chess: White wins∗
Uribe Muriel. Journal of Fabricated Results, Vol 06, No. 8, 2010



   ∗
       in 108 ply.
                     √
   Visits only about 5 of the total number of valid states1 , but, if a
   ply takes one second, an average game will last three years and two
   months.




       1
           Shannon, 1950.
Chess: White wins∗
Uribe Muriel. Journal of Fabricated Results, Vol 06, No. 8, 2010



   ∗
       in 108 ply.
                     √
   Visits only about 5 of the total number of valid states1 , but, if a
   ply takes one second, an average game will last three years and two
   months.

   Certainly unlikely to be the case, but in fact finding policies of
   maximum winning probability remains the usual goal in RL.




       1
           Shannon, 1950.
Chess: White wins∗
Uribe Muriel. Journal of Fabricated Results, Vol 06, No. 8, 2010



   ∗
       in 108 ply.
                     √
   Visits only about 5 of the total number of valid states1 , but, if a
   ply takes one second, an average game will last three years and two
   months.

   Certainly unlikely to be the case, but in fact finding policies of
   maximum winning probability remains the usual goal in RL.

   Discount factor γ, used to ensure values are finite, has effect in
   episode length, but is unpredictable and suboptimal (for the pdw

   problem)



       1
           Shannon, 1950.
Main result.



  For a general ±1-rewarded problem, there exists an
   ∗
  rstep for which the value-optimal solution maximizes
  pw
   d and the value of the initial state is -1

                 ∗
               ∃rstep |
                                              pw
                    π ∗ = argmax v = argmax
                            π∈Π       π∈Π      d
                ∗
               v (s0 ) = v = −1
Stating the obvious.


      Every policy has a mean episode length d ≥ 1 and probability
      of winning 0 ≤ pw ≤ 1.
Stating the obvious.


      Every policy has a mean episode length d ≥ 1 and probability
      of winning 0 ≤ pw ≤ 1.



      v = 2pw − 1 − rstep d
Stating the obvious.


      Every policy has a mean episode length d ≥ 1 and probability
      of winning 0 ≤ pw ≤ 1.



      v = 2pw − 1 − rstep d
  (Lemma: Extensible to vectors using indicator variables)
Stating the obvious.


      Every policy has a mean episode length d ≥ 1 and probability
      of winning 0 ≤ pw ≤ 1.



      v = 2pw − 1 − rstep d
  (Lemma: Extensible to vectors using indicator variables)


  The proof rests on a solid foundation of duh!
Key substitution.
The w − l space




                  w = pd
                       w   l = 1−pw
                                 d
Key substitution.
The w − l space




                  w = pd
                       w              l = 1−pw
                                            d

        Each policy is represented by a unique point in the w − l
        plane.
Key substitution.
The w − l space




                  w = pd
                       w               l = 1−pw
                                             d

        Each policy is represented by a unique point in the w − l
        plane.
        The policy cloud is limited by the triangle with vertices (1,0),
        (0,1), and (0,0).
Execution and speed in the w − l space.




     Winning probability:        Mean episode length:
                 w                           1
         pw =                         d=
                w +l                       w +l
Proof Outline - Value in the w − l space.




                           w − l − rstep
                      v=
                              w +l
So...



        All level sets intersect at the same point,
        (rstep , −rstep )
        There is a one-to-one relationship between
        values and slopes.
        Value (for all rstep ), mean episode length and
        winning probability level sets are lines
        Optimal policies in the convex hull of the policy
        cloud.
And done!


                                pw
                    π ∗ = max      = max w
                           π     d    π

        (Vertical level sets) When vt ≈ −1, we’re there.
Algorithm

       Set ε
       Initialize π0
       rstep ← 0
       Repeat:
                         +
           Find π + , vπ (solve from π0 by any RL method)
           rstep ← rstep
           π0 ← π +
                  +
       Until |vπ (s0 ) + 1| < ε
Algorithm

        Set ε
        Initialize π0
        rstep ← 0
        Repeat:
                          +
            Find π + , vπ (solve from π0 by any RL method)
            rstep ← rstep
            π0 ← π +
                   +
        Until |vπ (s0 ) + 1| < ε


     On termination, π + ≈ π ∗ .
     rstep update using a learning rate µ > 0,
                                             +
                         rstep = rstep + µ[vπ (s0 ) + 1]
Optimal rstep update.




      Minimizing the interval of rstep uncertainty in the next
      iteration.
      Requires solving a minmax problem. Either root of an 8th
      degree polynomial in rstep or zero of the difference of two
      rational functions of order 4. (Easy using secant method).
      O(log 1 ) complexity.
Extensions.



         Problems solvable through a similar method
                   Convex (linear) tradeoff.
                   π ∗ = argmaxπ∈Π {αpw − (1 − α)d}
                             Greedy tradeoff.
                         ∗                 2pw −1
                       π = argmaxπ∈Π         d
                         Arbitrary tradeoffs.
                        ∗                 αpw −β
                       π = argmaxπ∈Π        d
                        Asymmetric rewards.
                      rwin = a, rloss = −b; a, b ≥ 0
                    Games with tie outcomes.
              Games with multiple win / loss rewards.
Harder family of problems




  Maximize the probability of having won before n steps / m
  episodes.


  Why? Non-linear level sets / non-convex functions in the w − l
  space.
Outline of future research.
Towards robustness.



         Policy variation in tasks with fixed episode length. Inclusion of
         time as a component of the state space.
Outline of future research.
Towards robustness.



         Policy variation in tasks with fixed episode length. Inclusion of
         time as a component of the state space.
         Defining policy neighbourhoods.
Outline of future research.
Towards robustness.



         Policy variation in tasks with fixed episode length. Inclusion of
         time as a component of the state space.
         Defining policy neighbourhoods.
           1   Continuous/discrete statewise action neighbourhoods.
           2   Discrete policy neighbourhoods for structured tasks.
           3   General policy neighbourhoods.
Outline of future research.
Towards robustness.



         Policy variation in tasks with fixed episode length. Inclusion of
         time as a component of the state space.
         Defining policy neighbourhoods.
         Feature-robustness
Outline of future research.
Towards robustness.



         Policy variation in tasks with fixed episode length. Inclusion of
         time as a component of the state space.
         Defining policy neighbourhoods.
         Feature-robustness
           1   Value/Speed/Execution neighbourhoods in the w − l space.
           2   Robustness as a trading off of features
Outline of future research.
Towards robustness.



         Policy variation in tasks with fixed episode length. Inclusion of
         time as a component of the state space.
         Defining policy neighbourhoods.
         Feature-robustness
         Can traditional Reinforcement Learning methods still be used
         to handle the learning?
Thank you.
muriel@cs.colostate.edu - r-uribe@uniandes.edu.co




        Untitled by Li Wei, School of Design, Oita University, 2009.

Más contenido relacionado

Destacado

Economics of crime model
Economics of crime modelEconomics of crime model
Economics of crime modelHa Bui
 
Capital Asset Pricing Model
Capital Asset Pricing ModelCapital Asset Pricing Model
Capital Asset Pricing ModelRod Medallon
 
British economy presentation
British economy presentationBritish economy presentation
British economy presentationCOLUMDAE
 

Destacado (8)

Economics of crime model
Economics of crime modelEconomics of crime model
Economics of crime model
 
Sanctions in Anti-trust cases – Prof. John M. Connor – Purdue University, US ...
Sanctions in Anti-trust cases – Prof. John M. Connor – Purdue University, US ...Sanctions in Anti-trust cases – Prof. John M. Connor – Purdue University, US ...
Sanctions in Anti-trust cases – Prof. John M. Connor – Purdue University, US ...
 
Economic offences
Economic offencesEconomic offences
Economic offences
 
Capital Asset Pricing Model
Capital Asset Pricing ModelCapital Asset Pricing Model
Capital Asset Pricing Model
 
British economy presentation
British economy presentationBritish economy presentation
British economy presentation
 
Sanctions in Anti-trust cases – Prof. Hwang LEE – Korean University School of...
Sanctions in Anti-trust cases – Prof. Hwang LEE – Korean University School of...Sanctions in Anti-trust cases – Prof. Hwang LEE – Korean University School of...
Sanctions in Anti-trust cases – Prof. Hwang LEE – Korean University School of...
 
British Economy
British EconomyBritish Economy
British Economy
 
Micro and Macro Economics
Micro and Macro EconomicsMicro and Macro Economics
Micro and Macro Economics
 

Optimal Policies for Winning Slow, Losing Fast

  • 1. Winning slow, losing fast, and in between. Reinaldo A Uribe Muriel Colorado State University. Prof. C. Anderson Oita University. Prof. K. Shibata Universidad de Los Andes. Prof. F. Lozano February 8, 2010
  • 2. It’s all fun and games until someone proves a theorem. Outline 1 Fun and games 2 A theorem 3 An algorithm
  • 3. A game: Snakes & Ladders Board: Crawford & Son, Melbourne, 1901. (Source: http://www.naa.gov.au/) Player advances the number of steps indicated by a die. Landing on a snake’s mouth sends the player back to the tail. Landing on a ladder’s bottom moves the player forward to the top. Goal: reaching state 100.
  • 4. A game: Snakes & Ladders Board: Crawford & Son, Melbourne, 1901. (Source: http://www.naa.gov.au/) Player advances the number of steps indicated by a die. Landing on a snake’s mouth sends the player back to the tail. Boring! Landing on a ladder’s (No skill required, only luck.) bottom moves the player forward to the top. Goal: reaching state 100.
  • 5. Variation: Decision Snakes and Ladders Sets of “win” and “loss” terminal states. Actions: either “advance” or “retreat,” to be decided before throwing the die.
  • 6. Reinforcement Learning: Finding the optimal policy. “Natural” Rewards: ±1 on “win”/“lose”, 0 othw. Optimal policy maximizes total expected reward. Dynamic programming quickly finds the optimal policy. Probability of winning: pw = 0.97222 . . .
  • 7. Reinforcement Learning: Finding the optimal policy. “Natural” Rewards: ±1 on “win”/“lose”, 0 othw. Optimal policy maximizes total expected reward. Dynamic programming quickly finds the optimal policy. Probability of winning: pw = 0.97222 . . . But...
  • 8. Claim: It is not always desirable to find the optimal policy for that problem.
  • 9. Claim: It is not always desirable to find the optimal policy for that problem. Hint: mean episode length of the optimal policy, d = 84.58333 steps.
  • 10. Optimal policy revisited. Seek winning.
  • 11. Optimal policy revisited. Seek winning.
  • 12. Optimal policy revisited. Seek winning. Avoid losing.
  • 13. Optimal policy revisited. Seek winning. Avoid losing. Stay safe.
  • 14. Optimal policy revisited. Seek winning. Avoid losing. Stay safe.
  • 15. Optimal policy revisited. Seek winning. Avoid losing. Stay safe.
  • 16. Optimal policy revisited. Seek winning. Avoid losing. Stay safe.
  • 17. A simple, yet powerful idea. Introduce a step punishment term −rstep so the agent has an incentive to terminate faster.
  • 18. A simple, yet powerful idea. Introduce a step punishment term −rstep so the agent has an incentive to terminate faster. At time t,   +1 − rstep “win” r (t) = −1 − rstep “loss” −rstep othw. 
  • 19. A simple, yet powerful idea. Introduce a step punishment term −rstep so the agent has an incentive to terminate faster. At time t,   +1 − rstep “win” r (t) = −1 − rstep “loss” −rstep othw.  Origin: Maze rewards, −1 except on termination. Problem: rstep =? (i.e, cost of staying in the game usually incommensurable with terminal rewards)
  • 20. Better than optimal? Optimal policy for rstep = 0
  • 21. Better than optimal? Optimal policy for rstep = 0.08701
  • 22. Better than optimal? Optimal policy for rstep = 0.08701 pw = 0.48673 (was 0.97222 — 50.06%) d = 11.17627 (was 84.58333 — 13.21%)
  • 23. Better than optimal? Optimal policy for rstep = 0.08701 pw = 0.48673 (was 0.97222 — 50.06%) d = 11.17627 (was 84.58333 — 13.21%) pw This policy maximizes d
  • 24. Chess: White wins∗ Uribe Muriel. Journal of Fabricated Results, Vol 06, No. 8, 2010
  • 25. Chess: White wins∗ Uribe Muriel. Journal of Fabricated Results, Vol 06, No. 8, 2010 ∗ in 108 ply.
  • 26. Chess: White wins∗ Uribe Muriel. Journal of Fabricated Results, Vol 06, No. 8, 2010 ∗ in 108 ply. √ Visits only about 5 of the total number of valid states1 , but, if a ply takes one second, an average game will last three years and two months. 1 Shannon, 1950.
  • 27. Chess: White wins∗ Uribe Muriel. Journal of Fabricated Results, Vol 06, No. 8, 2010 ∗ in 108 ply. √ Visits only about 5 of the total number of valid states1 , but, if a ply takes one second, an average game will last three years and two months. Certainly unlikely to be the case, but in fact finding policies of maximum winning probability remains the usual goal in RL. 1 Shannon, 1950.
  • 28. Chess: White wins∗ Uribe Muriel. Journal of Fabricated Results, Vol 06, No. 8, 2010 ∗ in 108 ply. √ Visits only about 5 of the total number of valid states1 , but, if a ply takes one second, an average game will last three years and two months. Certainly unlikely to be the case, but in fact finding policies of maximum winning probability remains the usual goal in RL. Discount factor γ, used to ensure values are finite, has effect in episode length, but is unpredictable and suboptimal (for the pdw problem) 1 Shannon, 1950.
  • 29. Main result. For a general ±1-rewarded problem, there exists an ∗ rstep for which the value-optimal solution maximizes pw d and the value of the initial state is -1 ∗ ∃rstep | pw π ∗ = argmax v = argmax π∈Π π∈Π d ∗ v (s0 ) = v = −1
  • 30. Stating the obvious. Every policy has a mean episode length d ≥ 1 and probability of winning 0 ≤ pw ≤ 1.
  • 31. Stating the obvious. Every policy has a mean episode length d ≥ 1 and probability of winning 0 ≤ pw ≤ 1. v = 2pw − 1 − rstep d
  • 32. Stating the obvious. Every policy has a mean episode length d ≥ 1 and probability of winning 0 ≤ pw ≤ 1. v = 2pw − 1 − rstep d (Lemma: Extensible to vectors using indicator variables)
  • 33. Stating the obvious. Every policy has a mean episode length d ≥ 1 and probability of winning 0 ≤ pw ≤ 1. v = 2pw − 1 − rstep d (Lemma: Extensible to vectors using indicator variables) The proof rests on a solid foundation of duh!
  • 34. Key substitution. The w − l space w = pd w l = 1−pw d
  • 35. Key substitution. The w − l space w = pd w l = 1−pw d Each policy is represented by a unique point in the w − l plane.
  • 36. Key substitution. The w − l space w = pd w l = 1−pw d Each policy is represented by a unique point in the w − l plane. The policy cloud is limited by the triangle with vertices (1,0), (0,1), and (0,0).
  • 37. Execution and speed in the w − l space. Winning probability: Mean episode length: w 1 pw = d= w +l w +l
  • 38. Proof Outline - Value in the w − l space. w − l − rstep v= w +l
  • 39. So... All level sets intersect at the same point, (rstep , −rstep ) There is a one-to-one relationship between values and slopes. Value (for all rstep ), mean episode length and winning probability level sets are lines Optimal policies in the convex hull of the policy cloud.
  • 40. And done! pw π ∗ = max = max w π d π (Vertical level sets) When vt ≈ −1, we’re there.
  • 41. Algorithm Set ε Initialize π0 rstep ← 0 Repeat: + Find π + , vπ (solve from π0 by any RL method) rstep ← rstep π0 ← π + + Until |vπ (s0 ) + 1| < ε
  • 42. Algorithm Set ε Initialize π0 rstep ← 0 Repeat: + Find π + , vπ (solve from π0 by any RL method) rstep ← rstep π0 ← π + + Until |vπ (s0 ) + 1| < ε On termination, π + ≈ π ∗ . rstep update using a learning rate µ > 0, + rstep = rstep + µ[vπ (s0 ) + 1]
  • 43. Optimal rstep update. Minimizing the interval of rstep uncertainty in the next iteration. Requires solving a minmax problem. Either root of an 8th degree polynomial in rstep or zero of the difference of two rational functions of order 4. (Easy using secant method). O(log 1 ) complexity.
  • 44. Extensions. Problems solvable through a similar method Convex (linear) tradeoff. π ∗ = argmaxπ∈Π {αpw − (1 − α)d} Greedy tradeoff. ∗ 2pw −1 π = argmaxπ∈Π d Arbitrary tradeoffs. ∗ αpw −β π = argmaxπ∈Π d Asymmetric rewards. rwin = a, rloss = −b; a, b ≥ 0 Games with tie outcomes. Games with multiple win / loss rewards.
  • 45. Harder family of problems Maximize the probability of having won before n steps / m episodes. Why? Non-linear level sets / non-convex functions in the w − l space.
  • 46. Outline of future research. Towards robustness. Policy variation in tasks with fixed episode length. Inclusion of time as a component of the state space.
  • 47. Outline of future research. Towards robustness. Policy variation in tasks with fixed episode length. Inclusion of time as a component of the state space. Defining policy neighbourhoods.
  • 48. Outline of future research. Towards robustness. Policy variation in tasks with fixed episode length. Inclusion of time as a component of the state space. Defining policy neighbourhoods. 1 Continuous/discrete statewise action neighbourhoods. 2 Discrete policy neighbourhoods for structured tasks. 3 General policy neighbourhoods.
  • 49. Outline of future research. Towards robustness. Policy variation in tasks with fixed episode length. Inclusion of time as a component of the state space. Defining policy neighbourhoods. Feature-robustness
  • 50. Outline of future research. Towards robustness. Policy variation in tasks with fixed episode length. Inclusion of time as a component of the state space. Defining policy neighbourhoods. Feature-robustness 1 Value/Speed/Execution neighbourhoods in the w − l space. 2 Robustness as a trading off of features
  • 51. Outline of future research. Towards robustness. Policy variation in tasks with fixed episode length. Inclusion of time as a component of the state space. Defining policy neighbourhoods. Feature-robustness Can traditional Reinforcement Learning methods still be used to handle the learning?
  • 52. Thank you. muriel@cs.colostate.edu - r-uribe@uniandes.edu.co Untitled by Li Wei, School of Design, Oita University, 2009.