SlideShare una empresa de Scribd logo
1 de 34
Descargar para leer sin conexión
Optimal Nudging
A new approach to solving SMDPs
Reinaldo Uribe M
Universidad de los Andes — Oita University
Colorado State University

Nov. 11, 2013
Snakes & Ladders

Player advances the
number of steps indicated
by a die.
Landing on a snake’s
mouth sends the player
back to the tail.
Landing on a ladder’s
bottom moves the player
forward to the top.
Goal: reaching state 100.
Snakes & Ladders

Player advances the
number of steps indicated
by a die.

Boring!
(No skill required, only luck.)

Landing on a snake’s
mouth sends the player
back to the tail.
Landing on a ladder’s
bottom moves the player
forward to the top.
Goal: reaching state 100.
Variation: Decision Snakes and Ladders

Sets of “win” and
“loss” terminal states.
Actions: either
“advance” or “go
back,” to be decided
before throwing the die.
Reinforcement Learning: Finding an optimal policy.

“Natural” Rewards: ±1
on “win”/“lose”, 0
othw.
Optimal policy
maximizes total
expected reward.
Dynamic programming
quickly finds the
optimal policy.
Probability of winning:
pw = 0.97222 . . .
We know a lot!

Markov Decision Process: States, Actions, Transition
Probabilities, Rewards.
We know a lot!

Markov Decision Process: States, Actions, Transition
Probabilities, Rewards.
Policies and policy value.
We know a lot!

Markov Decision Process: States, Actions, Transition
Probabilities, Rewards.
Policies and policy value.
Max winning probability = max earnings.
We know a lot!

Markov Decision Process: States, Actions, Transition
Probabilities, Rewards.
Policies and policy value.
Max winning probability = max earnings.
Taking an action costs (in units different from rewards.)
We know a lot!

Markov Decision Process: States, Actions, Transition
Probabilities, Rewards.
Policies and policy value.
Max winning probability = max earnings.
Taking an action costs (in units different from rewards.)
Different actions may have different costs.
We know a lot!

Markov Decision Process: States, Actions, Transition
Probabilities, Rewards.
Policies and policy value.
Max winning probability = max earnings.
Taking an action costs (in units different from rewards.)
Different actions may have different costs.
Semi-Markov model with average rewards.
Better than optimal?

(Old optimal policy)
Better than optimal?

(Optimal policy) with
average reward
ρ = 0.08701
Better than optimal?

(Optimal policy) with
average reward
ρ = 0.08701
pw = 0.48673 (was
0.97222 — 50.06%)
d = 11.17627 (was
84.58333 — 13.21%)
Better than optimal?

(Optimal policy) with
average reward
ρ = 0.08701
pw = 0.48673 (was
0.97222 — 50.06%)
d = 11.17627 (was
84.58333 — 13.21%)
This policy maximizes

pw
d
So, how are average-reward optimal policies found?

Algorithm 1 Generic SMDP solver
Initialize
repeat forever
Act
Do RL to find value of current π
Update ρ.

Usually 1-step Q-learning

Average-adjusted Q-learning:
Qt+1 (st , at ) ← (1 − γt ) Qt (st , at ) + γt rt+1 − ρt ct+1 + max Qt (st+1 , a)
a
Generic Learning Algorithm
Table of algorithms. ARRL
Algorithm

Gain update
t

r(si , π i (si ))

AAC
Jalali and Ferguson 1989

R–Learning

ρt+1 ←
ρ

t+1

Tadepalli and Ok 1998

SSP Q-Learning

t+1

← (1 − α)ρt +
α rt+1 + max Qt (st+1 , a) − max Qt (st , a)

Schwartz 1993

H–Learning

i=0

a

ρt+1 ← ρt + αt min Qt (ˆ, a)
s
a

Abounadi et al. 2001
t

r(si , π i (si ))

HAR
Ghavamzadeh and Mahadevan 2007

a

ρt+1 ← (1−αt )ρt +αt rt+1 − H t (st ) + H t (st+1 )
αt
αt+1 ←
αt + 1

ρt+1 ←

i=0

t+1
Generic Learning Algorithm

Table of algorithms. SMDPRL
Algorithm

Gain update

SMART

t

r(si , π i (si ))

Das et al. 1999

ρt+1 ←

i=0
t

MAX-Q
Ghavamzadeh and Mahadevan 2001

c(si , π i (si ))
i=0
Nudging
Algorithm 2 Nudged Learning
Initialize (π, ρ, Q)
repeat
Set reward scheme to (r − ρc).
Solve by any RL method.
Update ρ
until Qπ (sI ) = 0
Nudging
Algorithm 3 Nudged Learning
Initialize (π, ρ, Q)
repeat
Set reward scheme to (r − ρc).
Solve by any RL method.
Update ρ
until Qπ (sI ) = 0

Note: ‘by any RL method’ refers to a well-studied problem for
which better algorithms (both practical and with theoretical
guarantees) exist.
ρ can (and will) be updated optimally.
The w − l space.
Definition

(Policy π has expected average reward v π and expected average
cost c π . Let D be a bound on the absolute value of v π )
wπ =

D + vπ
,
2c π

lπ =

D − vπ
.
2c π

D

l
●

● ●
●
●
●●
●
●
● ●
● ● ●● ●● ● ● ● ●
●
● ●● ● ● ●●●●
●
●
● ● ●●
●
● ● ● ● ●●●●● ●●●●● ●
● ●
●
● ● ●●
●
●
● ●
● ● ● ● ●●● ● ●●● ● ●
● ● ● ●●●●● ●●● ●● ● ●● ●
● ● ●● ● ●
● ● ● ●●● ● ●●●●●●●● ●●● ●● ●● ●● ● ●
●
●● ●● ●●●●●●●●● ● ●●● ●
●
● ●● ●● ●●
●
●
●●
●
●●● ●
● ● ●●●●● ●●● ●●
●
●
● ● ● ● ●●● ●●●● ●●●●●●●●●●●●●●●● ● ●
●
● ● ●●●●● ●●●●●●●●●●●●● ●
●●●●●●● ●●●●●●●●●●●●● ●●●●
● ●
●●●●● ●● ● ● ●●● ● ●● ●
● ● ● ● ● ●● ●
●●●● ●● ●●●●●●●●●● ●
●
● ●● ●●●●●●●●●●●●●●●●●●●●● ● ●
● ●●●●●●● ●●● ●●●●●●●● ●● ●
● ●●● ●●●● ●● ●●● ● ●
● ● ●●● ●●●● ●●●●● ● ●●
●
● ●●● ●●●●
● ● ● ● ●●●●●●●●●● ●●●●●●●●●● ● ● ●
● ●●●●●●●●●● ●●●●● ● ● ● ●
● ●●● ●● ●●●●●●●●●●●●●●●●●●●● ●● ●●
● ●●●● ●●●● ●
● ● ● ●● ●
●
●
● ● ●●●●
●
● ●● ●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ●
● ●●●●●●●●●●●●●●●●●●●● ● ● ●
●
●● ●●●●● ● ●●●●●
●●●●●●●●●●●●● ●●●●● ●●●
●● ● ●● ●●●●●●● ●●●● ●
● ●●●●●●●●●●●●●●●●●●●●●●● ● ●
● ●●● ●● ● ●● ●
● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●
● ● ● ● ●● ●
●●● ●●●● ●●●●●● ●●●●●●●● ●
●● ● ●●●●●●● ●●● ●●● ●● ●
● ● ● ●●●●● ●●●●●●●●●●●●●●●● ●
● ●●
● ● ● ● ●●● ● ●● ● ● ●
● ● ● ●●●●●●●●●●●●●●●●●● ●●● ●
● ● ● ●●●●●●●●●●●●●● ●●●●●●●●●●●● ●●●
●●●●● ● ● ● ●
●● ●● ●●●●●●●●●●●●● ●●●●●●●●●●●● ● ●
● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●
● ●●●●●●●●● ●● ● ● ● ●
●
●● ● ● ●
●
● ●● ●●●●●●●●●●● ●●●●●●●●●●●●
●●●●●●●● ●●●●●●●●●
●
● ● ● ●●
● ● ●●●●●●●●●●●●●●●●●●●●● ●
● ●●●●●●●●●● ●●●●●●● ●●
● ● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●●
● ● ● ●● ●● ●● ●
●● ●●●●●●●●●●●●●● ●● ● ●
●
●● ● ● ●● ●
● ●● ●●●●●●●●●●●●●●●●●●●●●●● ●●
● ● ●●●●●●●●●●●●●●●● ●●●●●●●
● ●● ●● ●●●●● ●●
● ● ●●● ● ●● ● ●
●
● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
● ●● ●●●●●●●●●● ●
● ●●●●●●●●●●●●●●●●●●●● ●●●●●●● ●
●● ● ●●●●●●●●●●●●●●●● ●●●● ●●
● ● ● ●●● ●●●●●●●●●●● ●●●●●
● ● ● ● ● ●●
● ●●●●●●●●●●●●● ● ● ●
● ●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●
● ●
● ● ●●●●●●●●●●●●●●●● ●●●●●●
● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ● ●
●
● ● ●●●●●●●●●● ●●●● ● ● ●
●
●●●●●●●●● ●●● ●●● ●
●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●
●● ●● ● ●●● ●
● ●● ● ●●● ●●●●● ●
● ●●●●●●●●●●●●● ●●● ● ● ●
● ● ●●●●●●●●●●●●● ●●●● ● ● ● ●
● ●●
●● ●●●●●●●●●●●●●●●●●●●●●● ●●●●
● ● ●●●●●● ● ● ● ●
● ● ● ●●●●●●●● ●● ● ●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●
●
●
●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●
● ●●●●●●●●●●●●●●●●●●●●●●● ● ●
●● ● ● ● ● ●
● ●● ●
●
●●●●●●●●●●●●●●●●●●●●●● ●
●
●
● ● ●●●●●●●●●●●●●●●●●●● ●● ● ●
●●● ● ●●●●●●●●●●●●● ●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●
●●● ● ●● ●●● ●●●●●
● ●●●●●●●●●●●●●●●●●●●● ●●●● ●
●
● ● ● ●●● ● ●● ● ●
●●● ● ●●● ●●●●●● ● ●
● ● ●● ●● ● ● ●
●●●●●●●●●●●●●●● ●●●● ● ●●
● ● ● ●●●●●●●●●●●●●●●●●●●●●●●● ●
● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ●
●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●● ● ●●● ● ●
● ● ● ●● ● ●●●●●●●●●● ●
●●
● ●●
● ● ●●● ●●●●●●●●●●●●●●●●●●●●●● ●
● ● ● ●●● ●● ●●●● ●● ●● ●
●● ● ● ●
● ●●● ●●●● ●●●●●●●●●●●● ●●
●●
● ● ● ●●●●●●●●●●● ●●●●●●●●●●●●●● ●
●
●●●● ● ● ●● ● ● ●
●●●●●●●●●●●●●●●●●●●●●● ● ●●
●●●
●
●●●●●●●●●●●●●●●●●●● ● ●
● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●
● ● ●●●●●●●●● ●●●●●●
●●●●●●● ●●● ●●●●
●● ● ●
● ● ●●●●●●●●●●●●●●●●●●●●●●●●● ●
●●●●● ●●●●●●●●●●● ●● ●
●●●● ●●●●●● ●●●●●●●●●●●●
● ●●●●●●●●●●●● ● ● ●
●
●●● ●●●● ●●●●●
●●●● ●●●●●●●●●●●●●●●●● ●●● ●
● ●
●
●● ● ●● ● ● ●●
● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●
●●● ●●● ●●●●●●●●●●●●●●●●●●●●● ● ●●
● ●● ●●●●●●●●●●●●●●●●●●●●
●
● ●●●●●●●●●●●●●●●●●●●●●●● ●
● ●●●●●●●●●●●●●●●●●●●●● ●●●● ● ●
●● ●● ● ●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●
● ●●●●●●●●●●●●●●●●●●
● ●● ●●● ●●●●●● ●● ● ●
●● ● ● ● ● ●● ●●
● ●● ●●●●●●●●●●●● ●●●●●● ●
●● ●●● ● ●● ● ●●
● ●●●●●●●●●●●●●●●●●●●●●●● ●
● ●●●●
●● ●●●●●●●●●●●●●●●●●●●● ●
●● ● ●
●
●●● ●●●●● ●● ● ● ● ●
●●●●●● ●●●●●●● ●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●● ●● ●
● ● ●● ● ● ●
● ●●● ●●●●●●●●●●●●●●●●●●●●● ●●●●
● ● ●●●●●●●●●●●●●●●●●●●●●●● ●
● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●● ● ●
●● ●● ●●●●●●●●●● ● ●●●
●●
● ●● ● ● ● ●● ●●
● ● ●●●●●●●●●●●●●●●●●●●●● ●●
●●
●●●●● ●●●●●●●●●●●●●●●●●●●●●●● ●●● ●●
●●●●●●●●●●●●●●●●●●●●●● ●
● ●● ● ●
●
●●●●●●● ●● ●●●●●●●
●●●●●● ●●●●●●●●●●●●●●● ●● ●
●●●●●● ● ●● ●● ●● ●●
●
●● ● ● ●
●● ● ● ●●●●●●● ●●●● ●● ●●●●
●● ● ●●● ● ●●● ● ●
● ●●●● ●●●●●●●●●●●●●●● ●●●●● ● ● ●
●●●● ●●●●●●●●●●●●●● ●●●●● ●
●
●● ●● ● ●
● ●
●●● ●● ●●●●●●● ●
● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●
●
● ● ● ●● ● ●
●●● ●●●●●●●●●●●●●●●● ●●●●
● ● ● ● ●● ● ● ● ●
● ●● ●●●●●●●●●●●●●●●●●●● ● ●●
● ●●●● ●●●●●●●●●● ●●●●
●●● ●● ●●●●●●●● ●
● ●●
● ●●●●●●●●●●●●●●●●●●●●●●● ● ●
● ●●● ●●●●●●●●●●●●● ●● ●
●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●
● ●
● ●●●●●●●●●●●●●●●●●●●●●●● ● ●
● ●●●●●●●●●●●● ●● ●● ● ●
● ● ●●●●●●●●●●●●●●●●●●●●●●●● ●
● ●● ●●● ●●●●●●●●●●●●●●●●●●●●●●●● ● ●●
● ●●●● ●●●●●● ● ●●● ●
●
●● ● ●●● ● ●● ●
● ● ●● ●
● ● ●● ●●●●●●●●●● ●● ●● ● ●
● ● ● ●●●●●●●●●●●●●● ●● ●●
● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●
● ●●● ●●●●●●●●●●●●●●●●●●●● ●●●●
●
● ● ● ●● ●●●●●●● ●●●●●● ● ●●
●● ● ● ●●●●●●●●●●●●●●●●●● ●
●●● ● ●
● ● ● ●●●● ●● ●
●
● ●●●●●● ●●● ● ●●● ●
●●●●●●●●●●●●●● ●●●●●●●●● ●● ●
● ● ● ●●●●●●●●●●●●●● ● ● ●
● ●●●
● ●●● ●●●●●● ● ●
● ●●
● ●● ● ●●●●●●● ●●●●● ●
● ●●●●●●●● ●●● ● ●
● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●●
●
● ● ● ●●●●●●●●●●●●●●●●●● ● ●●
●●●●●●●●● ●● ●●●● ● ●
● ●●● ● ●● ●●●●●●●●●●●●●●●●●●● ● ●
● ●● ●●● ●●●
● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ● ●
●●●●●●● ●●●● ●● ●●
●● ● ● ● ●●
●● ●
● ● ● ● ● ●●
● ●●● ●●● ●●●●● ● ● ● ● ● ● ● ●
●
● ●● ●● ●●● ●● ●
●
●●●●●●●●●●●●●● ●●● ● ●
●● ● ● ●
● ●● ● ●● ● ●
● ●● ●●● ●●●●●●●●●●●●●●●●●● ● ● ●●●
● ●●
●
●● ●●●●●●●●●●●●●●●●●● ●●
●● ● ●●●●●●●●●●●●●● ● ●
● ● ●● ●● ●●●●● ●●● ● ●
●
●●
●●● ●●●● ●●●●●●●●● ● ● ●
● ● ●●●●●●●●●●●●●●●●●●●● ●● ●
● ●● ●●●●●● ●
●
● ● ● ●●●●●●●●●●●●●●●●●●● ● ● ● ●
● ●●● ●●● ●●● ● ●
● ●● ●
● ●● ●●●●●●●●●●●●● ●●●●●●●● ●●● ● ●
●●● ●● ●●● ● ●●● ●
● ● ●●●● ●● ●●●●●●●● ●●● ●
●●
●●● ●● ●●● ●●●●● ●●●
● ●●●●●●●●●●●●●●●●●●● ●●● ●
●
●● ● ● ● ● ●●
●
●
● ● ●●●●●●●●●●●●●●●●● ● ●
● ●●●●● ●●● ● ● ●
● ● ● ●● ● ●
● ● ● ●●●●●● ●●●●●●●●●●● ●●● ● ●
● ● ●●●●●●●● ●● ●
● ●●●●●● ●●●● ●●●●●● ● ●
● ●●●●●●● ●●●●●●●● ●●● ● ● ●
● ● ● ●●●●●●●●●●●●●●●●●● ●●● ● ● ● ● ●
●
● ● ●●● ●● ● ●
●●●● ● ●●●● ●● ● ●● ●
●● ● ●●●●●● ●
●● ●●●●●●● ●●●●●●●● ● ●● ● ● ●
●●●●●●●●● ●● ●● ● ●
● ● ● ● ●● ● ●●● ● ● ● ● ●
●
● ●●● ●●●●●●●●●●●●●● ●● ● ●● ● ●
●●●● ● ● ●●●●●● ●● ●●● ● ●
●
● ● ●●●●●● ●●●
●
● ●
●
● ●
●
●
●●● ●● ● ● ● ●● ● ●
● ● ●●●●●● ● ●●●●● ● ●●● ●●
●
● ●● ●
●
● ●
●
● ●●● ● ● ● ● ●
● ● ● ●●●●●●●●●● ●●● ●●●●● ●●●
● ●
● ● ●● ● ● ●● ●● ● ● ● ●
●
●
● ● ● ● ●● ●
●●
● ●
● ● ●●●● ● ● ● ● ● ●●● ●●
● ●●● ● ●● ● ● ●
● ● ● ● ●● ●●●●● ●●●●● ● ●
●
● ●● ●● ●●●●●●●●●● ●● ●● ● ● ● ●●
●● ● ● ● ●●
●
●
● ● ● ●●●● ●● ● ●
●
● ● ● ● ●● ●
●●
● ● ●●●● ● ●●● ●● ●●●●● ● ●●●● ●
●
●
●
● ● ●
●
● ● ● ● ●●● ●●●● ● ●
●● ●● ●●●●●●●●●● ● ●
● ● ● ● ● ●● ●● ●●
● ● ● ● ●
● ●
●
● ●
●
●
●● ●●● ●●●● ●●●● ●●●● ●● ●
●
●
●● ●
●●
●
●
●
● ●●●●●●●●●●● ●●● ● ●
●
●● ● ●●● ● ● ●● ● ● ●
● ●
●● ●
●
● ●
●
● ●●● ● ● ● ●
●
● ●● ● ● ●
●
●
●
●
●● ●
●
●
●
●
●●

●

w

D
The w − l space.
Value and Cost

(Policy π has expected average reward v π and expected average
cost c π . Let D be a bound on the absolute value of v π )
wπ =

D + vπ
,
2c π

lπ =
D

5D
−0.

1

l
0

−D

D

l

D − vπ
.
2c π

2

0.5D

4
8

D

w

D

w

D
The w − l space.
Nudged value

(Policy π has expected average reward v π and expected average
cost c π . Let D be a bound on the absolute value of v π )
wπ =

D + vπ
,
2c π

lπ =

D − vπ
.
2c π

−
D

/2

D

l
● ●
●
●
●●
●
●
● ●
● ● ●● ●● ● ● ● ●
●
● ●● ● ● ●●●●
●
●
● ● ●●
●
● ● ● ● ●●●●● ●●●●● ●
● ●
●
● ● ●●
●
●
● ●
● ● ● ● ●●● ● ●●● ● ●
● ● ● ●●●●● ●●● ●● ● ●● ●
● ● ●● ● ●
● ● ● ●●● ● ●●●●●●●● ●●● ●● ●● ●● ● ●
●
●● ●● ●●●●●●●●● ● ●●● ●
●
● ●● ●● ●●
●
●
●●
●
●●● ●
● ● ●●●●● ●●● ●●
●
●
● ● ● ● ●●● ●●●● ●●●●●●●●●●●●●●●● ● ●
●
● ● ●●●●● ●●●●●●●●●●●●● ●
●●●●●●● ●●●●●●●●●●●●● ●●●●
● ●
●●●●● ●● ● ● ●●● ● ●● ●
● ● ● ● ● ●● ●
●●●● ●● ●●●●●●●●●● ●
●
● ●● ●●●●●●●●●●●●●●●●●●●●● ● ●
● ●●●●●●● ●●● ●●●●●●●● ●● ●
● ●●● ●●●● ●● ●●● ● ●
● ● ●●● ●●●● ●●●●● ● ●●
●
● ●●● ●●●●
● ● ● ● ●●●●●●●●●● ●●●●●●●●●● ● ● ●
● ●●●●●●●●●● ●●●●● ● ● ● ●
● ●●● ●● ●●●●●●●●●●●●●●●●●●●● ●● ●●
● ●●●● ●●●● ●
● ● ● ●● ●
●
●
● ● ●●●●
●
● ●● ●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ●
● ●●●●●●●●●●●●●●●●●●●● ● ● ●
●
●● ●●●●● ● ●●●●●
●●●●●●●●●●●●● ●●●●● ●●●
●● ● ●● ●●●●●●● ●●●● ●
● ●●●●●●●●●●●●●●●●●●●●●●● ● ●
● ●●● ●● ● ●● ●
● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●
● ● ● ● ●● ●
●●● ●●●● ●●●●●● ●●●●●●●● ●
●● ● ●●●●●●● ●●● ●●● ●● ●
● ● ● ●●●●● ●●●●●●●●●●●●●●●● ●
● ●●
● ● ● ● ●●● ● ●● ● ● ●
● ● ● ●●●●●●●●●●●●●●●●●● ●●● ●
● ● ● ●●●●●●●●●●●●●● ●●●●●●●●●●●● ●●●
●●●●● ● ● ● ●
●● ●● ●●●●●●●●●●●●● ●●●●●●●●●●●● ● ●
● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●
● ●●●●●●●●● ●● ● ● ● ●
●
●● ● ● ●
●
● ●● ●●●●●●●●●●● ●●●●●●●●●●●●
●●●●●●●● ●●●●●●●●●
●
● ● ● ●●
● ● ●●●●●●●●●●●●●●●●●●●●● ●
● ●●●●●●●●●● ●●●●●●● ●●
● ● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●●
● ● ● ●● ●● ●● ●
●● ●●●●●●●●●●●●●● ●● ● ●
●
●● ● ● ●● ●
● ●● ●●●●●●●●●●●●●●●●●●●●●●● ●●
● ● ●●●●●●●●●●●●●●●● ●●●●●●●
● ●● ●● ●●●●● ●●
● ● ●●● ● ●● ● ●
●
● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
● ●● ●●●●●●●●●● ●
● ●●●●●●●●●●●●●●●●●●●● ●●●●●●● ●
●● ● ●●●●●●●●●●●●●●●● ●●●● ●●
● ● ● ●●● ●●●●●●●●●●● ●●●●●
● ● ● ● ● ●●
● ●●●●●●●●●●●●● ● ● ●
● ●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●
● ●
● ● ●●●●●●●●●●●●●●●● ●●●●●●
● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ● ●
●
● ● ●●●●●●●●●● ●●●● ● ● ●
●
●●●●●●●●● ●●● ●●● ●
●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●
●● ●● ● ●●● ●
● ●● ● ●●● ●●●●● ●
● ●●●●●●●●●●●●● ●●● ● ● ●
● ● ●●●●●●●●●●●●● ●●●● ● ● ● ●
● ●●
●● ●●●●●●●●●●●●●●●●●●●●●● ●●●●
● ● ●●●●●● ● ● ● ●
● ● ● ●●●●●●●● ●● ● ●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●
●
●
●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●
● ●●●●●●●●●●●●●●●●●●●●●●● ● ●
●● ● ● ● ● ●
● ●● ●
●
●●●●●●●●●●●●●●●●●●●●●● ●
●
●
● ● ●●●●●●●●●●●●●●●●●●● ●● ● ●
●●● ● ●●●●●●●●●●●●● ●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●
●●● ● ●● ●●● ●●●●●
● ●●●●●●●●●●●●●●●●●●●● ●●●● ●
●
● ● ● ●●● ● ●● ● ●
●●● ● ●●● ●●●●●● ● ●
● ● ●● ●● ● ● ●
●●●●●●●●●●●●●●● ●●●● ● ●●
● ● ● ●●●●●●●●●●●●●●●●●●●●●●●● ●
● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ●
●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●● ● ●●● ● ●
● ● ● ●● ● ●●●●●●●●●● ●
●●
● ●●
● ● ●●● ●●●●●●●●●●●●●●●●●●●●●● ●
● ● ● ●●● ●● ●●●● ●● ●● ●
●● ● ● ●
● ●●● ●●●● ●●●●●●●●●●●● ●●
●●
● ● ● ●●●●●●●●●●● ●●●●●●●●●●●●●● ●
●
●●●● ● ● ●● ● ● ●
●●●●●●●●●●●●●●●●●●●●●● ● ●●
●●●
●
●●●●●●●●●●●●●●●●●●● ● ●
● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●
● ● ●●●●●●●●● ●●●●●●
●●●●●●● ●●● ●●●●
●● ● ●
● ● ●●●●●●●●●●●●●●●●●●●●●●●●● ●
●●●●● ●●●●●●●●●●● ●● ●
●●●● ●●●●●● ●●●●●●●●●●●●
● ●●●●●●●●●●●● ● ● ●
●
●●● ●●●● ●●●●●
●●●● ●●●●●●●●●●●●●●●●● ●●● ●
● ●
●
●● ● ●● ● ● ●●
● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●
●●● ●●● ●●●●●●●●●●●●●●●●●●●●● ● ●●
● ●● ●●●●●●●●●●●●●●●●●●●●
●
● ●●●●●●●●●●●●●●●●●●●●●●● ●
● ●●●●●●●●●●●●●●●●●●●●● ●●●● ● ●
●● ●● ● ●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●
● ●●●●●●●●●●●●●●●●●●
● ●● ●●● ●●●●●● ●● ● ●
●● ● ● ● ● ●● ●●
● ●● ●●●●●●●●●●●● ●●●●●● ●
●● ●●● ● ●● ● ●●
● ●●●●●●●●●●●●●●●●●●●●●●● ●
● ●●●●
●● ●●●●●●●●●●●●●●●●●●●● ●
●● ● ●
●
●●● ●●●●● ●● ● ● ● ●
●●●●●● ●●●●●●● ●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●● ●● ●
● ● ●● ● ● ●
● ●●● ●●●●●●●●●●●●●●●●●●●●● ●●●●
● ● ●●●●●●●●●●●●●●●●●●●●●●● ●
● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●● ● ●
●● ●● ●●●●●●●●●● ● ●●●
●●
● ●● ● ● ● ●● ●●
● ● ●●●●●●●●●●●●●●●●●●●●● ●●
●●
●●●●● ●●●●●●●●●●●●●●●●●●●●●●● ●●● ●●
●●●●●●●●●●●●●●●●●●●●●● ●
● ●● ● ●
●
●●●●●●● ●● ●●●●●●●
●●●●●● ●●●●●●●●●●●●●●● ●● ●
●●●●●● ● ●● ●● ●● ●●
●
●● ● ● ●
●● ● ● ●●●●●●● ●●●● ●● ●●●●
●● ● ●●● ● ●●● ● ●
● ●●●● ●●●●●●●●●●●●●●● ●●●●● ● ● ●
●●●● ●●●●●●●●●●●●●● ●●●●● ●
●
●● ●● ● ●
● ●
●●● ●● ●●●●●●● ●
● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●
●
● ● ● ●● ● ●
●●● ●●●●●●●●●●●●●●●● ●●●●
● ● ● ● ●● ● ● ● ●
● ●● ●●●●●●●●●●●●●●●●●●● ● ●●
● ●●●● ●●●●●●●●●● ●●●●
●●● ●● ●●●●●●●● ●
● ●●
● ●●●●●●●●●●●●●●●●●●●●●●● ● ●
● ●●● ●●●●●●●●●●●●● ●● ●
●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●
● ●
● ●●●●●●●●●●●●●●●●●●●●●●● ● ●
● ●●●●●●●●●●●● ●● ●● ● ●
● ● ●●●●●●●●●●●●●●●●●●●●●●●● ●
● ●● ●●● ●●●●●●●●●●●●●●●●●●●●●●●● ● ●●
● ●●●● ●●●●●● ● ●●● ●
●
●● ● ●●● ● ●● ●
● ● ●● ●
● ● ●● ●●●●●●●●●● ●● ●● ● ●
● ● ● ●●●●●●●●●●●●●● ●● ●●
● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●
● ●●● ●●●●●●●●●●●●●●●●●●●● ●●●●
●
● ● ● ●● ●●●●●●● ●●●●●● ● ●●
●● ● ● ●●●●●●●●●●●●●●●●●● ●
●●● ● ●
● ● ● ●●●● ●● ●
●
● ●●●●●● ●●● ● ●●● ●
●●●●●●●●●●●●●● ●●●●●●●●● ●● ●
● ● ● ●●●●●●●●●●●●●● ● ● ●
● ●●●
● ●●● ●●●●●● ● ●
● ●●
● ●● ● ●●●●●●● ●●●●● ●
● ●●●●●●●● ●●● ● ●
● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●●
●
● ● ● ●●●●●●●●●●●●●●●●●● ● ●●
●●●●●●●●● ●● ●●●● ● ●
● ●●● ● ●● ●●●●●●●●●●●●●●●●●●● ● ●
● ●● ●●● ●●●
● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ● ●
●●●●●●● ●●●● ●● ●●
●● ● ● ● ●●
●● ●
● ● ● ● ● ●●
● ●●● ●●● ●●●●● ● ● ● ● ● ● ● ●
●
● ●● ●● ●●● ●● ●
●
●●●●●●●●●●●●●● ●●● ● ●
●● ● ● ●
● ●● ● ●● ● ●
● ●● ●●● ●●●●●●●●●●●●●●●●●● ● ● ●●●
● ●●
●
●● ●●●●●●●●●●●●●●●●●● ●●
●● ● ●●●●●●●●●●●●●● ● ●
● ● ●● ●● ●●●●● ●●● ● ●
●
●●
●●● ●●●● ●●●●●●●●● ● ● ●
● ● ●●●●●●●●●●●●●●●●●●●● ●● ●
● ●● ●●●●●● ●
●
● ● ● ●●●●●●●●●●●●●●●●●●● ● ● ● ●
● ●●● ●●● ●●● ● ●
● ●● ●
● ●● ●●●●●●●●●●●●● ●●●●●●●● ●●● ● ●
●●● ●● ●●● ● ●●● ●
● ● ●●●● ●● ●●●●●●●● ●●● ●
●●
●●● ●● ●●● ●●●●● ●●●
● ●●●●●●●●●●●●●●●●●●● ●●● ●
●
●● ● ● ● ● ●●
●
●
● ● ●●●●●●●●●●●●●●●●● ● ●
● ●●●●● ●●● ● ● ●
● ● ● ●● ● ●
● ● ● ●●●●●● ●●●●●●●●●●● ●●● ● ●
● ● ●●●●●●●● ●● ●
● ●●●●●● ●●●● ●●●●●● ● ●
● ●●●●●●● ●●●●●●●● ●●● ● ● ●
● ● ● ●●●●●●●●●●●●●●●●●● ●●● ● ● ● ● ●
●
● ● ●●● ●● ● ●
●●●● ● ●●●● ●● ● ●● ●
●● ● ●●●●●● ●
●● ●●●●●●● ●●●●●●●● ● ●● ● ● ●
●●●●●●●●● ●● ●● ● ●
● ● ● ● ●● ● ●●● ● ● ● ● ●
●
● ●●● ●●●●●●●●●●●●●● ●● ● ●● ● ●
●●●● ● ● ●●●●●● ●● ●●● ● ●
●
● ● ●●●●●● ●●●
●
● ●
●
● ●
●
●
●●● ●● ● ● ● ●● ● ●
● ● ●●●●●● ● ●●●●● ● ●●● ●●
●
● ●● ●
●
● ●
●
● ●●● ● ● ● ● ●
● ● ● ●●●●●●●●●● ●●● ●●●●● ●●●
● ●
● ● ●● ● ● ●● ●● ● ● ● ●
●
●
● ● ● ● ●● ●
●●
● ●
● ● ●●●● ● ● ● ● ● ●●● ●●
● ●●● ● ●● ● ● ●
● ● ● ● ●● ●●●●● ●●●●● ● ●
●
● ●● ●● ●●●●●●●●●● ●● ●● ● ● ● ●●
●● ● ● ● ●●
●
●
● ● ● ●●●● ●● ● ●
●
● ● ● ● ●● ●
●●
● ● ●●●● ● ●●● ●● ●●●●● ● ●●●● ●
●
●
●
● ● ●
●
● ● ● ● ●●● ●●●● ● ●
●● ●● ●●●●●●●●●● ● ●
● ● ● ● ● ●● ●● ●●
● ● ● ● ●
● ●
●
● ●
●
●
●● ●●● ●●●● ●●●● ●●●● ●● ●
●
●
●● ●
●●
●
●
●
● ●●●●●●●●●●● ●●● ● ●
●
●● ● ●●● ● ● ●● ● ● ●
● ●
●● ●
●
● ●
●
● ●●● ● ● ● ●
●
● ●● ● ● ●
●
●
●
●
●● ●
●
●
●
●
●●

0

/2

●

D

●

w

D
The w − l space.
As a projective transformation.
The w − l space.
As a projective transformation.

Policy Value

D

1

−D
Episode Length
The w − l space.
As a projective transformation.

−D

Policy Value

D

l
1

−D
Episode Length

w

D
Sample task: two states, continuous actions

s1
a1 ∈ [0, 1]
r1 = 1 + (a1 − 0.5)2
c1 = 1 + a1
Sample task: two states, continuous actions

s1
a1 ∈ [0, 1]
r1 = 1 + (a1 − 0.5)2
c1 = 1 + a1

s2
a2 ∈ [0, 1]
r2 = 1 + a2
c2 = 1 + (a2 − 0.5)2
Sample task: two states, continuous actions
Policy Space (Actions)

1

a2

0
0

a1

1
Sample task: two states, continuous actions
Policy Values and Costs

Policy value

4

Policy cost

4
Sample task: two states, continuous actions
Policy Manifold in w − l

l

D/2

w

D/2
And the rest...

Neat geometry, linear problems in w − l.
Easily exploited using straightforward algebra / calculus.
Updating average reward between iterations can be optimized.
Becomes finding the (or rather an) intersection between two
conics.
Which can be solved in O(1) time.
And the rest...

Neat geometry, linear problems in w − l.
Easily exploited using straightforward algebra / calculus.
Updating average reward between iterations can be optimized.
Becomes finding the (or rather an) intersection between two
conics.
Which can be solved in O(1) time.
Worst case, uncertainty reduces in half.
Typically much better than that.
Little extra complexity added to already PAC methods.
Thank you.
r-uribe@uniandes.edu.co

Untitled by Li Wei, School of Design, Oita University, 2009.

Más contenido relacionado

Similar a Optimal Nudging. Presentation UD.

20131212 - Sydney - Garvan Institute - Human Genetics and Big Data
20131212 - Sydney - Garvan Institute - Human Genetics and Big Data20131212 - Sydney - Garvan Institute - Human Genetics and Big Data
20131212 - Sydney - Garvan Institute - Human Genetics and Big DataAllen Day, PhD
 
Barley environmental association - Plant & Animal Genome 2018
Barley environmental association - Plant & Animal Genome 2018Barley environmental association - Plant & Animal Genome 2018
Barley environmental association - Plant & Animal Genome 2018PeterMorrell4
 
Fruit breedomics workshop wp6 application of high throughput micheletti
Fruit breedomics workshop wp6 application of high throughput michelettiFruit breedomics workshop wp6 application of high throughput micheletti
Fruit breedomics workshop wp6 application of high throughput michelettifruitbreedomics
 
Aiello-Lammens: Global Sensitivity Analysis for Impact Assessments.
Aiello-Lammens:  Global Sensitivity Analysis for Impact Assessments.Aiello-Lammens:  Global Sensitivity Analysis for Impact Assessments.
Aiello-Lammens: Global Sensitivity Analysis for Impact Assessments.questRCN
 
Community dynamics of the adolescent vaginal microbiome during puberty (UOreg...
Community dynamics of the adolescent vaginal microbiome during puberty (UOreg...Community dynamics of the adolescent vaginal microbiome during puberty (UOreg...
Community dynamics of the adolescent vaginal microbiome during puberty (UOreg...Roxana Hickey
 
Fairisle knitting
Fairisle knittingFairisle knitting
Fairisle knittingzafiro555
 
Deep dive into Nagios analytics
Deep dive into Nagios analyticsDeep dive into Nagios analytics
Deep dive into Nagios analyticsDatadog
 
The Ecology of Forage Fish in the Salish Sea
The Ecology of Forage Fish in the Salish SeaThe Ecology of Forage Fish in the Salish Sea
The Ecology of Forage Fish in the Salish SeaTessa Francis
 
2018 jsm vancouver
2018 jsm vancouver2018 jsm vancouver
2018 jsm vancouverBin Chen
 
正誤表 p39
正誤表 p39正誤表 p39
正誤表 p39zafiro555
 
アイ・トレーニング10点)
アイ・トレーニング10点)アイ・トレーニング10点)
アイ・トレーニング10点)kenji sakuma
 
Consumer Preferences in Real Estate Markets
Consumer Preferences in Real Estate MarketsConsumer Preferences in Real Estate Markets
Consumer Preferences in Real Estate MarketsDominik Kalisch
 
A Large-Scale Study of Test Coverage Evolution
A Large-Scale Study of Test Coverage EvolutionA Large-Scale Study of Test Coverage Evolution
A Large-Scale Study of Test Coverage Evolutionjon_bell
 
Pocket dot grid pages
Pocket dot grid pagesPocket dot grid pages
Pocket dot grid pagesHIKOO
 
Advanced Procedural Rendering in DirectX11 - CEDEC 2012
Advanced Procedural Rendering in DirectX11 - CEDEC 2012 Advanced Procedural Rendering in DirectX11 - CEDEC 2012
Advanced Procedural Rendering in DirectX11 - CEDEC 2012 smashflt
 
How Outdoor Lighting damages the Environment
How Outdoor Lighting damages the EnvironmentHow Outdoor Lighting damages the Environment
How Outdoor Lighting damages the EnvironmentChristian Reinboth
 
Unit Testing Tool Competition-Eighth Round
Unit Testing Tool Competition-Eighth RoundUnit Testing Tool Competition-Eighth Round
Unit Testing Tool Competition-Eighth RoundSebastiano Panichella
 
Comparing public RNA-seq data
Comparing public RNA-seq dataComparing public RNA-seq data
Comparing public RNA-seq datamikaelhuss
 

Similar a Optimal Nudging. Presentation UD. (20)

20131212 - Sydney - Garvan Institute - Human Genetics and Big Data
20131212 - Sydney - Garvan Institute - Human Genetics and Big Data20131212 - Sydney - Garvan Institute - Human Genetics and Big Data
20131212 - Sydney - Garvan Institute - Human Genetics and Big Data
 
Barley environmental association - Plant & Animal Genome 2018
Barley environmental association - Plant & Animal Genome 2018Barley environmental association - Plant & Animal Genome 2018
Barley environmental association - Plant & Animal Genome 2018
 
Rgraphics
RgraphicsRgraphics
Rgraphics
 
Fruit breedomics workshop wp6 application of high throughput micheletti
Fruit breedomics workshop wp6 application of high throughput michelettiFruit breedomics workshop wp6 application of high throughput micheletti
Fruit breedomics workshop wp6 application of high throughput micheletti
 
Aiello-Lammens: Global Sensitivity Analysis for Impact Assessments.
Aiello-Lammens:  Global Sensitivity Analysis for Impact Assessments.Aiello-Lammens:  Global Sensitivity Analysis for Impact Assessments.
Aiello-Lammens: Global Sensitivity Analysis for Impact Assessments.
 
Community dynamics of the adolescent vaginal microbiome during puberty (UOreg...
Community dynamics of the adolescent vaginal microbiome during puberty (UOreg...Community dynamics of the adolescent vaginal microbiome during puberty (UOreg...
Community dynamics of the adolescent vaginal microbiome during puberty (UOreg...
 
Fairisle knitting
Fairisle knittingFairisle knitting
Fairisle knitting
 
Deep dive into Nagios analytics
Deep dive into Nagios analyticsDeep dive into Nagios analytics
Deep dive into Nagios analytics
 
The Ecology of Forage Fish in the Salish Sea
The Ecology of Forage Fish in the Salish SeaThe Ecology of Forage Fish in the Salish Sea
The Ecology of Forage Fish in the Salish Sea
 
2018 jsm vancouver
2018 jsm vancouver2018 jsm vancouver
2018 jsm vancouver
 
正誤表 p39
正誤表 p39正誤表 p39
正誤表 p39
 
アイ・トレーニング10点)
アイ・トレーニング10点)アイ・トレーニング10点)
アイ・トレーニング10点)
 
Consumer Preferences in Real Estate Markets
Consumer Preferences in Real Estate MarketsConsumer Preferences in Real Estate Markets
Consumer Preferences in Real Estate Markets
 
A Large-Scale Study of Test Coverage Evolution
A Large-Scale Study of Test Coverage EvolutionA Large-Scale Study of Test Coverage Evolution
A Large-Scale Study of Test Coverage Evolution
 
Pocket dot grid pages
Pocket dot grid pagesPocket dot grid pages
Pocket dot grid pages
 
Advanced Procedural Rendering in DirectX11 - CEDEC 2012
Advanced Procedural Rendering in DirectX11 - CEDEC 2012 Advanced Procedural Rendering in DirectX11 - CEDEC 2012
Advanced Procedural Rendering in DirectX11 - CEDEC 2012
 
How Outdoor Lighting damages the Environment
How Outdoor Lighting damages the EnvironmentHow Outdoor Lighting damages the Environment
How Outdoor Lighting damages the Environment
 
17 polishing
17 polishing17 polishing
17 polishing
 
Unit Testing Tool Competition-Eighth Round
Unit Testing Tool Competition-Eighth RoundUnit Testing Tool Competition-Eighth Round
Unit Testing Tool Competition-Eighth Round
 
Comparing public RNA-seq data
Comparing public RNA-seq dataComparing public RNA-seq data
Comparing public RNA-seq data
 

Optimal Nudging. Presentation UD.

  • 1. Optimal Nudging A new approach to solving SMDPs Reinaldo Uribe M Universidad de los Andes — Oita University Colorado State University Nov. 11, 2013
  • 2. Snakes & Ladders Player advances the number of steps indicated by a die. Landing on a snake’s mouth sends the player back to the tail. Landing on a ladder’s bottom moves the player forward to the top. Goal: reaching state 100.
  • 3. Snakes & Ladders Player advances the number of steps indicated by a die. Boring! (No skill required, only luck.) Landing on a snake’s mouth sends the player back to the tail. Landing on a ladder’s bottom moves the player forward to the top. Goal: reaching state 100.
  • 4. Variation: Decision Snakes and Ladders Sets of “win” and “loss” terminal states. Actions: either “advance” or “go back,” to be decided before throwing the die.
  • 5. Reinforcement Learning: Finding an optimal policy. “Natural” Rewards: ±1 on “win”/“lose”, 0 othw. Optimal policy maximizes total expected reward. Dynamic programming quickly finds the optimal policy. Probability of winning: pw = 0.97222 . . .
  • 6. We know a lot! Markov Decision Process: States, Actions, Transition Probabilities, Rewards.
  • 7. We know a lot! Markov Decision Process: States, Actions, Transition Probabilities, Rewards. Policies and policy value.
  • 8. We know a lot! Markov Decision Process: States, Actions, Transition Probabilities, Rewards. Policies and policy value. Max winning probability = max earnings.
  • 9. We know a lot! Markov Decision Process: States, Actions, Transition Probabilities, Rewards. Policies and policy value. Max winning probability = max earnings. Taking an action costs (in units different from rewards.)
  • 10. We know a lot! Markov Decision Process: States, Actions, Transition Probabilities, Rewards. Policies and policy value. Max winning probability = max earnings. Taking an action costs (in units different from rewards.) Different actions may have different costs.
  • 11. We know a lot! Markov Decision Process: States, Actions, Transition Probabilities, Rewards. Policies and policy value. Max winning probability = max earnings. Taking an action costs (in units different from rewards.) Different actions may have different costs. Semi-Markov model with average rewards.
  • 12. Better than optimal? (Old optimal policy)
  • 13. Better than optimal? (Optimal policy) with average reward ρ = 0.08701
  • 14. Better than optimal? (Optimal policy) with average reward ρ = 0.08701 pw = 0.48673 (was 0.97222 — 50.06%) d = 11.17627 (was 84.58333 — 13.21%)
  • 15. Better than optimal? (Optimal policy) with average reward ρ = 0.08701 pw = 0.48673 (was 0.97222 — 50.06%) d = 11.17627 (was 84.58333 — 13.21%) This policy maximizes pw d
  • 16. So, how are average-reward optimal policies found? Algorithm 1 Generic SMDP solver Initialize repeat forever Act Do RL to find value of current π Update ρ. Usually 1-step Q-learning Average-adjusted Q-learning: Qt+1 (st , at ) ← (1 − γt ) Qt (st , at ) + γt rt+1 − ρt ct+1 + max Qt (st+1 , a) a
  • 17. Generic Learning Algorithm Table of algorithms. ARRL Algorithm Gain update t r(si , π i (si )) AAC Jalali and Ferguson 1989 R–Learning ρt+1 ← ρ t+1 Tadepalli and Ok 1998 SSP Q-Learning t+1 ← (1 − α)ρt + α rt+1 + max Qt (st+1 , a) − max Qt (st , a) Schwartz 1993 H–Learning i=0 a ρt+1 ← ρt + αt min Qt (ˆ, a) s a Abounadi et al. 2001 t r(si , π i (si )) HAR Ghavamzadeh and Mahadevan 2007 a ρt+1 ← (1−αt )ρt +αt rt+1 − H t (st ) + H t (st+1 ) αt αt+1 ← αt + 1 ρt+1 ← i=0 t+1
  • 18. Generic Learning Algorithm Table of algorithms. SMDPRL Algorithm Gain update SMART t r(si , π i (si )) Das et al. 1999 ρt+1 ← i=0 t MAX-Q Ghavamzadeh and Mahadevan 2001 c(si , π i (si )) i=0
  • 19. Nudging Algorithm 2 Nudged Learning Initialize (π, ρ, Q) repeat Set reward scheme to (r − ρc). Solve by any RL method. Update ρ until Qπ (sI ) = 0
  • 20. Nudging Algorithm 3 Nudged Learning Initialize (π, ρ, Q) repeat Set reward scheme to (r − ρc). Solve by any RL method. Update ρ until Qπ (sI ) = 0 Note: ‘by any RL method’ refers to a well-studied problem for which better algorithms (both practical and with theoretical guarantees) exist. ρ can (and will) be updated optimally.
  • 21. The w − l space. Definition (Policy π has expected average reward v π and expected average cost c π . Let D be a bound on the absolute value of v π ) wπ = D + vπ , 2c π lπ = D − vπ . 2c π D l ● ● ● ● ● ●● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ●● ● ● ●●●● ● ● ● ● ●● ● ● ● ● ● ●●●●● ●●●●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●●● ● ●●● ● ● ● ● ● ●●●●● ●●● ●● ● ●● ● ● ● ●● ● ● ● ● ● ●●● ● ●●●●●●●● ●●● ●● ●● ●● ● ● ● ●● ●● ●●●●●●●●● ● ●●● ● ● ● ●● ●● ●● ● ● ●● ● ●●● ● ● ● ●●●●● ●●● ●● ● ● ● ● ● ● ●●● ●●●● ●●●●●●●●●●●●●●●● ● ● ● ● ● ●●●●● ●●●●●●●●●●●●● ● ●●●●●●● ●●●●●●●●●●●●● ●●●● ● ● ●●●●● ●● ● ● ●●● ● ●● ● ● ● ● ● ● ●● ● ●●●● ●● ●●●●●●●●●● ● ● ● ●● ●●●●●●●●●●●●●●●●●●●●● ● ● ● ●●●●●●● ●●● ●●●●●●●● ●● ● ● ●●● ●●●● ●● ●●● ● ● ● ● ●●● ●●●● ●●●●● ● ●● ● ● ●●● ●●●● ● ● ● ● ●●●●●●●●●● ●●●●●●●●●● ● ● ● ● ●●●●●●●●●● ●●●●● ● ● ● ● ● ●●● ●● ●●●●●●●●●●●●●●●●●●●● ●● ●● ● ●●●● ●●●● ● ● ● ● ●● ● ● ● ● ● ●●●● ● ● ●● ●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●●● ● ● ● ● ●● ●●●●● ● ●●●●● ●●●●●●●●●●●●● ●●●●● ●●● ●● ● ●● ●●●●●●● ●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ●●● ●● ● ●● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ● ● ● ● ● ●● ● ●●● ●●●● ●●●●●● ●●●●●●●● ● ●● ● ●●●●●●● ●●● ●●● ●● ● ● ● ● ●●●●● ●●●●●●●●●●●●●●●● ● ● ●● ● ● ● ● ●●● ● ●● ● ● ● ● ● ● ●●●●●●●●●●●●●●●●●● ●●● ● ● ● ● ●●●●●●●●●●●●●● ●●●●●●●●●●●● ●●● ●●●●● ● ● ● ● ●● ●● ●●●●●●●●●●●●● ●●●●●●●●●●●● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●● ●● ● ● ● ● ● ●● ● ● ● ● ● ●● ●●●●●●●●●●● ●●●●●●●●●●●● ●●●●●●●● ●●●●●●●●● ● ● ● ● ●● ● ● ●●●●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●● ●●●●●●● ●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ● ● ● ●● ●● ●● ● ●● ●●●●●●●●●●●●●● ●● ● ● ● ●● ● ● ●● ● ● ●● ●●●●●●●●●●●●●●●●●●●●●●● ●● ● ● ●●●●●●●●●●●●●●●● ●●●●●●● ● ●● ●● ●●●●● ●● ● ● ●●● ● ●● ● ● ● ● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●● ●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●● ●●●●●●● ● ●● ● ●●●●●●●●●●●●●●●● ●●●● ●● ● ● ● ●●● ●●●●●●●●●●● ●●●●● ● ● ● ● ● ●● ● ●●●●●●●●●●●●● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●● ● ● ● ● ●●●●●●●●●●●●●●●● ●●●●●● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ● ● ● ● ● ●●●●●●●●●● ●●●● ● ● ● ● ●●●●●●●●● ●●● ●●● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●● ●● ● ●●● ● ● ●● ● ●●● ●●●●● ● ● ●●●●●●●●●●●●● ●●● ● ● ● ● ● ●●●●●●●●●●●●● ●●●● ● ● ● ● ● ●● ●● ●●●●●●●●●●●●●●●●●●●●●● ●●●● ● ● ●●●●●● ● ● ● ● ● ● ● ●●●●●●●● ●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●● ● ● ●● ● ● ● ● ● ● ●● ● ● ●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●● ●● ● ● ●●● ● ●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●● ● ●● ●●● ●●●●● ● ●●●●●●●●●●●●●●●●●●●● ●●●● ● ● ● ● ● ●●● ● ●● ● ● ●●● ● ●●● ●●●●●● ● ● ● ● ●● ●● ● ● ● ●●●●●●●●●●●●●●● ●●●● ● ●● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●● ● ●●● ● ● ● ● ● ●● ● ●●●●●●●●●● ● ●● ● ●● ● ● ●●● ●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ●●● ●● ●●●● ●● ●● ● ●● ● ● ● ● ●●● ●●●● ●●●●●●●●●●●● ●● ●● ● ● ● ●●●●●●●●●●● ●●●●●●●●●●●●●● ● ● ●●●● ● ● ●● ● ● ● ●●●●●●●●●●●●●●●●●●●●●● ● ●● ●●● ● ●●●●●●●●●●●●●●●●●●● ● ● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ●●●●●●●●● ●●●●●● ●●●●●●● ●●● ●●●● ●● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●● ●●●●●●●●●●● ●● ● ●●●● ●●●●●● ●●●●●●●●●●●● ● ●●●●●●●●●●●● ● ● ● ● ●●● ●●●● ●●●●● ●●●● ●●●●●●●●●●●●●●●●● ●●● ● ● ● ● ●● ● ●● ● ● ●● ● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ● ●●● ●●● ●●●●●●●●●●●●●●●●●●●●● ● ●● ● ●● ●●●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●●● ●●●● ● ● ●● ●● ● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●● ● ●● ●●● ●●●●●● ●● ● ● ●● ● ● ● ● ●● ●● ● ●● ●●●●●●●●●●●● ●●●●●● ● ●● ●●● ● ●● ● ●● ● ●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●●● ●● ●●●●●●●●●●●●●●●●●●●● ● ●● ● ● ● ●●● ●●●●● ●● ● ● ● ● ●●●●●● ●●●●●●● ●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●● ●● ● ● ● ●● ● ● ● ● ●●● ●●●●●●●●●●●●●●●●●●●●● ●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●● ● ● ●● ●● ●●●●●●●●●● ● ●●● ●● ● ●● ● ● ● ●● ●● ● ● ●●●●●●●●●●●●●●●●●●●●● ●● ●● ●●●●● ●●●●●●●●●●●●●●●●●●●●●●● ●●● ●● ●●●●●●●●●●●●●●●●●●●●●● ● ● ●● ● ● ● ●●●●●●● ●● ●●●●●●● ●●●●●● ●●●●●●●●●●●●●●● ●● ● ●●●●●● ● ●● ●● ●● ●● ● ●● ● ● ● ●● ● ● ●●●●●●● ●●●● ●● ●●●● ●● ● ●●● ● ●●● ● ● ● ●●●● ●●●●●●●●●●●●●●● ●●●●● ● ● ● ●●●● ●●●●●●●●●●●●●● ●●●●● ● ● ●● ●● ● ● ● ● ●●● ●● ●●●●●●● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●● ● ● ● ● ●● ● ● ●●● ●●●●●●●●●●●●●●●● ●●●● ● ● ● ● ●● ● ● ● ● ● ●● ●●●●●●●●●●●●●●●●●●● ● ●● ● ●●●● ●●●●●●●●●● ●●●● ●●● ●● ●●●●●●●● ● ● ●● ● ●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ●●● ●●●●●●●●●●●●● ●● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ●●●●●●●●●●●● ●● ●● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●● ●●● ●●●●●●●●●●●●●●●●●●●●●●●● ● ●● ● ●●●● ●●●●●● ● ●●● ● ● ●● ● ●●● ● ●● ● ● ● ●● ● ● ● ●● ●●●●●●●●●● ●● ●● ● ● ● ● ● ●●●●●●●●●●●●●● ●● ●● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ●●● ●●●●●●●●●●●●●●●●●●●● ●●●● ● ● ● ● ●● ●●●●●●● ●●●●●● ● ●● ●● ● ● ●●●●●●●●●●●●●●●●●● ● ●●● ● ● ● ● ● ●●●● ●● ● ● ● ●●●●●● ●●● ● ●●● ● ●●●●●●●●●●●●●● ●●●●●●●●● ●● ● ● ● ● ●●●●●●●●●●●●●● ● ● ● ● ●●● ● ●●● ●●●●●● ● ● ● ●● ● ●● ● ●●●●●●● ●●●●● ● ● ●●●●●●●● ●●● ● ● ● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ● ● ● ● ●●●●●●●●●●●●●●●●●● ● ●● ●●●●●●●●● ●● ●●●● ● ● ● ●●● ● ●● ●●●●●●●●●●●●●●●●●●● ● ● ● ●● ●●● ●●● ● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ● ● ●●●●●●● ●●●● ●● ●● ●● ● ● ● ●● ●● ● ● ● ● ● ● ●● ● ●●● ●●● ●●●●● ● ● ● ● ● ● ● ● ● ● ●● ●● ●●● ●● ● ● ●●●●●●●●●●●●●● ●●● ● ● ●● ● ● ● ● ●● ● ●● ● ● ● ●● ●●● ●●●●●●●●●●●●●●●●●● ● ● ●●● ● ●● ● ●● ●●●●●●●●●●●●●●●●●● ●● ●● ● ●●●●●●●●●●●●●● ● ● ● ● ●● ●● ●●●●● ●●● ● ● ● ●● ●●● ●●●● ●●●●●●●●● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●●● ●● ● ● ●● ●●●●●● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●● ● ● ● ● ● ●●● ●●● ●●● ● ● ● ●● ● ● ●● ●●●●●●●●●●●●● ●●●●●●●● ●●● ● ● ●●● ●● ●●● ● ●●● ● ● ● ●●●● ●● ●●●●●●●● ●●● ● ●● ●●● ●● ●●● ●●●●● ●●● ● ●●●●●●●●●●●●●●●●●●● ●●● ● ● ●● ● ● ● ● ●● ● ● ● ● ●●●●●●●●●●●●●●●●● ● ● ● ●●●●● ●●● ● ● ● ● ● ● ●● ● ● ● ● ● ●●●●●● ●●●●●●●●●●● ●●● ● ● ● ● ●●●●●●●● ●● ● ● ●●●●●● ●●●● ●●●●●● ● ● ● ●●●●●●● ●●●●●●●● ●●● ● ● ● ● ● ● ●●●●●●●●●●●●●●●●●● ●●● ● ● ● ● ● ● ● ● ●●● ●● ● ● ●●●● ● ●●●● ●● ● ●● ● ●● ● ●●●●●● ● ●● ●●●●●●● ●●●●●●●● ● ●● ● ● ● ●●●●●●●●● ●● ●● ● ● ● ● ● ● ●● ● ●●● ● ● ● ● ● ● ● ●●● ●●●●●●●●●●●●●● ●● ● ●● ● ● ●●●● ● ● ●●●●●● ●● ●●● ● ● ● ● ● ●●●●●● ●●● ● ● ● ● ● ● ● ● ●●● ●● ● ● ● ●● ● ● ● ● ●●●●●● ● ●●●●● ● ●●● ●● ● ● ●● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ●●●●●●●●●● ●●● ●●●●● ●●● ● ● ● ● ●● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ●●●● ● ● ● ● ● ●●● ●● ● ●●● ● ●● ● ● ● ● ● ● ● ●● ●●●●● ●●●●● ● ● ● ● ●● ●● ●●●●●●●●●● ●● ●● ● ● ● ●● ●● ● ● ● ●● ● ● ● ● ● ●●●● ●● ● ● ● ● ● ● ● ●● ● ●● ● ● ●●●● ● ●●● ●● ●●●●● ● ●●●● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ●●●● ● ● ●● ●● ●●●●●●●●●● ● ● ● ● ● ● ● ●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●●● ●●●● ●●●● ●●●● ●● ● ● ● ●● ● ●● ● ● ● ● ●●●●●●●●●●● ●●● ● ● ● ●● ● ●●● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● w D
  • 22. The w − l space. Value and Cost (Policy π has expected average reward v π and expected average cost c π . Let D be a bound on the absolute value of v π ) wπ = D + vπ , 2c π lπ = D 5D −0. 1 l 0 −D D l D − vπ . 2c π 2 0.5D 4 8 D w D w D
  • 23. The w − l space. Nudged value (Policy π has expected average reward v π and expected average cost c π . Let D be a bound on the absolute value of v π ) wπ = D + vπ , 2c π lπ = D − vπ . 2c π − D /2 D l ● ● ● ● ●● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ●● ● ● ●●●● ● ● ● ● ●● ● ● ● ● ● ●●●●● ●●●●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●●● ● ●●● ● ● ● ● ● ●●●●● ●●● ●● ● ●● ● ● ● ●● ● ● ● ● ● ●●● ● ●●●●●●●● ●●● ●● ●● ●● ● ● ● ●● ●● ●●●●●●●●● ● ●●● ● ● ● ●● ●● ●● ● ● ●● ● ●●● ● ● ● ●●●●● ●●● ●● ● ● ● ● ● ● ●●● ●●●● ●●●●●●●●●●●●●●●● ● ● ● ● ● ●●●●● ●●●●●●●●●●●●● ● ●●●●●●● ●●●●●●●●●●●●● ●●●● ● ● ●●●●● ●● ● ● ●●● ● ●● ● ● ● ● ● ● ●● ● ●●●● ●● ●●●●●●●●●● ● ● ● ●● ●●●●●●●●●●●●●●●●●●●●● ● ● ● ●●●●●●● ●●● ●●●●●●●● ●● ● ● ●●● ●●●● ●● ●●● ● ● ● ● ●●● ●●●● ●●●●● ● ●● ● ● ●●● ●●●● ● ● ● ● ●●●●●●●●●● ●●●●●●●●●● ● ● ● ● ●●●●●●●●●● ●●●●● ● ● ● ● ● ●●● ●● ●●●●●●●●●●●●●●●●●●●● ●● ●● ● ●●●● ●●●● ● ● ● ● ●● ● ● ● ● ● ●●●● ● ● ●● ●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●●● ● ● ● ● ●● ●●●●● ● ●●●●● ●●●●●●●●●●●●● ●●●●● ●●● ●● ● ●● ●●●●●●● ●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ●●● ●● ● ●● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ● ● ● ● ● ●● ● ●●● ●●●● ●●●●●● ●●●●●●●● ● ●● ● ●●●●●●● ●●● ●●● ●● ● ● ● ● ●●●●● ●●●●●●●●●●●●●●●● ● ● ●● ● ● ● ● ●●● ● ●● ● ● ● ● ● ● ●●●●●●●●●●●●●●●●●● ●●● ● ● ● ● ●●●●●●●●●●●●●● ●●●●●●●●●●●● ●●● ●●●●● ● ● ● ● ●● ●● ●●●●●●●●●●●●● ●●●●●●●●●●●● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●● ●● ● ● ● ● ● ●● ● ● ● ● ● ●● ●●●●●●●●●●● ●●●●●●●●●●●● ●●●●●●●● ●●●●●●●●● ● ● ● ● ●● ● ● ●●●●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●● ●●●●●●● ●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ● ● ● ●● ●● ●● ● ●● ●●●●●●●●●●●●●● ●● ● ● ● ●● ● ● ●● ● ● ●● ●●●●●●●●●●●●●●●●●●●●●●● ●● ● ● ●●●●●●●●●●●●●●●● ●●●●●●● ● ●● ●● ●●●●● ●● ● ● ●●● ● ●● ● ● ● ● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●● ●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●● ●●●●●●● ● ●● ● ●●●●●●●●●●●●●●●● ●●●● ●● ● ● ● ●●● ●●●●●●●●●●● ●●●●● ● ● ● ● ● ●● ● ●●●●●●●●●●●●● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●● ● ● ● ● ●●●●●●●●●●●●●●●● ●●●●●● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ● ● ● ● ● ●●●●●●●●●● ●●●● ● ● ● ● ●●●●●●●●● ●●● ●●● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●● ●● ● ●●● ● ● ●● ● ●●● ●●●●● ● ● ●●●●●●●●●●●●● ●●● ● ● ● ● ● ●●●●●●●●●●●●● ●●●● ● ● ● ● ● ●● ●● ●●●●●●●●●●●●●●●●●●●●●● ●●●● ● ● ●●●●●● ● ● ● ● ● ● ● ●●●●●●●● ●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●● ● ● ●● ● ● ● ● ● ● ●● ● ● ●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●● ●● ● ● ●●● ● ●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●● ● ●● ●●● ●●●●● ● ●●●●●●●●●●●●●●●●●●●● ●●●● ● ● ● ● ● ●●● ● ●● ● ● ●●● ● ●●● ●●●●●● ● ● ● ● ●● ●● ● ● ● ●●●●●●●●●●●●●●● ●●●● ● ●● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●● ● ●●● ● ● ● ● ● ●● ● ●●●●●●●●●● ● ●● ● ●● ● ● ●●● ●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ●●● ●● ●●●● ●● ●● ● ●● ● ● ● ● ●●● ●●●● ●●●●●●●●●●●● ●● ●● ● ● ● ●●●●●●●●●●● ●●●●●●●●●●●●●● ● ● ●●●● ● ● ●● ● ● ● ●●●●●●●●●●●●●●●●●●●●●● ● ●● ●●● ● ●●●●●●●●●●●●●●●●●●● ● ● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ●●●●●●●●● ●●●●●● ●●●●●●● ●●● ●●●● ●● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●● ●●●●●●●●●●● ●● ● ●●●● ●●●●●● ●●●●●●●●●●●● ● ●●●●●●●●●●●● ● ● ● ● ●●● ●●●● ●●●●● ●●●● ●●●●●●●●●●●●●●●●● ●●● ● ● ● ● ●● ● ●● ● ● ●● ● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ● ●●● ●●● ●●●●●●●●●●●●●●●●●●●●● ● ●● ● ●● ●●●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●●● ●●●● ● ● ●● ●● ● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●● ● ●● ●●● ●●●●●● ●● ● ● ●● ● ● ● ● ●● ●● ● ●● ●●●●●●●●●●●● ●●●●●● ● ●● ●●● ● ●● ● ●● ● ●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●●● ●● ●●●●●●●●●●●●●●●●●●●● ● ●● ● ● ● ●●● ●●●●● ●● ● ● ● ● ●●●●●● ●●●●●●● ●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●● ●● ● ● ● ●● ● ● ● ● ●●● ●●●●●●●●●●●●●●●●●●●●● ●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●● ● ● ●● ●● ●●●●●●●●●● ● ●●● ●● ● ●● ● ● ● ●● ●● ● ● ●●●●●●●●●●●●●●●●●●●●● ●● ●● ●●●●● ●●●●●●●●●●●●●●●●●●●●●●● ●●● ●● ●●●●●●●●●●●●●●●●●●●●●● ● ● ●● ● ● ● ●●●●●●● ●● ●●●●●●● ●●●●●● ●●●●●●●●●●●●●●● ●● ● ●●●●●● ● ●● ●● ●● ●● ● ●● ● ● ● ●● ● ● ●●●●●●● ●●●● ●● ●●●● ●● ● ●●● ● ●●● ● ● ● ●●●● ●●●●●●●●●●●●●●● ●●●●● ● ● ● ●●●● ●●●●●●●●●●●●●● ●●●●● ● ● ●● ●● ● ● ● ● ●●● ●● ●●●●●●● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●● ● ● ● ● ●● ● ● ●●● ●●●●●●●●●●●●●●●● ●●●● ● ● ● ● ●● ● ● ● ● ● ●● ●●●●●●●●●●●●●●●●●●● ● ●● ● ●●●● ●●●●●●●●●● ●●●● ●●● ●● ●●●●●●●● ● ● ●● ● ●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ●●● ●●●●●●●●●●●●● ●● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ●●●●●●●●●●●● ●● ●● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●● ●●● ●●●●●●●●●●●●●●●●●●●●●●●● ● ●● ● ●●●● ●●●●●● ● ●●● ● ● ●● ● ●●● ● ●● ● ● ● ●● ● ● ● ●● ●●●●●●●●●● ●● ●● ● ● ● ● ● ●●●●●●●●●●●●●● ●● ●● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ●●● ●●●●●●●●●●●●●●●●●●●● ●●●● ● ● ● ● ●● ●●●●●●● ●●●●●● ● ●● ●● ● ● ●●●●●●●●●●●●●●●●●● ● ●●● ● ● ● ● ● ●●●● ●● ● ● ● ●●●●●● ●●● ● ●●● ● ●●●●●●●●●●●●●● ●●●●●●●●● ●● ● ● ● ● ●●●●●●●●●●●●●● ● ● ● ● ●●● ● ●●● ●●●●●● ● ● ● ●● ● ●● ● ●●●●●●● ●●●●● ● ● ●●●●●●●● ●●● ● ● ● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ● ● ● ● ●●●●●●●●●●●●●●●●●● ● ●● ●●●●●●●●● ●● ●●●● ● ● ● ●●● ● ●● ●●●●●●●●●●●●●●●●●●● ● ● ● ●● ●●● ●●● ● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ● ● ●●●●●●● ●●●● ●● ●● ●● ● ● ● ●● ●● ● ● ● ● ● ● ●● ● ●●● ●●● ●●●●● ● ● ● ● ● ● ● ● ● ● ●● ●● ●●● ●● ● ● ●●●●●●●●●●●●●● ●●● ● ● ●● ● ● ● ● ●● ● ●● ● ● ● ●● ●●● ●●●●●●●●●●●●●●●●●● ● ● ●●● ● ●● ● ●● ●●●●●●●●●●●●●●●●●● ●● ●● ● ●●●●●●●●●●●●●● ● ● ● ● ●● ●● ●●●●● ●●● ● ● ● ●● ●●● ●●●● ●●●●●●●●● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●●● ●● ● ● ●● ●●●●●● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●● ● ● ● ● ● ●●● ●●● ●●● ● ● ● ●● ● ● ●● ●●●●●●●●●●●●● ●●●●●●●● ●●● ● ● ●●● ●● ●●● ● ●●● ● ● ● ●●●● ●● ●●●●●●●● ●●● ● ●● ●●● ●● ●●● ●●●●● ●●● ● ●●●●●●●●●●●●●●●●●●● ●●● ● ● ●● ● ● ● ● ●● ● ● ● ● ●●●●●●●●●●●●●●●●● ● ● ● ●●●●● ●●● ● ● ● ● ● ● ●● ● ● ● ● ● ●●●●●● ●●●●●●●●●●● ●●● ● ● ● ● ●●●●●●●● ●● ● ● ●●●●●● ●●●● ●●●●●● ● ● ● ●●●●●●● ●●●●●●●● ●●● ● ● ● ● ● ● ●●●●●●●●●●●●●●●●●● ●●● ● ● ● ● ● ● ● ● ●●● ●● ● ● ●●●● ● ●●●● ●● ● ●● ● ●● ● ●●●●●● ● ●● ●●●●●●● ●●●●●●●● ● ●● ● ● ● ●●●●●●●●● ●● ●● ● ● ● ● ● ● ●● ● ●●● ● ● ● ● ● ● ● ●●● ●●●●●●●●●●●●●● ●● ● ●● ● ● ●●●● ● ● ●●●●●● ●● ●●● ● ● ● ● ● ●●●●●● ●●● ● ● ● ● ● ● ● ● ●●● ●● ● ● ● ●● ● ● ● ● ●●●●●● ● ●●●●● ● ●●● ●● ● ● ●● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ●●●●●●●●●● ●●● ●●●●● ●●● ● ● ● ● ●● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ●●●● ● ● ● ● ● ●●● ●● ● ●●● ● ●● ● ● ● ● ● ● ● ●● ●●●●● ●●●●● ● ● ● ● ●● ●● ●●●●●●●●●● ●● ●● ● ● ● ●● ●● ● ● ● ●● ● ● ● ● ● ●●●● ●● ● ● ● ● ● ● ● ●● ● ●● ● ● ●●●● ● ●●● ●● ●●●●● ● ●●●● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ●●●● ● ● ●● ●● ●●●●●●●●●● ● ● ● ● ● ● ● ●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●●● ●●●● ●●●● ●●●● ●● ● ● ● ●● ● ●● ● ● ● ● ●●●●●●●●●●● ●●● ● ● ● ●● ● ●●● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● 0 /2 ● D ● w D
  • 24. The w − l space. As a projective transformation.
  • 25. The w − l space. As a projective transformation. Policy Value D 1 −D Episode Length
  • 26. The w − l space. As a projective transformation. −D Policy Value D l 1 −D Episode Length w D
  • 27. Sample task: two states, continuous actions s1 a1 ∈ [0, 1] r1 = 1 + (a1 − 0.5)2 c1 = 1 + a1
  • 28. Sample task: two states, continuous actions s1 a1 ∈ [0, 1] r1 = 1 + (a1 − 0.5)2 c1 = 1 + a1 s2 a2 ∈ [0, 1] r2 = 1 + a2 c2 = 1 + (a2 − 0.5)2
  • 29. Sample task: two states, continuous actions Policy Space (Actions) 1 a2 0 0 a1 1
  • 30. Sample task: two states, continuous actions Policy Values and Costs Policy value 4 Policy cost 4
  • 31. Sample task: two states, continuous actions Policy Manifold in w − l l D/2 w D/2
  • 32. And the rest... Neat geometry, linear problems in w − l. Easily exploited using straightforward algebra / calculus. Updating average reward between iterations can be optimized. Becomes finding the (or rather an) intersection between two conics. Which can be solved in O(1) time.
  • 33. And the rest... Neat geometry, linear problems in w − l. Easily exploited using straightforward algebra / calculus. Updating average reward between iterations can be optimized. Becomes finding the (or rather an) intersection between two conics. Which can be solved in O(1) time. Worst case, uncertainty reduces in half. Typically much better than that. Little extra complexity added to already PAC methods.
  • 34. Thank you. r-uribe@uniandes.edu.co Untitled by Li Wei, School of Design, Oita University, 2009.