Optimal Nudging. Presentation UD.

Optimal Nudging
A new approach to solving SMDPs
Reinaldo Uribe M
Universidad de los Andes — Oita University
Colorado State University

Nov. 11, 2013

Snakes & Ladders

Player advances the
number of steps indicated
by a die.
Landing on a snake’s
mouth sends the player
back to the tail.
Landing on a ladder’s
bottom moves the player
forward to the top.
Goal: reaching state 100.

Snakes & Ladders

Player advances the
number of steps indicated
by a die.

Boring!
(No skill required, only luck.)

Landing on a snake’s
mouth sends the player
back to the tail.
Landing on a ladder’s
bottom moves the player
forward to the top.
Goal: reaching state 100.

Variation: Decision Snakes and Ladders

Sets of “win” and
“loss” terminal states.
Actions: either
“advance” or “go
back,” to be decided
before throwing the die.

Reinforcement Learning: Finding an optimal policy.

“Natural” Rewards: ±1
on “win”/“lose”, 0
othw.
Optimal policy
maximizes total
expected reward.
Dynamic programming
quickly ﬁnds the
optimal policy.
Probability of winning:
pw = 0.97222 . . .

We know a lot!

Markov Decision Process: States, Actions, Transition
Probabilities, Rewards.

We know a lot!

Policies and policy value.

We know a lot!

Max winning probability = max earnings.

We know a lot!

Taking an action costs (in units diﬀerent from rewards.)

We know a lot!

Diﬀerent actions may have diﬀerent costs.

We know a lot!

Diﬀerent actions may have diﬀerent costs.
Semi-Markov model with average rewards.

Better than optimal?

(Old optimal policy)


(Optimal policy) with
average reward
ρ = 0.08701


average reward
ρ = 0.08701
pw = 0.48673 (was
0.97222 — 50.06%)
d = 11.17627 (was
84.58333 — 13.21%)


average reward
ρ = 0.08701
pw = 0.48673 (was
0.97222 — 50.06%)
d = 11.17627 (was
84.58333 — 13.21%)
This policy maximizes

pw
d

So, how are average-reward optimal policies found?

Algorithm 1 Generic SMDP solver
Initialize
repeat forever
Act
Do RL to ﬁnd value of current π
Update ρ.

Usually 1-step Q-learning

Average-adjusted Q-learning:
Qt+1 (st , at ) ← (1 − γt ) Qt (st , at ) + γt rt+1 − ρt ct+1 + max Qt (st+1 , a)
a

Generic Learning Algorithm
Table of algorithms. ARRL
Algorithm

Gain update
t

r(si , π i (si ))

AAC
Jalali and Ferguson 1989

R–Learning

ρt+1 ←
ρ

t+1

Tadepalli and Ok 1998

SSP Q-Learning

t+1

← (1 − α)ρt +
α rt+1 + max Qt (st+1 , a) − max Qt (st , a)

Schwartz 1993

H–Learning

i=0

a

ρt+1 ← ρt + αt min Qt (ˆ, a)
s
a

Abounadi et al. 2001
t

r(si , π i (si ))

HAR
Ghavamzadeh and Mahadevan 2007

a

ρt+1 ← (1−αt )ρt +αt rt+1 − H t (st ) + H t (st+1 )
αt
αt+1 ←
αt + 1

ρt+1 ←

i=0

t+1

Generic Learning Algorithm

Table of algorithms. SMDPRL
Algorithm

Gain update

SMART

t

r(si , π i (si ))

Das et al. 1999

ρt+1 ←

i=0
t

MAX-Q
Ghavamzadeh and Mahadevan 2001

c(si , π i (si ))
i=0

Nudging
Algorithm 2 Nudged Learning
Initialize (π, ρ, Q)
repeat
Set reward scheme to (r − ρc).
Solve by any RL method.
Update ρ
until Qπ (sI ) = 0

Nudging
Algorithm 3 Nudged Learning
Initialize (π, ρ, Q)
repeat
Set reward scheme to (r − ρc).
Solve by any RL method.
Update ρ
until Qπ (sI ) = 0

Note: ‘by any RL method’ refers to a well-studied problem for
which better algorithms (both practical and with theoretical
guarantees) exist.
ρ can (and will) be updated optimally.

The w − l space.
Deﬁnition

(Policy π has expected average reward v π and expected average
cost c π . Let D be a bound on the absolute value of v π )
wπ =

D + vπ
,
2c π

lπ =

D − vπ
.
2c π

D

l
●

● ●
●
●
●●
●
●
● ●
● ● ●● ●● ● ● ● ●
●
● ●● ● ● ●●●●
●
●
● ● ●●
●
● ● ● ● ●●●●● ●●●●● ●
● ●
●
● ● ●●
●
●
● ●
● ● ● ● ●●● ● ●●● ● ●
● ● ● ●●●●● ●●● ●● ● ●● ●
● ● ●● ● ●
● ● ● ●●● ● ●●●●●●●● ●●● ●● ●● ●● ● ●
●
●● ●● ●●●●●●●●● ● ●●● ●
●
● ●● ●● ●●
●
●
●●
●
●●● ●
● ● ●●●●● ●●● ●●
●
●
● ● ● ● ●●● ●●●● ●●●●●●●●●●●●●●●● ● ●
●
● ● ●●●●● ●●●●●●●●●●●●● ●
●●●●●●● ●●●●●●●●●●●●● ●●●●
● ●
●●●●● ●● ● ● ●●● ● ●● ●
● ● ● ● ● ●● ●
●●●● ●● ●●●●●●●●●● ●
●
● ●● ●●●●●●●●●●●●●●●●●●●●● ● ●
● ●●●●●●● ●●● ●●●●●●●● ●● ●
● ●●● ●●●● ●● ●●● ● ●
● ● ●●● ●●●● ●●●●● ● ●●
●
● ●●● ●●●●
● ● ● ● ●●●●●●●●●● ●●●●●●●●●● ● ● ●
● ●●●●●●●●●● ●●●●● ● ● ● ●
● ●●● ●● ●●●●●●●●●●●●●●●●●●●● ●● ●●
● ●●●● ●●●● ●
● ● ● ●● ●
●
●
● ● ●●●●
●
● ●● ●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ●
● ●●●●●●●●●●●●●●●●●●●● ● ● ●
●
●● ●●●●● ● ●●●●●
●●●●●●●●●●●●● ●●●●● ●●●
●● ● ●● ●●●●●●● ●●●● ●
● ●●●●●●●●●●●●●●●●●●●●●●● ● ●
● ●●● ●● ● ●● ●
● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●
● ● ● ● ●● ●
●●● ●●●● ●●●●●● ●●●●●●●● ●
●● ● ●●●●●●● ●●● ●●● ●● ●
● ● ● ●●●●● ●●●●●●●●●●●●●●●● ●
● ●●
● ● ● ● ●●● ● ●● ● ● ●
● ● ● ●●●●●●●●●●●●●●●●●● ●●● ●
● ● ● ●●●●●●●●●●●●●● ●●●●●●●●●●●● ●●●
●●●●● ● ● ● ●
●● ●● ●●●●●●●●●●●●● ●●●●●●●●●●●● ● ●
● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●
● ●●●●●●●●● ●● ● ● ● ●
●
●● ● ● ●
●
● ●● ●●●●●●●●●●● ●●●●●●●●●●●●
●●●●●●●● ●●●●●●●●●
●
● ● ● ●●
● ● ●●●●●●●●●●●●●●●●●●●●● ●
● ●●●●●●●●●● ●●●●●●● ●●
● ● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●●
● ● ● ●● ●● ●● ●
●● ●●●●●●●●●●●●●● ●● ● ●
●
●● ● ● ●● ●
● ●● ●●●●●●●●●●●●●●●●●●●●●●● ●●
● ● ●●●●●●●●●●●●●●●● ●●●●●●●
● ●● ●● ●●●●● ●●
● ● ●●● ● ●● ● ●
●
● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
● ●● ●●●●●●●●●● ●
● ●●●●●●●●●●●●●●●●●●●● ●●●●●●● ●
●● ● ●●●●●●●●●●●●●●●● ●●●● ●●
● ● ● ●●● ●●●●●●●●●●● ●●●●●
● ● ● ● ● ●●
● ●●●●●●●●●●●●● ● ● ●
● ●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●
● ●
● ● ●●●●●●●●●●●●●●●● ●●●●●●
● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ● ●
●
● ● ●●●●●●●●●● ●●●● ● ● ●
●
●●●●●●●●● ●●● ●●● ●
●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●
●● ●● ● ●●● ●
● ●● ● ●●● ●●●●● ●
● ●●●●●●●●●●●●● ●●● ● ● ●
● ● ●●●●●●●●●●●●● ●●●● ● ● ● ●
● ●●
●● ●●●●●●●●●●●●●●●●●●●●●● ●●●●
● ● ●●●●●● ● ● ● ●
● ● ● ●●●●●●●● ●● ● ●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●
●
●
●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●
● ●●●●●●●●●●●●●●●●●●●●●●● ● ●
●● ● ● ● ● ●
● ●● ●
●
●●●●●●●●●●●●●●●●●●●●●● ●
●
●
● ● ●●●●●●●●●●●●●●●●●●● ●● ● ●
●●● ● ●●●●●●●●●●●●● ●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●
●●● ● ●● ●●● ●●●●●
● ●●●●●●●●●●●●●●●●●●●● ●●●● ●
●
● ● ● ●●● ● ●● ● ●
●●● ● ●●● ●●●●●● ● ●
● ● ●● ●● ● ● ●
●●●●●●●●●●●●●●● ●●●● ● ●●
● ● ● ●●●●●●●●●●●●●●●●●●●●●●●● ●
● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ●
●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●● ● ●●● ● ●
● ● ● ●● ● ●●●●●●●●●● ●
●●
● ●●
● ● ●●● ●●●●●●●●●●●●●●●●●●●●●● ●
● ● ● ●●● ●● ●●●● ●● ●● ●
●● ● ● ●
● ●●● ●●●● ●●●●●●●●●●●● ●●
●●
● ● ● ●●●●●●●●●●● ●●●●●●●●●●●●●● ●
●
●●●● ● ● ●● ● ● ●
●●●●●●●●●●●●●●●●●●●●●● ● ●●
●●●
●
●●●●●●●●●●●●●●●●●●● ● ●
● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●
● ● ●●●●●●●●● ●●●●●●
●●●●●●● ●●● ●●●●
●● ● ●
● ● ●●●●●●●●●●●●●●●●●●●●●●●●● ●
●●●●● ●●●●●●●●●●● ●● ●
●●●● ●●●●●● ●●●●●●●●●●●●
● ●●●●●●●●●●●● ● ● ●
●
●●● ●●●● ●●●●●
●●●● ●●●●●●●●●●●●●●●●● ●●● ●
● ●
●
●● ● ●● ● ● ●●
● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●
●●● ●●● ●●●●●●●●●●●●●●●●●●●●● ● ●●
● ●● ●●●●●●●●●●●●●●●●●●●●
●
● ●●●●●●●●●●●●●●●●●●●●●●● ●
● ●●●●●●●●●●●●●●●●●●●●● ●●●● ● ●
●● ●● ● ●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●
● ●●●●●●●●●●●●●●●●●●
● ●● ●●● ●●●●●● ●● ● ●
●● ● ● ● ● ●● ●●
● ●● ●●●●●●●●●●●● ●●●●●● ●
●● ●●● ● ●● ● ●●
● ●●●●●●●●●●●●●●●●●●●●●●● ●
● ●●●●
●● ●●●●●●●●●●●●●●●●●●●● ●
●● ● ●
●
●●● ●●●●● ●● ● ● ● ●
●●●●●● ●●●●●●● ●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●● ●● ●
● ● ●● ● ● ●
● ●●● ●●●●●●●●●●●●●●●●●●●●● ●●●●
● ● ●●●●●●●●●●●●●●●●●●●●●●● ●
● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●● ● ●
●● ●● ●●●●●●●●●● ● ●●●
●●
● ●● ● ● ● ●● ●●
● ● ●●●●●●●●●●●●●●●●●●●●● ●●
●●
●●●●● ●●●●●●●●●●●●●●●●●●●●●●● ●●● ●●
●●●●●●●●●●●●●●●●●●●●●● ●
● ●● ● ●
●
●●●●●●● ●● ●●●●●●●
●●●●●● ●●●●●●●●●●●●●●● ●● ●
●●●●●● ● ●● ●● ●● ●●
●
●● ● ● ●
●● ● ● ●●●●●●● ●●●● ●● ●●●●
●● ● ●●● ● ●●● ● ●
● ●●●● ●●●●●●●●●●●●●●● ●●●●● ● ● ●
●●●● ●●●●●●●●●●●●●● ●●●●● ●
●
●● ●● ● ●
● ●
●●● ●● ●●●●●●● ●
● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●
●
● ● ● ●● ● ●
●●● ●●●●●●●●●●●●●●●● ●●●●
● ● ● ● ●● ● ● ● ●
● ●● ●●●●●●●●●●●●●●●●●●● ● ●●
● ●●●● ●●●●●●●●●● ●●●●
●●● ●● ●●●●●●●● ●
● ●●
● ●●●●●●●●●●●●●●●●●●●●●●● ● ●
● ●●● ●●●●●●●●●●●●● ●● ●
●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●
● ●
● ●●●●●●●●●●●●●●●●●●●●●●● ● ●
● ●●●●●●●●●●●● ●● ●● ● ●
● ● ●●●●●●●●●●●●●●●●●●●●●●●● ●
● ●● ●●● ●●●●●●●●●●●●●●●●●●●●●●●● ● ●●
● ●●●● ●●●●●● ● ●●● ●
●
●● ● ●●● ● ●● ●
● ● ●● ●
● ● ●● ●●●●●●●●●● ●● ●● ● ●
● ● ● ●●●●●●●●●●●●●● ●● ●●
● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●
● ●●● ●●●●●●●●●●●●●●●●●●●● ●●●●
●
● ● ● ●● ●●●●●●● ●●●●●● ● ●●
●● ● ● ●●●●●●●●●●●●●●●●●● ●
●●● ● ●
● ● ● ●●●● ●● ●
●
● ●●●●●● ●●● ● ●●● ●
●●●●●●●●●●●●●● ●●●●●●●●● ●● ●
● ● ● ●●●●●●●●●●●●●● ● ● ●
● ●●●
● ●●● ●●●●●● ● ●
● ●●
● ●● ● ●●●●●●● ●●●●● ●
● ●●●●●●●● ●●● ● ●
● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●●
●
● ● ● ●●●●●●●●●●●●●●●●●● ● ●●
●●●●●●●●● ●● ●●●● ● ●
● ●●● ● ●● ●●●●●●●●●●●●●●●●●●● ● ●
● ●● ●●● ●●●
● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ● ●
●●●●●●● ●●●● ●● ●●
●● ● ● ● ●●
●● ●
● ● ● ● ● ●●
● ●●● ●●● ●●●●● ● ● ● ● ● ● ● ●
●
● ●● ●● ●●● ●● ●
●
●●●●●●●●●●●●●● ●●● ● ●
●● ● ● ●
● ●● ● ●● ● ●
● ●● ●●● ●●●●●●●●●●●●●●●●●● ● ● ●●●
● ●●
●
●● ●●●●●●●●●●●●●●●●●● ●●
●● ● ●●●●●●●●●●●●●● ● ●
● ● ●● ●● ●●●●● ●●● ● ●
●
●●
●●● ●●●● ●●●●●●●●● ● ● ●
● ● ●●●●●●●●●●●●●●●●●●●● ●● ●
● ●● ●●●●●● ●
●
● ● ● ●●●●●●●●●●●●●●●●●●● ● ● ● ●
● ●●● ●●● ●●● ● ●
● ●● ●
● ●● ●●●●●●●●●●●●● ●●●●●●●● ●●● ● ●
●●● ●● ●●● ● ●●● ●
● ● ●●●● ●● ●●●●●●●● ●●● ●
●●
●●● ●● ●●● ●●●●● ●●●
● ●●●●●●●●●●●●●●●●●●● ●●● ●
●
●● ● ● ● ● ●●
●
●
● ● ●●●●●●●●●●●●●●●●● ● ●
● ●●●●● ●●● ● ● ●
● ● ● ●● ● ●
● ● ● ●●●●●● ●●●●●●●●●●● ●●● ● ●
● ● ●●●●●●●● ●● ●
● ●●●●●● ●●●● ●●●●●● ● ●
● ●●●●●●● ●●●●●●●● ●●● ● ● ●
● ● ● ●●●●●●●●●●●●●●●●●● ●●● ● ● ● ● ●
●
● ● ●●● ●● ● ●
●●●● ● ●●●● ●● ● ●● ●
●● ● ●●●●●● ●
●● ●●●●●●● ●●●●●●●● ● ●● ● ● ●
●●●●●●●●● ●● ●● ● ●
● ● ● ● ●● ● ●●● ● ● ● ● ●
●
● ●●● ●●●●●●●●●●●●●● ●● ● ●● ● ●
●●●● ● ● ●●●●●● ●● ●●● ● ●
●
● ● ●●●●●● ●●●
●
● ●
●
● ●
●
●
●●● ●● ● ● ● ●● ● ●
● ● ●●●●●● ● ●●●●● ● ●●● ●●
●
● ●● ●
●
● ●
●
● ●●● ● ● ● ● ●
● ● ● ●●●●●●●●●● ●●● ●●●●● ●●●
● ●
● ● ●● ● ● ●● ●● ● ● ● ●
●
●
● ● ● ● ●● ●
●●
● ●
● ● ●●●● ● ● ● ● ● ●●● ●●
● ●●● ● ●● ● ● ●
● ● ● ● ●● ●●●●● ●●●●● ● ●
●
● ●● ●● ●●●●●●●●●● ●● ●● ● ● ● ●●
●● ● ● ● ●●
●
●
● ● ● ●●●● ●● ● ●
●
● ● ● ● ●● ●
●●
● ● ●●●● ● ●●● ●● ●●●●● ● ●●●● ●
●
●
●
● ● ●
●
● ● ● ● ●●● ●●●● ● ●
●● ●● ●●●●●●●●●● ● ●
● ● ● ● ● ●● ●● ●●
● ● ● ● ●
● ●
●
● ●
●
●
●● ●●● ●●●● ●●●● ●●●● ●● ●
●
●
●● ●
●●
●
●
●
● ●●●●●●●●●●● ●●● ● ●
●
●● ● ●●● ● ● ●● ● ● ●
● ●
●● ●
●
● ●
●
● ●●● ● ● ● ●
●
● ●● ● ● ●
●
●
●
●
●● ●
●
●
●
●
●●

●

w

D

The w − l space.
Value and Cost

wπ =

D + vπ
,
2c π

lπ =
D

5D
−0.

1

l
0

−D

D

l

D − vπ
.
2c π

2

0.5D

4
8

D

w

D

w

D

The w − l space.
Nudged value

wπ =

D + vπ
,
2c π

lπ =

D − vπ
.
2c π

−
D

/2

D

l
● ●
●
●
●●
●
●
● ●
● ● ●● ●● ● ● ● ●
●
● ●● ● ● ●●●●
●
●
● ● ●●
●
● ● ● ● ●●●●● ●●●●● ●
● ●
●
● ● ●●
●
●
● ●
● ● ● ● ●●● ● ●●● ● ●
● ● ● ●●●●● ●●● ●● ● ●● ●
● ● ●● ● ●
● ● ● ●●● ● ●●●●●●●● ●●● ●● ●● ●● ● ●
●
●● ●● ●●●●●●●●● ● ●●● ●
●
● ●● ●● ●●
●
●
●●
●
●●● ●
● ● ●●●●● ●●● ●●
●
●
● ● ● ● ●●● ●●●● ●●●●●●●●●●●●●●●● ● ●
●
● ● ●●●●● ●●●●●●●●●●●●● ●
●●●●●●● ●●●●●●●●●●●●● ●●●●
● ●
●●●●● ●● ● ● ●●● ● ●● ●
● ● ● ● ● ●● ●
●●●● ●● ●●●●●●●●●● ●
●
● ●● ●●●●●●●●●●●●●●●●●●●●● ● ●
● ●●●●●●● ●●● ●●●●●●●● ●● ●
● ●●● ●●●● ●● ●●● ● ●
● ● ●●● ●●●● ●●●●● ● ●●
●
● ●●● ●●●●
● ● ● ● ●●●●●●●●●● ●●●●●●●●●● ● ● ●
● ●●●●●●●●●● ●●●●● ● ● ● ●
● ●●● ●● ●●●●●●●●●●●●●●●●●●●● ●● ●●
● ●●●● ●●●● ●
● ● ● ●● ●
●
●
● ● ●●●●
●
● ●● ●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ●
● ●●●●●●●●●●●●●●●●●●●● ● ● ●
●
●● ●●●●● ● ●●●●●
●●●●●●●●●●●●● ●●●●● ●●●
●● ● ●● ●●●●●●● ●●●● ●
● ●●●●●●●●●●●●●●●●●●●●●●● ● ●
● ●●● ●● ● ●● ●
● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●
● ● ● ● ●● ●
●●● ●●●● ●●●●●● ●●●●●●●● ●
●● ● ●●●●●●● ●●● ●●● ●● ●
● ● ● ●●●●● ●●●●●●●●●●●●●●●● ●
● ●●
● ● ● ● ●●● ● ●● ● ● ●
● ● ● ●●●●●●●●●●●●●●●●●● ●●● ●
● ● ● ●●●●●●●●●●●●●● ●●●●●●●●●●●● ●●●
●●●●● ● ● ● ●
●● ●● ●●●●●●●●●●●●● ●●●●●●●●●●●● ● ●
● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●
● ●●●●●●●●● ●● ● ● ● ●
●
●● ● ● ●
●
● ●● ●●●●●●●●●●● ●●●●●●●●●●●●
●●●●●●●● ●●●●●●●●●
●
● ● ● ●●
● ● ●●●●●●●●●●●●●●●●●●●●● ●
● ●●●●●●●●●● ●●●●●●● ●●
● ● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●●
● ● ● ●● ●● ●● ●
●● ●●●●●●●●●●●●●● ●● ● ●
●
●● ● ● ●● ●
● ●● ●●●●●●●●●●●●●●●●●●●●●●● ●●
● ● ●●●●●●●●●●●●●●●● ●●●●●●●
● ●● ●● ●●●●● ●●
● ● ●●● ● ●● ● ●
●
● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
● ●● ●●●●●●●●●● ●
● ●●●●●●●●●●●●●●●●●●●● ●●●●●●● ●
●● ● ●●●●●●●●●●●●●●●● ●●●● ●●
● ● ● ●●● ●●●●●●●●●●● ●●●●●
● ● ● ● ● ●●
● ●●●●●●●●●●●●● ● ● ●
● ●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●
● ●
● ● ●●●●●●●●●●●●●●●● ●●●●●●
● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ● ●
●
● ● ●●●●●●●●●● ●●●● ● ● ●
●
●●●●●●●●● ●●● ●●● ●
●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●
●● ●● ● ●●● ●
● ●● ● ●●● ●●●●● ●
● ●●●●●●●●●●●●● ●●● ● ● ●
● ● ●●●●●●●●●●●●● ●●●● ● ● ● ●
● ●●
●● ●●●●●●●●●●●●●●●●●●●●●● ●●●●
● ● ●●●●●● ● ● ● ●
● ● ● ●●●●●●●● ●● ● ●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●
●
●
●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●
● ●●●●●●●●●●●●●●●●●●●●●●● ● ●
●● ● ● ● ● ●
● ●● ●
●
●●●●●●●●●●●●●●●●●●●●●● ●
●
●
● ● ●●●●●●●●●●●●●●●●●●● ●● ● ●
●●● ● ●●●●●●●●●●●●● ●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●
●●● ● ●● ●●● ●●●●●
● ●●●●●●●●●●●●●●●●●●●● ●●●● ●
●
● ● ● ●●● ● ●● ● ●
●●● ● ●●● ●●●●●● ● ●
● ● ●● ●● ● ● ●
●●●●●●●●●●●●●●● ●●●● ● ●●
● ● ● ●●●●●●●●●●●●●●●●●●●●●●●● ●
● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ●
●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●● ● ●●● ● ●
● ● ● ●● ● ●●●●●●●●●● ●
●●
● ●●
● ● ●●● ●●●●●●●●●●●●●●●●●●●●●● ●
● ● ● ●●● ●● ●●●● ●● ●● ●
●● ● ● ●
● ●●● ●●●● ●●●●●●●●●●●● ●●
●●
● ● ● ●●●●●●●●●●● ●●●●●●●●●●●●●● ●
●
●●●● ● ● ●● ● ● ●
●●●●●●●●●●●●●●●●●●●●●● ● ●●
●●●
●
●●●●●●●●●●●●●●●●●●● ● ●
● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●
● ● ●●●●●●●●● ●●●●●●
●●●●●●● ●●● ●●●●
●● ● ●
● ● ●●●●●●●●●●●●●●●●●●●●●●●●● ●
●●●●● ●●●●●●●●●●● ●● ●
●●●● ●●●●●● ●●●●●●●●●●●●
● ●●●●●●●●●●●● ● ● ●
●
●●● ●●●● ●●●●●
●●●● ●●●●●●●●●●●●●●●●● ●●● ●
● ●
●
●● ● ●● ● ● ●●
● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●
●●● ●●● ●●●●●●●●●●●●●●●●●●●●● ● ●●
● ●● ●●●●●●●●●●●●●●●●●●●●
●
● ●●●●●●●●●●●●●●●●●●●●●●● ●
● ●●●●●●●●●●●●●●●●●●●●● ●●●● ● ●
●● ●● ● ●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●
● ●●●●●●●●●●●●●●●●●●
● ●● ●●● ●●●●●● ●● ● ●
●● ● ● ● ● ●● ●●
● ●● ●●●●●●●●●●●● ●●●●●● ●
●● ●●● ● ●● ● ●●
● ●●●●●●●●●●●●●●●●●●●●●●● ●
● ●●●●
●● ●●●●●●●●●●●●●●●●●●●● ●
●● ● ●
●
●●● ●●●●● ●● ● ● ● ●
●●●●●● ●●●●●●● ●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●● ●● ●
● ● ●● ● ● ●
● ●●● ●●●●●●●●●●●●●●●●●●●●● ●●●●
● ● ●●●●●●●●●●●●●●●●●●●●●●● ●
● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●● ● ●
●● ●● ●●●●●●●●●● ● ●●●
●●
● ●● ● ● ● ●● ●●
● ● ●●●●●●●●●●●●●●●●●●●●● ●●
●●
●●●●● ●●●●●●●●●●●●●●●●●●●●●●● ●●● ●●
●●●●●●●●●●●●●●●●●●●●●● ●
● ●● ● ●
●
●●●●●●● ●● ●●●●●●●
●●●●●● ●●●●●●●●●●●●●●● ●● ●
●●●●●● ● ●● ●● ●● ●●
●
●● ● ● ●
●● ● ● ●●●●●●● ●●●● ●● ●●●●
●● ● ●●● ● ●●● ● ●
● ●●●● ●●●●●●●●●●●●●●● ●●●●● ● ● ●
●●●● ●●●●●●●●●●●●●● ●●●●● ●
●
●● ●● ● ●
● ●
●●● ●● ●●●●●●● ●
● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●
●
● ● ● ●● ● ●
●●● ●●●●●●●●●●●●●●●● ●●●●
● ● ● ● ●● ● ● ● ●
● ●● ●●●●●●●●●●●●●●●●●●● ● ●●
● ●●●● ●●●●●●●●●● ●●●●
●●● ●● ●●●●●●●● ●
● ●●
● ●●●●●●●●●●●●●●●●●●●●●●● ● ●
● ●●● ●●●●●●●●●●●●● ●● ●
●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●
● ●
● ●●●●●●●●●●●●●●●●●●●●●●● ● ●
● ●●●●●●●●●●●● ●● ●● ● ●
● ● ●●●●●●●●●●●●●●●●●●●●●●●● ●
● ●● ●●● ●●●●●●●●●●●●●●●●●●●●●●●● ● ●●
● ●●●● ●●●●●● ● ●●● ●
●
●● ● ●●● ● ●● ●
● ● ●● ●
● ● ●● ●●●●●●●●●● ●● ●● ● ●
● ● ● ●●●●●●●●●●●●●● ●● ●●
● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●
● ●●● ●●●●●●●●●●●●●●●●●●●● ●●●●
●
● ● ● ●● ●●●●●●● ●●●●●● ● ●●
●● ● ● ●●●●●●●●●●●●●●●●●● ●
●●● ● ●
● ● ● ●●●● ●● ●
●
● ●●●●●● ●●● ● ●●● ●
●●●●●●●●●●●●●● ●●●●●●●●● ●● ●
● ● ● ●●●●●●●●●●●●●● ● ● ●
● ●●●
● ●●● ●●●●●● ● ●
● ●●
● ●● ● ●●●●●●● ●●●●● ●
● ●●●●●●●● ●●● ● ●
● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●●
●
● ● ● ●●●●●●●●●●●●●●●●●● ● ●●
●●●●●●●●● ●● ●●●● ● ●
● ●●● ● ●● ●●●●●●●●●●●●●●●●●●● ● ●
● ●● ●●● ●●●
● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ● ●
●●●●●●● ●●●● ●● ●●
●● ● ● ● ●●
●● ●
● ● ● ● ● ●●
● ●●● ●●● ●●●●● ● ● ● ● ● ● ● ●
●
● ●● ●● ●●● ●● ●
●
●●●●●●●●●●●●●● ●●● ● ●
●● ● ● ●
● ●● ● ●● ● ●
● ●● ●●● ●●●●●●●●●●●●●●●●●● ● ● ●●●
● ●●
●
●● ●●●●●●●●●●●●●●●●●● ●●
●● ● ●●●●●●●●●●●●●● ● ●
● ● ●● ●● ●●●●● ●●● ● ●
●
●●
●●● ●●●● ●●●●●●●●● ● ● ●
● ● ●●●●●●●●●●●●●●●●●●●● ●● ●
● ●● ●●●●●● ●
●
● ● ● ●●●●●●●●●●●●●●●●●●● ● ● ● ●
● ●●● ●●● ●●● ● ●
● ●● ●
● ●● ●●●●●●●●●●●●● ●●●●●●●● ●●● ● ●
●●● ●● ●●● ● ●●● ●
● ● ●●●● ●● ●●●●●●●● ●●● ●
●●
●●● ●● ●●● ●●●●● ●●●
● ●●●●●●●●●●●●●●●●●●● ●●● ●
●
●● ● ● ● ● ●●
●
●
● ● ●●●●●●●●●●●●●●●●● ● ●
● ●●●●● ●●● ● ● ●
● ● ● ●● ● ●
● ● ● ●●●●●● ●●●●●●●●●●● ●●● ● ●
● ● ●●●●●●●● ●● ●
● ●●●●●● ●●●● ●●●●●● ● ●
● ●●●●●●● ●●●●●●●● ●●● ● ● ●
● ● ● ●●●●●●●●●●●●●●●●●● ●●● ● ● ● ● ●
●
● ● ●●● ●● ● ●
●●●● ● ●●●● ●● ● ●● ●
●● ● ●●●●●● ●
●● ●●●●●●● ●●●●●●●● ● ●● ● ● ●
●●●●●●●●● ●● ●● ● ●
● ● ● ● ●● ● ●●● ● ● ● ● ●
●
● ●●● ●●●●●●●●●●●●●● ●● ● ●● ● ●
●●●● ● ● ●●●●●● ●● ●●● ● ●
●
● ● ●●●●●● ●●●
●
● ●
●
● ●
●
●
●●● ●● ● ● ● ●● ● ●
● ● ●●●●●● ● ●●●●● ● ●●● ●●
●
● ●● ●
●
● ●
●
● ●●● ● ● ● ● ●
● ● ● ●●●●●●●●●● ●●● ●●●●● ●●●
● ●
● ● ●● ● ● ●● ●● ● ● ● ●
●
●
● ● ● ● ●● ●
●●
● ●
● ● ●●●● ● ● ● ● ● ●●● ●●
● ●●● ● ●● ● ● ●
● ● ● ● ●● ●●●●● ●●●●● ● ●
●
● ●● ●● ●●●●●●●●●● ●● ●● ● ● ● ●●
●● ● ● ● ●●
●
●
● ● ● ●●●● ●● ● ●
●
● ● ● ● ●● ●
●●
● ● ●●●● ● ●●● ●● ●●●●● ● ●●●● ●
●
●
●
● ● ●
●
● ● ● ● ●●● ●●●● ● ●
●● ●● ●●●●●●●●●● ● ●
● ● ● ● ● ●● ●● ●●
● ● ● ● ●
● ●
●
● ●
●
●
●● ●●● ●●●● ●●●● ●●●● ●● ●
●
●
●● ●
●●
●
●
●
● ●●●●●●●●●●● ●●● ● ●
●
●● ● ●●● ● ● ●● ● ● ●
● ●
●● ●
●
● ●
●
● ●●● ● ● ● ●
●
● ●● ● ● ●
●
●
●
●
●● ●
●
●
●
●
●●

0

/2

●

D

●

w

D

The w − l space.
As a projective transformation.

The w − l space.

Policy Value

D

1

−D
Episode Length

The w − l space.

−D

Policy Value

D

l
1

−D
Episode Length

w

D

Sample task: two states, continuous actions

s1
a1 ∈ [0, 1]
r1 = 1 + (a1 − 0.5)2
c1 = 1 + a1


s1
a1 ∈ [0, 1]
r1 = 1 + (a1 − 0.5)2
c1 = 1 + a1

s2
a2 ∈ [0, 1]
r2 = 1 + a2
c2 = 1 + (a2 − 0.5)2

Policy Space (Actions)

1

a2

0
0

a1

1

Policy Values and Costs

Policy value

4

Policy cost

4

Policy Manifold in w − l

l

D/2

w

D/2

And the rest...

Neat geometry, linear problems in w − l.
Easily exploited using straightforward algebra / calculus.
Updating average reward between iterations can be optimized.
Becomes ﬁnding the (or rather an) intersection between two
conics.
Which can be solved in O(1) time.

And the rest...

Neat geometry, linear problems in w − l.
Easily exploited using straightforward algebra / calculus.
Updating average reward between iterations can be optimized.
Becomes ﬁnding the (or rather an) intersection between two
conics.
Which can be solved in O(1) time.
Worst case, uncertainty reduces in half.
Typically much better than that.
Little extra complexity added to already PAC methods.

Thank you.
r-uribe@uniandes.edu.co

Untitled by Li Wei, School of Design, Oita University, 2009.

Optimal Nudging. Presentation UD.

Recomendados

Recomendados

Más contenido relacionado

Similar a Optimal Nudging. Presentation UD.

Similar a Optimal Nudging. Presentation UD. (20)

Optimal Nudging. Presentation UD.