SlideShare una empresa de Scribd logo
1 de 42
Descargar para leer sin conexión
Lecture 3: Planning by Dynamic Programming
Lecture 3: Planning by Dynamic Programming
David Silver
Lecture 3: Planning by Dynamic Programming
Outline
1 Introduction
2 Policy Evaluation
3 Policy Iteration
4 Value Iteration
5 Extensions to Dynamic Programming
6 Contraction Mapping
Lecture 3: Planning by Dynamic Programming
Introduction
What is Dynamic Programming?
Dynamic sequential or temporal component to the problem
Programming optimising a “program”, i.e. a policy
c.f. linear programming
A method for solving complex problems
By breaking them down into subproblems
Solve the subproblems
Combine solutions to subproblems
Lecture 3: Planning by Dynamic Programming
Introduction
Requirements for Dynamic Programming
Dynamic Programming is a very general solution method for
problems which have two properties:
Optimal substructure
Principle of optimality applies
Optimal solution can be decomposed into subproblems
Overlapping subproblems
Subproblems recur many times
Solutions can be cached and reused
Markov decision processes satisfy both properties
Bellman equation gives recursive decomposition
Value function stores and reuses solutions
Lecture 3: Planning by Dynamic Programming
Introduction
Planning by Dynamic Programming
Dynamic programming assumes full knowledge of the MDP
It is used for planning in an MDP
For prediction:
Input: MDP S, A, P, R, γ and policy π
or: MRP S, Pπ
, Rπ
, γ
Output: value function vπ
Or for control:
Input: MDP S, A, P, R, γ
Output: optimal value function v∗
and: optimal policy π∗
Lecture 3: Planning by Dynamic Programming
Introduction
Other Applications of Dynamic Programming
Dynamic programming is used to solve many other problems, e.g.
Scheduling algorithms
String algorithms (e.g. sequence alignment)
Graph algorithms (e.g. shortest path algorithms)
Graphical models (e.g. Viterbi algorithm)
Bioinformatics (e.g. lattice models)
Lecture 3: Planning by Dynamic Programming
Policy Evaluation
Iterative Policy Evaluation
Iterative Policy Evaluation
Problem: evaluate a given policy π
Solution: iterative application of Bellman expectation backup
v1 → v2 → ... → vπ
Using synchronous backups,
At each iteration k + 1
For all states s ∈ S
Update vk+1(s) from vk (s )
where s is a successor state of s
We will discuss asynchronous backups later
Convergence to vπ will be proven at the end of the lecture
Lecture 3: Planning by Dynamic Programming
Policy Evaluation
Iterative Policy Evaluation
Iterative Policy Evaluation (2)
a
r
vk+1(s) [ s
vk(s0
) [ s0
vk+1(s) =
a∈A
π(a|s) Ra
s + γ
s ∈S
Pa
ss vk(s )
vk+1
= Rπ
Rπ
Rπ
+ γPπ
Pπ
Pπ
vk
Lecture 3: Planning by Dynamic Programming
Policy Evaluation
Example: Small Gridworld
Evaluating a Random Policy in the Small Gridworld
Undiscounted episodic MDP (γ = 1)
Nonterminal states 1, ..., 14
One terminal state (shown twice as shaded squares)
Actions leading out of the grid leave state unchanged
Reward is −1 until the terminal state is reached
Agent follows uniform random policy
π(n|·) = π(e|·) = π(s|·) = π(w|·) = 0.25
Lecture 3: Planning by Dynamic Programming
Policy Evaluation
Example: Small Gridworld
Iterative Policy Evaluation in Small Gridworld
0.0 0.0 0.0
0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0
0.0 0.0 0.0
-1.0 -1.0 -1.0
-1.0 -1.0 -1.0 -1.0
-1.0 -1.0 -1.0 -1.0
-1.0 -1.0 -1.0
-1.7 -2.0 -2.0
-1.7 -2.0 -2.0 -2.0
-2.0 -2.0 -2.0 -1.7
-2.0 -2.0 -1.7
Vk for the
Random Policy
Greedy Policy
w.r.t. Vk
k = 0
k = 1
k = 2
random
policy
0.0
0.0
0.0
0.0
0.0
0.0
vk
vk
Lecture 3: Planning by Dynamic Programming
Policy Evaluation
Example: Small Gridworld
Iterative Policy Evaluation in Small Gridworld (2)
-1.0 -1.0 -1.0 -1.0
-1.0 -1.0 -1.0
-1.7 -2.0 -2.0
-1.7 -2.0 -2.0 -2.0
-2.0 -2.0 -2.0 -1.7
-2.0 -2.0 -1.7
-2.4 -2.9 -3.0
-2.4 -2.9 -3.0 -2.9
-2.9 -3.0 -2.9 -2.4
-3.0 -2.9 -2.4
-6.1 -8.4 -9.0
-6.1 -7.7 -8.4 -8.4
-8.4 -8.4 -7.7 -6.1
-9.0 -8.4 -6.1
-14. -20. -22.
-14. -18. -20. -20.
-20. -20. -18. -14.
-22. -20. -14.
k = 2
k = 10
k = °
k = 3
optimal
policy
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
∞
Lecture 3: Planning by Dynamic Programming
Policy Iteration
How to Improve a Policy
Given a policy π
Evaluate the policy π
vπ(s) = E [Rt+1 + γRt+2 + ...|St = s]
Improve the policy by acting greedily with respect to vπ
π = greedy(vπ)
In Small Gridworld improved policy was optimal, π = π∗
In general, need more iterations of improvement / evaluation
But this process of policy iteration always converges to π∗
Lecture 3: Planning by Dynamic Programming
Policy Iteration
Policy Iteration
Policy evaluation Estimate vπ
Iterative policy evaluation
Policy improvement Generate π ≥ π
Greedy policy improvement
Lecture 3: Planning by Dynamic Programming
Policy Iteration
Example: Jack’s Car Rental
Jack’s Car Rental
States: Two locations, maximum of 20 cars at each
Actions: Move up to 5 cars between locations overnight
Reward: $10 for each car rented (must be available)
Transitions: Cars returned and requested randomly
Poisson distribution, n returns/requests with prob λn
n! e−λ
1st location: average requests = 3, average returns = 3
2nd location: average requests = 4, average returns = 2
Lecture 3: Planning by Dynamic Programming
Policy Iteration
Example: Jack’s Car Rental
Policy Iteration in Jack’s Car Rental
Lecture 3: Planning by Dynamic Programming
Policy Iteration
Policy Improvement
Policy Improvement
Consider a deterministic policy, a = π(s)
We can improve the policy by acting greedily
π (s) = argmax
a∈A
qπ(s, a)
This improves the value from any state s over one step,
qπ(s, π (s)) = max
a∈A
qπ(s, a) ≥ qπ(s, π(s)) = vπ(s)
It therefore improves the value function, vπ (s) ≥ vπ(s)
vπ(s) ≤ qπ(s, π (s)) = Eπ [Rt+1 + γvπ(St+1) | St = s]
≤ Eπ Rt+1 + γqπ(St+1, π (St+1)) | St = s
≤ Eπ Rt+1 + γRt+2 + γ2
qπ(St+2, π (St+2)) | St = s
≤ Eπ [Rt+1 + γRt+2 + ... | St = s] = vπ (s)
Lecture 3: Planning by Dynamic Programming
Policy Iteration
Policy Improvement
Policy Improvement (2)
If improvements stop,
qπ(s, π (s)) = max
a∈A
qπ(s, a) = qπ(s, π(s)) = vπ(s)
Then the Bellman optimality equation has been satisfied
vπ(s) = max
a∈A
qπ(s, a)
Therefore vπ(s) = v∗(s) for all s ∈ S
so π is an optimal policy
Lecture 3: Planning by Dynamic Programming
Policy Iteration
Extensions to Policy Iteration
Modified Policy Iteration
Does policy evaluation need to converge to vπ?
Or should we introduce a stopping condition
e.g. -convergence of value function
Or simply stop after k iterations of iterative policy evaluation?
For example, in the small gridworld k = 3 was sufficient to
achieve optimal policy
Why not update policy every iteration? i.e. stop after k = 1
This is equivalent to value iteration (next section)
Lecture 3: Planning by Dynamic Programming
Policy Iteration
Extensions to Policy Iteration
Generalised Policy Iteration
Policy evaluation Estimate vπ
Any policy evaluation algorithm
Policy improvement Generate π ≥ π
Any policy improvement algorithm
Lecture 3: Planning by Dynamic Programming
Value Iteration
Value Iteration in MDPs
Principle of Optimality
Any optimal policy can be subdivided into two components:
An optimal first action A∗
Followed by an optimal policy from successor state S
Theorem (Principle of Optimality)
A policy π(a|s) achieves the optimal value from state s,
vπ(s) = v∗(s), if and only if
For any state s reachable from s
π achieves the optimal value from state s , vπ(s ) = v∗(s )
Lecture 3: Planning by Dynamic Programming
Value Iteration
Value Iteration in MDPs
Deterministic Value Iteration
If we know the solution to subproblems v∗(s )
Then solution v∗(s) can be found by one-step lookahead
v∗(s) ← max
a∈A
Ra
s + γ
s ∈S
Pa
ss v∗(s )
The idea of value iteration is to apply these updates iteratively
Intuition: start with final rewards and work backwards
Still works with loopy, stochastic MDPs
Lecture 3: Planning by Dynamic Programming
Value Iteration
Value Iteration in MDPs
Example: Shortest Path
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
0
-1
-2
-2
-1
-2
-2
-2
-2
-2
-2
-2
-2
-2
-2
-2
0
-1
-2
-3
-1
-2
-3
-3
-2
-3
-3
-3
-3
-3
-3
-3
0
-1
-2
-3
-1
-2
-3
-4
-2
-3
-4
-4
-3
-4
-4
-4
0
-1
-2
-3
-1
-2
-3
-4
-2
-3
-4
-5
-3
-4
-5
-5
0
-1
-2
-3
-1
-2
-3
-4
-2
-3
-4
-5
-3
-4
-5
-6
g
Problem V1 V2 V3
V4 V5 V6 V7
Lecture 3: Planning by Dynamic Programming
Value Iteration
Value Iteration in MDPs
Value Iteration
Problem: find optimal policy π
Solution: iterative application of Bellman optimality backup
v1 → v2 → ... → v∗
Using synchronous backups
At each iteration k + 1
For all states s ∈ S
Update vk+1(s) from vk (s )
Convergence to v∗ will be proven later
Unlike policy iteration, there is no explicit policy
Intermediate value functions may not correspond to any policy
Lecture 3: Planning by Dynamic Programming
Value Iteration
Value Iteration in MDPs
Value Iteration (2)
vk+1(s) [ s
vk(s0
) [ s0
r
a
vk+1(s) = max
a∈A
Ra
s + γ
s ∈S
Pa
ss vk(s )
vk+1 = max
a∈A
Ra
Ra
Ra
+ γPa
Pa
Pa
vk
Lecture 3: Planning by Dynamic Programming
Value Iteration
Value Iteration in MDPs
Example of Value Iteration in Practice
http://www.cs.ubc.ca/∼poole/demos/mdp/vi.html
Lecture 3: Planning by Dynamic Programming
Value Iteration
Summary of DP Algorithms
Synchronous Dynamic Programming Algorithms
Problem Bellman Equation Algorithm
Prediction Bellman Expectation Equation
Iterative
Policy Evaluation
Control
Bellman Expectation Equation
Policy Iteration
+ Greedy Policy Improvement
Control Bellman Optimality Equation Value Iteration
Algorithms are based on state-value function vπ(s) or v∗(s)
Complexity O(mn2) per iteration, for m actions and n states
Could also apply to action-value function qπ(s, a) or q∗(s, a)
Complexity O(m2n2) per iteration
Lecture 3: Planning by Dynamic Programming
Extensions to Dynamic Programming
Asynchronous Dynamic Programming
Asynchronous Dynamic Programming
DP methods described so far used synchronous backups
i.e. all states are backed up in parallel
Asynchronous DP backs up states individually, in any order
For each selected state, apply the appropriate backup
Can significantly reduce computation
Guaranteed to converge if all states continue to be selected
Lecture 3: Planning by Dynamic Programming
Extensions to Dynamic Programming
Asynchronous Dynamic Programming
Asynchronous Dynamic Programming
Three simple ideas for asynchronous dynamic programming:
In-place dynamic programming
Prioritised sweeping
Real-time dynamic programming
Lecture 3: Planning by Dynamic Programming
Extensions to Dynamic Programming
Asynchronous Dynamic Programming
In-Place Dynamic Programming
Synchronous value iteration stores two copies of value function
for all s in S
vnew (s) ← max
a∈A
Ra
s + γ
s ∈S
Pa
ss vold (s )
vold ← vnew
In-place value iteration only stores one copy of value function
for all s in S
v(s) ← max
a∈A
Ra
s + γ
s ∈S
Pa
ss v(s )
Lecture 3: Planning by Dynamic Programming
Extensions to Dynamic Programming
Asynchronous Dynamic Programming
Prioritised Sweeping
Use magnitude of Bellman error to guide state selection, e.g.
max
a∈A
Ra
s + γ
s ∈S
Pa
ss v(s ) − v(s)
Backup the state with the largest remaining Bellman error
Update Bellman error of affected states after each backup
Requires knowledge of reverse dynamics (predecessor states)
Can be implemented efficiently by maintaining a priority queue
Lecture 3: Planning by Dynamic Programming
Extensions to Dynamic Programming
Asynchronous Dynamic Programming
Real-Time Dynamic Programming
Idea: only states that are relevant to agent
Use agent’s experience to guide the selection of states
After each time-step St, At, Rt+1
Backup the state St
v(St) ← max
a∈A
Ra
St
+ γ
s ∈S
Pa
St s v(s )
Lecture 3: Planning by Dynamic Programming
Extensions to Dynamic Programming
Full-width and sample backups
Full-Width Backups
DP uses full-width backups
For each backup (sync or async)
Every successor state and action is
considered
Using knowledge of the MDP transitions
and reward function
DP is effective for medium-sized problems
(millions of states)
For large problems DP suffers Bellman’s
curse of dimensionality
Number of states n = |S| grows
exponentially with number of state
variables
Even one backup can be too expensive
vk+1(s) [ s
vk(s0
) [ s0
r
a
Lecture 3: Planning by Dynamic Programming
Extensions to Dynamic Programming
Full-width and sample backups
Sample Backups
In subsequent lectures we will consider sample backups
Using sample rewards and sample transitions
S, A, R, S
Instead of reward function R and transition dynamics P
Advantages:
Model-free: no advance knowledge of MDP required
Breaks the curse of dimensionality through sampling
Cost of backup is constant, independent of n = |S|
Lecture 3: Planning by Dynamic Programming
Extensions to Dynamic Programming
Approximate Dynamic Programming
Approximate Dynamic Programming
Approximate the value function
Using a function approximator ˆv(s, w)
Apply dynamic programming to ˆv(·, w)
e.g. Fitted Value Iteration repeats at each iteration k,
Sample states ˜S ⊆ S
For each state s ∈ ˜S, estimate target value using Bellman
optimality equation,
˜vk (s) = max
a∈A
Ra
s + γ
s ∈S
Pa
ss ˆv(s , wk)
Train next value function ˆv(·, wk+1) using targets { s, ˜vk (s) }
Lecture 3: Planning by Dynamic Programming
Contraction Mapping
Some Technical Questions
How do we know that value iteration converges to v∗?
Or that iterative policy evaluation converges to vπ?
And therefore that policy iteration converges to v∗?
Is the solution unique?
How fast do these algorithms converge?
These questions are resolved by contraction mapping theorem
Lecture 3: Planning by Dynamic Programming
Contraction Mapping
Value Function Space
Consider the vector space V over value functions
There are |S| dimensions
Each point in this space fully specifies a value function v(s)
What does a Bellman backup do to points in this space?
We will show that it brings value functions closer
And therefore the backups must converge on a unique solution
Lecture 3: Planning by Dynamic Programming
Contraction Mapping
Value Function ∞-Norm
We will measure distance between state-value functions u and
v by the ∞-norm
i.e. the largest difference between state values,
||u − v||∞ = max
s∈S
|u(s) − v(s)|
Lecture 3: Planning by Dynamic Programming
Contraction Mapping
Bellman Expectation Backup is a Contraction
Define the Bellman expectation backup operator Tπ,
Tπ
(v) = Rπ
+ γPπ
v
This operator is a γ-contraction, i.e. it makes value functions
closer by at least γ,
||Tπ
(u) − Tπ
(v)||∞ = || (Rπ
+ γPπ
u) − (Rπ
+ γPπ
v) ||∞
= ||γPπ
(u − v)||∞
≤ ||γPπ
||u − v||∞||∞
≤ γ||u − v||∞
Lecture 3: Planning by Dynamic Programming
Contraction Mapping
Contraction Mapping Theorem
Theorem (Contraction Mapping Theorem)
For any metric space V that is complete (i.e. closed) under an
operator T(v), where T is a γ-contraction,
T converges to a unique fixed point
At a linear convergence rate of γ
Lecture 3: Planning by Dynamic Programming
Contraction Mapping
Convergence of Iter. Policy Evaluation and Policy Iteration
The Bellman expectation operator Tπ has a unique fixed point
vπ is a fixed point of Tπ (by Bellman expectation equation)
By contraction mapping theorem
Iterative policy evaluation converges on vπ
Policy iteration converges on v∗
Lecture 3: Planning by Dynamic Programming
Contraction Mapping
Bellman Optimality Backup is a Contraction
Define the Bellman optimality backup operator T∗,
T∗
(v) = max
a∈A
Ra
+ γPa
v
This operator is a γ-contraction, i.e. it makes value functions
closer by at least γ (similar to previous proof)
||T∗
(u) − T∗
(v)||∞ ≤ γ||u − v||∞
Lecture 3: Planning by Dynamic Programming
Contraction Mapping
Convergence of Value Iteration
The Bellman optimality operator T∗ has a unique fixed point
v∗ is a fixed point of T∗ (by Bellman optimality equation)
By contraction mapping theorem
Value iteration converges on v∗

Más contenido relacionado

La actualidad más candente

SURF 2012 Final Report(1)
SURF 2012 Final Report(1)SURF 2012 Final Report(1)
SURF 2012 Final Report(1)Eric Zhang
 
Cheatsheet convolutional-neural-networks
Cheatsheet convolutional-neural-networksCheatsheet convolutional-neural-networks
Cheatsheet convolutional-neural-networksSteve Nouri
 
Stochastic gradient descent and its tuning
Stochastic gradient descent and its tuningStochastic gradient descent and its tuning
Stochastic gradient descent and its tuningArsalan Qadri
 
D I G I T A L C O N T R O L S Y S T E M S J N T U M O D E L P A P E R{Www
D I G I T A L  C O N T R O L  S Y S T E M S  J N T U  M O D E L  P A P E R{WwwD I G I T A L  C O N T R O L  S Y S T E M S  J N T U  M O D E L  P A P E R{Www
D I G I T A L C O N T R O L S Y S T E M S J N T U M O D E L P A P E R{Wwwguest3f9c6b
 
Applied Machine Learning For Search Engine Relevance
Applied Machine Learning For Search Engine Relevance Applied Machine Learning For Search Engine Relevance
Applied Machine Learning For Search Engine Relevance charlesmartin14
 
Algorithm_NP-Completeness Proof
Algorithm_NP-Completeness ProofAlgorithm_NP-Completeness Proof
Algorithm_NP-Completeness ProofIm Rafid
 
Divide and Conquer - Part 1
Divide and Conquer - Part 1Divide and Conquer - Part 1
Divide and Conquer - Part 1Amrinder Arora
 
Distributed Parallel Process Particle Swarm Optimization on Fixed Charge Netw...
Distributed Parallel Process Particle Swarm Optimization on Fixed Charge Netw...Distributed Parallel Process Particle Swarm Optimization on Fixed Charge Netw...
Distributed Parallel Process Particle Swarm Optimization on Fixed Charge Netw...Corey Clark, Ph.D.
 
lecture 30
lecture 30lecture 30
lecture 30sajinsc
 
Lecture 3 complexity
Lecture 3 complexityLecture 3 complexity
Lecture 3 complexityMadhu Niket
 
(DL hacks輪読)Bayesian Neural Network
(DL hacks輪読)Bayesian Neural Network(DL hacks輪読)Bayesian Neural Network
(DL hacks輪読)Bayesian Neural NetworkMasahiro Suzuki
 
Asymptotic Notations
Asymptotic NotationsAsymptotic Notations
Asymptotic NotationsNagendraK18
 
MLHEP 2015: Introductory Lecture #3
MLHEP 2015: Introductory Lecture #3MLHEP 2015: Introductory Lecture #3
MLHEP 2015: Introductory Lecture #3arogozhnikov
 
From logistic regression to linear chain CRF
From logistic regression to linear chain CRFFrom logistic regression to linear chain CRF
From logistic regression to linear chain CRFDarren Yow-Bang Wang
 

La actualidad más candente (20)

parallel
parallelparallel
parallel
 
SURF 2012 Final Report(1)
SURF 2012 Final Report(1)SURF 2012 Final Report(1)
SURF 2012 Final Report(1)
 
Cheatsheet convolutional-neural-networks
Cheatsheet convolutional-neural-networksCheatsheet convolutional-neural-networks
Cheatsheet convolutional-neural-networks
 
Stochastic gradient descent and its tuning
Stochastic gradient descent and its tuningStochastic gradient descent and its tuning
Stochastic gradient descent and its tuning
 
Ma2520962099
Ma2520962099Ma2520962099
Ma2520962099
 
D I G I T A L C O N T R O L S Y S T E M S J N T U M O D E L P A P E R{Www
D I G I T A L  C O N T R O L  S Y S T E M S  J N T U  M O D E L  P A P E R{WwwD I G I T A L  C O N T R O L  S Y S T E M S  J N T U  M O D E L  P A P E R{Www
D I G I T A L C O N T R O L S Y S T E M S J N T U M O D E L P A P E R{Www
 
Applied Machine Learning For Search Engine Relevance
Applied Machine Learning For Search Engine Relevance Applied Machine Learning For Search Engine Relevance
Applied Machine Learning For Search Engine Relevance
 
AA ppt9107
AA ppt9107AA ppt9107
AA ppt9107
 
NP completeness
NP completenessNP completeness
NP completeness
 
Algorithm_NP-Completeness Proof
Algorithm_NP-Completeness ProofAlgorithm_NP-Completeness Proof
Algorithm_NP-Completeness Proof
 
Divide and Conquer - Part 1
Divide and Conquer - Part 1Divide and Conquer - Part 1
Divide and Conquer - Part 1
 
Distributed Parallel Process Particle Swarm Optimization on Fixed Charge Netw...
Distributed Parallel Process Particle Swarm Optimization on Fixed Charge Netw...Distributed Parallel Process Particle Swarm Optimization on Fixed Charge Netw...
Distributed Parallel Process Particle Swarm Optimization on Fixed Charge Netw...
 
lecture 30
lecture 30lecture 30
lecture 30
 
Lecture 3 complexity
Lecture 3 complexityLecture 3 complexity
Lecture 3 complexity
 
ECE 565 FInal Project
ECE 565 FInal ProjectECE 565 FInal Project
ECE 565 FInal Project
 
Tsp is NP-Complete
Tsp is NP-CompleteTsp is NP-Complete
Tsp is NP-Complete
 
(DL hacks輪読)Bayesian Neural Network
(DL hacks輪読)Bayesian Neural Network(DL hacks輪読)Bayesian Neural Network
(DL hacks輪読)Bayesian Neural Network
 
Asymptotic Notations
Asymptotic NotationsAsymptotic Notations
Asymptotic Notations
 
MLHEP 2015: Introductory Lecture #3
MLHEP 2015: Introductory Lecture #3MLHEP 2015: Introductory Lecture #3
MLHEP 2015: Introductory Lecture #3
 
From logistic regression to linear chain CRF
From logistic regression to linear chain CRFFrom logistic regression to linear chain CRF
From logistic regression to linear chain CRF
 

Similar a Dp

Biosight: Quantitative Methods for Policy Analysis - Introduction to GAMS, Li...
Biosight: Quantitative Methods for Policy Analysis - Introduction to GAMS, Li...Biosight: Quantitative Methods for Policy Analysis - Introduction to GAMS, Li...
Biosight: Quantitative Methods for Policy Analysis - Introduction to GAMS, Li...IFPRI-EPTD
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017MLconf
 
safe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningsafe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningRyo Iwaki
 
Numerical analysis dual, primal, revised simplex
Numerical analysis  dual, primal, revised simplexNumerical analysis  dual, primal, revised simplex
Numerical analysis dual, primal, revised simplexSHAMJITH KM
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningDongHyun Kwak
 
Dynamic programming prasintation eaisy
Dynamic programming prasintation eaisyDynamic programming prasintation eaisy
Dynamic programming prasintation eaisyahmed51236
 
Satisfaction And Its Application To Ai Planning
Satisfaction And Its Application To Ai PlanningSatisfaction And Its Application To Ai Planning
Satisfaction And Its Application To Ai Planningahmad bassiouny
 
Scalable trust-region method for deep reinforcement learning using Kronecker-...
Scalable trust-region method for deep reinforcement learning using Kronecker-...Scalable trust-region method for deep reinforcement learning using Kronecker-...
Scalable trust-region method for deep reinforcement learning using Kronecker-...Willy Marroquin (WillyDevNET)
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningNAVER Engineering
 
25 introduction reinforcement_learning
25 introduction reinforcement_learning25 introduction reinforcement_learning
25 introduction reinforcement_learningAndres Mendez-Vazquez
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningDongHyun Kwak
 
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018Universitat Politècnica de Catalunya
 
Ai lecture 11(unit02)
Ai lecture  11(unit02)Ai lecture  11(unit02)
Ai lecture 11(unit02)vikas dhakane
 
Off-Policy Deep Reinforcement Learning without Exploration.pdf
Off-Policy Deep Reinforcement Learning without Exploration.pdfOff-Policy Deep Reinforcement Learning without Exploration.pdf
Off-Policy Deep Reinforcement Learning without Exploration.pdfPo-Chuan Chen
 
Different Types of Machine Learning Algorithms
Different Types of Machine Learning AlgorithmsDifferent Types of Machine Learning Algorithms
Different Types of Machine Learning Algorithmsrahmedraj93
 
Hands-on Tutorial of Machine Learning in Python
Hands-on Tutorial of Machine Learning in PythonHands-on Tutorial of Machine Learning in Python
Hands-on Tutorial of Machine Learning in PythonChun-Ming Chang
 
Value Function Approximation via Low-Rank Models
Value Function Approximation via Low-Rank ModelsValue Function Approximation via Low-Rank Models
Value Function Approximation via Low-Rank ModelsLyft
 

Similar a Dp (20)

Summary of BRAC
Summary of BRACSummary of BRAC
Summary of BRAC
 
Biosight: Quantitative Methods for Policy Analysis - Introduction to GAMS, Li...
Biosight: Quantitative Methods for Policy Analysis - Introduction to GAMS, Li...Biosight: Quantitative Methods for Policy Analysis - Introduction to GAMS, Li...
Biosight: Quantitative Methods for Policy Analysis - Introduction to GAMS, Li...
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
 
safe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningsafe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learning
 
Numerical analysis dual, primal, revised simplex
Numerical analysis  dual, primal, revised simplexNumerical analysis  dual, primal, revised simplex
Numerical analysis dual, primal, revised simplex
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Dynamic programming prasintation eaisy
Dynamic programming prasintation eaisyDynamic programming prasintation eaisy
Dynamic programming prasintation eaisy
 
Satisfaction And Its Application To Ai Planning
Satisfaction And Its Application To Ai PlanningSatisfaction And Its Application To Ai Planning
Satisfaction And Its Application To Ai Planning
 
Scalable trust-region method for deep reinforcement learning using Kronecker-...
Scalable trust-region method for deep reinforcement learning using Kronecker-...Scalable trust-region method for deep reinforcement learning using Kronecker-...
Scalable trust-region method for deep reinforcement learning using Kronecker-...
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
 
25 introduction reinforcement_learning
25 introduction reinforcement_learning25 introduction reinforcement_learning
25 introduction reinforcement_learning
 
Backpropagation - Elisa Sayrol - UPC Barcelona 2018
Backpropagation - Elisa Sayrol - UPC Barcelona 2018Backpropagation - Elisa Sayrol - UPC Barcelona 2018
Backpropagation - Elisa Sayrol - UPC Barcelona 2018
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
 
Ai lecture 11(unit02)
Ai lecture  11(unit02)Ai lecture  11(unit02)
Ai lecture 11(unit02)
 
Approximation algorithms
Approximation  algorithms Approximation  algorithms
Approximation algorithms
 
Off-Policy Deep Reinforcement Learning without Exploration.pdf
Off-Policy Deep Reinforcement Learning without Exploration.pdfOff-Policy Deep Reinforcement Learning without Exploration.pdf
Off-Policy Deep Reinforcement Learning without Exploration.pdf
 
Different Types of Machine Learning Algorithms
Different Types of Machine Learning AlgorithmsDifferent Types of Machine Learning Algorithms
Different Types of Machine Learning Algorithms
 
Hands-on Tutorial of Machine Learning in Python
Hands-on Tutorial of Machine Learning in PythonHands-on Tutorial of Machine Learning in Python
Hands-on Tutorial of Machine Learning in Python
 
Value Function Approximation via Low-Rank Models
Value Function Approximation via Low-Rank ModelsValue Function Approximation via Low-Rank Models
Value Function Approximation via Low-Rank Models
 

Más de Ronald Teo

07 regularization
07 regularization07 regularization
07 regularizationRonald Teo
 
Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel
 Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel
Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodelRonald Teo
 
Lec7 deeprlbootcamp-svg+scg
Lec7 deeprlbootcamp-svg+scgLec7 deeprlbootcamp-svg+scg
Lec7 deeprlbootcamp-svg+scgRonald Teo
 
Lec5 advanced-policy-gradient-methods
Lec5 advanced-policy-gradient-methodsLec5 advanced-policy-gradient-methods
Lec5 advanced-policy-gradient-methodsRonald Teo
 
Lec6 nuts-and-bolts-deep-rl-research
Lec6 nuts-and-bolts-deep-rl-researchLec6 nuts-and-bolts-deep-rl-research
Lec6 nuts-and-bolts-deep-rl-researchRonald Teo
 
Lec4b pong from_pixels
Lec4b pong from_pixelsLec4b pong from_pixels
Lec4b pong from_pixelsRonald Teo
 
Lec4a policy-gradients-actor-critic
Lec4a policy-gradients-actor-criticLec4a policy-gradients-actor-critic
Lec4a policy-gradients-actor-criticRonald Teo
 
Lec2 sampling-based-approximations-and-function-fitting
Lec2 sampling-based-approximations-and-function-fittingLec2 sampling-based-approximations-and-function-fitting
Lec2 sampling-based-approximations-and-function-fittingRonald Teo
 
Lec1 intro-mdps-exact-methods
Lec1 intro-mdps-exact-methodsLec1 intro-mdps-exact-methods
Lec1 intro-mdps-exact-methodsRonald Teo
 
02 linear algebra
02 linear algebra02 linear algebra
02 linear algebraRonald Teo
 

Más de Ronald Teo (14)

07 regularization
07 regularization07 regularization
07 regularization
 
06 mlp
06 mlp06 mlp
06 mlp
 
04 numerical
04 numerical04 numerical
04 numerical
 
Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel
 Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel
Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel
 
Intro rl
Intro rlIntro rl
Intro rl
 
Lec7 deeprlbootcamp-svg+scg
Lec7 deeprlbootcamp-svg+scgLec7 deeprlbootcamp-svg+scg
Lec7 deeprlbootcamp-svg+scg
 
Lec5 advanced-policy-gradient-methods
Lec5 advanced-policy-gradient-methodsLec5 advanced-policy-gradient-methods
Lec5 advanced-policy-gradient-methods
 
Lec6 nuts-and-bolts-deep-rl-research
Lec6 nuts-and-bolts-deep-rl-researchLec6 nuts-and-bolts-deep-rl-research
Lec6 nuts-and-bolts-deep-rl-research
 
Lec4b pong from_pixels
Lec4b pong from_pixelsLec4b pong from_pixels
Lec4b pong from_pixels
 
Lec4a policy-gradients-actor-critic
Lec4a policy-gradients-actor-criticLec4a policy-gradients-actor-critic
Lec4a policy-gradients-actor-critic
 
Lec3 dqn
Lec3 dqnLec3 dqn
Lec3 dqn
 
Lec2 sampling-based-approximations-and-function-fitting
Lec2 sampling-based-approximations-and-function-fittingLec2 sampling-based-approximations-and-function-fitting
Lec2 sampling-based-approximations-and-function-fitting
 
Lec1 intro-mdps-exact-methods
Lec1 intro-mdps-exact-methodsLec1 intro-mdps-exact-methods
Lec1 intro-mdps-exact-methods
 
02 linear algebra
02 linear algebra02 linear algebra
02 linear algebra
 

Último

Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 

Último (20)

Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 

Dp

  • 1. Lecture 3: Planning by Dynamic Programming Lecture 3: Planning by Dynamic Programming David Silver
  • 2. Lecture 3: Planning by Dynamic Programming Outline 1 Introduction 2 Policy Evaluation 3 Policy Iteration 4 Value Iteration 5 Extensions to Dynamic Programming 6 Contraction Mapping
  • 3. Lecture 3: Planning by Dynamic Programming Introduction What is Dynamic Programming? Dynamic sequential or temporal component to the problem Programming optimising a “program”, i.e. a policy c.f. linear programming A method for solving complex problems By breaking them down into subproblems Solve the subproblems Combine solutions to subproblems
  • 4. Lecture 3: Planning by Dynamic Programming Introduction Requirements for Dynamic Programming Dynamic Programming is a very general solution method for problems which have two properties: Optimal substructure Principle of optimality applies Optimal solution can be decomposed into subproblems Overlapping subproblems Subproblems recur many times Solutions can be cached and reused Markov decision processes satisfy both properties Bellman equation gives recursive decomposition Value function stores and reuses solutions
  • 5. Lecture 3: Planning by Dynamic Programming Introduction Planning by Dynamic Programming Dynamic programming assumes full knowledge of the MDP It is used for planning in an MDP For prediction: Input: MDP S, A, P, R, γ and policy π or: MRP S, Pπ , Rπ , γ Output: value function vπ Or for control: Input: MDP S, A, P, R, γ Output: optimal value function v∗ and: optimal policy π∗
  • 6. Lecture 3: Planning by Dynamic Programming Introduction Other Applications of Dynamic Programming Dynamic programming is used to solve many other problems, e.g. Scheduling algorithms String algorithms (e.g. sequence alignment) Graph algorithms (e.g. shortest path algorithms) Graphical models (e.g. Viterbi algorithm) Bioinformatics (e.g. lattice models)
  • 7. Lecture 3: Planning by Dynamic Programming Policy Evaluation Iterative Policy Evaluation Iterative Policy Evaluation Problem: evaluate a given policy π Solution: iterative application of Bellman expectation backup v1 → v2 → ... → vπ Using synchronous backups, At each iteration k + 1 For all states s ∈ S Update vk+1(s) from vk (s ) where s is a successor state of s We will discuss asynchronous backups later Convergence to vπ will be proven at the end of the lecture
  • 8. Lecture 3: Planning by Dynamic Programming Policy Evaluation Iterative Policy Evaluation Iterative Policy Evaluation (2) a r vk+1(s) [ s vk(s0 ) [ s0 vk+1(s) = a∈A π(a|s) Ra s + γ s ∈S Pa ss vk(s ) vk+1 = Rπ Rπ Rπ + γPπ Pπ Pπ vk
  • 9. Lecture 3: Planning by Dynamic Programming Policy Evaluation Example: Small Gridworld Evaluating a Random Policy in the Small Gridworld Undiscounted episodic MDP (γ = 1) Nonterminal states 1, ..., 14 One terminal state (shown twice as shaded squares) Actions leading out of the grid leave state unchanged Reward is −1 until the terminal state is reached Agent follows uniform random policy π(n|·) = π(e|·) = π(s|·) = π(w|·) = 0.25
  • 10. Lecture 3: Planning by Dynamic Programming Policy Evaluation Example: Small Gridworld Iterative Policy Evaluation in Small Gridworld 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.7 -2.0 -2.0 -1.7 -2.0 -2.0 -2.0 -2.0 -2.0 -2.0 -1.7 -2.0 -2.0 -1.7 Vk for the Random Policy Greedy Policy w.r.t. Vk k = 0 k = 1 k = 2 random policy 0.0 0.0 0.0 0.0 0.0 0.0 vk vk
  • 11. Lecture 3: Planning by Dynamic Programming Policy Evaluation Example: Small Gridworld Iterative Policy Evaluation in Small Gridworld (2) -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.7 -2.0 -2.0 -1.7 -2.0 -2.0 -2.0 -2.0 -2.0 -2.0 -1.7 -2.0 -2.0 -1.7 -2.4 -2.9 -3.0 -2.4 -2.9 -3.0 -2.9 -2.9 -3.0 -2.9 -2.4 -3.0 -2.9 -2.4 -6.1 -8.4 -9.0 -6.1 -7.7 -8.4 -8.4 -8.4 -8.4 -7.7 -6.1 -9.0 -8.4 -6.1 -14. -20. -22. -14. -18. -20. -20. -20. -20. -18. -14. -22. -20. -14. k = 2 k = 10 k = ° k = 3 optimal policy 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ∞
  • 12. Lecture 3: Planning by Dynamic Programming Policy Iteration How to Improve a Policy Given a policy π Evaluate the policy π vπ(s) = E [Rt+1 + γRt+2 + ...|St = s] Improve the policy by acting greedily with respect to vπ π = greedy(vπ) In Small Gridworld improved policy was optimal, π = π∗ In general, need more iterations of improvement / evaluation But this process of policy iteration always converges to π∗
  • 13. Lecture 3: Planning by Dynamic Programming Policy Iteration Policy Iteration Policy evaluation Estimate vπ Iterative policy evaluation Policy improvement Generate π ≥ π Greedy policy improvement
  • 14. Lecture 3: Planning by Dynamic Programming Policy Iteration Example: Jack’s Car Rental Jack’s Car Rental States: Two locations, maximum of 20 cars at each Actions: Move up to 5 cars between locations overnight Reward: $10 for each car rented (must be available) Transitions: Cars returned and requested randomly Poisson distribution, n returns/requests with prob λn n! e−λ 1st location: average requests = 3, average returns = 3 2nd location: average requests = 4, average returns = 2
  • 15. Lecture 3: Planning by Dynamic Programming Policy Iteration Example: Jack’s Car Rental Policy Iteration in Jack’s Car Rental
  • 16. Lecture 3: Planning by Dynamic Programming Policy Iteration Policy Improvement Policy Improvement Consider a deterministic policy, a = π(s) We can improve the policy by acting greedily π (s) = argmax a∈A qπ(s, a) This improves the value from any state s over one step, qπ(s, π (s)) = max a∈A qπ(s, a) ≥ qπ(s, π(s)) = vπ(s) It therefore improves the value function, vπ (s) ≥ vπ(s) vπ(s) ≤ qπ(s, π (s)) = Eπ [Rt+1 + γvπ(St+1) | St = s] ≤ Eπ Rt+1 + γqπ(St+1, π (St+1)) | St = s ≤ Eπ Rt+1 + γRt+2 + γ2 qπ(St+2, π (St+2)) | St = s ≤ Eπ [Rt+1 + γRt+2 + ... | St = s] = vπ (s)
  • 17. Lecture 3: Planning by Dynamic Programming Policy Iteration Policy Improvement Policy Improvement (2) If improvements stop, qπ(s, π (s)) = max a∈A qπ(s, a) = qπ(s, π(s)) = vπ(s) Then the Bellman optimality equation has been satisfied vπ(s) = max a∈A qπ(s, a) Therefore vπ(s) = v∗(s) for all s ∈ S so π is an optimal policy
  • 18. Lecture 3: Planning by Dynamic Programming Policy Iteration Extensions to Policy Iteration Modified Policy Iteration Does policy evaluation need to converge to vπ? Or should we introduce a stopping condition e.g. -convergence of value function Or simply stop after k iterations of iterative policy evaluation? For example, in the small gridworld k = 3 was sufficient to achieve optimal policy Why not update policy every iteration? i.e. stop after k = 1 This is equivalent to value iteration (next section)
  • 19. Lecture 3: Planning by Dynamic Programming Policy Iteration Extensions to Policy Iteration Generalised Policy Iteration Policy evaluation Estimate vπ Any policy evaluation algorithm Policy improvement Generate π ≥ π Any policy improvement algorithm
  • 20. Lecture 3: Planning by Dynamic Programming Value Iteration Value Iteration in MDPs Principle of Optimality Any optimal policy can be subdivided into two components: An optimal first action A∗ Followed by an optimal policy from successor state S Theorem (Principle of Optimality) A policy π(a|s) achieves the optimal value from state s, vπ(s) = v∗(s), if and only if For any state s reachable from s π achieves the optimal value from state s , vπ(s ) = v∗(s )
  • 21. Lecture 3: Planning by Dynamic Programming Value Iteration Value Iteration in MDPs Deterministic Value Iteration If we know the solution to subproblems v∗(s ) Then solution v∗(s) can be found by one-step lookahead v∗(s) ← max a∈A Ra s + γ s ∈S Pa ss v∗(s ) The idea of value iteration is to apply these updates iteratively Intuition: start with final rewards and work backwards Still works with loopy, stochastic MDPs
  • 22. Lecture 3: Planning by Dynamic Programming Value Iteration Value Iteration in MDPs Example: Shortest Path 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 -1 -2 -2 -1 -2 -2 -2 -2 -2 -2 -2 -2 -2 -2 -2 0 -1 -2 -3 -1 -2 -3 -3 -2 -3 -3 -3 -3 -3 -3 -3 0 -1 -2 -3 -1 -2 -3 -4 -2 -3 -4 -4 -3 -4 -4 -4 0 -1 -2 -3 -1 -2 -3 -4 -2 -3 -4 -5 -3 -4 -5 -5 0 -1 -2 -3 -1 -2 -3 -4 -2 -3 -4 -5 -3 -4 -5 -6 g Problem V1 V2 V3 V4 V5 V6 V7
  • 23. Lecture 3: Planning by Dynamic Programming Value Iteration Value Iteration in MDPs Value Iteration Problem: find optimal policy π Solution: iterative application of Bellman optimality backup v1 → v2 → ... → v∗ Using synchronous backups At each iteration k + 1 For all states s ∈ S Update vk+1(s) from vk (s ) Convergence to v∗ will be proven later Unlike policy iteration, there is no explicit policy Intermediate value functions may not correspond to any policy
  • 24. Lecture 3: Planning by Dynamic Programming Value Iteration Value Iteration in MDPs Value Iteration (2) vk+1(s) [ s vk(s0 ) [ s0 r a vk+1(s) = max a∈A Ra s + γ s ∈S Pa ss vk(s ) vk+1 = max a∈A Ra Ra Ra + γPa Pa Pa vk
  • 25. Lecture 3: Planning by Dynamic Programming Value Iteration Value Iteration in MDPs Example of Value Iteration in Practice http://www.cs.ubc.ca/∼poole/demos/mdp/vi.html
  • 26. Lecture 3: Planning by Dynamic Programming Value Iteration Summary of DP Algorithms Synchronous Dynamic Programming Algorithms Problem Bellman Equation Algorithm Prediction Bellman Expectation Equation Iterative Policy Evaluation Control Bellman Expectation Equation Policy Iteration + Greedy Policy Improvement Control Bellman Optimality Equation Value Iteration Algorithms are based on state-value function vπ(s) or v∗(s) Complexity O(mn2) per iteration, for m actions and n states Could also apply to action-value function qπ(s, a) or q∗(s, a) Complexity O(m2n2) per iteration
  • 27. Lecture 3: Planning by Dynamic Programming Extensions to Dynamic Programming Asynchronous Dynamic Programming Asynchronous Dynamic Programming DP methods described so far used synchronous backups i.e. all states are backed up in parallel Asynchronous DP backs up states individually, in any order For each selected state, apply the appropriate backup Can significantly reduce computation Guaranteed to converge if all states continue to be selected
  • 28. Lecture 3: Planning by Dynamic Programming Extensions to Dynamic Programming Asynchronous Dynamic Programming Asynchronous Dynamic Programming Three simple ideas for asynchronous dynamic programming: In-place dynamic programming Prioritised sweeping Real-time dynamic programming
  • 29. Lecture 3: Planning by Dynamic Programming Extensions to Dynamic Programming Asynchronous Dynamic Programming In-Place Dynamic Programming Synchronous value iteration stores two copies of value function for all s in S vnew (s) ← max a∈A Ra s + γ s ∈S Pa ss vold (s ) vold ← vnew In-place value iteration only stores one copy of value function for all s in S v(s) ← max a∈A Ra s + γ s ∈S Pa ss v(s )
  • 30. Lecture 3: Planning by Dynamic Programming Extensions to Dynamic Programming Asynchronous Dynamic Programming Prioritised Sweeping Use magnitude of Bellman error to guide state selection, e.g. max a∈A Ra s + γ s ∈S Pa ss v(s ) − v(s) Backup the state with the largest remaining Bellman error Update Bellman error of affected states after each backup Requires knowledge of reverse dynamics (predecessor states) Can be implemented efficiently by maintaining a priority queue
  • 31. Lecture 3: Planning by Dynamic Programming Extensions to Dynamic Programming Asynchronous Dynamic Programming Real-Time Dynamic Programming Idea: only states that are relevant to agent Use agent’s experience to guide the selection of states After each time-step St, At, Rt+1 Backup the state St v(St) ← max a∈A Ra St + γ s ∈S Pa St s v(s )
  • 32. Lecture 3: Planning by Dynamic Programming Extensions to Dynamic Programming Full-width and sample backups Full-Width Backups DP uses full-width backups For each backup (sync or async) Every successor state and action is considered Using knowledge of the MDP transitions and reward function DP is effective for medium-sized problems (millions of states) For large problems DP suffers Bellman’s curse of dimensionality Number of states n = |S| grows exponentially with number of state variables Even one backup can be too expensive vk+1(s) [ s vk(s0 ) [ s0 r a
  • 33. Lecture 3: Planning by Dynamic Programming Extensions to Dynamic Programming Full-width and sample backups Sample Backups In subsequent lectures we will consider sample backups Using sample rewards and sample transitions S, A, R, S Instead of reward function R and transition dynamics P Advantages: Model-free: no advance knowledge of MDP required Breaks the curse of dimensionality through sampling Cost of backup is constant, independent of n = |S|
  • 34. Lecture 3: Planning by Dynamic Programming Extensions to Dynamic Programming Approximate Dynamic Programming Approximate Dynamic Programming Approximate the value function Using a function approximator ˆv(s, w) Apply dynamic programming to ˆv(·, w) e.g. Fitted Value Iteration repeats at each iteration k, Sample states ˜S ⊆ S For each state s ∈ ˜S, estimate target value using Bellman optimality equation, ˜vk (s) = max a∈A Ra s + γ s ∈S Pa ss ˆv(s , wk) Train next value function ˆv(·, wk+1) using targets { s, ˜vk (s) }
  • 35. Lecture 3: Planning by Dynamic Programming Contraction Mapping Some Technical Questions How do we know that value iteration converges to v∗? Or that iterative policy evaluation converges to vπ? And therefore that policy iteration converges to v∗? Is the solution unique? How fast do these algorithms converge? These questions are resolved by contraction mapping theorem
  • 36. Lecture 3: Planning by Dynamic Programming Contraction Mapping Value Function Space Consider the vector space V over value functions There are |S| dimensions Each point in this space fully specifies a value function v(s) What does a Bellman backup do to points in this space? We will show that it brings value functions closer And therefore the backups must converge on a unique solution
  • 37. Lecture 3: Planning by Dynamic Programming Contraction Mapping Value Function ∞-Norm We will measure distance between state-value functions u and v by the ∞-norm i.e. the largest difference between state values, ||u − v||∞ = max s∈S |u(s) − v(s)|
  • 38. Lecture 3: Planning by Dynamic Programming Contraction Mapping Bellman Expectation Backup is a Contraction Define the Bellman expectation backup operator Tπ, Tπ (v) = Rπ + γPπ v This operator is a γ-contraction, i.e. it makes value functions closer by at least γ, ||Tπ (u) − Tπ (v)||∞ = || (Rπ + γPπ u) − (Rπ + γPπ v) ||∞ = ||γPπ (u − v)||∞ ≤ ||γPπ ||u − v||∞||∞ ≤ γ||u − v||∞
  • 39. Lecture 3: Planning by Dynamic Programming Contraction Mapping Contraction Mapping Theorem Theorem (Contraction Mapping Theorem) For any metric space V that is complete (i.e. closed) under an operator T(v), where T is a γ-contraction, T converges to a unique fixed point At a linear convergence rate of γ
  • 40. Lecture 3: Planning by Dynamic Programming Contraction Mapping Convergence of Iter. Policy Evaluation and Policy Iteration The Bellman expectation operator Tπ has a unique fixed point vπ is a fixed point of Tπ (by Bellman expectation equation) By contraction mapping theorem Iterative policy evaluation converges on vπ Policy iteration converges on v∗
  • 41. Lecture 3: Planning by Dynamic Programming Contraction Mapping Bellman Optimality Backup is a Contraction Define the Bellman optimality backup operator T∗, T∗ (v) = max a∈A Ra + γPa v This operator is a γ-contraction, i.e. it makes value functions closer by at least γ (similar to previous proof) ||T∗ (u) − T∗ (v)||∞ ≤ γ||u − v||∞
  • 42. Lecture 3: Planning by Dynamic Programming Contraction Mapping Convergence of Value Iteration The Bellman optimality operator T∗ has a unique fixed point v∗ is a fixed point of T∗ (by Bellman optimality equation) By contraction mapping theorem Value iteration converges on v∗