"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
25 introduction reinforcement_learning
1. Introduction to Artificial Intelligence
Reinforcement Learning
Andres Mendez-Vazquez
April 26, 2019
1 / 111
2. Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
2 / 111
3. Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
3 / 111
4. Images/cinvestav
Reinforcement Learning is learning what to do
What does Reinforcement Learning want?
Maximize a numerical reward signal
The learner is not told which actions to take
A discovery must be performed to obtain the actions that yield the
most rewards
4 / 111
5. Images/cinvestav
Reinforcement Learning is learning what to do
What does Reinforcement Learning want?
Maximize a numerical reward signal
The learner is not told which actions to take
A discovery must be performed to obtain the actions that yield the
most rewards
4 / 111
6. Images/cinvestav
These ideas come from Dynamical Systems Theory
Specifically
As the optimal control of incompletely-known Markov Decision
Processes (MDP).
Thus, the device must do the following
Interact with the environment to learn by using punishments and
rewards.
5 / 111
7. Images/cinvestav
These ideas come from Dynamical Systems Theory
Specifically
As the optimal control of incompletely-known Markov Decision
Processes (MDP).
Thus, the device must do the following
Interact with the environment to learn by using punishments and
rewards.
5 / 111
8. Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
6 / 111
9. Images/cinvestav
A mobile robot
decides whether it should enter a new room
in search of more trash to collect
or going back to charge its battery
It needs to take a decision based on the
current charge level of its battery
how quickly it can arrive to its base
7 / 111
10. Images/cinvestav
A mobile robot
decides whether it should enter a new room
in search of more trash to collect
or going back to charge its battery
It needs to take a decision based on the
current charge level of its battery
how quickly it can arrive to its base
7 / 111
11. Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
8 / 111
12. Images/cinvestav
A K-Armed Bandit Problem
Definition
You are faced repeatedly with a choice among k different options, or
actions.
After each choice you receive a numerical reward chosen
From an stationary probability distribution depending on the actions.
Definition Stationary Probability
A stationary distribution of a Markov chain is a probability distribution
that remains unchanged in the Markov chain as time progresses.
9 / 111
13. Images/cinvestav
A K-Armed Bandit Problem
Definition
You are faced repeatedly with a choice among k different options, or
actions.
After each choice you receive a numerical reward chosen
From an stationary probability distribution depending on the actions.
Definition Stationary Probability
A stationary distribution of a Markov chain is a probability distribution
that remains unchanged in the Markov chain as time progresses.
9 / 111
14. Images/cinvestav
A K-Armed Bandit Problem
Definition
You are faced repeatedly with a choice among k different options, or
actions.
After each choice you receive a numerical reward chosen
From an stationary probability distribution depending on the actions.
Definition Stationary Probability
A stationary distribution of a Markov chain is a probability distribution
that remains unchanged in the Markov chain as time progresses.
9 / 111
16. Images/cinvestav
Therefore
Each of the k actions has an expected value
q∗ (a) = E [Rt|At = a]
Something Notable
If you knew the value of each action,
It would be trivial to solve the k-armed bandit problem.
11 / 111
17. Images/cinvestav
Therefore
Each of the k actions has an expected value
q∗ (a) = E [Rt|At = a]
Something Notable
If you knew the value of each action,
It would be trivial to solve the k-armed bandit problem.
11 / 111
18. Images/cinvestav
What do we want?
To find the estimated value of action a at time t, Qt (a)
|Qt (a) − q∗ (a)| < with as small as possible
12 / 111
19. Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
13 / 111
20. Images/cinvestav
Greedy Actions
Intuitions
To choose the greatest E [Rt|At = a]
This is know as Exploitation
You are exploiting your current knowledge of the values of the actions.
Contrasting to Exploration
When you select the non-gready actions to obtain a better knowledge
of the non-gready actions.
14 / 111
21. Images/cinvestav
Greedy Actions
Intuitions
To choose the greatest E [Rt|At = a]
This is know as Exploitation
You are exploiting your current knowledge of the values of the actions.
Contrasting to Exploration
When you select the non-gready actions to obtain a better knowledge
of the non-gready actions.
14 / 111
22. Images/cinvestav
Greedy Actions
Intuitions
To choose the greatest E [Rt|At = a]
This is know as Exploitation
You are exploiting your current knowledge of the values of the actions.
Contrasting to Exploration
When you select the non-gready actions to obtain a better knowledge
of the non-gready actions.
14 / 111
23. Images/cinvestav
An Interesting Conundrum
It has been observed that
Exploitation maximizes the expected reward on the one step,
Exploration may produce the greater total reward in the long run.
Thus, the idea
The value of a state s is the total amount of reward an agent can
expect to accumulate over the future, starting from s.
Thus, Values
They indicate the long-term profit of states after taking into account
The states that follow and their rewards.
15 / 111
24. Images/cinvestav
An Interesting Conundrum
It has been observed that
Exploitation maximizes the expected reward on the one step,
Exploration may produce the greater total reward in the long run.
Thus, the idea
The value of a state s is the total amount of reward an agent can
expect to accumulate over the future, starting from s.
Thus, Values
They indicate the long-term profit of states after taking into account
The states that follow and their rewards.
15 / 111
25. Images/cinvestav
An Interesting Conundrum
It has been observed that
Exploitation maximizes the expected reward on the one step,
Exploration may produce the greater total reward in the long run.
Thus, the idea
The value of a state s is the total amount of reward an agent can
expect to accumulate over the future, starting from s.
Thus, Values
They indicate the long-term profit of states after taking into account
The states that follow and their rewards.
15 / 111
26. Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
16 / 111
27. Images/cinvestav
Therefore, the parts of a Reinforcement Learning
A Policy
It defines the learning agent’s way of behaving at a given time.
The policy is the core of a reinforcement learning system
In general, policies may be stochastic
A Reward Signal
It defines the goal of a reinforcement learning problem.
The system’s sole objective is to maximize the total reward over the
long run
This encourage Exploitation
It is immediate... you might want something with more long term.
A Value Function
It specifies what is good in the long run.
This encourage Exploration
17 / 111
28. Images/cinvestav
Therefore, the parts of a Reinforcement Learning
A Policy
It defines the learning agent’s way of behaving at a given time.
The policy is the core of a reinforcement learning system
In general, policies may be stochastic
A Reward Signal
It defines the goal of a reinforcement learning problem.
The system’s sole objective is to maximize the total reward over the
long run
This encourage Exploitation
It is immediate... you might want something with more long term.
A Value Function
It specifies what is good in the long run.
This encourage Exploration
17 / 111
29. Images/cinvestav
Therefore, the parts of a Reinforcement Learning
A Policy
It defines the learning agent’s way of behaving at a given time.
The policy is the core of a reinforcement learning system
In general, policies may be stochastic
A Reward Signal
It defines the goal of a reinforcement learning problem.
The system’s sole objective is to maximize the total reward over the
long run
This encourage Exploitation
It is immediate... you might want something with more long term.
A Value Function
It specifies what is good in the long run.
This encourage Exploration
17 / 111
30. Images/cinvestav
Therefore, the parts of a Reinforcement Learning
A Policy
It defines the learning agent’s way of behaving at a given time.
The policy is the core of a reinforcement learning system
In general, policies may be stochastic
A Reward Signal
It defines the goal of a reinforcement learning problem.
The system’s sole objective is to maximize the total reward over the
long run
This encourage Exploitation
It is immediate... you might want something with more long term.
A Value Function
It specifies what is good in the long run.
This encourage Exploration
17 / 111
31. Images/cinvestav
Therefore, the parts of a Reinforcement Learning
A Policy
It defines the learning agent’s way of behaving at a given time.
The policy is the core of a reinforcement learning system
In general, policies may be stochastic
A Reward Signal
It defines the goal of a reinforcement learning problem.
The system’s sole objective is to maximize the total reward over the
long run
This encourage Exploitation
It is immediate... you might want something with more long term.
A Value Function
It specifies what is good in the long run.
This encourage Exploration
17 / 111
32. Images/cinvestav
Therefore, the parts of a Reinforcement Learning
A Policy
It defines the learning agent’s way of behaving at a given time.
The policy is the core of a reinforcement learning system
In general, policies may be stochastic
A Reward Signal
It defines the goal of a reinforcement learning problem.
The system’s sole objective is to maximize the total reward over the
long run
This encourage Exploitation
It is immediate... you might want something with more long term.
A Value Function
It specifies what is good in the long run.
This encourage Exploration
17 / 111
33. Images/cinvestav
Therefore, the parts of a Reinforcement Learning
A Policy
It defines the learning agent’s way of behaving at a given time.
The policy is the core of a reinforcement learning system
In general, policies may be stochastic
A Reward Signal
It defines the goal of a reinforcement learning problem.
The system’s sole objective is to maximize the total reward over the
long run
This encourage Exploitation
It is immediate... you might want something with more long term.
A Value Function
It specifies what is good in the long run.
This encourage Exploration
17 / 111
34. Images/cinvestav
Therefore, the parts of a Reinforcement Learning
A Policy
It defines the learning agent’s way of behaving at a given time.
The policy is the core of a reinforcement learning system
In general, policies may be stochastic
A Reward Signal
It defines the goal of a reinforcement learning problem.
The system’s sole objective is to maximize the total reward over the
long run
This encourage Exploitation
It is immediate... you might want something with more long term.
A Value Function
It specifies what is good in the long run.
This encourage Exploration
17 / 111
35. Images/cinvestav
Therefore, the parts of a Reinforcement Learning
A Policy
It defines the learning agent’s way of behaving at a given time.
The policy is the core of a reinforcement learning system
In general, policies may be stochastic
A Reward Signal
It defines the goal of a reinforcement learning problem.
The system’s sole objective is to maximize the total reward over the
long run
This encourage Exploitation
It is immediate... you might want something with more long term.
A Value Function
It specifies what is good in the long run.
This encourage Exploration
17 / 111
36. Images/cinvestav
Finally
A Model of the Environment
This simulates the behavior of the environment
Allowing to make inferences on how the environment can behave
Something Notable
Models are used for planning
Meaning we made decisions before they are executed
Thus, Reinforcement Learning Methods using models are called
model-based
18 / 111
37. Images/cinvestav
Finally
A Model of the Environment
This simulates the behavior of the environment
Allowing to make inferences on how the environment can behave
Something Notable
Models are used for planning
Meaning we made decisions before they are executed
Thus, Reinforcement Learning Methods using models are called
model-based
18 / 111
38. Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
19 / 111
39. Images/cinvestav
Limitations
Given that
Reinforcement learning relies heavily on the concept of state
Therefore, it has the same problem as in Markov Decision Processes
Defining the state to obtain a better representation of the search
space.
This is where the creative part of the problems come to be
As we have seen, solutions in AI really a lot in the creativity of the
solution...
20 / 111
40. Images/cinvestav
Limitations
Given that
Reinforcement learning relies heavily on the concept of state
Therefore, it has the same problem as in Markov Decision Processes
Defining the state to obtain a better representation of the search
space.
This is where the creative part of the problems come to be
As we have seen, solutions in AI really a lot in the creativity of the
solution...
20 / 111
41. Images/cinvestav
Limitations
Given that
Reinforcement learning relies heavily on the concept of state
Therefore, it has the same problem as in Markov Decision Processes
Defining the state to obtain a better representation of the search
space.
This is where the creative part of the problems come to be
As we have seen, solutions in AI really a lot in the creativity of the
solution...
20 / 111
42. Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
21 / 111
44. Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
23 / 111
45. Images/cinvestav
Markov Process and Markov Decision Process
Definition of Markov Process
A sequence of states is Markov if and only if the probability of moving
to the next state st+1 depends only on the present state st
P [st+1|st] = P [st+1|s1, ..., st]
Something Notable
We always talk about time-homogeneous Markov chain in
Reinforcement Learning:
the probability of the transition is independent of t
P [st+1 = s |st = s] = P [st = s |st−1 = s]
24 / 111
46. Images/cinvestav
Markov Process and Markov Decision Process
Definition of Markov Process
A sequence of states is Markov if and only if the probability of moving
to the next state st+1 depends only on the present state st
P [st+1|st] = P [st+1|s1, ..., st]
Something Notable
We always talk about time-homogeneous Markov chain in
Reinforcement Learning:
the probability of the transition is independent of t
P [st+1 = s |st = s] = P [st = s |st−1 = s]
24 / 111
47. Images/cinvestav
Formally
Definition (Markov Process)
a Markov Process (or Markov Chain) is a tuple (S, P), where
1 S is a finite set of states
2 P is a state transition probability matrix. Pss = P [st+1 = s |st = s]
Thus, the dynamic of the Markov process is as follow
s0
Ps0s1
−→ s1
Ps1s2
−→ s2
Ps2s3
−→ s3 −→ · · ·
25 / 111
48. Images/cinvestav
Formally
Definition (Markov Process)
a Markov Process (or Markov Chain) is a tuple (S, P), where
1 S is a finite set of states
2 P is a state transition probability matrix. Pss = P [st+1 = s |st = s]
Thus, the dynamic of the Markov process is as follow
s0
Ps0s1
−→ s1
Ps1s2
−→ s2
Ps2s3
−→ s3 −→ · · ·
25 / 111
49. Images/cinvestav
Then, we get the following familiar definition
Definition
MDP is a tuple S, A, T , R, α
1 S is a finite set of states
2 A is a finite set of actions
3 P is a state transition probability matrix.
Pss = P st+1 = s |st = s, At = a
4 α ∈ [0, 1] is called the discount factor
5 R : S × A → R is a reward function
Thus, the dynamic of the Markov Decision Process is as follow using
a probabilty to pick an action
s0
a0
−→ s1
a1
−→ s2
a2
−→ s3 −→ · · ·
26 / 111
50. Images/cinvestav
Then, we get the following familiar definition
Definition
MDP is a tuple S, A, T , R, α
1 S is a finite set of states
2 A is a finite set of actions
3 P is a state transition probability matrix.
Pss = P st+1 = s |st = s, At = a
4 α ∈ [0, 1] is called the discount factor
5 R : S × A → R is a reward function
Thus, the dynamic of the Markov Decision Process is as follow using
a probabilty to pick an action
s0
a0
−→ s1
a1
−→ s2
a2
−→ s3 −→ · · ·
26 / 111
51. Images/cinvestav
Then, we get the following familiar definition
Definition
MDP is a tuple S, A, T , R, α
1 S is a finite set of states
2 A is a finite set of actions
3 P is a state transition probability matrix.
Pss = P st+1 = s |st = s, At = a
4 α ∈ [0, 1] is called the discount factor
5 R : S × A → R is a reward function
Thus, the dynamic of the Markov Decision Process is as follow using
a probabilty to pick an action
s0
a0
−→ s1
a1
−→ s2
a2
−→ s3 −→ · · ·
26 / 111
52. Images/cinvestav
Then, we get the following familiar definition
Definition
MDP is a tuple S, A, T , R, α
1 S is a finite set of states
2 A is a finite set of actions
3 P is a state transition probability matrix.
Pss = P st+1 = s |st = s, At = a
4 α ∈ [0, 1] is called the discount factor
5 R : S × A → R is a reward function
Thus, the dynamic of the Markov Decision Process is as follow using
a probabilty to pick an action
s0
a0
−→ s1
a1
−→ s2
a2
−→ s3 −→ · · ·
26 / 111
53. Images/cinvestav
Then, we get the following familiar definition
Definition
MDP is a tuple S, A, T , R, α
1 S is a finite set of states
2 A is a finite set of actions
3 P is a state transition probability matrix.
Pss = P st+1 = s |st = s, At = a
4 α ∈ [0, 1] is called the discount factor
5 R : S × A → R is a reward function
Thus, the dynamic of the Markov Decision Process is as follow using
a probabilty to pick an action
s0
a0
−→ s1
a1
−→ s2
a2
−→ s3 −→ · · ·
26 / 111
54. Images/cinvestav
Then, we get the following familiar definition
Definition
MDP is a tuple S, A, T , R, α
1 S is a finite set of states
2 A is a finite set of actions
3 P is a state transition probability matrix.
Pss = P st+1 = s |st = s, At = a
4 α ∈ [0, 1] is called the discount factor
5 R : S × A → R is a reward function
Thus, the dynamic of the Markov Decision Process is as follow using
a probabilty to pick an action
s0
a0
−→ s1
a1
−→ s2
a2
−→ s3 −→ · · ·
26 / 111
55. Images/cinvestav
Then, we get the following familiar definition
Definition
MDP is a tuple S, A, T , R, α
1 S is a finite set of states
2 A is a finite set of actions
3 P is a state transition probability matrix.
Pss = P st+1 = s |st = s, At = a
4 α ∈ [0, 1] is called the discount factor
5 R : S × A → R is a reward function
Thus, the dynamic of the Markov Decision Process is as follow using
a probabilty to pick an action
s0
a0
−→ s1
a1
−→ s2
a2
−→ s3 −→ · · ·
26 / 111
56. Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
27 / 111
57. Images/cinvestav
The Goal of Reinforcement Learning
We want to maximize the expected value of the return
i.e. To obtain the optimal policy!!!
Thus, as in our previous study of MDP’s
Gt = Rt+1 + αRt+2 + · · · =
∞
k=0
αk
Rt+k+1
28 / 111
58. Images/cinvestav
The Goal of Reinforcement Learning
We want to maximize the expected value of the return
i.e. To obtain the optimal policy!!!
Thus, as in our previous study of MDP’s
Gt = Rt+1 + αRt+2 + · · · =
∞
k=0
αk
Rt+k+1
28 / 111
59. Images/cinvestav
Remarks
Interpretation
The discounted future rewards can be interpreted as the current
value of future rewards.
The reward α
α close to 0 leads to “short-sighted” evaluation... you are only
looking for immediate rewards!!!
α close to 1 leads to “long-sighted” evaluation... you are willing to
wait for a better reward!!!
29 / 111
60. Images/cinvestav
Remarks
Interpretation
The discounted future rewards can be interpreted as the current
value of future rewards.
The reward α
α close to 0 leads to “short-sighted” evaluation... you are only
looking for immediate rewards!!!
α close to 1 leads to “long-sighted” evaluation... you are willing to
wait for a better reward!!!
29 / 111
61. Images/cinvestav
Why to use this discount factor?
Basically
1 Discount factor avoids infinite returns in cyclic Markov process given
the property of the geometry series.
2 There is certain uncertainty about the future reward.
3 If the reward is financial, immediate rewards may earn more interest
than delayed rewards.
4 Animal/human behavior shows preference for immediate reward.
30 / 111
62. Images/cinvestav
Why to use this discount factor?
Basically
1 Discount factor avoids infinite returns in cyclic Markov process given
the property of the geometry series.
2 There is certain uncertainty about the future reward.
3 If the reward is financial, immediate rewards may earn more interest
than delayed rewards.
4 Animal/human behavior shows preference for immediate reward.
30 / 111
63. Images/cinvestav
Why to use this discount factor?
Basically
1 Discount factor avoids infinite returns in cyclic Markov process given
the property of the geometry series.
2 There is certain uncertainty about the future reward.
3 If the reward is financial, immediate rewards may earn more interest
than delayed rewards.
4 Animal/human behavior shows preference for immediate reward.
30 / 111
64. Images/cinvestav
Why to use this discount factor?
Basically
1 Discount factor avoids infinite returns in cyclic Markov process given
the property of the geometry series.
2 There is certain uncertainty about the future reward.
3 If the reward is financial, immediate rewards may earn more interest
than delayed rewards.
4 Animal/human behavior shows preference for immediate reward.
30 / 111
65. Images/cinvestav
A Policy
Using our MDP ideas...
A policy π is a distribution over actions given states
π (a|s) = P [At = a|St = s]
A policy guides the choice of action at a given state.
And it is time independent
Then, we introduce two value functions
State-value function.
Action-value function.
31 / 111
66. Images/cinvestav
A Policy
Using our MDP ideas...
A policy π is a distribution over actions given states
π (a|s) = P [At = a|St = s]
A policy guides the choice of action at a given state.
And it is time independent
Then, we introduce two value functions
State-value function.
Action-value function.
31 / 111
67. Images/cinvestav
The state-value function vπ of an MDP
It is defined as
vπ (s) = Eπ [Gt|St = s]
This gives the long-term value of the state s
vπ (s) = Eπ [Gt|St = s]
= Eπ
∞
k=0
αk
Rt+k+1|St = s
= Eπ [Rt+1 + αGt+1|St = s]
= Eπ [Rt+1|St = s]
Immediate Reward
+ Eπ [αvπ (St+1) |St = s]
Discount value of successor state
32 / 111
68. Images/cinvestav
The state-value function vπ of an MDP
It is defined as
vπ (s) = Eπ [Gt|St = s]
This gives the long-term value of the state s
vπ (s) = Eπ [Gt|St = s]
= Eπ
∞
k=0
αk
Rt+k+1|St = s
= Eπ [Rt+1 + αGt+1|St = s]
= Eπ [Rt+1|St = s]
Immediate Reward
+ Eπ [αvπ (St+1) |St = s]
Discount value of successor state
32 / 111
69. Images/cinvestav
The state-value function vπ of an MDP
It is defined as
vπ (s) = Eπ [Gt|St = s]
This gives the long-term value of the state s
vπ (s) = Eπ [Gt|St = s]
= Eπ
∞
k=0
αk
Rt+k+1|St = s
= Eπ [Rt+1 + αGt+1|St = s]
= Eπ [Rt+1|St = s]
Immediate Reward
+ Eπ [αvπ (St+1) |St = s]
Discount value of successor state
32 / 111
70. Images/cinvestav
The state-value function vπ of an MDP
It is defined as
vπ (s) = Eπ [Gt|St = s]
This gives the long-term value of the state s
vπ (s) = Eπ [Gt|St = s]
= Eπ
∞
k=0
αk
Rt+k+1|St = s
= Eπ [Rt+1 + αGt+1|St = s]
= Eπ [Rt+1|St = s]
Immediate Reward
+ Eπ [αvπ (St+1) |St = s]
Discount value of successor state
32 / 111
71. Images/cinvestav
The Action-Value Function qπ (s, a)
It is the expected return starting from state s
Starting from state s, taking action a, and then following policy π
qπ (s, a) = Eπ [Gt|St = a, At = a]
We can also obtain the following decomposition
qπ (s, a) = Eπ [Rt+1 + αqπ (St+1, At+1) |St = s, At = a]
33 / 111
72. Images/cinvestav
The Action-Value Function qπ (s, a)
It is the expected return starting from state s
Starting from state s, taking action a, and then following policy π
qπ (s, a) = Eπ [Gt|St = a, At = a]
We can also obtain the following decomposition
qπ (s, a) = Eπ [Rt+1 + αqπ (St+1, At+1) |St = s, At = a]
33 / 111
75. Images/cinvestav
Then
Expressing qπ (s, a) in terms of vπ (s) in the expression of vπ (s) -
The Bellman Equation
vπ (s) =
a∈A
π (a|s)
Ra
s + α
s ∈S
Pa
ss vπ s
The Bellman equation relates the state-value function
of one state with that of other states.
35 / 111
76. Images/cinvestav
Then
Expressing qπ (s, a) in terms of vπ (s) in the expression of vπ (s) -
The Bellman Equation
vπ (s) =
a∈A
π (a|s)
Ra
s + α
s ∈S
Pa
ss vπ s
The Bellman equation relates the state-value function
of one state with that of other states.
35 / 111
77. Images/cinvestav
Also
Bellman equation for qπ (s, a)
qπ (s, a) = Ra
s + α
s ∈S
Pa
ss
a∈A
π (a|s) qπ (s, a)
How do we solve this problem?
We have the following solutions
36 / 111
78. Images/cinvestav
Also
Bellman equation for qπ (s, a)
qπ (s, a) = Ra
s + α
s ∈S
Pa
ss
a∈A
π (a|s) qπ (s, a)
How do we solve this problem?
We have the following solutions
36 / 111
79. Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
37 / 111
80. Images/cinvestav
Action-Value Methods
One natural way to estimate this is by averaging the rewards actually
received
Qt (a) =
t−1
i=1 RiIAi=a
t−1
i=1 IAi=a
Nevertheless
If the denominator is zero, we define Qt (a) at some default value
Now, as the denominator goes to infinity, by the law of large numbers,
Qt (a) −→ q∗
(a)
38 / 111
81. Images/cinvestav
Action-Value Methods
One natural way to estimate this is by averaging the rewards actually
received
Qt (a) =
t−1
i=1 RiIAi=a
t−1
i=1 IAi=a
Nevertheless
If the denominator is zero, we define Qt (a) at some default value
Now, as the denominator goes to infinity, by the law of large numbers,
Qt (a) −→ q∗
(a)
38 / 111
82. Images/cinvestav
Action-Value Methods
One natural way to estimate this is by averaging the rewards actually
received
Qt (a) =
t−1
i=1 RiIAi=a
t−1
i=1 IAi=a
Nevertheless
If the denominator is zero, we define Qt (a) at some default value
Now, as the denominator goes to infinity, by the law of large numbers,
Qt (a) −→ q∗
(a)
38 / 111
83. Images/cinvestav
Furthermore
We call this the sample-average method
Not the best, but allows to introduce a way to select actions...
A classic one us the greedy estimate
At = arg max
a
Qt (a)
Therefore
Greedy action selection always exploits current knowledge to
maximize immediate reward.
it spends no time in inferior actions to see if they might really be
better.
39 / 111
84. Images/cinvestav
Furthermore
We call this the sample-average method
Not the best, but allows to introduce a way to select actions...
A classic one us the greedy estimate
At = arg max
a
Qt (a)
Therefore
Greedy action selection always exploits current knowledge to
maximize immediate reward.
it spends no time in inferior actions to see if they might really be
better.
39 / 111
85. Images/cinvestav
Furthermore
We call this the sample-average method
Not the best, but allows to introduce a way to select actions...
A classic one us the greedy estimate
At = arg max
a
Qt (a)
Therefore
Greedy action selection always exploits current knowledge to
maximize immediate reward.
it spends no time in inferior actions to see if they might really be
better.
39 / 111
86. Images/cinvestav
Furthermore
We call this the sample-average method
Not the best, but allows to introduce a way to select actions...
A classic one us the greedy estimate
At = arg max
a
Qt (a)
Therefore
Greedy action selection always exploits current knowledge to
maximize immediate reward.
it spends no time in inferior actions to see if they might really be
better.
39 / 111
87. Images/cinvestav
And Here a Heuristic
A Simple Alternative is to act greedily most of the time
every once in a while, say with small probability , select randomly
from among all the actions with equal probability.
This is called -greedy method
In the limit to infinity, this method ensures:
Qt (a) −→ q∗
(a)
40 / 111
88. Images/cinvestav
And Here a Heuristic
A Simple Alternative is to act greedily most of the time
every once in a while, say with small probability , select randomly
from among all the actions with equal probability.
This is called -greedy method
In the limit to infinity, this method ensures:
Qt (a) −→ q∗
(a)
40 / 111
90. Images/cinvestav
Example
In the following example
The q∗ (a) were selected according to a Gaussian Distribution with
mean 0 and variance 1.
Then, when an action is selected, the reward Rt was selected from a
normal distribution with mean q∗ (At) and variance 1.
We have then
42 / 111
91. Images/cinvestav
Example
In the following example
The q∗ (a) were selected according to a Gaussian Distribution with
mean 0 and variance 1.
Then, when an action is selected, the reward Rt was selected from a
normal distribution with mean q∗ (At) and variance 1.
We have then
42 / 111
93. Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
44 / 111
94. Images/cinvestav
Imagine the following
Let Ri denotes the reward after the ith
of this action
Qn =
R1 + R2 + · · · + Rn−1
n − 1
The problem of this technique
The Amount of Memory and Computational Power to keep updates
the estimates as more reward happens
45 / 111
95. Images/cinvestav
Imagine the following
Let Ri denotes the reward after the ith
of this action
Qn =
R1 + R2 + · · · + Rn−1
n − 1
The problem of this technique
The Amount of Memory and Computational Power to keep updates
the estimates as more reward happens
45 / 111
96. Images/cinvestav
Therefore
We need something more efficient
We can do something like that by realizing the following:
Qn+1 =
1
n
n
i=1
Rn
=
1
n
Rn +
n
i=1
Ei
=
1
n
[Rn + (n − 1) Qn]
= Qn +
1
n
[Rn − Qn]
46 / 111
97. Images/cinvestav
Therefore
We need something more efficient
We can do something like that by realizing the following:
Qn+1 =
1
n
n
i=1
Rn
=
1
n
Rn +
n
i=1
Ei
=
1
n
[Rn + (n − 1) Qn]
= Qn +
1
n
[Rn − Qn]
46 / 111
98. Images/cinvestav
Therefore
We need something more efficient
We can do something like that by realizing the following:
Qn+1 =
1
n
n
i=1
Rn
=
1
n
Rn +
n
i=1
Ei
=
1
n
[Rn + (n − 1) Qn]
= Qn +
1
n
[Rn − Qn]
46 / 111
99. Images/cinvestav
Therefore
We need something more efficient
We can do something like that by realizing the following:
Qn+1 =
1
n
n
i=1
Rn
=
1
n
Rn +
n
i=1
Ei
=
1
n
[Rn + (n − 1) Qn]
= Qn +
1
n
[Rn − Qn]
46 / 111
100. Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
47 / 111
101. Images/cinvestav
General Formula in Reinforcement Learning
The Pattern Appears almost all the time in RL
New Estimate = Old Estimate + Step Size [Target − Old Estimate]
Looks a lot as a Gradient Descent if we decide Assume a Taylor Series
f (x) = f (a) + f (a) (x − a) =⇒ f (a) =
f (x) − f (a)
x − a
Then
xn+1 = xn + γf (a) = xn + γ
f (x) − f (a)
x − a
= xn + γ [f (x) − f (a)]
48 / 111
102. Images/cinvestav
General Formula in Reinforcement Learning
The Pattern Appears almost all the time in RL
New Estimate = Old Estimate + Step Size [Target − Old Estimate]
Looks a lot as a Gradient Descent if we decide Assume a Taylor Series
f (x) = f (a) + f (a) (x − a) =⇒ f (a) =
f (x) − f (a)
x − a
Then
xn+1 = xn + γf (a) = xn + γ
f (x) − f (a)
x − a
= xn + γ [f (x) − f (a)]
48 / 111
103. Images/cinvestav
General Formula in Reinforcement Learning
The Pattern Appears almost all the time in RL
New Estimate = Old Estimate + Step Size [Target − Old Estimate]
Looks a lot as a Gradient Descent if we decide Assume a Taylor Series
f (x) = f (a) + f (a) (x − a) =⇒ f (a) =
f (x) − f (a)
x − a
Then
xn+1 = xn + γf (a) = xn + γ
f (x) − f (a)
x − a
= xn + γ [f (x) − f (a)]
48 / 111
104. Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
49 / 111
105. Images/cinvestav
A Simple Bandit Algorithm
Initialize for a = 1 to k
1 Q (a) ← 0
2 N (a) ← 0
Loop until certain threshold γ
1 A ←
arg maxa Q (a) with probability 1 −
A random action with probability
2 R ← bandit (A)
3 N (A) ← N (A) + 1
4 Q (A) ← Q (A) + 1
N(A) [R − Q (A)]
50 / 111
106. Images/cinvestav
A Simple Bandit Algorithm
Initialize for a = 1 to k
1 Q (a) ← 0
2 N (a) ← 0
Loop until certain threshold γ
1 A ←
arg maxa Q (a) with probability 1 −
A random action with probability
2 R ← bandit (A)
3 N (A) ← N (A) + 1
4 Q (A) ← Q (A) + 1
N(A) [R − Q (A)]
50 / 111
107. Images/cinvestav
Remarks
Remember
This is works well for a stationary problem
For more in non-stationary problems
“Reinforcement Learning: An Introduction“ by Sutton et al. at
chapter 2.
51 / 111
108. Images/cinvestav
Remarks
Remember
This is works well for a stationary problem
For more in non-stationary problems
“Reinforcement Learning: An Introduction“ by Sutton et al. at
chapter 2.
51 / 111
109. Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
52 / 111
110. Images/cinvestav
Dynamic Programming
As in many things in Computer Science
We love the Divide Et Impera...
Then
In Reinforcement Learning, Bellman equation gives recursive
decompositions, and value function stores and reuses solutions.
53 / 111
111. Images/cinvestav
Dynamic Programming
As in many things in Computer Science
We love the Divide Et Impera...
Then
In Reinforcement Learning, Bellman equation gives recursive
decompositions, and value function stores and reuses solutions.
53 / 111
112. Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
54 / 111
113. Images/cinvestav
Policy Evaluation
For this, we assume as always
The problem can be solved with the Bellman Equation
We start from an initial guess v1
Then, we apply Bellman Iteratively
v1 → v2 → · · · → vπ
For this, we use synchronous updates
We update the value functions of the present iteration at the same
time based on the that of the previous iteration.
55 / 111
114. Images/cinvestav
Policy Evaluation
For this, we assume as always
The problem can be solved with the Bellman Equation
We start from an initial guess v1
Then, we apply Bellman Iteratively
v1 → v2 → · · · → vπ
For this, we use synchronous updates
We update the value functions of the present iteration at the same
time based on the that of the previous iteration.
55 / 111
115. Images/cinvestav
Policy Evaluation
For this, we assume as always
The problem can be solved with the Bellman Equation
We start from an initial guess v1
Then, we apply Bellman Iteratively
v1 → v2 → · · · → vπ
For this, we use synchronous updates
We update the value functions of the present iteration at the same
time based on the that of the previous iteration.
55 / 111
116. Images/cinvestav
Then
At each iteration k + 1, for all states s ∈ S
We update vk+1 from vk (s ) according to Bellman Equations, where
s is a successor state of s:
vk+1 (s) =
a∈A
π (a|s)
Ra
s + α
s ∈S
Pa
ss vk s
Something Notable
The iterative process stops when the maximum difference between
value function at the current step and that at the previous step is
smaller than some small positive constant.
56 / 111
117. Images/cinvestav
Then
At each iteration k + 1, for all states s ∈ S
We update vk+1 from vk (s ) according to Bellman Equations, where
s is a successor state of s:
vk+1 (s) =
a∈A
π (a|s)
Ra
s + α
s ∈S
Pa
ss vk s
Something Notable
The iterative process stops when the maximum difference between
value function at the current step and that at the previous step is
smaller than some small positive constant.
56 / 111
118. Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
57 / 111
119. Images/cinvestav
However
Our ultimate goal is to find the optimal policy
For this, we can use policy iteration process
Algorithm
1 Initialize π randomly
2 Repeat until the previous policy and the current policy are the same.
1 Evaluate vπ by policy evaluation.
2 Using synchronous updates, for each state s, let
π (s) = arg max
a∈A
q (s, a) = arg max
a∈A
Ra
s + α
s ∈S
Pa
ss vπ s
58 / 111
120. Images/cinvestav
However
Our ultimate goal is to find the optimal policy
For this, we can use policy iteration process
Algorithm
1 Initialize π randomly
2 Repeat until the previous policy and the current policy are the same.
1 Evaluate vπ by policy evaluation.
2 Using synchronous updates, for each state s, let
π (s) = arg max
a∈A
q (s, a) = arg max
a∈A
Ra
s + α
s ∈S
Pa
ss vπ s
58 / 111
121. Images/cinvestav
Now
Something Notable
This policy iteration algorithm will improve the policy for each step.
Assume that
We have a deterministic policy π and after one step we get π .
We know that the value improves in one step because
qπ s, π (s) = max
a∈A
qπ (s, a) ≥ qπ (s, π (s)) = vπ (s)
59 / 111
122. Images/cinvestav
Now
Something Notable
This policy iteration algorithm will improve the policy for each step.
Assume that
We have a deterministic policy π and after one step we get π .
We know that the value improves in one step because
qπ s, π (s) = max
a∈A
qπ (s, a) ≥ qπ (s, π (s)) = vπ (s)
59 / 111
123. Images/cinvestav
Now
Something Notable
This policy iteration algorithm will improve the policy for each step.
Assume that
We have a deterministic policy π and after one step we get π .
We know that the value improves in one step because
qπ s, π (s) = max
a∈A
qπ (s, a) ≥ qπ (s, π (s)) = vπ (s)
59 / 111
124. Images/cinvestav
Then
We can see that
vπ (s) ≤ qπ s, π (s) = Eπ [Rt+1 + αvπ (St+1) |St = s]
≤ Eπ Rt+1 + αqπ St+1, π (St+1) |St = s
≤ Eπ Rt+1 + αRt+2 + α2
qπ St+2, π (St+2) |St = s
≤ Eπ [Rt+1 + αRt+2 + · · · |St = s] = vπ (s)
60 / 111
125. Images/cinvestav
Then
We can see that
vπ (s) ≤ qπ s, π (s) = Eπ [Rt+1 + αvπ (St+1) |St = s]
≤ Eπ Rt+1 + αqπ St+1, π (St+1) |St = s
≤ Eπ Rt+1 + αRt+2 + α2
qπ St+2, π (St+2) |St = s
≤ Eπ [Rt+1 + αRt+2 + · · · |St = s] = vπ (s)
60 / 111
126. Images/cinvestav
Then
We can see that
vπ (s) ≤ qπ s, π (s) = Eπ [Rt+1 + αvπ (St+1) |St = s]
≤ Eπ Rt+1 + αqπ St+1, π (St+1) |St = s
≤ Eπ Rt+1 + αRt+2 + α2
qπ St+2, π (St+2) |St = s
≤ Eπ [Rt+1 + αRt+2 + · · · |St = s] = vπ (s)
60 / 111
127. Images/cinvestav
Then
We can see that
vπ (s) ≤ qπ s, π (s) = Eπ [Rt+1 + αvπ (St+1) |St = s]
≤ Eπ Rt+1 + αqπ St+1, π (St+1) |St = s
≤ Eπ Rt+1 + αRt+2 + α2
qπ St+2, π (St+2) |St = s
≤ Eπ [Rt+1 + αRt+2 + · · · |St = s] = vπ (s)
60 / 111
128. Images/cinvestav
Then
If improvements stop
qπ s, π (s) = max
a∈A
qπ (s, a) = qπ (s, π (s)) = vπ (s)
This means that the Bellman equation has been satisfied
vπ (s) = max
a∈A
qπ (s, a)
Therefore, vπ (s) = v∗ (s) for all s ∈ S and π is an optimal policy.
61 / 111
129. Images/cinvestav
Then
If improvements stop
qπ s, π (s) = max
a∈A
qπ (s, a) = qπ (s, π (s)) = vπ (s)
This means that the Bellman equation has been satisfied
vπ (s) = max
a∈A
qπ (s, a)
Therefore, vπ (s) = v∗ (s) for all s ∈ S and π is an optimal policy.
61 / 111
130. Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
62 / 111
131. Images/cinvestav
Once the policy has been improved
We can iteratively a sequence of monotonically improving policies and
values
π0
E
−→ vπ0
I
→ π1
E
−→ vπ1
I
→ π2
E
−→ · · ·
I
−→ π∗
E
→ v∗
with E = evaluation and I = improvement.
63 / 111
132. Images/cinvestav
Policy Iteration (using iterative policy evaluation)
Step 1. Initialization
V (s) ∈ R and π (s) ∈ A (s) arbitrarily for all s ∈ S
Step 2. Policy Evaluation
1 Loop:
2 ∆ ← 0
3 Loop for each s ∈ S
4 v ← V (s)
5 V (s) ← s ,r p (s , r|s, π (s)) [r + αV (s )]
6 ∆ ← max (∆, |v − V (s)|)
7 until ∆ < θ (A small threshold for accuracy)
64 / 111
133. Images/cinvestav
Policy Iteration (using iterative policy evaluation)
Step 1. Initialization
V (s) ∈ R and π (s) ∈ A (s) arbitrarily for all s ∈ S
Step 2. Policy Evaluation
1 Loop:
2 ∆ ← 0
3 Loop for each s ∈ S
4 v ← V (s)
5 V (s) ← s ,r p (s , r|s, π (s)) [r + αV (s )]
6 ∆ ← max (∆, |v − V (s)|)
7 until ∆ < θ (A small threshold for accuracy)
64 / 111
134. Images/cinvestav
Further
Step 3. Policy Improvement
1 Policy-stable ← True
2 For each s ∈ S:
3 old_action ← π (s)
4 π (s) ← s ,r p (s .r|s, a) [r + αV (s )]
5 if old_action= π (s) then policy-stable ← False
6 If policy-stable then stop and return V ≈ v∗ and π ≈ π∗ else go to 2.
65 / 111
135. Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
66 / 111
136. Images/cinvestav
Jack’s Car Rental
Setup
We have a Car’s Rental with two places (s1, s2).
And each time Jack is available he rents it out and is credited $10 by
the company.
If Jack’s is out of cars at that location, then the business is lost.
The cars are available for renting the next day
Jack’s has options
It can move them between the two locations overnight, at a cost of
$2 dollars and only 5 of them can be moved.
67 / 111
137. Images/cinvestav
Jack’s Car Rental
Setup
We have a Car’s Rental with two places (s1, s2).
And each time Jack is available he rents it out and is credited $10 by
the company.
If Jack’s is out of cars at that location, then the business is lost.
The cars are available for renting the next day
Jack’s has options
It can move them between the two locations overnight, at a cost of
$2 dollars and only 5 of them can be moved.
67 / 111
138. Images/cinvestav
Jack’s Car Rental
Setup
We have a Car’s Rental with two places (s1, s2).
And each time Jack is available he rents it out and is credited $10 by
the company.
If Jack’s is out of cars at that location, then the business is lost.
The cars are available for renting the next day
Jack’s has options
It can move them between the two locations overnight, at a cost of
$2 dollars and only 5 of them can be moved.
67 / 111
139. Images/cinvestav
Jack’s Car Rental
Setup
We have a Car’s Rental with two places (s1, s2).
And each time Jack is available he rents it out and is credited $10 by
the company.
If Jack’s is out of cars at that location, then the business is lost.
The cars are available for renting the next day
Jack’s has options
It can move them between the two locations overnight, at a cost of
$2 dollars and only 5 of them can be moved.
67 / 111
140. Images/cinvestav
Jack’s Car Rental
Setup
We have a Car’s Rental with two places (s1, s2).
And each time Jack is available he rents it out and is credited $10 by
the company.
If Jack’s is out of cars at that location, then the business is lost.
The cars are available for renting the next day
Jack’s has options
It can move them between the two locations overnight, at a cost of
$2 dollars and only 5 of them can be moved.
67 / 111
141. Images/cinvestav
We choose a probability for modeling requests and returns
Additionally, renting at each location has a Poisson distribution
P (n|λ) =
λn
n!
e−λ
We simplify the problem by having an upper-bound n = 20
Basically, any other car generated by the return simply disappear!!!
68 / 111
142. Images/cinvestav
We choose a probability for modeling requests and returns
Additionally, renting at each location has a Poisson distribution
P (n|λ) =
λn
n!
e−λ
We simplify the problem by having an upper-bound n = 20
Basically, any other car generated by the return simply disappear!!!
68 / 111
143. Images/cinvestav
Finally
We have the following λ s for request and returns
1 Request s1 we have λr1 = 3
2 Request s2 we have λr2 = 4
3 Return s1 we have λr1 = 3
4 Return s2 we have λr1 = 2
Actually, it defines how many cars are requested or returned in a time
interval (One day)
69 / 111
144. Images/cinvestav
Finally
We have the following λ s for request and returns
1 Request s1 we have λr1 = 3
2 Request s2 we have λr2 = 4
3 Return s1 we have λr1 = 3
4 Return s2 we have λr1 = 2
Actually, it defines how many cars are requested or returned in a time
interval (One day)
69 / 111
145. Images/cinvestav
Finally
We have the following λ s for request and returns
1 Request s1 we have λr1 = 3
2 Request s2 we have λr2 = 4
3 Return s1 we have λr1 = 3
4 Return s2 we have λr1 = 2
Actually, it defines how many cars are requested or returned in a time
interval (One day)
69 / 111
146. Images/cinvestav
Finally
We have the following λ s for request and returns
1 Request s1 we have λr1 = 3
2 Request s2 we have λr2 = 4
3 Return s1 we have λr1 = 3
4 Return s2 we have λr1 = 2
Actually, it defines how many cars are requested or returned in a time
interval (One day)
69 / 111
147. Images/cinvestav
Finally
We have the following λ s for request and returns
1 Request s1 we have λr1 = 3
2 Request s2 we have λr2 = 4
3 Return s1 we have λr1 = 3
4 Return s2 we have λr1 = 2
Actually, it defines how many cars are requested or returned in a time
interval (One day)
Time Interval
Events
69 / 111
149. Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
71 / 111
150. Images/cinvestav
History
There is an original earlier work
By Arthur Samuel and it was used by Sutton to invent Temporal
Difference-Lambda
It was famously applied by Gerald Tesauro
To build TD-Gammon
A program that learned to play backgammon at the level of expert
human players
72 / 111
151. Images/cinvestav
History
There is an original earlier work
By Arthur Samuel and it was used by Sutton to invent Temporal
Difference-Lambda
It was famously applied by Gerald Tesauro
To build TD-Gammon
A program that learned to play backgammon at the level of expert
human players
72 / 111
152. Images/cinvestav
Temporal-Difference (TD) Learning
Something Notable
Possibly the central and novel idea of reinforcement learning,
It has similarity with Monte Carlo Methods
They use experience to solve the prediction problem (For More look
at Sutton’s Book)
Monte Carlo methods wait until Gt is known
V (St) ← V (St) + γ [Gt − V (St)] and Gt =
∞
k=0
αk
Rt+k+1
Let us call this method constant-α Monte Carlo.
73 / 111
153. Images/cinvestav
Temporal-Difference (TD) Learning
Something Notable
Possibly the central and novel idea of reinforcement learning,
It has similarity with Monte Carlo Methods
They use experience to solve the prediction problem (For More look
at Sutton’s Book)
Monte Carlo methods wait until Gt is known
V (St) ← V (St) + γ [Gt − V (St)] and Gt =
∞
k=0
αk
Rt+k+1
Let us call this method constant-α Monte Carlo.
73 / 111
154. Images/cinvestav
Temporal-Difference (TD) Learning
Something Notable
Possibly the central and novel idea of reinforcement learning,
It has similarity with Monte Carlo Methods
They use experience to solve the prediction problem (For More look
at Sutton’s Book)
Monte Carlo methods wait until Gt is known
V (St) ← V (St) + γ [Gt − V (St)] and Gt =
∞
k=0
αk
Rt+k+1
Let us call this method constant-α Monte Carlo.
73 / 111
155. Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
74 / 111
156. Images/cinvestav
Fundamental Theorem of Simulation
Cumulative Distribution Function
With every random variable, we associate a function called
Cumulative Distribution Function (CDF) which is defined as follows:
FX(x) = P(f(X) ≤ x)
With properties:
FX(x) ≥ 0
FX(x) in a non-decreasing function of X.
75 / 111
157. Images/cinvestav
Fundamental Theorem of Simulation
Cumulative Distribution Function
With every random variable, we associate a function called
Cumulative Distribution Function (CDF) which is defined as follows:
FX(x) = P(f(X) ≤ x)
With properties:
FX(x) ≥ 0
FX(x) in a non-decreasing function of X.
75 / 111
158. Images/cinvestav
Fundamental Theorem of Simulation
Cumulative Distribution Function
With every random variable, we associate a function called
Cumulative Distribution Function (CDF) which is defined as follows:
FX(x) = P(f(X) ≤ x)
With properties:
FX(x) ≥ 0
FX(x) in a non-decreasing function of X.
75 / 111
161. Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
78 / 111
166. Images/cinvestav
Therefore
One thing that is made clear by this theorem is that we can generate
X in three ways
1 First generate X ∼ f, and then U|X = x ∼ U {u|u ≤ f (x)} , but
this is useless because we already have X and do not need X.
2 First generate U, and then X|U = u ∼ U {x|u ≤ f (x)} .
3 Generate (X, U) jointly, a smart idea.
it allows us to generate data on a larger set were simulation is easier.
Not as simple as it looks.
81 / 111
167. Images/cinvestav
Therefore
One thing that is made clear by this theorem is that we can generate
X in three ways
1 First generate X ∼ f, and then U|X = x ∼ U {u|u ≤ f (x)} , but
this is useless because we already have X and do not need X.
2 First generate U, and then X|U = u ∼ U {x|u ≤ f (x)} .
3 Generate (X, U) jointly, a smart idea.
it allows us to generate data on a larger set were simulation is easier.
Not as simple as it looks.
81 / 111
168. Images/cinvestav
Therefore
One thing that is made clear by this theorem is that we can generate
X in three ways
1 First generate X ∼ f, and then U|X = x ∼ U {u|u ≤ f (x)} , but
this is useless because we already have X and do not need X.
2 First generate U, and then X|U = u ∼ U {x|u ≤ f (x)} .
3 Generate (X, U) jointly, a smart idea.
it allows us to generate data on a larger set were simulation is easier.
Not as simple as it looks.
81 / 111
169. Images/cinvestav
Therefore
One thing that is made clear by this theorem is that we can generate
X in three ways
1 First generate X ∼ f, and then U|X = x ∼ U {u|u ≤ f (x)} , but
this is useless because we already have X and do not need X.
2 First generate U, and then X|U = u ∼ U {x|u ≤ f (x)} .
3 Generate (X, U) jointly, a smart idea.
it allows us to generate data on a larger set were simulation is easier.
Not as simple as it looks.
81 / 111
170. Images/cinvestav
Markov Chains
The random process Xt ∈ S for t = 1, 2, ..., T has a Markov property
iif
p (XT |XT−1, XT−2, ..., X1) = p (XT |XT−1)
Finite-state Discrete Time Markov Chains |S| < ∞
It can be completly specified by the transition matrix.
P = [pij] with pij = P [Xt = j|Xt−1 = i]
For Irreducible chains, the stationary distribution π is long-term
proportion of time that the chain spends in each state.
Computed by π = πP .
82 / 111
171. Images/cinvestav
Markov Chains
The random process Xt ∈ S for t = 1, 2, ..., T has a Markov property
iif
p (XT |XT−1, XT−2, ..., X1) = p (XT |XT−1)
Finite-state Discrete Time Markov Chains |S| < ∞
It can be completly specified by the transition matrix.
P = [pij] with pij = P [Xt = j|Xt−1 = i]
For Irreducible chains, the stationary distribution π is long-term
proportion of time that the chain spends in each state.
Computed by π = πP .
82 / 111
172. Images/cinvestav
Markov Chains
The random process Xt ∈ S for t = 1, 2, ..., T has a Markov property
iif
p (XT |XT−1, XT−2, ..., X1) = p (XT |XT−1)
Finite-state Discrete Time Markov Chains |S| < ∞
It can be completly specified by the transition matrix.
P = [pij] with pij = P [Xt = j|Xt−1 = i]
For Irreducible chains, the stationary distribution π is long-term
proportion of time that the chain spends in each state.
Computed by π = πP .
82 / 111
173. Images/cinvestav
Markov Chains
The random process Xt ∈ S for t = 1, 2, ..., T has a Markov property
iif
p (XT |XT−1, XT−2, ..., X1) = p (XT |XT−1)
Finite-state Discrete Time Markov Chains |S| < ∞
It can be completly specified by the transition matrix.
P = [pij] with pij = P [Xt = j|Xt−1 = i]
For Irreducible chains, the stationary distribution π is long-term
proportion of time that the chain spends in each state.
Computed by π = πP .
82 / 111
174. Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
83 / 111
175. Images/cinvestav
The Idea
Draw an i.i.d. set of samples x(i)
N
i=1
From a target density p (x) on a high-dimensional X.
The set of possible configurations of a system.
Thus, we can use those samples to approximate the target density
pN (x) =
1
N
N
i=1
δx(i) (x)
where δx(i) (x) denotes the Dirac Delta mass located at x(i).
84 / 111
176. Images/cinvestav
The Idea
Draw an i.i.d. set of samples x(i)
N
i=1
From a target density p (x) on a high-dimensional X.
The set of possible configurations of a system.
Thus, we can use those samples to approximate the target density
pN (x) =
1
N
N
i=1
δx(i) (x)
where δx(i) (x) denotes the Dirac Delta mass located at x(i).
84 / 111
177. Images/cinvestav
The Idea
Draw an i.i.d. set of samples x(i)
N
i=1
From a target density p (x) on a high-dimensional X.
The set of possible configurations of a system.
Thus, we can use those samples to approximate the target density
pN (x) =
1
N
N
i=1
δx(i) (x)
where δx(i) (x) denotes the Dirac Delta mass located at x(i).
84 / 111
180. Images/cinvestav
Consequently
Approximate the integrals
IN (f) =
1
N
N
i=1
f x(i) a.s.
−→
N→∞
I (f) =
X
f (x) p (x) dx
Where Variance of f (x)
σ2
f = Ep(x) f2
(x) − I2
(f) < ∞
Therefore the variance of the estimator IN (f)
EIN (f) = var (IN (f)) =
σ2
f
N
86 / 111
181. Images/cinvestav
Consequently
Approximate the integrals
IN (f) =
1
N
N
i=1
f x(i) a.s.
−→
N→∞
I (f) =
X
f (x) p (x) dx
Where Variance of f (x)
σ2
f = Ep(x) f2
(x) − I2
(f) < ∞
Therefore the variance of the estimator IN (f)
EIN (f) = var (IN (f)) =
σ2
f
N
86 / 111
182. Images/cinvestav
Consequently
Approximate the integrals
IN (f) =
1
N
N
i=1
f x(i) a.s.
−→
N→∞
I (f) =
X
f (x) p (x) dx
Where Variance of f (x)
σ2
f = Ep(x) f2
(x) − I2
(f) < ∞
Therefore the variance of the estimator IN (f)
EIN (f) = var (IN (f)) =
σ2
f
N
86 / 111
185. Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
88 / 111
186. Images/cinvestav
What if we do not wait for Gt
TD methods need to wait only until the next time step
At time t + 1, it uses the observed reward Rt+1 and the estimate
V (St+1)
The simplest TD method
V (St) ← V (St) + γ [Rt+1 + αV (St+1) − V (St)]
Immediately on transition to St+1 and receiving Rt+1 .
89 / 111
187. Images/cinvestav
What if we do not wait for Gt
TD methods need to wait only until the next time step
At time t + 1, it uses the observed reward Rt+1 and the estimate
V (St+1)
The simplest TD method
V (St) ← V (St) + γ [Rt+1 + αV (St+1) − V (St)]
Immediately on transition to St+1 and receiving Rt+1 .
89 / 111
188. Images/cinvestav
This method is called TD(0) or one-step TD
Actually
There are TD with n-step TD methods
Differences with the Monte Carlo Methods
In Monte Carlo update, the target is Gt
In TD update, the target is Rt + αV (St+1)
90 / 111
189. Images/cinvestav
This method is called TD(0) or one-step TD
Actually
There are TD with n-step TD methods
Differences with the Monte Carlo Methods
In Monte Carlo update, the target is Gt
In TD update, the target is Rt + αV (St+1)
90 / 111
190. Images/cinvestav
TD as a Bootstrapping Method
Quite similar to Dynamic
Additionally, we know that
vπ (s) = Eπ [Gt|St]
= Eπ [Rt+1 + αGt+1|St]
= Eπ [Rt+1 + αvπ (St+1) |St]
Monte Carlo methods use an estimate of vπ (s) = Eπ [Gt|St]
Instead, Dynamic Programming methods use an estimate of
vπ (s) = Eπ [Rt+1 + αvπ (St+1) |St].
91 / 111
191. Images/cinvestav
TD as a Bootstrapping Method
Quite similar to Dynamic
Additionally, we know that
vπ (s) = Eπ [Gt|St]
= Eπ [Rt+1 + αGt+1|St]
= Eπ [Rt+1 + αvπ (St+1) |St]
Monte Carlo methods use an estimate of vπ (s) = Eπ [Gt|St]
Instead, Dynamic Programming methods use an estimate of
vπ (s) = Eπ [Rt+1 + αvπ (St+1) |St].
91 / 111
192. Images/cinvestav
Furthermore
The Monte Carlo target is an estimate
because the expected value in (6.3) is not known, but a sample is
used instead
The DP target is an estimate
because vπ(St + 1) is not known and the current estimate,V (St+1), is
used instead.
Here TD goes beyond and combines both ideas
It combines the sampling of Monte Carlo with the bootstrapping of
DP.
92 / 111
193. Images/cinvestav
Furthermore
The Monte Carlo target is an estimate
because the expected value in (6.3) is not known, but a sample is
used instead
The DP target is an estimate
because vπ(St + 1) is not known and the current estimate,V (St+1), is
used instead.
Here TD goes beyond and combines both ideas
It combines the sampling of Monte Carlo with the bootstrapping of
DP.
92 / 111
194. Images/cinvestav
Furthermore
The Monte Carlo target is an estimate
because the expected value in (6.3) is not known, but a sample is
used instead
The DP target is an estimate
because vπ(St + 1) is not known and the current estimate,V (St+1), is
used instead.
Here TD goes beyond and combines both ideas
It combines the sampling of Monte Carlo with the bootstrapping of
DP.
92 / 111
195. Images/cinvestav
As in many methods there is an error
And it is used to improve the policy
δt = Rt+1 + αV (St+1) − V (St)
Because the TD error depends on the next state and next reward
It is not actually available until one time step later.
δt is the error in V (St)
At time t + 1
93 / 111
196. Images/cinvestav
As in many methods there is an error
And it is used to improve the policy
δt = Rt+1 + αV (St+1) − V (St)
Because the TD error depends on the next state and next reward
It is not actually available until one time step later.
δt is the error in V (St)
At time t + 1
93 / 111
197. Images/cinvestav
As in many methods there is an error
And it is used to improve the policy
δt = Rt+1 + αV (St+1) − V (St)
Because the TD error depends on the next state and next reward
It is not actually available until one time step later.
δt is the error in V (St)
At time t + 1
93 / 111
198. Images/cinvestav
Finally
If the array V does not change during the episode,
The Monte Carlo Error can be written as a sum of TD errors
= Gt − V (St) = Rt+1 + αGt+1 − V (St) + αV (St+1) − αV (St+1)
= δt + α (Gt+1 − V (St+1))
= δt + αδt+! + α2
(Gt+2 − V (St+2))
=
T−1
k=t
αk−t
δk
94 / 111
199. Images/cinvestav
Optimality of TD(0)
Under batch updating
TD(0) converges deterministically to a single answer independent
as long as γ is chosen to be sufficiently small.
V (St) ← V (St) + γ [Gt − V (St)] and
95 / 111
200. Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
96 / 111
201. Images/cinvestav
Q-learning: Off-Policy TD Control
One of the early breakthroughs in reinforcement learning
The development of an off-policy TD control algorithm, Q-learning
(Watkind, 1989)
It is defined as
Q (St, At) ← Q (St, At) + γ Rt+1 + α max
a
Q (St+1, a) − Q (St, At)
Here
The learned action-value function, Q, directly approximates q∗ the
optimal action-value function
independent of the policy being followed q.
97 / 111
202. Images/cinvestav
Q-learning: Off-Policy TD Control
One of the early breakthroughs in reinforcement learning
The development of an off-policy TD control algorithm, Q-learning
(Watkind, 1989)
It is defined as
Q (St, At) ← Q (St, At) + γ Rt+1 + α max
a
Q (St+1, a) − Q (St, At)
Here
The learned action-value function, Q, directly approximates q∗ the
optimal action-value function
independent of the policy being followed q.
97 / 111
203. Images/cinvestav
Q-learning: Off-Policy TD Control
One of the early breakthroughs in reinforcement learning
The development of an off-policy TD control algorithm, Q-learning
(Watkind, 1989)
It is defined as
Q (St, At) ← Q (St, At) + γ Rt+1 + α max
a
Q (St+1, a) − Q (St, At)
Here
The learned action-value function, Q, directly approximates q∗ the
optimal action-value function
independent of the policy being followed q.
97 / 111
204. Images/cinvestav
Furthermore
The policy still has an effect
to determine which state-action pairs are visited and updated.
Therefore
All that is required for correct convergence is that all pairs continue
to be updated
Under this assumption and a variant of the usual stochastic
approximation conditions
Q has been shown to converge with probability 1 to q∗
98 / 111
205. Images/cinvestav
Furthermore
The policy still has an effect
to determine which state-action pairs are visited and updated.
Therefore
All that is required for correct convergence is that all pairs continue
to be updated
Under this assumption and a variant of the usual stochastic
approximation conditions
Q has been shown to converge with probability 1 to q∗
98 / 111
206. Images/cinvestav
Furthermore
The policy still has an effect
to determine which state-action pairs are visited and updated.
Therefore
All that is required for correct convergence is that all pairs continue
to be updated
Under this assumption and a variant of the usual stochastic
approximation conditions
Q has been shown to converge with probability 1 to q∗
98 / 111
207. Images/cinvestav
Q-learning (Off-Policy TD control) for estimating π ≈ π∗
Algorithm parameters
α ∈ (0, 1] and small > 0
Initialize
Initialize Q(s, a), for all s ∈ S+ , a ∈ A, arbitrarily except that
Q (terminal, ·) = 0
99 / 111
208. Images/cinvestav
Q-learning (Off-Policy TD control) for estimating π ≈ π∗
Algorithm parameters
α ∈ (0, 1] and small > 0
Initialize
Initialize Q(s, a), for all s ∈ S+ , a ∈ A, arbitrarily except that
Q (terminal, ·) = 0
99 / 111
209. Images/cinvestav
Then
Loop for each episode
1 Initialize S
2 Loop for each step of episode:
3 Choose A from S using policy derived from Q (for example
-greedy)
4 Take action A, observe R, S
Q (S, A) ← Q (S, A) + γ R + α max
a
Q S , a − Q (S, A)
5 S ← S
6 Until S is terminal
100 / 111
210. Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
101 / 111
211. Images/cinvestav
One of the earliest successes
Backgammon is a major game
with numerous tournaments and regular world championship matches.
It is a stochastic game with a full description of the world
102 / 111
212. Images/cinvestav
One of the earliest successes
Backgammon is a major game
with numerous tournaments and regular world championship matches.
It is a stochastic game with a full description of the world
102 / 111
213. Images/cinvestav
What do they used?
TD-Gammon used a nonlinear form of TD (λ)
An estimated value ˆv (s, w) is used to estimate the probability of
winning from state s.
Then
To achieve this, rewards were defined as zero for all time steps except
those on which the game is won.
103 / 111
214. Images/cinvestav
What do they used?
TD-Gammon used a nonlinear form of TD (λ)
An estimated value ˆv (s, w) is used to estimate the probability of
winning from state s.
Then
To achieve this, rewards were defined as zero for all time steps except
those on which the game is won.
103 / 111
216. Images/cinvestav
The interesting part
Tesauro’s representation
For each point on the backgammon board
Four units indicate the number of white pieces on the point
Basically
If no white pieces, all four units took zero values,
if there was one white piece, the first unit takes a one value
And so on...
Therefore, given four white pieces and another four black pieces at
each of the 24 points
We have 4*24+4*24=192 units
105 / 111
217. Images/cinvestav
The interesting part
Tesauro’s representation
For each point on the backgammon board
Four units indicate the number of white pieces on the point
Basically
If no white pieces, all four units took zero values,
if there was one white piece, the first unit takes a one value
And so on...
Therefore, given four white pieces and another four black pieces at
each of the 24 points
We have 4*24+4*24=192 units
105 / 111
218. Images/cinvestav
The interesting part
Tesauro’s representation
For each point on the backgammon board
Four units indicate the number of white pieces on the point
Basically
If no white pieces, all four units took zero values,
if there was one white piece, the first unit takes a one value
And so on...
Therefore, given four white pieces and another four black pieces at
each of the 24 points
We have 4*24+4*24=192 units
105 / 111
219. Images/cinvestav
Then
We have that
Two additional units encoded the number of white and black pieces
on the bar (Each took the value n/2 with n the number on the bar)
Two other represent the number of black and white pieces already
(these took the value n/15)
two units indicated in a binary fashion the white or black turn.
106 / 111
220. Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
107 / 111
221. Images/cinvestav
TD(λ) by Back-Propagation
TD-Gammon uses a semi-gradient of the TD (λ)
TD (λ) has the following general update rule:
wt+1 = wt + γ [Rt+1 + αˆv (St+1, wt) − ˆv (St, wt)] zt
with wt is the vector of all modifiable parameters
zt is a vector of eligibility traces
Eligibility traces unify and generalize TD and Monte Carlo methods.
It works as Long-Short Term Memory
The rough idea is that when a component of wt produces an estimate
zt is bumped out and then begins to fade away.
108 / 111
222. Images/cinvestav
TD(λ) by Back-Propagation
TD-Gammon uses a semi-gradient of the TD (λ)
TD (λ) has the following general update rule:
wt+1 = wt + γ [Rt+1 + αˆv (St+1, wt) − ˆv (St, wt)] zt
with wt is the vector of all modifiable parameters
zt is a vector of eligibility traces
Eligibility traces unify and generalize TD and Monte Carlo methods.
It works as Long-Short Term Memory
The rough idea is that when a component of wt produces an estimate
zt is bumped out and then begins to fade away.
108 / 111
223. Images/cinvestav
TD(λ) by Back-Propagation
TD-Gammon uses a semi-gradient of the TD (λ)
TD (λ) has the following general update rule:
wt+1 = wt + γ [Rt+1 + αˆv (St+1, wt) − ˆv (St, wt)] zt
with wt is the vector of all modifiable parameters
zt is a vector of eligibility traces
Eligibility traces unify and generalize TD and Monte Carlo methods.
It works as Long-Short Term Memory
The rough idea is that when a component of wt produces an estimate
zt is bumped out and then begins to fade away.
108 / 111
224. Images/cinvestav
How this is done?
Update Rule
zt = αλzt−1 + ˆv (St, wt) with z0 = 0
Something Notable
The gradient in this equation can be computed efficiently by the
back-propapagation procedure
For this, we set the error in the ANN
ˆv (St+1, wt) − ˆv (St, wt)
109 / 111
225. Images/cinvestav
How this is done?
Update Rule
zt = αλzt−1 + ˆv (St, wt) with z0 = 0
Something Notable
The gradient in this equation can be computed efficiently by the
back-propapagation procedure
For this, we set the error in the ANN
ˆv (St+1, wt) − ˆv (St, wt)
109 / 111
226. Images/cinvestav
How this is done?
Update Rule
zt = αλzt−1 + ˆv (St, wt) with z0 = 0
Something Notable
The gradient in this equation can be computed efficiently by the
back-propapagation procedure
For this, we set the error in the ANN
ˆv (St+1, wt) − ˆv (St, wt)
109 / 111
227. Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
110 / 111
228. Images/cinvestav
Conclusions
With this new tools as Reinforcement Learning
The New Artificial Intelligence has a lot to do in the world
Thanks to be in this class
As they say we have a lot to do...
111 / 111
229. Images/cinvestav
Conclusions
With this new tools as Reinforcement Learning
The New Artificial Intelligence has a lot to do in the world
Thanks to be in this class
As they say we have a lot to do...
111 / 111