SlideShare una empresa de Scribd logo
1 de 229
Descargar para leer sin conexión
Introduction to Artificial Intelligence
Reinforcement Learning
Andres Mendez-Vazquez
April 26, 2019
1 / 111
Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
2 / 111
Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
3 / 111
Images/cinvestav
Reinforcement Learning is learning what to do
What does Reinforcement Learning want?
Maximize a numerical reward signal
The learner is not told which actions to take
A discovery must be performed to obtain the actions that yield the
most rewards
4 / 111
Images/cinvestav
Reinforcement Learning is learning what to do
What does Reinforcement Learning want?
Maximize a numerical reward signal
The learner is not told which actions to take
A discovery must be performed to obtain the actions that yield the
most rewards
4 / 111
Images/cinvestav
These ideas come from Dynamical Systems Theory
Specifically
As the optimal control of incompletely-known Markov Decision
Processes (MDP).
Thus, the device must do the following
Interact with the environment to learn by using punishments and
rewards.
5 / 111
Images/cinvestav
These ideas come from Dynamical Systems Theory
Specifically
As the optimal control of incompletely-known Markov Decision
Processes (MDP).
Thus, the device must do the following
Interact with the environment to learn by using punishments and
rewards.
5 / 111
Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
6 / 111
Images/cinvestav
A mobile robot
decides whether it should enter a new room
in search of more trash to collect
or going back to charge its battery
It needs to take a decision based on the
current charge level of its battery
how quickly it can arrive to its base
7 / 111
Images/cinvestav
A mobile robot
decides whether it should enter a new room
in search of more trash to collect
or going back to charge its battery
It needs to take a decision based on the
current charge level of its battery
how quickly it can arrive to its base
7 / 111
Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
8 / 111
Images/cinvestav
A K-Armed Bandit Problem
Definition
You are faced repeatedly with a choice among k different options, or
actions.
After each choice you receive a numerical reward chosen
From an stationary probability distribution depending on the actions.
Definition Stationary Probability
A stationary distribution of a Markov chain is a probability distribution
that remains unchanged in the Markov chain as time progresses.
9 / 111
Images/cinvestav
A K-Armed Bandit Problem
Definition
You are faced repeatedly with a choice among k different options, or
actions.
After each choice you receive a numerical reward chosen
From an stationary probability distribution depending on the actions.
Definition Stationary Probability
A stationary distribution of a Markov chain is a probability distribution
that remains unchanged in the Markov chain as time progresses.
9 / 111
Images/cinvestav
A K-Armed Bandit Problem
Definition
You are faced repeatedly with a choice among k different options, or
actions.
After each choice you receive a numerical reward chosen
From an stationary probability distribution depending on the actions.
Definition Stationary Probability
A stationary distribution of a Markov chain is a probability distribution
that remains unchanged in the Markov chain as time progresses.
9 / 111
Images/cinvestav
Example
A Two State System
10 / 111
Images/cinvestav
Therefore
Each of the k actions has an expected value
q∗ (a) = E [Rt|At = a]
Something Notable
If you knew the value of each action,
It would be trivial to solve the k-armed bandit problem.
11 / 111
Images/cinvestav
Therefore
Each of the k actions has an expected value
q∗ (a) = E [Rt|At = a]
Something Notable
If you knew the value of each action,
It would be trivial to solve the k-armed bandit problem.
11 / 111
Images/cinvestav
What do we want?
To find the estimated value of action a at time t, Qt (a)
|Qt (a) − q∗ (a)| < with as small as possible
12 / 111
Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
13 / 111
Images/cinvestav
Greedy Actions
Intuitions
To choose the greatest E [Rt|At = a]
This is know as Exploitation
You are exploiting your current knowledge of the values of the actions.
Contrasting to Exploration
When you select the non-gready actions to obtain a better knowledge
of the non-gready actions.
14 / 111
Images/cinvestav
Greedy Actions
Intuitions
To choose the greatest E [Rt|At = a]
This is know as Exploitation
You are exploiting your current knowledge of the values of the actions.
Contrasting to Exploration
When you select the non-gready actions to obtain a better knowledge
of the non-gready actions.
14 / 111
Images/cinvestav
Greedy Actions
Intuitions
To choose the greatest E [Rt|At = a]
This is know as Exploitation
You are exploiting your current knowledge of the values of the actions.
Contrasting to Exploration
When you select the non-gready actions to obtain a better knowledge
of the non-gready actions.
14 / 111
Images/cinvestav
An Interesting Conundrum
It has been observed that
Exploitation maximizes the expected reward on the one step,
Exploration may produce the greater total reward in the long run.
Thus, the idea
The value of a state s is the total amount of reward an agent can
expect to accumulate over the future, starting from s.
Thus, Values
They indicate the long-term profit of states after taking into account
The states that follow and their rewards.
15 / 111
Images/cinvestav
An Interesting Conundrum
It has been observed that
Exploitation maximizes the expected reward on the one step,
Exploration may produce the greater total reward in the long run.
Thus, the idea
The value of a state s is the total amount of reward an agent can
expect to accumulate over the future, starting from s.
Thus, Values
They indicate the long-term profit of states after taking into account
The states that follow and their rewards.
15 / 111
Images/cinvestav
An Interesting Conundrum
It has been observed that
Exploitation maximizes the expected reward on the one step,
Exploration may produce the greater total reward in the long run.
Thus, the idea
The value of a state s is the total amount of reward an agent can
expect to accumulate over the future, starting from s.
Thus, Values
They indicate the long-term profit of states after taking into account
The states that follow and their rewards.
15 / 111
Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
16 / 111
Images/cinvestav
Therefore, the parts of a Reinforcement Learning
A Policy
It defines the learning agent’s way of behaving at a given time.
The policy is the core of a reinforcement learning system
In general, policies may be stochastic
A Reward Signal
It defines the goal of a reinforcement learning problem.
The system’s sole objective is to maximize the total reward over the
long run
This encourage Exploitation
It is immediate... you might want something with more long term.
A Value Function
It specifies what is good in the long run.
This encourage Exploration
17 / 111
Images/cinvestav
Therefore, the parts of a Reinforcement Learning
A Policy
It defines the learning agent’s way of behaving at a given time.
The policy is the core of a reinforcement learning system
In general, policies may be stochastic
A Reward Signal
It defines the goal of a reinforcement learning problem.
The system’s sole objective is to maximize the total reward over the
long run
This encourage Exploitation
It is immediate... you might want something with more long term.
A Value Function
It specifies what is good in the long run.
This encourage Exploration
17 / 111
Images/cinvestav
Therefore, the parts of a Reinforcement Learning
A Policy
It defines the learning agent’s way of behaving at a given time.
The policy is the core of a reinforcement learning system
In general, policies may be stochastic
A Reward Signal
It defines the goal of a reinforcement learning problem.
The system’s sole objective is to maximize the total reward over the
long run
This encourage Exploitation
It is immediate... you might want something with more long term.
A Value Function
It specifies what is good in the long run.
This encourage Exploration
17 / 111
Images/cinvestav
Therefore, the parts of a Reinforcement Learning
A Policy
It defines the learning agent’s way of behaving at a given time.
The policy is the core of a reinforcement learning system
In general, policies may be stochastic
A Reward Signal
It defines the goal of a reinforcement learning problem.
The system’s sole objective is to maximize the total reward over the
long run
This encourage Exploitation
It is immediate... you might want something with more long term.
A Value Function
It specifies what is good in the long run.
This encourage Exploration
17 / 111
Images/cinvestav
Therefore, the parts of a Reinforcement Learning
A Policy
It defines the learning agent’s way of behaving at a given time.
The policy is the core of a reinforcement learning system
In general, policies may be stochastic
A Reward Signal
It defines the goal of a reinforcement learning problem.
The system’s sole objective is to maximize the total reward over the
long run
This encourage Exploitation
It is immediate... you might want something with more long term.
A Value Function
It specifies what is good in the long run.
This encourage Exploration
17 / 111
Images/cinvestav
Therefore, the parts of a Reinforcement Learning
A Policy
It defines the learning agent’s way of behaving at a given time.
The policy is the core of a reinforcement learning system
In general, policies may be stochastic
A Reward Signal
It defines the goal of a reinforcement learning problem.
The system’s sole objective is to maximize the total reward over the
long run
This encourage Exploitation
It is immediate... you might want something with more long term.
A Value Function
It specifies what is good in the long run.
This encourage Exploration
17 / 111
Images/cinvestav
Therefore, the parts of a Reinforcement Learning
A Policy
It defines the learning agent’s way of behaving at a given time.
The policy is the core of a reinforcement learning system
In general, policies may be stochastic
A Reward Signal
It defines the goal of a reinforcement learning problem.
The system’s sole objective is to maximize the total reward over the
long run
This encourage Exploitation
It is immediate... you might want something with more long term.
A Value Function
It specifies what is good in the long run.
This encourage Exploration
17 / 111
Images/cinvestav
Therefore, the parts of a Reinforcement Learning
A Policy
It defines the learning agent’s way of behaving at a given time.
The policy is the core of a reinforcement learning system
In general, policies may be stochastic
A Reward Signal
It defines the goal of a reinforcement learning problem.
The system’s sole objective is to maximize the total reward over the
long run
This encourage Exploitation
It is immediate... you might want something with more long term.
A Value Function
It specifies what is good in the long run.
This encourage Exploration
17 / 111
Images/cinvestav
Therefore, the parts of a Reinforcement Learning
A Policy
It defines the learning agent’s way of behaving at a given time.
The policy is the core of a reinforcement learning system
In general, policies may be stochastic
A Reward Signal
It defines the goal of a reinforcement learning problem.
The system’s sole objective is to maximize the total reward over the
long run
This encourage Exploitation
It is immediate... you might want something with more long term.
A Value Function
It specifies what is good in the long run.
This encourage Exploration
17 / 111
Images/cinvestav
Finally
A Model of the Environment
This simulates the behavior of the environment
Allowing to make inferences on how the environment can behave
Something Notable
Models are used for planning
Meaning we made decisions before they are executed
Thus, Reinforcement Learning Methods using models are called
model-based
18 / 111
Images/cinvestav
Finally
A Model of the Environment
This simulates the behavior of the environment
Allowing to make inferences on how the environment can behave
Something Notable
Models are used for planning
Meaning we made decisions before they are executed
Thus, Reinforcement Learning Methods using models are called
model-based
18 / 111
Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
19 / 111
Images/cinvestav
Limitations
Given that
Reinforcement learning relies heavily on the concept of state
Therefore, it has the same problem as in Markov Decision Processes
Defining the state to obtain a better representation of the search
space.
This is where the creative part of the problems come to be
As we have seen, solutions in AI really a lot in the creativity of the
solution...
20 / 111
Images/cinvestav
Limitations
Given that
Reinforcement learning relies heavily on the concept of state
Therefore, it has the same problem as in Markov Decision Processes
Defining the state to obtain a better representation of the search
space.
This is where the creative part of the problems come to be
As we have seen, solutions in AI really a lot in the creativity of the
solution...
20 / 111
Images/cinvestav
Limitations
Given that
Reinforcement learning relies heavily on the concept of state
Therefore, it has the same problem as in Markov Decision Processes
Defining the state to obtain a better representation of the search
space.
This is where the creative part of the problems come to be
As we have seen, solutions in AI really a lot in the creativity of the
solution...
20 / 111
Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
21 / 111
Images/cinvestav
The System–Environment Interaction
In a Markov Decision Process
22 / 111
Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
23 / 111
Images/cinvestav
Markov Process and Markov Decision Process
Definition of Markov Process
A sequence of states is Markov if and only if the probability of moving
to the next state st+1 depends only on the present state st
P [st+1|st] = P [st+1|s1, ..., st]
Something Notable
We always talk about time-homogeneous Markov chain in
Reinforcement Learning:
the probability of the transition is independent of t
P [st+1 = s |st = s] = P [st = s |st−1 = s]
24 / 111
Images/cinvestav
Markov Process and Markov Decision Process
Definition of Markov Process
A sequence of states is Markov if and only if the probability of moving
to the next state st+1 depends only on the present state st
P [st+1|st] = P [st+1|s1, ..., st]
Something Notable
We always talk about time-homogeneous Markov chain in
Reinforcement Learning:
the probability of the transition is independent of t
P [st+1 = s |st = s] = P [st = s |st−1 = s]
24 / 111
Images/cinvestav
Formally
Definition (Markov Process)
a Markov Process (or Markov Chain) is a tuple (S, P), where
1 S is a finite set of states
2 P is a state transition probability matrix. Pss = P [st+1 = s |st = s]
Thus, the dynamic of the Markov process is as follow
s0
Ps0s1
−→ s1
Ps1s2
−→ s2
Ps2s3
−→ s3 −→ · · ·
25 / 111
Images/cinvestav
Formally
Definition (Markov Process)
a Markov Process (or Markov Chain) is a tuple (S, P), where
1 S is a finite set of states
2 P is a state transition probability matrix. Pss = P [st+1 = s |st = s]
Thus, the dynamic of the Markov process is as follow
s0
Ps0s1
−→ s1
Ps1s2
−→ s2
Ps2s3
−→ s3 −→ · · ·
25 / 111
Images/cinvestav
Then, we get the following familiar definition
Definition
MDP is a tuple S, A, T , R, α
1 S is a finite set of states
2 A is a finite set of actions
3 P is a state transition probability matrix.
Pss = P st+1 = s |st = s, At = a
4 α ∈ [0, 1] is called the discount factor
5 R : S × A → R is a reward function
Thus, the dynamic of the Markov Decision Process is as follow using
a probabilty to pick an action
s0
a0
−→ s1
a1
−→ s2
a2
−→ s3 −→ · · ·
26 / 111
Images/cinvestav
Then, we get the following familiar definition
Definition
MDP is a tuple S, A, T , R, α
1 S is a finite set of states
2 A is a finite set of actions
3 P is a state transition probability matrix.
Pss = P st+1 = s |st = s, At = a
4 α ∈ [0, 1] is called the discount factor
5 R : S × A → R is a reward function
Thus, the dynamic of the Markov Decision Process is as follow using
a probabilty to pick an action
s0
a0
−→ s1
a1
−→ s2
a2
−→ s3 −→ · · ·
26 / 111
Images/cinvestav
Then, we get the following familiar definition
Definition
MDP is a tuple S, A, T , R, α
1 S is a finite set of states
2 A is a finite set of actions
3 P is a state transition probability matrix.
Pss = P st+1 = s |st = s, At = a
4 α ∈ [0, 1] is called the discount factor
5 R : S × A → R is a reward function
Thus, the dynamic of the Markov Decision Process is as follow using
a probabilty to pick an action
s0
a0
−→ s1
a1
−→ s2
a2
−→ s3 −→ · · ·
26 / 111
Images/cinvestav
Then, we get the following familiar definition
Definition
MDP is a tuple S, A, T , R, α
1 S is a finite set of states
2 A is a finite set of actions
3 P is a state transition probability matrix.
Pss = P st+1 = s |st = s, At = a
4 α ∈ [0, 1] is called the discount factor
5 R : S × A → R is a reward function
Thus, the dynamic of the Markov Decision Process is as follow using
a probabilty to pick an action
s0
a0
−→ s1
a1
−→ s2
a2
−→ s3 −→ · · ·
26 / 111
Images/cinvestav
Then, we get the following familiar definition
Definition
MDP is a tuple S, A, T , R, α
1 S is a finite set of states
2 A is a finite set of actions
3 P is a state transition probability matrix.
Pss = P st+1 = s |st = s, At = a
4 α ∈ [0, 1] is called the discount factor
5 R : S × A → R is a reward function
Thus, the dynamic of the Markov Decision Process is as follow using
a probabilty to pick an action
s0
a0
−→ s1
a1
−→ s2
a2
−→ s3 −→ · · ·
26 / 111
Images/cinvestav
Then, we get the following familiar definition
Definition
MDP is a tuple S, A, T , R, α
1 S is a finite set of states
2 A is a finite set of actions
3 P is a state transition probability matrix.
Pss = P st+1 = s |st = s, At = a
4 α ∈ [0, 1] is called the discount factor
5 R : S × A → R is a reward function
Thus, the dynamic of the Markov Decision Process is as follow using
a probabilty to pick an action
s0
a0
−→ s1
a1
−→ s2
a2
−→ s3 −→ · · ·
26 / 111
Images/cinvestav
Then, we get the following familiar definition
Definition
MDP is a tuple S, A, T , R, α
1 S is a finite set of states
2 A is a finite set of actions
3 P is a state transition probability matrix.
Pss = P st+1 = s |st = s, At = a
4 α ∈ [0, 1] is called the discount factor
5 R : S × A → R is a reward function
Thus, the dynamic of the Markov Decision Process is as follow using
a probabilty to pick an action
s0
a0
−→ s1
a1
−→ s2
a2
−→ s3 −→ · · ·
26 / 111
Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
27 / 111
Images/cinvestav
The Goal of Reinforcement Learning
We want to maximize the expected value of the return
i.e. To obtain the optimal policy!!!
Thus, as in our previous study of MDP’s
Gt = Rt+1 + αRt+2 + · · · =
∞
k=0
αk
Rt+k+1
28 / 111
Images/cinvestav
The Goal of Reinforcement Learning
We want to maximize the expected value of the return
i.e. To obtain the optimal policy!!!
Thus, as in our previous study of MDP’s
Gt = Rt+1 + αRt+2 + · · · =
∞
k=0
αk
Rt+k+1
28 / 111
Images/cinvestav
Remarks
Interpretation
The discounted future rewards can be interpreted as the current
value of future rewards.
The reward α
α close to 0 leads to “short-sighted” evaluation... you are only
looking for immediate rewards!!!
α close to 1 leads to “long-sighted” evaluation... you are willing to
wait for a better reward!!!
29 / 111
Images/cinvestav
Remarks
Interpretation
The discounted future rewards can be interpreted as the current
value of future rewards.
The reward α
α close to 0 leads to “short-sighted” evaluation... you are only
looking for immediate rewards!!!
α close to 1 leads to “long-sighted” evaluation... you are willing to
wait for a better reward!!!
29 / 111
Images/cinvestav
Why to use this discount factor?
Basically
1 Discount factor avoids infinite returns in cyclic Markov process given
the property of the geometry series.
2 There is certain uncertainty about the future reward.
3 If the reward is financial, immediate rewards may earn more interest
than delayed rewards.
4 Animal/human behavior shows preference for immediate reward.
30 / 111
Images/cinvestav
Why to use this discount factor?
Basically
1 Discount factor avoids infinite returns in cyclic Markov process given
the property of the geometry series.
2 There is certain uncertainty about the future reward.
3 If the reward is financial, immediate rewards may earn more interest
than delayed rewards.
4 Animal/human behavior shows preference for immediate reward.
30 / 111
Images/cinvestav
Why to use this discount factor?
Basically
1 Discount factor avoids infinite returns in cyclic Markov process given
the property of the geometry series.
2 There is certain uncertainty about the future reward.
3 If the reward is financial, immediate rewards may earn more interest
than delayed rewards.
4 Animal/human behavior shows preference for immediate reward.
30 / 111
Images/cinvestav
Why to use this discount factor?
Basically
1 Discount factor avoids infinite returns in cyclic Markov process given
the property of the geometry series.
2 There is certain uncertainty about the future reward.
3 If the reward is financial, immediate rewards may earn more interest
than delayed rewards.
4 Animal/human behavior shows preference for immediate reward.
30 / 111
Images/cinvestav
A Policy
Using our MDP ideas...
A policy π is a distribution over actions given states
π (a|s) = P [At = a|St = s]
A policy guides the choice of action at a given state.
And it is time independent
Then, we introduce two value functions
State-value function.
Action-value function.
31 / 111
Images/cinvestav
A Policy
Using our MDP ideas...
A policy π is a distribution over actions given states
π (a|s) = P [At = a|St = s]
A policy guides the choice of action at a given state.
And it is time independent
Then, we introduce two value functions
State-value function.
Action-value function.
31 / 111
Images/cinvestav
The state-value function vπ of an MDP
It is defined as
vπ (s) = Eπ [Gt|St = s]
This gives the long-term value of the state s
vπ (s) = Eπ [Gt|St = s]
= Eπ
∞
k=0
αk
Rt+k+1|St = s
= Eπ [Rt+1 + αGt+1|St = s]
= Eπ [Rt+1|St = s]
Immediate Reward
+ Eπ [αvπ (St+1) |St = s]
Discount value of successor state
32 / 111
Images/cinvestav
The state-value function vπ of an MDP
It is defined as
vπ (s) = Eπ [Gt|St = s]
This gives the long-term value of the state s
vπ (s) = Eπ [Gt|St = s]
= Eπ
∞
k=0
αk
Rt+k+1|St = s
= Eπ [Rt+1 + αGt+1|St = s]
= Eπ [Rt+1|St = s]
Immediate Reward
+ Eπ [αvπ (St+1) |St = s]
Discount value of successor state
32 / 111
Images/cinvestav
The state-value function vπ of an MDP
It is defined as
vπ (s) = Eπ [Gt|St = s]
This gives the long-term value of the state s
vπ (s) = Eπ [Gt|St = s]
= Eπ
∞
k=0
αk
Rt+k+1|St = s
= Eπ [Rt+1 + αGt+1|St = s]
= Eπ [Rt+1|St = s]
Immediate Reward
+ Eπ [αvπ (St+1) |St = s]
Discount value of successor state
32 / 111
Images/cinvestav
The state-value function vπ of an MDP
It is defined as
vπ (s) = Eπ [Gt|St = s]
This gives the long-term value of the state s
vπ (s) = Eπ [Gt|St = s]
= Eπ
∞
k=0
αk
Rt+k+1|St = s
= Eπ [Rt+1 + αGt+1|St = s]
= Eπ [Rt+1|St = s]
Immediate Reward
+ Eπ [αvπ (St+1) |St = s]
Discount value of successor state
32 / 111
Images/cinvestav
The Action-Value Function qπ (s, a)
It is the expected return starting from state s
Starting from state s, taking action a, and then following policy π
qπ (s, a) = Eπ [Gt|St = a, At = a]
We can also obtain the following decomposition
qπ (s, a) = Eπ [Rt+1 + αqπ (St+1, At+1) |St = s, At = a]
33 / 111
Images/cinvestav
The Action-Value Function qπ (s, a)
It is the expected return starting from state s
Starting from state s, taking action a, and then following policy π
qπ (s, a) = Eπ [Gt|St = a, At = a]
We can also obtain the following decomposition
qπ (s, a) = Eπ [Rt+1 + αqπ (St+1, At+1) |St = s, At = a]
33 / 111
Images/cinvestav
Now...
To simplify notations, we define
Ra
s = E [Rt+1|St = s, At = a]
Then, we have
vπ (s) =
a∈A
π (a|s) qπ (s, a)
qπ (s, a) = Ra
s + α
s ∈S
Pa
ss vπ s
34 / 111
Images/cinvestav
Now...
To simplify notations, we define
Ra
s = E [Rt+1|St = s, At = a]
Then, we have
vπ (s) =
a∈A
π (a|s) qπ (s, a)
qπ (s, a) = Ra
s + α
s ∈S
Pa
ss vπ s
34 / 111
Images/cinvestav
Then
Expressing qπ (s, a) in terms of vπ (s) in the expression of vπ (s) -
The Bellman Equation
vπ (s) =
a∈A
π (a|s)

Ra
s + α
s ∈S
Pa
ss vπ s


The Bellman equation relates the state-value function
of one state with that of other states.
35 / 111
Images/cinvestav
Then
Expressing qπ (s, a) in terms of vπ (s) in the expression of vπ (s) -
The Bellman Equation
vπ (s) =
a∈A
π (a|s)

Ra
s + α
s ∈S
Pa
ss vπ s


The Bellman equation relates the state-value function
of one state with that of other states.
35 / 111
Images/cinvestav
Also
Bellman equation for qπ (s, a)
qπ (s, a) = Ra
s + α
s ∈S
Pa
ss
a∈A
π (a|s) qπ (s, a)
How do we solve this problem?
We have the following solutions
36 / 111
Images/cinvestav
Also
Bellman equation for qπ (s, a)
qπ (s, a) = Ra
s + α
s ∈S
Pa
ss
a∈A
π (a|s) qπ (s, a)
How do we solve this problem?
We have the following solutions
36 / 111
Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
37 / 111
Images/cinvestav
Action-Value Methods
One natural way to estimate this is by averaging the rewards actually
received
Qt (a) =
t−1
i=1 RiIAi=a
t−1
i=1 IAi=a
Nevertheless
If the denominator is zero, we define Qt (a) at some default value
Now, as the denominator goes to infinity, by the law of large numbers,
Qt (a) −→ q∗
(a)
38 / 111
Images/cinvestav
Action-Value Methods
One natural way to estimate this is by averaging the rewards actually
received
Qt (a) =
t−1
i=1 RiIAi=a
t−1
i=1 IAi=a
Nevertheless
If the denominator is zero, we define Qt (a) at some default value
Now, as the denominator goes to infinity, by the law of large numbers,
Qt (a) −→ q∗
(a)
38 / 111
Images/cinvestav
Action-Value Methods
One natural way to estimate this is by averaging the rewards actually
received
Qt (a) =
t−1
i=1 RiIAi=a
t−1
i=1 IAi=a
Nevertheless
If the denominator is zero, we define Qt (a) at some default value
Now, as the denominator goes to infinity, by the law of large numbers,
Qt (a) −→ q∗
(a)
38 / 111
Images/cinvestav
Furthermore
We call this the sample-average method
Not the best, but allows to introduce a way to select actions...
A classic one us the greedy estimate
At = arg max
a
Qt (a)
Therefore
Greedy action selection always exploits current knowledge to
maximize immediate reward.
it spends no time in inferior actions to see if they might really be
better.
39 / 111
Images/cinvestav
Furthermore
We call this the sample-average method
Not the best, but allows to introduce a way to select actions...
A classic one us the greedy estimate
At = arg max
a
Qt (a)
Therefore
Greedy action selection always exploits current knowledge to
maximize immediate reward.
it spends no time in inferior actions to see if they might really be
better.
39 / 111
Images/cinvestav
Furthermore
We call this the sample-average method
Not the best, but allows to introduce a way to select actions...
A classic one us the greedy estimate
At = arg max
a
Qt (a)
Therefore
Greedy action selection always exploits current knowledge to
maximize immediate reward.
it spends no time in inferior actions to see if they might really be
better.
39 / 111
Images/cinvestav
Furthermore
We call this the sample-average method
Not the best, but allows to introduce a way to select actions...
A classic one us the greedy estimate
At = arg max
a
Qt (a)
Therefore
Greedy action selection always exploits current knowledge to
maximize immediate reward.
it spends no time in inferior actions to see if they might really be
better.
39 / 111
Images/cinvestav
And Here a Heuristic
A Simple Alternative is to act greedily most of the time
every once in a while, say with small probability , select randomly
from among all the actions with equal probability.
This is called -greedy method
In the limit to infinity, this method ensures:
Qt (a) −→ q∗
(a)
40 / 111
Images/cinvestav
And Here a Heuristic
A Simple Alternative is to act greedily most of the time
every once in a while, say with small probability , select randomly
from among all the actions with equal probability.
This is called -greedy method
In the limit to infinity, this method ensures:
Qt (a) −→ q∗
(a)
40 / 111
Images/cinvestav
It is like the Mean Filter
We have a nice result
41 / 111
Images/cinvestav
Example
In the following example
The q∗ (a) were selected according to a Gaussian Distribution with
mean 0 and variance 1.
Then, when an action is selected, the reward Rt was selected from a
normal distribution with mean q∗ (At) and variance 1.
We have then
42 / 111
Images/cinvestav
Example
In the following example
The q∗ (a) were selected according to a Gaussian Distribution with
mean 0 and variance 1.
Then, when an action is selected, the reward Rt was selected from a
normal distribution with mean q∗ (At) and variance 1.
We have then
42 / 111
Images/cinvestav
Now, as the sampling goes up
We get
43 / 111
Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
44 / 111
Images/cinvestav
Imagine the following
Let Ri denotes the reward after the ith
of this action
Qn =
R1 + R2 + · · · + Rn−1
n − 1
The problem of this technique
The Amount of Memory and Computational Power to keep updates
the estimates as more reward happens
45 / 111
Images/cinvestav
Imagine the following
Let Ri denotes the reward after the ith
of this action
Qn =
R1 + R2 + · · · + Rn−1
n − 1
The problem of this technique
The Amount of Memory and Computational Power to keep updates
the estimates as more reward happens
45 / 111
Images/cinvestav
Therefore
We need something more efficient
We can do something like that by realizing the following:
Qn+1 =
1
n
n
i=1
Rn
=
1
n
Rn +
n
i=1
Ei
=
1
n
[Rn + (n − 1) Qn]
= Qn +
1
n
[Rn − Qn]
46 / 111
Images/cinvestav
Therefore
We need something more efficient
We can do something like that by realizing the following:
Qn+1 =
1
n
n
i=1
Rn
=
1
n
Rn +
n
i=1
Ei
=
1
n
[Rn + (n − 1) Qn]
= Qn +
1
n
[Rn − Qn]
46 / 111
Images/cinvestav
Therefore
We need something more efficient
We can do something like that by realizing the following:
Qn+1 =
1
n
n
i=1
Rn
=
1
n
Rn +
n
i=1
Ei
=
1
n
[Rn + (n − 1) Qn]
= Qn +
1
n
[Rn − Qn]
46 / 111
Images/cinvestav
Therefore
We need something more efficient
We can do something like that by realizing the following:
Qn+1 =
1
n
n
i=1
Rn
=
1
n
Rn +
n
i=1
Ei
=
1
n
[Rn + (n − 1) Qn]
= Qn +
1
n
[Rn − Qn]
46 / 111
Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
47 / 111
Images/cinvestav
General Formula in Reinforcement Learning
The Pattern Appears almost all the time in RL
New Estimate = Old Estimate + Step Size [Target − Old Estimate]
Looks a lot as a Gradient Descent if we decide Assume a Taylor Series
f (x) = f (a) + f (a) (x − a) =⇒ f (a) =
f (x) − f (a)
x − a
Then
xn+1 = xn + γf (a) = xn + γ
f (x) − f (a)
x − a
= xn + γ [f (x) − f (a)]
48 / 111
Images/cinvestav
General Formula in Reinforcement Learning
The Pattern Appears almost all the time in RL
New Estimate = Old Estimate + Step Size [Target − Old Estimate]
Looks a lot as a Gradient Descent if we decide Assume a Taylor Series
f (x) = f (a) + f (a) (x − a) =⇒ f (a) =
f (x) − f (a)
x − a
Then
xn+1 = xn + γf (a) = xn + γ
f (x) − f (a)
x − a
= xn + γ [f (x) − f (a)]
48 / 111
Images/cinvestav
General Formula in Reinforcement Learning
The Pattern Appears almost all the time in RL
New Estimate = Old Estimate + Step Size [Target − Old Estimate]
Looks a lot as a Gradient Descent if we decide Assume a Taylor Series
f (x) = f (a) + f (a) (x − a) =⇒ f (a) =
f (x) − f (a)
x − a
Then
xn+1 = xn + γf (a) = xn + γ
f (x) − f (a)
x − a
= xn + γ [f (x) − f (a)]
48 / 111
Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
49 / 111
Images/cinvestav
A Simple Bandit Algorithm
Initialize for a = 1 to k
1 Q (a) ← 0
2 N (a) ← 0
Loop until certain threshold γ
1 A ←
arg maxa Q (a) with probability 1 −
A random action with probability
2 R ← bandit (A)
3 N (A) ← N (A) + 1
4 Q (A) ← Q (A) + 1
N(A) [R − Q (A)]
50 / 111
Images/cinvestav
A Simple Bandit Algorithm
Initialize for a = 1 to k
1 Q (a) ← 0
2 N (a) ← 0
Loop until certain threshold γ
1 A ←
arg maxa Q (a) with probability 1 −
A random action with probability
2 R ← bandit (A)
3 N (A) ← N (A) + 1
4 Q (A) ← Q (A) + 1
N(A) [R − Q (A)]
50 / 111
Images/cinvestav
Remarks
Remember
This is works well for a stationary problem
For more in non-stationary problems
“Reinforcement Learning: An Introduction“ by Sutton et al. at
chapter 2.
51 / 111
Images/cinvestav
Remarks
Remember
This is works well for a stationary problem
For more in non-stationary problems
“Reinforcement Learning: An Introduction“ by Sutton et al. at
chapter 2.
51 / 111
Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
52 / 111
Images/cinvestav
Dynamic Programming
As in many things in Computer Science
We love the Divide Et Impera...
Then
In Reinforcement Learning, Bellman equation gives recursive
decompositions, and value function stores and reuses solutions.
53 / 111
Images/cinvestav
Dynamic Programming
As in many things in Computer Science
We love the Divide Et Impera...
Then
In Reinforcement Learning, Bellman equation gives recursive
decompositions, and value function stores and reuses solutions.
53 / 111
Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
54 / 111
Images/cinvestav
Policy Evaluation
For this, we assume as always
The problem can be solved with the Bellman Equation
We start from an initial guess v1
Then, we apply Bellman Iteratively
v1 → v2 → · · · → vπ
For this, we use synchronous updates
We update the value functions of the present iteration at the same
time based on the that of the previous iteration.
55 / 111
Images/cinvestav
Policy Evaluation
For this, we assume as always
The problem can be solved with the Bellman Equation
We start from an initial guess v1
Then, we apply Bellman Iteratively
v1 → v2 → · · · → vπ
For this, we use synchronous updates
We update the value functions of the present iteration at the same
time based on the that of the previous iteration.
55 / 111
Images/cinvestav
Policy Evaluation
For this, we assume as always
The problem can be solved with the Bellman Equation
We start from an initial guess v1
Then, we apply Bellman Iteratively
v1 → v2 → · · · → vπ
For this, we use synchronous updates
We update the value functions of the present iteration at the same
time based on the that of the previous iteration.
55 / 111
Images/cinvestav
Then
At each iteration k + 1, for all states s ∈ S
We update vk+1 from vk (s ) according to Bellman Equations, where
s is a successor state of s:
vk+1 (s) =
a∈A
π (a|s)

Ra
s + α
s ∈S
Pa
ss vk s


Something Notable
The iterative process stops when the maximum difference between
value function at the current step and that at the previous step is
smaller than some small positive constant.
56 / 111
Images/cinvestav
Then
At each iteration k + 1, for all states s ∈ S
We update vk+1 from vk (s ) according to Bellman Equations, where
s is a successor state of s:
vk+1 (s) =
a∈A
π (a|s)

Ra
s + α
s ∈S
Pa
ss vk s


Something Notable
The iterative process stops when the maximum difference between
value function at the current step and that at the previous step is
smaller than some small positive constant.
56 / 111
Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
57 / 111
Images/cinvestav
However
Our ultimate goal is to find the optimal policy
For this, we can use policy iteration process
Algorithm
1 Initialize π randomly
2 Repeat until the previous policy and the current policy are the same.
1 Evaluate vπ by policy evaluation.
2 Using synchronous updates, for each state s, let
π (s) = arg max
a∈A
q (s, a) = arg max
a∈A

Ra
s + α
s ∈S
Pa
ss vπ s


58 / 111
Images/cinvestav
However
Our ultimate goal is to find the optimal policy
For this, we can use policy iteration process
Algorithm
1 Initialize π randomly
2 Repeat until the previous policy and the current policy are the same.
1 Evaluate vπ by policy evaluation.
2 Using synchronous updates, for each state s, let
π (s) = arg max
a∈A
q (s, a) = arg max
a∈A

Ra
s + α
s ∈S
Pa
ss vπ s


58 / 111
Images/cinvestav
Now
Something Notable
This policy iteration algorithm will improve the policy for each step.
Assume that
We have a deterministic policy π and after one step we get π .
We know that the value improves in one step because
qπ s, π (s) = max
a∈A
qπ (s, a) ≥ qπ (s, π (s)) = vπ (s)
59 / 111
Images/cinvestav
Now
Something Notable
This policy iteration algorithm will improve the policy for each step.
Assume that
We have a deterministic policy π and after one step we get π .
We know that the value improves in one step because
qπ s, π (s) = max
a∈A
qπ (s, a) ≥ qπ (s, π (s)) = vπ (s)
59 / 111
Images/cinvestav
Now
Something Notable
This policy iteration algorithm will improve the policy for each step.
Assume that
We have a deterministic policy π and after one step we get π .
We know that the value improves in one step because
qπ s, π (s) = max
a∈A
qπ (s, a) ≥ qπ (s, π (s)) = vπ (s)
59 / 111
Images/cinvestav
Then
We can see that
vπ (s) ≤ qπ s, π (s) = Eπ [Rt+1 + αvπ (St+1) |St = s]
≤ Eπ Rt+1 + αqπ St+1, π (St+1) |St = s
≤ Eπ Rt+1 + αRt+2 + α2
qπ St+2, π (St+2) |St = s
≤ Eπ [Rt+1 + αRt+2 + · · · |St = s] = vπ (s)
60 / 111
Images/cinvestav
Then
We can see that
vπ (s) ≤ qπ s, π (s) = Eπ [Rt+1 + αvπ (St+1) |St = s]
≤ Eπ Rt+1 + αqπ St+1, π (St+1) |St = s
≤ Eπ Rt+1 + αRt+2 + α2
qπ St+2, π (St+2) |St = s
≤ Eπ [Rt+1 + αRt+2 + · · · |St = s] = vπ (s)
60 / 111
Images/cinvestav
Then
We can see that
vπ (s) ≤ qπ s, π (s) = Eπ [Rt+1 + αvπ (St+1) |St = s]
≤ Eπ Rt+1 + αqπ St+1, π (St+1) |St = s
≤ Eπ Rt+1 + αRt+2 + α2
qπ St+2, π (St+2) |St = s
≤ Eπ [Rt+1 + αRt+2 + · · · |St = s] = vπ (s)
60 / 111
Images/cinvestav
Then
We can see that
vπ (s) ≤ qπ s, π (s) = Eπ [Rt+1 + αvπ (St+1) |St = s]
≤ Eπ Rt+1 + αqπ St+1, π (St+1) |St = s
≤ Eπ Rt+1 + αRt+2 + α2
qπ St+2, π (St+2) |St = s
≤ Eπ [Rt+1 + αRt+2 + · · · |St = s] = vπ (s)
60 / 111
Images/cinvestav
Then
If improvements stop
qπ s, π (s) = max
a∈A
qπ (s, a) = qπ (s, π (s)) = vπ (s)
This means that the Bellman equation has been satisfied
vπ (s) = max
a∈A
qπ (s, a)
Therefore, vπ (s) = v∗ (s) for all s ∈ S and π is an optimal policy.
61 / 111
Images/cinvestav
Then
If improvements stop
qπ s, π (s) = max
a∈A
qπ (s, a) = qπ (s, π (s)) = vπ (s)
This means that the Bellman equation has been satisfied
vπ (s) = max
a∈A
qπ (s, a)
Therefore, vπ (s) = v∗ (s) for all s ∈ S and π is an optimal policy.
61 / 111
Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
62 / 111
Images/cinvestav
Once the policy has been improved
We can iteratively a sequence of monotonically improving policies and
values
π0
E
−→ vπ0
I
→ π1
E
−→ vπ1
I
→ π2
E
−→ · · ·
I
−→ π∗
E
→ v∗
with E = evaluation and I = improvement.
63 / 111
Images/cinvestav
Policy Iteration (using iterative policy evaluation)
Step 1. Initialization
V (s) ∈ R and π (s) ∈ A (s) arbitrarily for all s ∈ S
Step 2. Policy Evaluation
1 Loop:
2 ∆ ← 0
3 Loop for each s ∈ S
4 v ← V (s)
5 V (s) ← s ,r p (s , r|s, π (s)) [r + αV (s )]
6 ∆ ← max (∆, |v − V (s)|)
7 until ∆ < θ (A small threshold for accuracy)
64 / 111
Images/cinvestav
Policy Iteration (using iterative policy evaluation)
Step 1. Initialization
V (s) ∈ R and π (s) ∈ A (s) arbitrarily for all s ∈ S
Step 2. Policy Evaluation
1 Loop:
2 ∆ ← 0
3 Loop for each s ∈ S
4 v ← V (s)
5 V (s) ← s ,r p (s , r|s, π (s)) [r + αV (s )]
6 ∆ ← max (∆, |v − V (s)|)
7 until ∆ < θ (A small threshold for accuracy)
64 / 111
Images/cinvestav
Further
Step 3. Policy Improvement
1 Policy-stable ← True
2 For each s ∈ S:
3 old_action ← π (s)
4 π (s) ← s ,r p (s .r|s, a) [r + αV (s )]
5 if old_action= π (s) then policy-stable ← False
6 If policy-stable then stop and return V ≈ v∗ and π ≈ π∗ else go to 2.
65 / 111
Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
66 / 111
Images/cinvestav
Jack’s Car Rental
Setup
We have a Car’s Rental with two places (s1, s2).
And each time Jack is available he rents it out and is credited $10 by
the company.
If Jack’s is out of cars at that location, then the business is lost.
The cars are available for renting the next day
Jack’s has options
It can move them between the two locations overnight, at a cost of
$2 dollars and only 5 of them can be moved.
67 / 111
Images/cinvestav
Jack’s Car Rental
Setup
We have a Car’s Rental with two places (s1, s2).
And each time Jack is available he rents it out and is credited $10 by
the company.
If Jack’s is out of cars at that location, then the business is lost.
The cars are available for renting the next day
Jack’s has options
It can move them between the two locations overnight, at a cost of
$2 dollars and only 5 of them can be moved.
67 / 111
Images/cinvestav
Jack’s Car Rental
Setup
We have a Car’s Rental with two places (s1, s2).
And each time Jack is available he rents it out and is credited $10 by
the company.
If Jack’s is out of cars at that location, then the business is lost.
The cars are available for renting the next day
Jack’s has options
It can move them between the two locations overnight, at a cost of
$2 dollars and only 5 of them can be moved.
67 / 111
Images/cinvestav
Jack’s Car Rental
Setup
We have a Car’s Rental with two places (s1, s2).
And each time Jack is available he rents it out and is credited $10 by
the company.
If Jack’s is out of cars at that location, then the business is lost.
The cars are available for renting the next day
Jack’s has options
It can move them between the two locations overnight, at a cost of
$2 dollars and only 5 of them can be moved.
67 / 111
Images/cinvestav
Jack’s Car Rental
Setup
We have a Car’s Rental with two places (s1, s2).
And each time Jack is available he rents it out and is credited $10 by
the company.
If Jack’s is out of cars at that location, then the business is lost.
The cars are available for renting the next day
Jack’s has options
It can move them between the two locations overnight, at a cost of
$2 dollars and only 5 of them can be moved.
67 / 111
Images/cinvestav
We choose a probability for modeling requests and returns
Additionally, renting at each location has a Poisson distribution
P (n|λ) =
λn
n!
e−λ
We simplify the problem by having an upper-bound n = 20
Basically, any other car generated by the return simply disappear!!!
68 / 111
Images/cinvestav
We choose a probability for modeling requests and returns
Additionally, renting at each location has a Poisson distribution
P (n|λ) =
λn
n!
e−λ
We simplify the problem by having an upper-bound n = 20
Basically, any other car generated by the return simply disappear!!!
68 / 111
Images/cinvestav
Finally
We have the following λ s for request and returns
1 Request s1 we have λr1 = 3
2 Request s2 we have λr2 = 4
3 Return s1 we have λr1 = 3
4 Return s2 we have λr1 = 2
Actually, it defines how many cars are requested or returned in a time
interval (One day)
69 / 111
Images/cinvestav
Finally
We have the following λ s for request and returns
1 Request s1 we have λr1 = 3
2 Request s2 we have λr2 = 4
3 Return s1 we have λr1 = 3
4 Return s2 we have λr1 = 2
Actually, it defines how many cars are requested or returned in a time
interval (One day)
69 / 111
Images/cinvestav
Finally
We have the following λ s for request and returns
1 Request s1 we have λr1 = 3
2 Request s2 we have λr2 = 4
3 Return s1 we have λr1 = 3
4 Return s2 we have λr1 = 2
Actually, it defines how many cars are requested or returned in a time
interval (One day)
69 / 111
Images/cinvestav
Finally
We have the following λ s for request and returns
1 Request s1 we have λr1 = 3
2 Request s2 we have λr2 = 4
3 Return s1 we have λr1 = 3
4 Return s2 we have λr1 = 2
Actually, it defines how many cars are requested or returned in a time
interval (One day)
69 / 111
Images/cinvestav
Finally
We have the following λ s for request and returns
1 Request s1 we have λr1 = 3
2 Request s2 we have λr2 = 4
3 Return s1 we have λr1 = 3
4 Return s2 we have λr1 = 2
Actually, it defines how many cars are requested or returned in a time
interval (One day)
Time Interval
Events
69 / 111
Images/cinvestav
Thus, we have the following running
With a α = 0.9
70 / 111
Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
71 / 111
Images/cinvestav
History
There is an original earlier work
By Arthur Samuel and it was used by Sutton to invent Temporal
Difference-Lambda
It was famously applied by Gerald Tesauro
To build TD-Gammon
A program that learned to play backgammon at the level of expert
human players
72 / 111
Images/cinvestav
History
There is an original earlier work
By Arthur Samuel and it was used by Sutton to invent Temporal
Difference-Lambda
It was famously applied by Gerald Tesauro
To build TD-Gammon
A program that learned to play backgammon at the level of expert
human players
72 / 111
Images/cinvestav
Temporal-Difference (TD) Learning
Something Notable
Possibly the central and novel idea of reinforcement learning,
It has similarity with Monte Carlo Methods
They use experience to solve the prediction problem (For More look
at Sutton’s Book)
Monte Carlo methods wait until Gt is known
V (St) ← V (St) + γ [Gt − V (St)] and Gt =
∞
k=0
αk
Rt+k+1
Let us call this method constant-α Monte Carlo.
73 / 111
Images/cinvestav
Temporal-Difference (TD) Learning
Something Notable
Possibly the central and novel idea of reinforcement learning,
It has similarity with Monte Carlo Methods
They use experience to solve the prediction problem (For More look
at Sutton’s Book)
Monte Carlo methods wait until Gt is known
V (St) ← V (St) + γ [Gt − V (St)] and Gt =
∞
k=0
αk
Rt+k+1
Let us call this method constant-α Monte Carlo.
73 / 111
Images/cinvestav
Temporal-Difference (TD) Learning
Something Notable
Possibly the central and novel idea of reinforcement learning,
It has similarity with Monte Carlo Methods
They use experience to solve the prediction problem (For More look
at Sutton’s Book)
Monte Carlo methods wait until Gt is known
V (St) ← V (St) + γ [Gt − V (St)] and Gt =
∞
k=0
αk
Rt+k+1
Let us call this method constant-α Monte Carlo.
73 / 111
Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
74 / 111
Images/cinvestav
Fundamental Theorem of Simulation
Cumulative Distribution Function
With every random variable, we associate a function called
Cumulative Distribution Function (CDF) which is defined as follows:
FX(x) = P(f(X) ≤ x)
With properties:
FX(x) ≥ 0
FX(x) in a non-decreasing function of X.
75 / 111
Images/cinvestav
Fundamental Theorem of Simulation
Cumulative Distribution Function
With every random variable, we associate a function called
Cumulative Distribution Function (CDF) which is defined as follows:
FX(x) = P(f(X) ≤ x)
With properties:
FX(x) ≥ 0
FX(x) in a non-decreasing function of X.
75 / 111
Images/cinvestav
Fundamental Theorem of Simulation
Cumulative Distribution Function
With every random variable, we associate a function called
Cumulative Distribution Function (CDF) which is defined as follows:
FX(x) = P(f(X) ≤ x)
With properties:
FX(x) ≥ 0
FX(x) in a non-decreasing function of X.
75 / 111
Images/cinvestav
Graphically
Example Uniform Distribution
1
76 / 111
Images/cinvestav
Graphically
Example Gaussian Distribution
77 / 111
Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
78 / 111
Images/cinvestav
Fundamental Theorem of Simulation
Theorem
Simulating
X ∼ f (x)
It is equivalent to simulating
(X, U) ∼ U ((x, u) |0 < u < f (x))
79 / 111
Images/cinvestav
Fundamental Theorem of Simulation
Theorem
Simulating
X ∼ f (x)
It is equivalent to simulating
(X, U) ∼ U ((x, u) |0 < u < f (x))
79 / 111
Images/cinvestav
Geometrically
We have
Proof
We have that f (x) =
f(x)
0 du the marginal pdf of (X, U).
80 / 111
Images/cinvestav
Geometrically
We have
Proof
We have that f (x) =
f(x)
0 du the marginal pdf of (X, U).
80 / 111
Images/cinvestav
Therefore
One thing that is made clear by this theorem is that we can generate
X in three ways
1 First generate X ∼ f, and then U|X = x ∼ U {u|u ≤ f (x)} , but
this is useless because we already have X and do not need X.
2 First generate U, and then X|U = u ∼ U {x|u ≤ f (x)} .
3 Generate (X, U) jointly, a smart idea.
it allows us to generate data on a larger set were simulation is easier.
Not as simple as it looks.
81 / 111
Images/cinvestav
Therefore
One thing that is made clear by this theorem is that we can generate
X in three ways
1 First generate X ∼ f, and then U|X = x ∼ U {u|u ≤ f (x)} , but
this is useless because we already have X and do not need X.
2 First generate U, and then X|U = u ∼ U {x|u ≤ f (x)} .
3 Generate (X, U) jointly, a smart idea.
it allows us to generate data on a larger set were simulation is easier.
Not as simple as it looks.
81 / 111
Images/cinvestav
Therefore
One thing that is made clear by this theorem is that we can generate
X in three ways
1 First generate X ∼ f, and then U|X = x ∼ U {u|u ≤ f (x)} , but
this is useless because we already have X and do not need X.
2 First generate U, and then X|U = u ∼ U {x|u ≤ f (x)} .
3 Generate (X, U) jointly, a smart idea.
it allows us to generate data on a larger set were simulation is easier.
Not as simple as it looks.
81 / 111
Images/cinvestav
Therefore
One thing that is made clear by this theorem is that we can generate
X in three ways
1 First generate X ∼ f, and then U|X = x ∼ U {u|u ≤ f (x)} , but
this is useless because we already have X and do not need X.
2 First generate U, and then X|U = u ∼ U {x|u ≤ f (x)} .
3 Generate (X, U) jointly, a smart idea.
it allows us to generate data on a larger set were simulation is easier.
Not as simple as it looks.
81 / 111
Images/cinvestav
Markov Chains
The random process Xt ∈ S for t = 1, 2, ..., T has a Markov property
iif
p (XT |XT−1, XT−2, ..., X1) = p (XT |XT−1)
Finite-state Discrete Time Markov Chains |S| < ∞
It can be completly specified by the transition matrix.
P = [pij] with pij = P [Xt = j|Xt−1 = i]
For Irreducible chains, the stationary distribution π is long-term
proportion of time that the chain spends in each state.
Computed by π = πP .
82 / 111
Images/cinvestav
Markov Chains
The random process Xt ∈ S for t = 1, 2, ..., T has a Markov property
iif
p (XT |XT−1, XT−2, ..., X1) = p (XT |XT−1)
Finite-state Discrete Time Markov Chains |S| < ∞
It can be completly specified by the transition matrix.
P = [pij] with pij = P [Xt = j|Xt−1 = i]
For Irreducible chains, the stationary distribution π is long-term
proportion of time that the chain spends in each state.
Computed by π = πP .
82 / 111
Images/cinvestav
Markov Chains
The random process Xt ∈ S for t = 1, 2, ..., T has a Markov property
iif
p (XT |XT−1, XT−2, ..., X1) = p (XT |XT−1)
Finite-state Discrete Time Markov Chains |S| < ∞
It can be completly specified by the transition matrix.
P = [pij] with pij = P [Xt = j|Xt−1 = i]
For Irreducible chains, the stationary distribution π is long-term
proportion of time that the chain spends in each state.
Computed by π = πP .
82 / 111
Images/cinvestav
Markov Chains
The random process Xt ∈ S for t = 1, 2, ..., T has a Markov property
iif
p (XT |XT−1, XT−2, ..., X1) = p (XT |XT−1)
Finite-state Discrete Time Markov Chains |S| < ∞
It can be completly specified by the transition matrix.
P = [pij] with pij = P [Xt = j|Xt−1 = i]
For Irreducible chains, the stationary distribution π is long-term
proportion of time that the chain spends in each state.
Computed by π = πP .
82 / 111
Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
83 / 111
Images/cinvestav
The Idea
Draw an i.i.d. set of samples x(i)
N
i=1
From a target density p (x) on a high-dimensional X.
The set of possible configurations of a system.
Thus, we can use those samples to approximate the target density
pN (x) =
1
N
N
i=1
δx(i) (x)
where δx(i) (x) denotes the Dirac Delta mass located at x(i).
84 / 111
Images/cinvestav
The Idea
Draw an i.i.d. set of samples x(i)
N
i=1
From a target density p (x) on a high-dimensional X.
The set of possible configurations of a system.
Thus, we can use those samples to approximate the target density
pN (x) =
1
N
N
i=1
δx(i) (x)
where δx(i) (x) denotes the Dirac Delta mass located at x(i).
84 / 111
Images/cinvestav
The Idea
Draw an i.i.d. set of samples x(i)
N
i=1
From a target density p (x) on a high-dimensional X.
The set of possible configurations of a system.
Thus, we can use those samples to approximate the target density
pN (x) =
1
N
N
i=1
δx(i) (x)
where δx(i) (x) denotes the Dirac Delta mass located at x(i).
84 / 111
Images/cinvestav
Thus
Given, the Dirac Delta
δ (t) =
0 t = 0
∞ t = 0
with
t2
t1
δ(t)dt = 1
Therefore
δ (t) = lim
σ→0
1
√
2πσ
e− t2
2σ2
85 / 111
Images/cinvestav
Thus
Given, the Dirac Delta
δ (t) =
0 t = 0
∞ t = 0
with
t2
t1
δ(t)dt = 1
Therefore
δ (t) = lim
σ→0
1
√
2πσ
e− t2
2σ2
85 / 111
Images/cinvestav
Consequently
Approximate the integrals
IN (f) =
1
N
N
i=1
f x(i) a.s.
−→
N→∞
I (f) =
X
f (x) p (x) dx
Where Variance of f (x)
σ2
f = Ep(x) f2
(x) − I2
(f) < ∞
Therefore the variance of the estimator IN (f)
EIN (f) = var (IN (f)) =
σ2
f
N
86 / 111
Images/cinvestav
Consequently
Approximate the integrals
IN (f) =
1
N
N
i=1
f x(i) a.s.
−→
N→∞
I (f) =
X
f (x) p (x) dx
Where Variance of f (x)
σ2
f = Ep(x) f2
(x) − I2
(f) < ∞
Therefore the variance of the estimator IN (f)
EIN (f) = var (IN (f)) =
σ2
f
N
86 / 111
Images/cinvestav
Consequently
Approximate the integrals
IN (f) =
1
N
N
i=1
f x(i) a.s.
−→
N→∞
I (f) =
X
f (x) p (x) dx
Where Variance of f (x)
σ2
f = Ep(x) f2
(x) − I2
(f) < ∞
Therefore the variance of the estimator IN (f)
EIN (f) = var (IN (f)) =
σ2
f
N
86 / 111
Images/cinvestav
Then
By Robert & Casella 1999, Section 3.2
√
N (IN (f) − I (f)) =⇒
N−→∞
N 0, σ2
f
87 / 111
Images/cinvestav
Then
By Robert & Casella 1999, Section 3.2
√
N (IN (f) − I (f)) =⇒
N−→∞
N 0, σ2
f
87 / 111
Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
88 / 111
Images/cinvestav
What if we do not wait for Gt
TD methods need to wait only until the next time step
At time t + 1, it uses the observed reward Rt+1 and the estimate
V (St+1)
The simplest TD method
V (St) ← V (St) + γ [Rt+1 + αV (St+1) − V (St)]
Immediately on transition to St+1 and receiving Rt+1 .
89 / 111
Images/cinvestav
What if we do not wait for Gt
TD methods need to wait only until the next time step
At time t + 1, it uses the observed reward Rt+1 and the estimate
V (St+1)
The simplest TD method
V (St) ← V (St) + γ [Rt+1 + αV (St+1) − V (St)]
Immediately on transition to St+1 and receiving Rt+1 .
89 / 111
Images/cinvestav
This method is called TD(0) or one-step TD
Actually
There are TD with n-step TD methods
Differences with the Monte Carlo Methods
In Monte Carlo update, the target is Gt
In TD update, the target is Rt + αV (St+1)
90 / 111
Images/cinvestav
This method is called TD(0) or one-step TD
Actually
There are TD with n-step TD methods
Differences with the Monte Carlo Methods
In Monte Carlo update, the target is Gt
In TD update, the target is Rt + αV (St+1)
90 / 111
Images/cinvestav
TD as a Bootstrapping Method
Quite similar to Dynamic
Additionally, we know that
vπ (s) = Eπ [Gt|St]
= Eπ [Rt+1 + αGt+1|St]
= Eπ [Rt+1 + αvπ (St+1) |St]
Monte Carlo methods use an estimate of vπ (s) = Eπ [Gt|St]
Instead, Dynamic Programming methods use an estimate of
vπ (s) = Eπ [Rt+1 + αvπ (St+1) |St].
91 / 111
Images/cinvestav
TD as a Bootstrapping Method
Quite similar to Dynamic
Additionally, we know that
vπ (s) = Eπ [Gt|St]
= Eπ [Rt+1 + αGt+1|St]
= Eπ [Rt+1 + αvπ (St+1) |St]
Monte Carlo methods use an estimate of vπ (s) = Eπ [Gt|St]
Instead, Dynamic Programming methods use an estimate of
vπ (s) = Eπ [Rt+1 + αvπ (St+1) |St].
91 / 111
Images/cinvestav
Furthermore
The Monte Carlo target is an estimate
because the expected value in (6.3) is not known, but a sample is
used instead
The DP target is an estimate
because vπ(St + 1) is not known and the current estimate,V (St+1), is
used instead.
Here TD goes beyond and combines both ideas
It combines the sampling of Monte Carlo with the bootstrapping of
DP.
92 / 111
Images/cinvestav
Furthermore
The Monte Carlo target is an estimate
because the expected value in (6.3) is not known, but a sample is
used instead
The DP target is an estimate
because vπ(St + 1) is not known and the current estimate,V (St+1), is
used instead.
Here TD goes beyond and combines both ideas
It combines the sampling of Monte Carlo with the bootstrapping of
DP.
92 / 111
Images/cinvestav
Furthermore
The Monte Carlo target is an estimate
because the expected value in (6.3) is not known, but a sample is
used instead
The DP target is an estimate
because vπ(St + 1) is not known and the current estimate,V (St+1), is
used instead.
Here TD goes beyond and combines both ideas
It combines the sampling of Monte Carlo with the bootstrapping of
DP.
92 / 111
Images/cinvestav
As in many methods there is an error
And it is used to improve the policy
δt = Rt+1 + αV (St+1) − V (St)
Because the TD error depends on the next state and next reward
It is not actually available until one time step later.
δt is the error in V (St)
At time t + 1
93 / 111
Images/cinvestav
As in many methods there is an error
And it is used to improve the policy
δt = Rt+1 + αV (St+1) − V (St)
Because the TD error depends on the next state and next reward
It is not actually available until one time step later.
δt is the error in V (St)
At time t + 1
93 / 111
Images/cinvestav
As in many methods there is an error
And it is used to improve the policy
δt = Rt+1 + αV (St+1) − V (St)
Because the TD error depends on the next state and next reward
It is not actually available until one time step later.
δt is the error in V (St)
At time t + 1
93 / 111
Images/cinvestav
Finally
If the array V does not change during the episode,
The Monte Carlo Error can be written as a sum of TD errors
= Gt − V (St) = Rt+1 + αGt+1 − V (St) + αV (St+1) − αV (St+1)
= δt + α (Gt+1 − V (St+1))
= δt + αδt+! + α2
(Gt+2 − V (St+2))
=
T−1
k=t
αk−t
δk
94 / 111
Images/cinvestav
Optimality of TD(0)
Under batch updating
TD(0) converges deterministically to a single answer independent
as long as γ is chosen to be sufficiently small.
V (St) ← V (St) + γ [Gt − V (St)] and
95 / 111
Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
96 / 111
Images/cinvestav
Q-learning: Off-Policy TD Control
One of the early breakthroughs in reinforcement learning
The development of an off-policy TD control algorithm, Q-learning
(Watkind, 1989)
It is defined as
Q (St, At) ← Q (St, At) + γ Rt+1 + α max
a
Q (St+1, a) − Q (St, At)
Here
The learned action-value function, Q, directly approximates q∗ the
optimal action-value function
independent of the policy being followed q.
97 / 111
Images/cinvestav
Q-learning: Off-Policy TD Control
One of the early breakthroughs in reinforcement learning
The development of an off-policy TD control algorithm, Q-learning
(Watkind, 1989)
It is defined as
Q (St, At) ← Q (St, At) + γ Rt+1 + α max
a
Q (St+1, a) − Q (St, At)
Here
The learned action-value function, Q, directly approximates q∗ the
optimal action-value function
independent of the policy being followed q.
97 / 111
Images/cinvestav
Q-learning: Off-Policy TD Control
One of the early breakthroughs in reinforcement learning
The development of an off-policy TD control algorithm, Q-learning
(Watkind, 1989)
It is defined as
Q (St, At) ← Q (St, At) + γ Rt+1 + α max
a
Q (St+1, a) − Q (St, At)
Here
The learned action-value function, Q, directly approximates q∗ the
optimal action-value function
independent of the policy being followed q.
97 / 111
Images/cinvestav
Furthermore
The policy still has an effect
to determine which state-action pairs are visited and updated.
Therefore
All that is required for correct convergence is that all pairs continue
to be updated
Under this assumption and a variant of the usual stochastic
approximation conditions
Q has been shown to converge with probability 1 to q∗
98 / 111
Images/cinvestav
Furthermore
The policy still has an effect
to determine which state-action pairs are visited and updated.
Therefore
All that is required for correct convergence is that all pairs continue
to be updated
Under this assumption and a variant of the usual stochastic
approximation conditions
Q has been shown to converge with probability 1 to q∗
98 / 111
Images/cinvestav
Furthermore
The policy still has an effect
to determine which state-action pairs are visited and updated.
Therefore
All that is required for correct convergence is that all pairs continue
to be updated
Under this assumption and a variant of the usual stochastic
approximation conditions
Q has been shown to converge with probability 1 to q∗
98 / 111
Images/cinvestav
Q-learning (Off-Policy TD control) for estimating π ≈ π∗
Algorithm parameters
α ∈ (0, 1] and small > 0
Initialize
Initialize Q(s, a), for all s ∈ S+ , a ∈ A, arbitrarily except that
Q (terminal, ·) = 0
99 / 111
Images/cinvestav
Q-learning (Off-Policy TD control) for estimating π ≈ π∗
Algorithm parameters
α ∈ (0, 1] and small > 0
Initialize
Initialize Q(s, a), for all s ∈ S+ , a ∈ A, arbitrarily except that
Q (terminal, ·) = 0
99 / 111
Images/cinvestav
Then
Loop for each episode
1 Initialize S
2 Loop for each step of episode:
3 Choose A from S using policy derived from Q (for example
-greedy)
4 Take action A, observe R, S
Q (S, A) ← Q (S, A) + γ R + α max
a
Q S , a − Q (S, A)
5 S ← S
6 Until S is terminal
100 / 111
Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
101 / 111
Images/cinvestav
One of the earliest successes
Backgammon is a major game
with numerous tournaments and regular world championship matches.
It is a stochastic game with a full description of the world
102 / 111
Images/cinvestav
One of the earliest successes
Backgammon is a major game
with numerous tournaments and regular world championship matches.
It is a stochastic game with a full description of the world
102 / 111
Images/cinvestav
What do they used?
TD-Gammon used a nonlinear form of TD (λ)
An estimated value ˆv (s, w) is used to estimate the probability of
winning from state s.
Then
To achieve this, rewards were defined as zero for all time steps except
those on which the game is won.
103 / 111
Images/cinvestav
What do they used?
TD-Gammon used a nonlinear form of TD (λ)
An estimated value ˆv (s, w) is used to estimate the probability of
winning from state s.
Then
To achieve this, rewards were defined as zero for all time steps except
those on which the game is won.
103 / 111
Images/cinvestav
Implementation of the value function in TD-Gammon
They took the decision to use a standard multilayer ANN
104 / 111
Images/cinvestav
The interesting part
Tesauro’s representation
For each point on the backgammon board
Four units indicate the number of white pieces on the point
Basically
If no white pieces, all four units took zero values,
if there was one white piece, the first unit takes a one value
And so on...
Therefore, given four white pieces and another four black pieces at
each of the 24 points
We have 4*24+4*24=192 units
105 / 111
Images/cinvestav
The interesting part
Tesauro’s representation
For each point on the backgammon board
Four units indicate the number of white pieces on the point
Basically
If no white pieces, all four units took zero values,
if there was one white piece, the first unit takes a one value
And so on...
Therefore, given four white pieces and another four black pieces at
each of the 24 points
We have 4*24+4*24=192 units
105 / 111
Images/cinvestav
The interesting part
Tesauro’s representation
For each point on the backgammon board
Four units indicate the number of white pieces on the point
Basically
If no white pieces, all four units took zero values,
if there was one white piece, the first unit takes a one value
And so on...
Therefore, given four white pieces and another four black pieces at
each of the 24 points
We have 4*24+4*24=192 units
105 / 111
Images/cinvestav
Then
We have that
Two additional units encoded the number of white and black pieces
on the bar (Each took the value n/2 with n the number on the bar)
Two other represent the number of black and white pieces already
(these took the value n/15)
two units indicated in a binary fashion the white or black turn.
106 / 111
Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
107 / 111
Images/cinvestav
TD(λ) by Back-Propagation
TD-Gammon uses a semi-gradient of the TD (λ)
TD (λ) has the following general update rule:
wt+1 = wt + γ [Rt+1 + αˆv (St+1, wt) − ˆv (St, wt)] zt
with wt is the vector of all modifiable parameters
zt is a vector of eligibility traces
Eligibility traces unify and generalize TD and Monte Carlo methods.
It works as Long-Short Term Memory
The rough idea is that when a component of wt produces an estimate
zt is bumped out and then begins to fade away.
108 / 111
Images/cinvestav
TD(λ) by Back-Propagation
TD-Gammon uses a semi-gradient of the TD (λ)
TD (λ) has the following general update rule:
wt+1 = wt + γ [Rt+1 + αˆv (St+1, wt) − ˆv (St, wt)] zt
with wt is the vector of all modifiable parameters
zt is a vector of eligibility traces
Eligibility traces unify and generalize TD and Monte Carlo methods.
It works as Long-Short Term Memory
The rough idea is that when a component of wt produces an estimate
zt is bumped out and then begins to fade away.
108 / 111
Images/cinvestav
TD(λ) by Back-Propagation
TD-Gammon uses a semi-gradient of the TD (λ)
TD (λ) has the following general update rule:
wt+1 = wt + γ [Rt+1 + αˆv (St+1, wt) − ˆv (St, wt)] zt
with wt is the vector of all modifiable parameters
zt is a vector of eligibility traces
Eligibility traces unify and generalize TD and Monte Carlo methods.
It works as Long-Short Term Memory
The rough idea is that when a component of wt produces an estimate
zt is bumped out and then begins to fade away.
108 / 111
Images/cinvestav
How this is done?
Update Rule
zt = αλzt−1 + ˆv (St, wt) with z0 = 0
Something Notable
The gradient in this equation can be computed efficiently by the
back-propapagation procedure
For this, we set the error in the ANN
ˆv (St+1, wt) − ˆv (St, wt)
109 / 111
Images/cinvestav
How this is done?
Update Rule
zt = αλzt−1 + ˆv (St, wt) with z0 = 0
Something Notable
The gradient in this equation can be computed efficiently by the
back-propapagation procedure
For this, we set the error in the ANN
ˆv (St+1, wt) − ˆv (St, wt)
109 / 111
Images/cinvestav
How this is done?
Update Rule
zt = αλzt−1 + ˆv (St, wt) with z0 = 0
Something Notable
The gradient in this equation can be computed efficiently by the
back-propapagation procedure
For this, we set the error in the ANN
ˆv (St+1, wt) − ˆv (St, wt)
109 / 111
Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
110 / 111
Images/cinvestav
Conclusions
With this new tools as Reinforcement Learning
The New Artificial Intelligence has a lot to do in the world
Thanks to be in this class
As they say we have a lot to do...
111 / 111
Images/cinvestav
Conclusions
With this new tools as Reinforcement Learning
The New Artificial Intelligence has a lot to do in the world
Thanks to be in this class
As they say we have a lot to do...
111 / 111

Más contenido relacionado

La actualidad más candente

Deep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-LearningDeep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-LearningKai-Wen Zhao
 
Deep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its ApplicationsDeep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its ApplicationsBill Liu
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningDongHyun Kwak
 
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratchJie-Han Chen
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement LearningUsman Qayyum
 
Rl chapter 1 introduction
Rl chapter 1 introductionRl chapter 1 introduction
Rl chapter 1 introductionConnorShorten2
 
Reinforcement Learning 4. Dynamic Programming
Reinforcement Learning 4. Dynamic ProgrammingReinforcement Learning 4. Dynamic Programming
Reinforcement Learning 4. Dynamic ProgrammingSeung Jae Lee
 
Reinforcement Learning 5. Monte Carlo Methods
Reinforcement Learning 5. Monte Carlo MethodsReinforcement Learning 5. Monte Carlo Methods
Reinforcement Learning 5. Monte Carlo MethodsSeung Jae Lee
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningNAVER Engineering
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningCloudxLab
 
Dueling Network Architectures for Deep Reinforcement Learning
Dueling Network Architectures for Deep Reinforcement LearningDueling Network Architectures for Deep Reinforcement Learning
Dueling Network Architectures for Deep Reinforcement LearningYoonho Lee
 
An introduction to reinforcement learning
An introduction to reinforcement learningAn introduction to reinforcement learning
An introduction to reinforcement learningSubrat Panda, PhD
 
Reinforcement Learning 3. Finite Markov Decision Processes
Reinforcement Learning 3. Finite Markov Decision ProcessesReinforcement Learning 3. Finite Markov Decision Processes
Reinforcement Learning 3. Finite Markov Decision ProcessesSeung Jae Lee
 
Control as Inference.pptx
Control as Inference.pptxControl as Inference.pptx
Control as Inference.pptxssuserbd1647
 
Temporal difference learning
Temporal difference learningTemporal difference learning
Temporal difference learningJie-Han Chen
 
Reinforcement Learning Q-Learning
Reinforcement Learning   Q-Learning Reinforcement Learning   Q-Learning
Reinforcement Learning Q-Learning Melaku Eneayehu
 

La actualidad más candente (20)

Deep Q-Learning
Deep Q-LearningDeep Q-Learning
Deep Q-Learning
 
Deep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-LearningDeep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-Learning
 
Deep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its ApplicationsDeep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its Applications
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratch
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
Rl chapter 1 introduction
Rl chapter 1 introductionRl chapter 1 introduction
Rl chapter 1 introduction
 
Reinforcement Learning 4. Dynamic Programming
Reinforcement Learning 4. Dynamic ProgrammingReinforcement Learning 4. Dynamic Programming
Reinforcement Learning 4. Dynamic Programming
 
Reinforcement Learning 5. Monte Carlo Methods
Reinforcement Learning 5. Monte Carlo MethodsReinforcement Learning 5. Monte Carlo Methods
Reinforcement Learning 5. Monte Carlo Methods
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Dueling Network Architectures for Deep Reinforcement Learning
Dueling Network Architectures for Deep Reinforcement LearningDueling Network Architectures for Deep Reinforcement Learning
Dueling Network Architectures for Deep Reinforcement Learning
 
An introduction to reinforcement learning
An introduction to reinforcement learningAn introduction to reinforcement learning
An introduction to reinforcement learning
 
Reinforcement Learning 3. Finite Markov Decision Processes
Reinforcement Learning 3. Finite Markov Decision ProcessesReinforcement Learning 3. Finite Markov Decision Processes
Reinforcement Learning 3. Finite Markov Decision Processes
 
Control as Inference.pptx
Control as Inference.pptxControl as Inference.pptx
Control as Inference.pptx
 
Temporal difference learning
Temporal difference learningTemporal difference learning
Temporal difference learning
 
Deep Q-learning explained
Deep Q-learning explainedDeep Q-learning explained
Deep Q-learning explained
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Reinforcement Learning Q-Learning
Reinforcement Learning   Q-Learning Reinforcement Learning   Q-Learning
Reinforcement Learning Q-Learning
 

Similar a 25 introduction reinforcement_learning

Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning Chandra Meena
 
reinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdfreinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdfVaishnavGhadge1
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningDongHyun Kwak
 
How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysYasutoTamura1
 
Reinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine SweeperReinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine SweeperDataScienceLab
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017MLconf
 
Shanghai deep learning meetup 4
Shanghai deep learning meetup 4Shanghai deep learning meetup 4
Shanghai deep learning meetup 4Xiaohu ZHU
 
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...Universitat Politècnica de Catalunya
 
REINFORCEMENT LEARNING
REINFORCEMENT LEARNINGREINFORCEMENT LEARNING
REINFORCEMENT LEARNINGpradiprahul
 
An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningPrabhu Kumar
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningDing Li
 
Modern Recommendation for Advanced Practitioners part2
Modern Recommendation for Advanced Practitioners part2Modern Recommendation for Advanced Practitioners part2
Modern Recommendation for Advanced Practitioners part2Flavian Vasile
 
Introduction to Reinforcement Learning for Molecular Design
Introduction to Reinforcement Learning for Molecular Design Introduction to Reinforcement Learning for Molecular Design
Introduction to Reinforcement Learning for Molecular Design Dan Elton
 
TensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksTensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksBen Ball
 
14_ReinforcementLearning.pptx
14_ReinforcementLearning.pptx14_ReinforcementLearning.pptx
14_ReinforcementLearning.pptxRithikRaj25
 
Stanford CME 241 - Reinforcement Learning for Stochastic Control Problems in ...
Stanford CME 241 - Reinforcement Learning for Stochastic Control Problems in ...Stanford CME 241 - Reinforcement Learning for Stochastic Control Problems in ...
Stanford CME 241 - Reinforcement Learning for Stochastic Control Problems in ...Ashwin Rao
 
Hibridization of Reinforcement Learning Agents
Hibridization of Reinforcement Learning AgentsHibridization of Reinforcement Learning Agents
Hibridization of Reinforcement Learning Agentsbutest
 
Model-Based Reinforcement Learning @NIPS2017
Model-Based Reinforcement Learning @NIPS2017Model-Based Reinforcement Learning @NIPS2017
Model-Based Reinforcement Learning @NIPS2017mooopan
 

Similar a 25 introduction reinforcement_learning (20)

Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning
 
reinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdfreinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdf
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative ways
 
Reinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine SweeperReinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine Sweeper
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
 
Shanghai deep learning meetup 4
Shanghai deep learning meetup 4Shanghai deep learning meetup 4
Shanghai deep learning meetup 4
 
Dp
DpDp
Dp
 
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
 
Introduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement LearningIntroduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement Learning
 
REINFORCEMENT LEARNING
REINFORCEMENT LEARNINGREINFORCEMENT LEARNING
REINFORCEMENT LEARNING
 
An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game Learning
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Modern Recommendation for Advanced Practitioners part2
Modern Recommendation for Advanced Practitioners part2Modern Recommendation for Advanced Practitioners part2
Modern Recommendation for Advanced Practitioners part2
 
Introduction to Reinforcement Learning for Molecular Design
Introduction to Reinforcement Learning for Molecular Design Introduction to Reinforcement Learning for Molecular Design
Introduction to Reinforcement Learning for Molecular Design
 
TensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksTensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and Tricks
 
14_ReinforcementLearning.pptx
14_ReinforcementLearning.pptx14_ReinforcementLearning.pptx
14_ReinforcementLearning.pptx
 
Stanford CME 241 - Reinforcement Learning for Stochastic Control Problems in ...
Stanford CME 241 - Reinforcement Learning for Stochastic Control Problems in ...Stanford CME 241 - Reinforcement Learning for Stochastic Control Problems in ...
Stanford CME 241 - Reinforcement Learning for Stochastic Control Problems in ...
 
Hibridization of Reinforcement Learning Agents
Hibridization of Reinforcement Learning AgentsHibridization of Reinforcement Learning Agents
Hibridization of Reinforcement Learning Agents
 
Model-Based Reinforcement Learning @NIPS2017
Model-Based Reinforcement Learning @NIPS2017Model-Based Reinforcement Learning @NIPS2017
Model-Based Reinforcement Learning @NIPS2017
 

Más de Andres Mendez-Vazquez

01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectors01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectorsAndres Mendez-Vazquez
 
01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issues01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issuesAndres Mendez-Vazquez
 
05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiationAndres Mendez-Vazquez
 
01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep Learning01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep LearningAndres Mendez-Vazquez
 
Neural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning SyllabusNeural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning SyllabusAndres Mendez-Vazquez
 
Introduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabusIntroduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabusAndres Mendez-Vazquez
 
Ideas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data SciencesIdeas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data SciencesAndres Mendez-Vazquez
 
20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variationsAndres Mendez-Vazquez
 
Introduction Mathematics Intelligent Systems Syllabus
Introduction Mathematics Intelligent Systems SyllabusIntroduction Mathematics Intelligent Systems Syllabus
Introduction Mathematics Intelligent Systems SyllabusAndres Mendez-Vazquez
 

Más de Andres Mendez-Vazquez (20)

2.03 bayesian estimation
2.03 bayesian estimation2.03 bayesian estimation
2.03 bayesian estimation
 
05 linear transformations
05 linear transformations05 linear transformations
05 linear transformations
 
01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectors01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectors
 
01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issues01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issues
 
01.02 linear equations
01.02 linear equations01.02 linear equations
01.02 linear equations
 
01.01 vector spaces
01.01 vector spaces01.01 vector spaces
01.01 vector spaces
 
06 recurrent neural_networks
06 recurrent neural_networks06 recurrent neural_networks
06 recurrent neural_networks
 
05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation
 
Zetta global
Zetta globalZetta global
Zetta global
 
01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep Learning01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep Learning
 
Neural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning SyllabusNeural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning Syllabus
 
Introduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabusIntroduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabus
 
Ideas 09 22_2018
Ideas 09 22_2018Ideas 09 22_2018
Ideas 09 22_2018
 
Ideas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data SciencesIdeas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data Sciences
 
Analysis of Algorithms Syllabus
Analysis of Algorithms  SyllabusAnalysis of Algorithms  Syllabus
Analysis of Algorithms Syllabus
 
20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations
 
18.1 combining models
18.1 combining models18.1 combining models
18.1 combining models
 
17 vapnik chervonenkis dimension
17 vapnik chervonenkis dimension17 vapnik chervonenkis dimension
17 vapnik chervonenkis dimension
 
A basic introduction to learning
A basic introduction to learningA basic introduction to learning
A basic introduction to learning
 
Introduction Mathematics Intelligent Systems Syllabus
Introduction Mathematics Intelligent Systems SyllabusIntroduction Mathematics Intelligent Systems Syllabus
Introduction Mathematics Intelligent Systems Syllabus
 

Último

Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaOmar Fathy
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startQuintin Balsdon
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityMorshed Ahmed Rahath
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxJuliansyahHarahap1
 
Air Compressor reciprocating single stage
Air Compressor reciprocating single stageAir Compressor reciprocating single stage
Air Compressor reciprocating single stageAbc194748
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptxJIT KUMAR GUPTA
 
Learn the concepts of Thermodynamics on Magic Marks
Learn the concepts of Thermodynamics on Magic MarksLearn the concepts of Thermodynamics on Magic Marks
Learn the concepts of Thermodynamics on Magic MarksMagic Marks
 
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...Health
 
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...HenryBriggs2
 
kiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadkiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadhamedmustafa094
 
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...soginsider
 
Bridge Jacking Design Sample Calculation.pptx
Bridge Jacking Design Sample Calculation.pptxBridge Jacking Design Sample Calculation.pptx
Bridge Jacking Design Sample Calculation.pptxnuruddin69
 
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxS1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxSCMS School of Architecture
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdfKamal Acharya
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.Kamal Acharya
 
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptxA CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptxmaisarahman1
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationBhangaleSonal
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXssuser89054b
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"mphochane1998
 

Último (20)

Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS Lambda
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna Municipality
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptx
 
Air Compressor reciprocating single stage
Air Compressor reciprocating single stageAir Compressor reciprocating single stage
Air Compressor reciprocating single stage
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
 
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced LoadsFEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
 
Learn the concepts of Thermodynamics on Magic Marks
Learn the concepts of Thermodynamics on Magic MarksLearn the concepts of Thermodynamics on Magic Marks
Learn the concepts of Thermodynamics on Magic Marks
 
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
 
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
 
kiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadkiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal load
 
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
 
Bridge Jacking Design Sample Calculation.pptx
Bridge Jacking Design Sample Calculation.pptxBridge Jacking Design Sample Calculation.pptx
Bridge Jacking Design Sample Calculation.pptx
 
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxS1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdf
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptxA CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equation
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
 

25 introduction reinforcement_learning

  • 1. Introduction to Artificial Intelligence Reinforcement Learning Andres Mendez-Vazquez April 26, 2019 1 / 111
  • 2. Images/cinvestav Outline 1 Reinforcement Learning Introduction Example A K-Armed Bandit Problem Exploration vs Exploitation The Elements of Reinforcement Learning Limitations 2 Formalizing Reinforcement Learning Introduction Markov Process and Markov Decision Process Return, Policy and Value function 3 Solution Methods Multi-armed Bandits Problems of Implementations General Formula in Reinforcement Learning A Simple Bandit Algorithm Dynamic Programming Policy Evaluation Policy Iteration Policy and Value Improvement Example Temporal-Difference Learning A Small Review of Monte Carlo Methods The Main Idea The Monte Carlo Principle Going Back to Temporal Difference Q-learning: Off-Policy TD Control 4 The Neural Network Approach Introduction, TD-Gammon TD(λ) by Back-Propagation Conclusions 2 / 111
  • 3. Images/cinvestav Outline 1 Reinforcement Learning Introduction Example A K-Armed Bandit Problem Exploration vs Exploitation The Elements of Reinforcement Learning Limitations 2 Formalizing Reinforcement Learning Introduction Markov Process and Markov Decision Process Return, Policy and Value function 3 Solution Methods Multi-armed Bandits Problems of Implementations General Formula in Reinforcement Learning A Simple Bandit Algorithm Dynamic Programming Policy Evaluation Policy Iteration Policy and Value Improvement Example Temporal-Difference Learning A Small Review of Monte Carlo Methods The Main Idea The Monte Carlo Principle Going Back to Temporal Difference Q-learning: Off-Policy TD Control 4 The Neural Network Approach Introduction, TD-Gammon TD(λ) by Back-Propagation Conclusions 3 / 111
  • 4. Images/cinvestav Reinforcement Learning is learning what to do What does Reinforcement Learning want? Maximize a numerical reward signal The learner is not told which actions to take A discovery must be performed to obtain the actions that yield the most rewards 4 / 111
  • 5. Images/cinvestav Reinforcement Learning is learning what to do What does Reinforcement Learning want? Maximize a numerical reward signal The learner is not told which actions to take A discovery must be performed to obtain the actions that yield the most rewards 4 / 111
  • 6. Images/cinvestav These ideas come from Dynamical Systems Theory Specifically As the optimal control of incompletely-known Markov Decision Processes (MDP). Thus, the device must do the following Interact with the environment to learn by using punishments and rewards. 5 / 111
  • 7. Images/cinvestav These ideas come from Dynamical Systems Theory Specifically As the optimal control of incompletely-known Markov Decision Processes (MDP). Thus, the device must do the following Interact with the environment to learn by using punishments and rewards. 5 / 111
  • 8. Images/cinvestav Outline 1 Reinforcement Learning Introduction Example A K-Armed Bandit Problem Exploration vs Exploitation The Elements of Reinforcement Learning Limitations 2 Formalizing Reinforcement Learning Introduction Markov Process and Markov Decision Process Return, Policy and Value function 3 Solution Methods Multi-armed Bandits Problems of Implementations General Formula in Reinforcement Learning A Simple Bandit Algorithm Dynamic Programming Policy Evaluation Policy Iteration Policy and Value Improvement Example Temporal-Difference Learning A Small Review of Monte Carlo Methods The Main Idea The Monte Carlo Principle Going Back to Temporal Difference Q-learning: Off-Policy TD Control 4 The Neural Network Approach Introduction, TD-Gammon TD(λ) by Back-Propagation Conclusions 6 / 111
  • 9. Images/cinvestav A mobile robot decides whether it should enter a new room in search of more trash to collect or going back to charge its battery It needs to take a decision based on the current charge level of its battery how quickly it can arrive to its base 7 / 111
  • 10. Images/cinvestav A mobile robot decides whether it should enter a new room in search of more trash to collect or going back to charge its battery It needs to take a decision based on the current charge level of its battery how quickly it can arrive to its base 7 / 111
  • 11. Images/cinvestav Outline 1 Reinforcement Learning Introduction Example A K-Armed Bandit Problem Exploration vs Exploitation The Elements of Reinforcement Learning Limitations 2 Formalizing Reinforcement Learning Introduction Markov Process and Markov Decision Process Return, Policy and Value function 3 Solution Methods Multi-armed Bandits Problems of Implementations General Formula in Reinforcement Learning A Simple Bandit Algorithm Dynamic Programming Policy Evaluation Policy Iteration Policy and Value Improvement Example Temporal-Difference Learning A Small Review of Monte Carlo Methods The Main Idea The Monte Carlo Principle Going Back to Temporal Difference Q-learning: Off-Policy TD Control 4 The Neural Network Approach Introduction, TD-Gammon TD(λ) by Back-Propagation Conclusions 8 / 111
  • 12. Images/cinvestav A K-Armed Bandit Problem Definition You are faced repeatedly with a choice among k different options, or actions. After each choice you receive a numerical reward chosen From an stationary probability distribution depending on the actions. Definition Stationary Probability A stationary distribution of a Markov chain is a probability distribution that remains unchanged in the Markov chain as time progresses. 9 / 111
  • 13. Images/cinvestav A K-Armed Bandit Problem Definition You are faced repeatedly with a choice among k different options, or actions. After each choice you receive a numerical reward chosen From an stationary probability distribution depending on the actions. Definition Stationary Probability A stationary distribution of a Markov chain is a probability distribution that remains unchanged in the Markov chain as time progresses. 9 / 111
  • 14. Images/cinvestav A K-Armed Bandit Problem Definition You are faced repeatedly with a choice among k different options, or actions. After each choice you receive a numerical reward chosen From an stationary probability distribution depending on the actions. Definition Stationary Probability A stationary distribution of a Markov chain is a probability distribution that remains unchanged in the Markov chain as time progresses. 9 / 111
  • 16. Images/cinvestav Therefore Each of the k actions has an expected value q∗ (a) = E [Rt|At = a] Something Notable If you knew the value of each action, It would be trivial to solve the k-armed bandit problem. 11 / 111
  • 17. Images/cinvestav Therefore Each of the k actions has an expected value q∗ (a) = E [Rt|At = a] Something Notable If you knew the value of each action, It would be trivial to solve the k-armed bandit problem. 11 / 111
  • 18. Images/cinvestav What do we want? To find the estimated value of action a at time t, Qt (a) |Qt (a) − q∗ (a)| < with as small as possible 12 / 111
  • 19. Images/cinvestav Outline 1 Reinforcement Learning Introduction Example A K-Armed Bandit Problem Exploration vs Exploitation The Elements of Reinforcement Learning Limitations 2 Formalizing Reinforcement Learning Introduction Markov Process and Markov Decision Process Return, Policy and Value function 3 Solution Methods Multi-armed Bandits Problems of Implementations General Formula in Reinforcement Learning A Simple Bandit Algorithm Dynamic Programming Policy Evaluation Policy Iteration Policy and Value Improvement Example Temporal-Difference Learning A Small Review of Monte Carlo Methods The Main Idea The Monte Carlo Principle Going Back to Temporal Difference Q-learning: Off-Policy TD Control 4 The Neural Network Approach Introduction, TD-Gammon TD(λ) by Back-Propagation Conclusions 13 / 111
  • 20. Images/cinvestav Greedy Actions Intuitions To choose the greatest E [Rt|At = a] This is know as Exploitation You are exploiting your current knowledge of the values of the actions. Contrasting to Exploration When you select the non-gready actions to obtain a better knowledge of the non-gready actions. 14 / 111
  • 21. Images/cinvestav Greedy Actions Intuitions To choose the greatest E [Rt|At = a] This is know as Exploitation You are exploiting your current knowledge of the values of the actions. Contrasting to Exploration When you select the non-gready actions to obtain a better knowledge of the non-gready actions. 14 / 111
  • 22. Images/cinvestav Greedy Actions Intuitions To choose the greatest E [Rt|At = a] This is know as Exploitation You are exploiting your current knowledge of the values of the actions. Contrasting to Exploration When you select the non-gready actions to obtain a better knowledge of the non-gready actions. 14 / 111
  • 23. Images/cinvestav An Interesting Conundrum It has been observed that Exploitation maximizes the expected reward on the one step, Exploration may produce the greater total reward in the long run. Thus, the idea The value of a state s is the total amount of reward an agent can expect to accumulate over the future, starting from s. Thus, Values They indicate the long-term profit of states after taking into account The states that follow and their rewards. 15 / 111
  • 24. Images/cinvestav An Interesting Conundrum It has been observed that Exploitation maximizes the expected reward on the one step, Exploration may produce the greater total reward in the long run. Thus, the idea The value of a state s is the total amount of reward an agent can expect to accumulate over the future, starting from s. Thus, Values They indicate the long-term profit of states after taking into account The states that follow and their rewards. 15 / 111
  • 25. Images/cinvestav An Interesting Conundrum It has been observed that Exploitation maximizes the expected reward on the one step, Exploration may produce the greater total reward in the long run. Thus, the idea The value of a state s is the total amount of reward an agent can expect to accumulate over the future, starting from s. Thus, Values They indicate the long-term profit of states after taking into account The states that follow and their rewards. 15 / 111
  • 26. Images/cinvestav Outline 1 Reinforcement Learning Introduction Example A K-Armed Bandit Problem Exploration vs Exploitation The Elements of Reinforcement Learning Limitations 2 Formalizing Reinforcement Learning Introduction Markov Process and Markov Decision Process Return, Policy and Value function 3 Solution Methods Multi-armed Bandits Problems of Implementations General Formula in Reinforcement Learning A Simple Bandit Algorithm Dynamic Programming Policy Evaluation Policy Iteration Policy and Value Improvement Example Temporal-Difference Learning A Small Review of Monte Carlo Methods The Main Idea The Monte Carlo Principle Going Back to Temporal Difference Q-learning: Off-Policy TD Control 4 The Neural Network Approach Introduction, TD-Gammon TD(λ) by Back-Propagation Conclusions 16 / 111
  • 27. Images/cinvestav Therefore, the parts of a Reinforcement Learning A Policy It defines the learning agent’s way of behaving at a given time. The policy is the core of a reinforcement learning system In general, policies may be stochastic A Reward Signal It defines the goal of a reinforcement learning problem. The system’s sole objective is to maximize the total reward over the long run This encourage Exploitation It is immediate... you might want something with more long term. A Value Function It specifies what is good in the long run. This encourage Exploration 17 / 111
  • 28. Images/cinvestav Therefore, the parts of a Reinforcement Learning A Policy It defines the learning agent’s way of behaving at a given time. The policy is the core of a reinforcement learning system In general, policies may be stochastic A Reward Signal It defines the goal of a reinforcement learning problem. The system’s sole objective is to maximize the total reward over the long run This encourage Exploitation It is immediate... you might want something with more long term. A Value Function It specifies what is good in the long run. This encourage Exploration 17 / 111
  • 29. Images/cinvestav Therefore, the parts of a Reinforcement Learning A Policy It defines the learning agent’s way of behaving at a given time. The policy is the core of a reinforcement learning system In general, policies may be stochastic A Reward Signal It defines the goal of a reinforcement learning problem. The system’s sole objective is to maximize the total reward over the long run This encourage Exploitation It is immediate... you might want something with more long term. A Value Function It specifies what is good in the long run. This encourage Exploration 17 / 111
  • 30. Images/cinvestav Therefore, the parts of a Reinforcement Learning A Policy It defines the learning agent’s way of behaving at a given time. The policy is the core of a reinforcement learning system In general, policies may be stochastic A Reward Signal It defines the goal of a reinforcement learning problem. The system’s sole objective is to maximize the total reward over the long run This encourage Exploitation It is immediate... you might want something with more long term. A Value Function It specifies what is good in the long run. This encourage Exploration 17 / 111
  • 31. Images/cinvestav Therefore, the parts of a Reinforcement Learning A Policy It defines the learning agent’s way of behaving at a given time. The policy is the core of a reinforcement learning system In general, policies may be stochastic A Reward Signal It defines the goal of a reinforcement learning problem. The system’s sole objective is to maximize the total reward over the long run This encourage Exploitation It is immediate... you might want something with more long term. A Value Function It specifies what is good in the long run. This encourage Exploration 17 / 111
  • 32. Images/cinvestav Therefore, the parts of a Reinforcement Learning A Policy It defines the learning agent’s way of behaving at a given time. The policy is the core of a reinforcement learning system In general, policies may be stochastic A Reward Signal It defines the goal of a reinforcement learning problem. The system’s sole objective is to maximize the total reward over the long run This encourage Exploitation It is immediate... you might want something with more long term. A Value Function It specifies what is good in the long run. This encourage Exploration 17 / 111
  • 33. Images/cinvestav Therefore, the parts of a Reinforcement Learning A Policy It defines the learning agent’s way of behaving at a given time. The policy is the core of a reinforcement learning system In general, policies may be stochastic A Reward Signal It defines the goal of a reinforcement learning problem. The system’s sole objective is to maximize the total reward over the long run This encourage Exploitation It is immediate... you might want something with more long term. A Value Function It specifies what is good in the long run. This encourage Exploration 17 / 111
  • 34. Images/cinvestav Therefore, the parts of a Reinforcement Learning A Policy It defines the learning agent’s way of behaving at a given time. The policy is the core of a reinforcement learning system In general, policies may be stochastic A Reward Signal It defines the goal of a reinforcement learning problem. The system’s sole objective is to maximize the total reward over the long run This encourage Exploitation It is immediate... you might want something with more long term. A Value Function It specifies what is good in the long run. This encourage Exploration 17 / 111
  • 35. Images/cinvestav Therefore, the parts of a Reinforcement Learning A Policy It defines the learning agent’s way of behaving at a given time. The policy is the core of a reinforcement learning system In general, policies may be stochastic A Reward Signal It defines the goal of a reinforcement learning problem. The system’s sole objective is to maximize the total reward over the long run This encourage Exploitation It is immediate... you might want something with more long term. A Value Function It specifies what is good in the long run. This encourage Exploration 17 / 111
  • 36. Images/cinvestav Finally A Model of the Environment This simulates the behavior of the environment Allowing to make inferences on how the environment can behave Something Notable Models are used for planning Meaning we made decisions before they are executed Thus, Reinforcement Learning Methods using models are called model-based 18 / 111
  • 37. Images/cinvestav Finally A Model of the Environment This simulates the behavior of the environment Allowing to make inferences on how the environment can behave Something Notable Models are used for planning Meaning we made decisions before they are executed Thus, Reinforcement Learning Methods using models are called model-based 18 / 111
  • 38. Images/cinvestav Outline 1 Reinforcement Learning Introduction Example A K-Armed Bandit Problem Exploration vs Exploitation The Elements of Reinforcement Learning Limitations 2 Formalizing Reinforcement Learning Introduction Markov Process and Markov Decision Process Return, Policy and Value function 3 Solution Methods Multi-armed Bandits Problems of Implementations General Formula in Reinforcement Learning A Simple Bandit Algorithm Dynamic Programming Policy Evaluation Policy Iteration Policy and Value Improvement Example Temporal-Difference Learning A Small Review of Monte Carlo Methods The Main Idea The Monte Carlo Principle Going Back to Temporal Difference Q-learning: Off-Policy TD Control 4 The Neural Network Approach Introduction, TD-Gammon TD(λ) by Back-Propagation Conclusions 19 / 111
  • 39. Images/cinvestav Limitations Given that Reinforcement learning relies heavily on the concept of state Therefore, it has the same problem as in Markov Decision Processes Defining the state to obtain a better representation of the search space. This is where the creative part of the problems come to be As we have seen, solutions in AI really a lot in the creativity of the solution... 20 / 111
  • 40. Images/cinvestav Limitations Given that Reinforcement learning relies heavily on the concept of state Therefore, it has the same problem as in Markov Decision Processes Defining the state to obtain a better representation of the search space. This is where the creative part of the problems come to be As we have seen, solutions in AI really a lot in the creativity of the solution... 20 / 111
  • 41. Images/cinvestav Limitations Given that Reinforcement learning relies heavily on the concept of state Therefore, it has the same problem as in Markov Decision Processes Defining the state to obtain a better representation of the search space. This is where the creative part of the problems come to be As we have seen, solutions in AI really a lot in the creativity of the solution... 20 / 111
  • 42. Images/cinvestav Outline 1 Reinforcement Learning Introduction Example A K-Armed Bandit Problem Exploration vs Exploitation The Elements of Reinforcement Learning Limitations 2 Formalizing Reinforcement Learning Introduction Markov Process and Markov Decision Process Return, Policy and Value function 3 Solution Methods Multi-armed Bandits Problems of Implementations General Formula in Reinforcement Learning A Simple Bandit Algorithm Dynamic Programming Policy Evaluation Policy Iteration Policy and Value Improvement Example Temporal-Difference Learning A Small Review of Monte Carlo Methods The Main Idea The Monte Carlo Principle Going Back to Temporal Difference Q-learning: Off-Policy TD Control 4 The Neural Network Approach Introduction, TD-Gammon TD(λ) by Back-Propagation Conclusions 21 / 111
  • 43. Images/cinvestav The System–Environment Interaction In a Markov Decision Process 22 / 111
  • 44. Images/cinvestav Outline 1 Reinforcement Learning Introduction Example A K-Armed Bandit Problem Exploration vs Exploitation The Elements of Reinforcement Learning Limitations 2 Formalizing Reinforcement Learning Introduction Markov Process and Markov Decision Process Return, Policy and Value function 3 Solution Methods Multi-armed Bandits Problems of Implementations General Formula in Reinforcement Learning A Simple Bandit Algorithm Dynamic Programming Policy Evaluation Policy Iteration Policy and Value Improvement Example Temporal-Difference Learning A Small Review of Monte Carlo Methods The Main Idea The Monte Carlo Principle Going Back to Temporal Difference Q-learning: Off-Policy TD Control 4 The Neural Network Approach Introduction, TD-Gammon TD(λ) by Back-Propagation Conclusions 23 / 111
  • 45. Images/cinvestav Markov Process and Markov Decision Process Definition of Markov Process A sequence of states is Markov if and only if the probability of moving to the next state st+1 depends only on the present state st P [st+1|st] = P [st+1|s1, ..., st] Something Notable We always talk about time-homogeneous Markov chain in Reinforcement Learning: the probability of the transition is independent of t P [st+1 = s |st = s] = P [st = s |st−1 = s] 24 / 111
  • 46. Images/cinvestav Markov Process and Markov Decision Process Definition of Markov Process A sequence of states is Markov if and only if the probability of moving to the next state st+1 depends only on the present state st P [st+1|st] = P [st+1|s1, ..., st] Something Notable We always talk about time-homogeneous Markov chain in Reinforcement Learning: the probability of the transition is independent of t P [st+1 = s |st = s] = P [st = s |st−1 = s] 24 / 111
  • 47. Images/cinvestav Formally Definition (Markov Process) a Markov Process (or Markov Chain) is a tuple (S, P), where 1 S is a finite set of states 2 P is a state transition probability matrix. Pss = P [st+1 = s |st = s] Thus, the dynamic of the Markov process is as follow s0 Ps0s1 −→ s1 Ps1s2 −→ s2 Ps2s3 −→ s3 −→ · · · 25 / 111
  • 48. Images/cinvestav Formally Definition (Markov Process) a Markov Process (or Markov Chain) is a tuple (S, P), where 1 S is a finite set of states 2 P is a state transition probability matrix. Pss = P [st+1 = s |st = s] Thus, the dynamic of the Markov process is as follow s0 Ps0s1 −→ s1 Ps1s2 −→ s2 Ps2s3 −→ s3 −→ · · · 25 / 111
  • 49. Images/cinvestav Then, we get the following familiar definition Definition MDP is a tuple S, A, T , R, α 1 S is a finite set of states 2 A is a finite set of actions 3 P is a state transition probability matrix. Pss = P st+1 = s |st = s, At = a 4 α ∈ [0, 1] is called the discount factor 5 R : S × A → R is a reward function Thus, the dynamic of the Markov Decision Process is as follow using a probabilty to pick an action s0 a0 −→ s1 a1 −→ s2 a2 −→ s3 −→ · · · 26 / 111
  • 50. Images/cinvestav Then, we get the following familiar definition Definition MDP is a tuple S, A, T , R, α 1 S is a finite set of states 2 A is a finite set of actions 3 P is a state transition probability matrix. Pss = P st+1 = s |st = s, At = a 4 α ∈ [0, 1] is called the discount factor 5 R : S × A → R is a reward function Thus, the dynamic of the Markov Decision Process is as follow using a probabilty to pick an action s0 a0 −→ s1 a1 −→ s2 a2 −→ s3 −→ · · · 26 / 111
  • 51. Images/cinvestav Then, we get the following familiar definition Definition MDP is a tuple S, A, T , R, α 1 S is a finite set of states 2 A is a finite set of actions 3 P is a state transition probability matrix. Pss = P st+1 = s |st = s, At = a 4 α ∈ [0, 1] is called the discount factor 5 R : S × A → R is a reward function Thus, the dynamic of the Markov Decision Process is as follow using a probabilty to pick an action s0 a0 −→ s1 a1 −→ s2 a2 −→ s3 −→ · · · 26 / 111
  • 52. Images/cinvestav Then, we get the following familiar definition Definition MDP is a tuple S, A, T , R, α 1 S is a finite set of states 2 A is a finite set of actions 3 P is a state transition probability matrix. Pss = P st+1 = s |st = s, At = a 4 α ∈ [0, 1] is called the discount factor 5 R : S × A → R is a reward function Thus, the dynamic of the Markov Decision Process is as follow using a probabilty to pick an action s0 a0 −→ s1 a1 −→ s2 a2 −→ s3 −→ · · · 26 / 111
  • 53. Images/cinvestav Then, we get the following familiar definition Definition MDP is a tuple S, A, T , R, α 1 S is a finite set of states 2 A is a finite set of actions 3 P is a state transition probability matrix. Pss = P st+1 = s |st = s, At = a 4 α ∈ [0, 1] is called the discount factor 5 R : S × A → R is a reward function Thus, the dynamic of the Markov Decision Process is as follow using a probabilty to pick an action s0 a0 −→ s1 a1 −→ s2 a2 −→ s3 −→ · · · 26 / 111
  • 54. Images/cinvestav Then, we get the following familiar definition Definition MDP is a tuple S, A, T , R, α 1 S is a finite set of states 2 A is a finite set of actions 3 P is a state transition probability matrix. Pss = P st+1 = s |st = s, At = a 4 α ∈ [0, 1] is called the discount factor 5 R : S × A → R is a reward function Thus, the dynamic of the Markov Decision Process is as follow using a probabilty to pick an action s0 a0 −→ s1 a1 −→ s2 a2 −→ s3 −→ · · · 26 / 111
  • 55. Images/cinvestav Then, we get the following familiar definition Definition MDP is a tuple S, A, T , R, α 1 S is a finite set of states 2 A is a finite set of actions 3 P is a state transition probability matrix. Pss = P st+1 = s |st = s, At = a 4 α ∈ [0, 1] is called the discount factor 5 R : S × A → R is a reward function Thus, the dynamic of the Markov Decision Process is as follow using a probabilty to pick an action s0 a0 −→ s1 a1 −→ s2 a2 −→ s3 −→ · · · 26 / 111
  • 56. Images/cinvestav Outline 1 Reinforcement Learning Introduction Example A K-Armed Bandit Problem Exploration vs Exploitation The Elements of Reinforcement Learning Limitations 2 Formalizing Reinforcement Learning Introduction Markov Process and Markov Decision Process Return, Policy and Value function 3 Solution Methods Multi-armed Bandits Problems of Implementations General Formula in Reinforcement Learning A Simple Bandit Algorithm Dynamic Programming Policy Evaluation Policy Iteration Policy and Value Improvement Example Temporal-Difference Learning A Small Review of Monte Carlo Methods The Main Idea The Monte Carlo Principle Going Back to Temporal Difference Q-learning: Off-Policy TD Control 4 The Neural Network Approach Introduction, TD-Gammon TD(λ) by Back-Propagation Conclusions 27 / 111
  • 57. Images/cinvestav The Goal of Reinforcement Learning We want to maximize the expected value of the return i.e. To obtain the optimal policy!!! Thus, as in our previous study of MDP’s Gt = Rt+1 + αRt+2 + · · · = ∞ k=0 αk Rt+k+1 28 / 111
  • 58. Images/cinvestav The Goal of Reinforcement Learning We want to maximize the expected value of the return i.e. To obtain the optimal policy!!! Thus, as in our previous study of MDP’s Gt = Rt+1 + αRt+2 + · · · = ∞ k=0 αk Rt+k+1 28 / 111
  • 59. Images/cinvestav Remarks Interpretation The discounted future rewards can be interpreted as the current value of future rewards. The reward α α close to 0 leads to “short-sighted” evaluation... you are only looking for immediate rewards!!! α close to 1 leads to “long-sighted” evaluation... you are willing to wait for a better reward!!! 29 / 111
  • 60. Images/cinvestav Remarks Interpretation The discounted future rewards can be interpreted as the current value of future rewards. The reward α α close to 0 leads to “short-sighted” evaluation... you are only looking for immediate rewards!!! α close to 1 leads to “long-sighted” evaluation... you are willing to wait for a better reward!!! 29 / 111
  • 61. Images/cinvestav Why to use this discount factor? Basically 1 Discount factor avoids infinite returns in cyclic Markov process given the property of the geometry series. 2 There is certain uncertainty about the future reward. 3 If the reward is financial, immediate rewards may earn more interest than delayed rewards. 4 Animal/human behavior shows preference for immediate reward. 30 / 111
  • 62. Images/cinvestav Why to use this discount factor? Basically 1 Discount factor avoids infinite returns in cyclic Markov process given the property of the geometry series. 2 There is certain uncertainty about the future reward. 3 If the reward is financial, immediate rewards may earn more interest than delayed rewards. 4 Animal/human behavior shows preference for immediate reward. 30 / 111
  • 63. Images/cinvestav Why to use this discount factor? Basically 1 Discount factor avoids infinite returns in cyclic Markov process given the property of the geometry series. 2 There is certain uncertainty about the future reward. 3 If the reward is financial, immediate rewards may earn more interest than delayed rewards. 4 Animal/human behavior shows preference for immediate reward. 30 / 111
  • 64. Images/cinvestav Why to use this discount factor? Basically 1 Discount factor avoids infinite returns in cyclic Markov process given the property of the geometry series. 2 There is certain uncertainty about the future reward. 3 If the reward is financial, immediate rewards may earn more interest than delayed rewards. 4 Animal/human behavior shows preference for immediate reward. 30 / 111
  • 65. Images/cinvestav A Policy Using our MDP ideas... A policy π is a distribution over actions given states π (a|s) = P [At = a|St = s] A policy guides the choice of action at a given state. And it is time independent Then, we introduce two value functions State-value function. Action-value function. 31 / 111
  • 66. Images/cinvestav A Policy Using our MDP ideas... A policy π is a distribution over actions given states π (a|s) = P [At = a|St = s] A policy guides the choice of action at a given state. And it is time independent Then, we introduce two value functions State-value function. Action-value function. 31 / 111
  • 67. Images/cinvestav The state-value function vπ of an MDP It is defined as vπ (s) = Eπ [Gt|St = s] This gives the long-term value of the state s vπ (s) = Eπ [Gt|St = s] = Eπ ∞ k=0 αk Rt+k+1|St = s = Eπ [Rt+1 + αGt+1|St = s] = Eπ [Rt+1|St = s] Immediate Reward + Eπ [αvπ (St+1) |St = s] Discount value of successor state 32 / 111
  • 68. Images/cinvestav The state-value function vπ of an MDP It is defined as vπ (s) = Eπ [Gt|St = s] This gives the long-term value of the state s vπ (s) = Eπ [Gt|St = s] = Eπ ∞ k=0 αk Rt+k+1|St = s = Eπ [Rt+1 + αGt+1|St = s] = Eπ [Rt+1|St = s] Immediate Reward + Eπ [αvπ (St+1) |St = s] Discount value of successor state 32 / 111
  • 69. Images/cinvestav The state-value function vπ of an MDP It is defined as vπ (s) = Eπ [Gt|St = s] This gives the long-term value of the state s vπ (s) = Eπ [Gt|St = s] = Eπ ∞ k=0 αk Rt+k+1|St = s = Eπ [Rt+1 + αGt+1|St = s] = Eπ [Rt+1|St = s] Immediate Reward + Eπ [αvπ (St+1) |St = s] Discount value of successor state 32 / 111
  • 70. Images/cinvestav The state-value function vπ of an MDP It is defined as vπ (s) = Eπ [Gt|St = s] This gives the long-term value of the state s vπ (s) = Eπ [Gt|St = s] = Eπ ∞ k=0 αk Rt+k+1|St = s = Eπ [Rt+1 + αGt+1|St = s] = Eπ [Rt+1|St = s] Immediate Reward + Eπ [αvπ (St+1) |St = s] Discount value of successor state 32 / 111
  • 71. Images/cinvestav The Action-Value Function qπ (s, a) It is the expected return starting from state s Starting from state s, taking action a, and then following policy π qπ (s, a) = Eπ [Gt|St = a, At = a] We can also obtain the following decomposition qπ (s, a) = Eπ [Rt+1 + αqπ (St+1, At+1) |St = s, At = a] 33 / 111
  • 72. Images/cinvestav The Action-Value Function qπ (s, a) It is the expected return starting from state s Starting from state s, taking action a, and then following policy π qπ (s, a) = Eπ [Gt|St = a, At = a] We can also obtain the following decomposition qπ (s, a) = Eπ [Rt+1 + αqπ (St+1, At+1) |St = s, At = a] 33 / 111
  • 73. Images/cinvestav Now... To simplify notations, we define Ra s = E [Rt+1|St = s, At = a] Then, we have vπ (s) = a∈A π (a|s) qπ (s, a) qπ (s, a) = Ra s + α s ∈S Pa ss vπ s 34 / 111
  • 74. Images/cinvestav Now... To simplify notations, we define Ra s = E [Rt+1|St = s, At = a] Then, we have vπ (s) = a∈A π (a|s) qπ (s, a) qπ (s, a) = Ra s + α s ∈S Pa ss vπ s 34 / 111
  • 75. Images/cinvestav Then Expressing qπ (s, a) in terms of vπ (s) in the expression of vπ (s) - The Bellman Equation vπ (s) = a∈A π (a|s)  Ra s + α s ∈S Pa ss vπ s   The Bellman equation relates the state-value function of one state with that of other states. 35 / 111
  • 76. Images/cinvestav Then Expressing qπ (s, a) in terms of vπ (s) in the expression of vπ (s) - The Bellman Equation vπ (s) = a∈A π (a|s)  Ra s + α s ∈S Pa ss vπ s   The Bellman equation relates the state-value function of one state with that of other states. 35 / 111
  • 77. Images/cinvestav Also Bellman equation for qπ (s, a) qπ (s, a) = Ra s + α s ∈S Pa ss a∈A π (a|s) qπ (s, a) How do we solve this problem? We have the following solutions 36 / 111
  • 78. Images/cinvestav Also Bellman equation for qπ (s, a) qπ (s, a) = Ra s + α s ∈S Pa ss a∈A π (a|s) qπ (s, a) How do we solve this problem? We have the following solutions 36 / 111
  • 79. Images/cinvestav Outline 1 Reinforcement Learning Introduction Example A K-Armed Bandit Problem Exploration vs Exploitation The Elements of Reinforcement Learning Limitations 2 Formalizing Reinforcement Learning Introduction Markov Process and Markov Decision Process Return, Policy and Value function 3 Solution Methods Multi-armed Bandits Problems of Implementations General Formula in Reinforcement Learning A Simple Bandit Algorithm Dynamic Programming Policy Evaluation Policy Iteration Policy and Value Improvement Example Temporal-Difference Learning A Small Review of Monte Carlo Methods The Main Idea The Monte Carlo Principle Going Back to Temporal Difference Q-learning: Off-Policy TD Control 4 The Neural Network Approach Introduction, TD-Gammon TD(λ) by Back-Propagation Conclusions 37 / 111
  • 80. Images/cinvestav Action-Value Methods One natural way to estimate this is by averaging the rewards actually received Qt (a) = t−1 i=1 RiIAi=a t−1 i=1 IAi=a Nevertheless If the denominator is zero, we define Qt (a) at some default value Now, as the denominator goes to infinity, by the law of large numbers, Qt (a) −→ q∗ (a) 38 / 111
  • 81. Images/cinvestav Action-Value Methods One natural way to estimate this is by averaging the rewards actually received Qt (a) = t−1 i=1 RiIAi=a t−1 i=1 IAi=a Nevertheless If the denominator is zero, we define Qt (a) at some default value Now, as the denominator goes to infinity, by the law of large numbers, Qt (a) −→ q∗ (a) 38 / 111
  • 82. Images/cinvestav Action-Value Methods One natural way to estimate this is by averaging the rewards actually received Qt (a) = t−1 i=1 RiIAi=a t−1 i=1 IAi=a Nevertheless If the denominator is zero, we define Qt (a) at some default value Now, as the denominator goes to infinity, by the law of large numbers, Qt (a) −→ q∗ (a) 38 / 111
  • 83. Images/cinvestav Furthermore We call this the sample-average method Not the best, but allows to introduce a way to select actions... A classic one us the greedy estimate At = arg max a Qt (a) Therefore Greedy action selection always exploits current knowledge to maximize immediate reward. it spends no time in inferior actions to see if they might really be better. 39 / 111
  • 84. Images/cinvestav Furthermore We call this the sample-average method Not the best, but allows to introduce a way to select actions... A classic one us the greedy estimate At = arg max a Qt (a) Therefore Greedy action selection always exploits current knowledge to maximize immediate reward. it spends no time in inferior actions to see if they might really be better. 39 / 111
  • 85. Images/cinvestav Furthermore We call this the sample-average method Not the best, but allows to introduce a way to select actions... A classic one us the greedy estimate At = arg max a Qt (a) Therefore Greedy action selection always exploits current knowledge to maximize immediate reward. it spends no time in inferior actions to see if they might really be better. 39 / 111
  • 86. Images/cinvestav Furthermore We call this the sample-average method Not the best, but allows to introduce a way to select actions... A classic one us the greedy estimate At = arg max a Qt (a) Therefore Greedy action selection always exploits current knowledge to maximize immediate reward. it spends no time in inferior actions to see if they might really be better. 39 / 111
  • 87. Images/cinvestav And Here a Heuristic A Simple Alternative is to act greedily most of the time every once in a while, say with small probability , select randomly from among all the actions with equal probability. This is called -greedy method In the limit to infinity, this method ensures: Qt (a) −→ q∗ (a) 40 / 111
  • 88. Images/cinvestav And Here a Heuristic A Simple Alternative is to act greedily most of the time every once in a while, say with small probability , select randomly from among all the actions with equal probability. This is called -greedy method In the limit to infinity, this method ensures: Qt (a) −→ q∗ (a) 40 / 111
  • 89. Images/cinvestav It is like the Mean Filter We have a nice result 41 / 111
  • 90. Images/cinvestav Example In the following example The q∗ (a) were selected according to a Gaussian Distribution with mean 0 and variance 1. Then, when an action is selected, the reward Rt was selected from a normal distribution with mean q∗ (At) and variance 1. We have then 42 / 111
  • 91. Images/cinvestav Example In the following example The q∗ (a) were selected according to a Gaussian Distribution with mean 0 and variance 1. Then, when an action is selected, the reward Rt was selected from a normal distribution with mean q∗ (At) and variance 1. We have then 42 / 111
  • 92. Images/cinvestav Now, as the sampling goes up We get 43 / 111
  • 93. Images/cinvestav Outline 1 Reinforcement Learning Introduction Example A K-Armed Bandit Problem Exploration vs Exploitation The Elements of Reinforcement Learning Limitations 2 Formalizing Reinforcement Learning Introduction Markov Process and Markov Decision Process Return, Policy and Value function 3 Solution Methods Multi-armed Bandits Problems of Implementations General Formula in Reinforcement Learning A Simple Bandit Algorithm Dynamic Programming Policy Evaluation Policy Iteration Policy and Value Improvement Example Temporal-Difference Learning A Small Review of Monte Carlo Methods The Main Idea The Monte Carlo Principle Going Back to Temporal Difference Q-learning: Off-Policy TD Control 4 The Neural Network Approach Introduction, TD-Gammon TD(λ) by Back-Propagation Conclusions 44 / 111
  • 94. Images/cinvestav Imagine the following Let Ri denotes the reward after the ith of this action Qn = R1 + R2 + · · · + Rn−1 n − 1 The problem of this technique The Amount of Memory and Computational Power to keep updates the estimates as more reward happens 45 / 111
  • 95. Images/cinvestav Imagine the following Let Ri denotes the reward after the ith of this action Qn = R1 + R2 + · · · + Rn−1 n − 1 The problem of this technique The Amount of Memory and Computational Power to keep updates the estimates as more reward happens 45 / 111
  • 96. Images/cinvestav Therefore We need something more efficient We can do something like that by realizing the following: Qn+1 = 1 n n i=1 Rn = 1 n Rn + n i=1 Ei = 1 n [Rn + (n − 1) Qn] = Qn + 1 n [Rn − Qn] 46 / 111
  • 97. Images/cinvestav Therefore We need something more efficient We can do something like that by realizing the following: Qn+1 = 1 n n i=1 Rn = 1 n Rn + n i=1 Ei = 1 n [Rn + (n − 1) Qn] = Qn + 1 n [Rn − Qn] 46 / 111
  • 98. Images/cinvestav Therefore We need something more efficient We can do something like that by realizing the following: Qn+1 = 1 n n i=1 Rn = 1 n Rn + n i=1 Ei = 1 n [Rn + (n − 1) Qn] = Qn + 1 n [Rn − Qn] 46 / 111
  • 99. Images/cinvestav Therefore We need something more efficient We can do something like that by realizing the following: Qn+1 = 1 n n i=1 Rn = 1 n Rn + n i=1 Ei = 1 n [Rn + (n − 1) Qn] = Qn + 1 n [Rn − Qn] 46 / 111
  • 100. Images/cinvestav Outline 1 Reinforcement Learning Introduction Example A K-Armed Bandit Problem Exploration vs Exploitation The Elements of Reinforcement Learning Limitations 2 Formalizing Reinforcement Learning Introduction Markov Process and Markov Decision Process Return, Policy and Value function 3 Solution Methods Multi-armed Bandits Problems of Implementations General Formula in Reinforcement Learning A Simple Bandit Algorithm Dynamic Programming Policy Evaluation Policy Iteration Policy and Value Improvement Example Temporal-Difference Learning A Small Review of Monte Carlo Methods The Main Idea The Monte Carlo Principle Going Back to Temporal Difference Q-learning: Off-Policy TD Control 4 The Neural Network Approach Introduction, TD-Gammon TD(λ) by Back-Propagation Conclusions 47 / 111
  • 101. Images/cinvestav General Formula in Reinforcement Learning The Pattern Appears almost all the time in RL New Estimate = Old Estimate + Step Size [Target − Old Estimate] Looks a lot as a Gradient Descent if we decide Assume a Taylor Series f (x) = f (a) + f (a) (x − a) =⇒ f (a) = f (x) − f (a) x − a Then xn+1 = xn + γf (a) = xn + γ f (x) − f (a) x − a = xn + γ [f (x) − f (a)] 48 / 111
  • 102. Images/cinvestav General Formula in Reinforcement Learning The Pattern Appears almost all the time in RL New Estimate = Old Estimate + Step Size [Target − Old Estimate] Looks a lot as a Gradient Descent if we decide Assume a Taylor Series f (x) = f (a) + f (a) (x − a) =⇒ f (a) = f (x) − f (a) x − a Then xn+1 = xn + γf (a) = xn + γ f (x) − f (a) x − a = xn + γ [f (x) − f (a)] 48 / 111
  • 103. Images/cinvestav General Formula in Reinforcement Learning The Pattern Appears almost all the time in RL New Estimate = Old Estimate + Step Size [Target − Old Estimate] Looks a lot as a Gradient Descent if we decide Assume a Taylor Series f (x) = f (a) + f (a) (x − a) =⇒ f (a) = f (x) − f (a) x − a Then xn+1 = xn + γf (a) = xn + γ f (x) − f (a) x − a = xn + γ [f (x) − f (a)] 48 / 111
  • 104. Images/cinvestav Outline 1 Reinforcement Learning Introduction Example A K-Armed Bandit Problem Exploration vs Exploitation The Elements of Reinforcement Learning Limitations 2 Formalizing Reinforcement Learning Introduction Markov Process and Markov Decision Process Return, Policy and Value function 3 Solution Methods Multi-armed Bandits Problems of Implementations General Formula in Reinforcement Learning A Simple Bandit Algorithm Dynamic Programming Policy Evaluation Policy Iteration Policy and Value Improvement Example Temporal-Difference Learning A Small Review of Monte Carlo Methods The Main Idea The Monte Carlo Principle Going Back to Temporal Difference Q-learning: Off-Policy TD Control 4 The Neural Network Approach Introduction, TD-Gammon TD(λ) by Back-Propagation Conclusions 49 / 111
  • 105. Images/cinvestav A Simple Bandit Algorithm Initialize for a = 1 to k 1 Q (a) ← 0 2 N (a) ← 0 Loop until certain threshold γ 1 A ← arg maxa Q (a) with probability 1 − A random action with probability 2 R ← bandit (A) 3 N (A) ← N (A) + 1 4 Q (A) ← Q (A) + 1 N(A) [R − Q (A)] 50 / 111
  • 106. Images/cinvestav A Simple Bandit Algorithm Initialize for a = 1 to k 1 Q (a) ← 0 2 N (a) ← 0 Loop until certain threshold γ 1 A ← arg maxa Q (a) with probability 1 − A random action with probability 2 R ← bandit (A) 3 N (A) ← N (A) + 1 4 Q (A) ← Q (A) + 1 N(A) [R − Q (A)] 50 / 111
  • 107. Images/cinvestav Remarks Remember This is works well for a stationary problem For more in non-stationary problems “Reinforcement Learning: An Introduction“ by Sutton et al. at chapter 2. 51 / 111
  • 108. Images/cinvestav Remarks Remember This is works well for a stationary problem For more in non-stationary problems “Reinforcement Learning: An Introduction“ by Sutton et al. at chapter 2. 51 / 111
  • 109. Images/cinvestav Outline 1 Reinforcement Learning Introduction Example A K-Armed Bandit Problem Exploration vs Exploitation The Elements of Reinforcement Learning Limitations 2 Formalizing Reinforcement Learning Introduction Markov Process and Markov Decision Process Return, Policy and Value function 3 Solution Methods Multi-armed Bandits Problems of Implementations General Formula in Reinforcement Learning A Simple Bandit Algorithm Dynamic Programming Policy Evaluation Policy Iteration Policy and Value Improvement Example Temporal-Difference Learning A Small Review of Monte Carlo Methods The Main Idea The Monte Carlo Principle Going Back to Temporal Difference Q-learning: Off-Policy TD Control 4 The Neural Network Approach Introduction, TD-Gammon TD(λ) by Back-Propagation Conclusions 52 / 111
  • 110. Images/cinvestav Dynamic Programming As in many things in Computer Science We love the Divide Et Impera... Then In Reinforcement Learning, Bellman equation gives recursive decompositions, and value function stores and reuses solutions. 53 / 111
  • 111. Images/cinvestav Dynamic Programming As in many things in Computer Science We love the Divide Et Impera... Then In Reinforcement Learning, Bellman equation gives recursive decompositions, and value function stores and reuses solutions. 53 / 111
  • 112. Images/cinvestav Outline 1 Reinforcement Learning Introduction Example A K-Armed Bandit Problem Exploration vs Exploitation The Elements of Reinforcement Learning Limitations 2 Formalizing Reinforcement Learning Introduction Markov Process and Markov Decision Process Return, Policy and Value function 3 Solution Methods Multi-armed Bandits Problems of Implementations General Formula in Reinforcement Learning A Simple Bandit Algorithm Dynamic Programming Policy Evaluation Policy Iteration Policy and Value Improvement Example Temporal-Difference Learning A Small Review of Monte Carlo Methods The Main Idea The Monte Carlo Principle Going Back to Temporal Difference Q-learning: Off-Policy TD Control 4 The Neural Network Approach Introduction, TD-Gammon TD(λ) by Back-Propagation Conclusions 54 / 111
  • 113. Images/cinvestav Policy Evaluation For this, we assume as always The problem can be solved with the Bellman Equation We start from an initial guess v1 Then, we apply Bellman Iteratively v1 → v2 → · · · → vπ For this, we use synchronous updates We update the value functions of the present iteration at the same time based on the that of the previous iteration. 55 / 111
  • 114. Images/cinvestav Policy Evaluation For this, we assume as always The problem can be solved with the Bellman Equation We start from an initial guess v1 Then, we apply Bellman Iteratively v1 → v2 → · · · → vπ For this, we use synchronous updates We update the value functions of the present iteration at the same time based on the that of the previous iteration. 55 / 111
  • 115. Images/cinvestav Policy Evaluation For this, we assume as always The problem can be solved with the Bellman Equation We start from an initial guess v1 Then, we apply Bellman Iteratively v1 → v2 → · · · → vπ For this, we use synchronous updates We update the value functions of the present iteration at the same time based on the that of the previous iteration. 55 / 111
  • 116. Images/cinvestav Then At each iteration k + 1, for all states s ∈ S We update vk+1 from vk (s ) according to Bellman Equations, where s is a successor state of s: vk+1 (s) = a∈A π (a|s)  Ra s + α s ∈S Pa ss vk s   Something Notable The iterative process stops when the maximum difference between value function at the current step and that at the previous step is smaller than some small positive constant. 56 / 111
  • 117. Images/cinvestav Then At each iteration k + 1, for all states s ∈ S We update vk+1 from vk (s ) according to Bellman Equations, where s is a successor state of s: vk+1 (s) = a∈A π (a|s)  Ra s + α s ∈S Pa ss vk s   Something Notable The iterative process stops when the maximum difference between value function at the current step and that at the previous step is smaller than some small positive constant. 56 / 111
  • 118. Images/cinvestav Outline 1 Reinforcement Learning Introduction Example A K-Armed Bandit Problem Exploration vs Exploitation The Elements of Reinforcement Learning Limitations 2 Formalizing Reinforcement Learning Introduction Markov Process and Markov Decision Process Return, Policy and Value function 3 Solution Methods Multi-armed Bandits Problems of Implementations General Formula in Reinforcement Learning A Simple Bandit Algorithm Dynamic Programming Policy Evaluation Policy Iteration Policy and Value Improvement Example Temporal-Difference Learning A Small Review of Monte Carlo Methods The Main Idea The Monte Carlo Principle Going Back to Temporal Difference Q-learning: Off-Policy TD Control 4 The Neural Network Approach Introduction, TD-Gammon TD(λ) by Back-Propagation Conclusions 57 / 111
  • 119. Images/cinvestav However Our ultimate goal is to find the optimal policy For this, we can use policy iteration process Algorithm 1 Initialize π randomly 2 Repeat until the previous policy and the current policy are the same. 1 Evaluate vπ by policy evaluation. 2 Using synchronous updates, for each state s, let π (s) = arg max a∈A q (s, a) = arg max a∈A  Ra s + α s ∈S Pa ss vπ s   58 / 111
  • 120. Images/cinvestav However Our ultimate goal is to find the optimal policy For this, we can use policy iteration process Algorithm 1 Initialize π randomly 2 Repeat until the previous policy and the current policy are the same. 1 Evaluate vπ by policy evaluation. 2 Using synchronous updates, for each state s, let π (s) = arg max a∈A q (s, a) = arg max a∈A  Ra s + α s ∈S Pa ss vπ s   58 / 111
  • 121. Images/cinvestav Now Something Notable This policy iteration algorithm will improve the policy for each step. Assume that We have a deterministic policy π and after one step we get π . We know that the value improves in one step because qπ s, π (s) = max a∈A qπ (s, a) ≥ qπ (s, π (s)) = vπ (s) 59 / 111
  • 122. Images/cinvestav Now Something Notable This policy iteration algorithm will improve the policy for each step. Assume that We have a deterministic policy π and after one step we get π . We know that the value improves in one step because qπ s, π (s) = max a∈A qπ (s, a) ≥ qπ (s, π (s)) = vπ (s) 59 / 111
  • 123. Images/cinvestav Now Something Notable This policy iteration algorithm will improve the policy for each step. Assume that We have a deterministic policy π and after one step we get π . We know that the value improves in one step because qπ s, π (s) = max a∈A qπ (s, a) ≥ qπ (s, π (s)) = vπ (s) 59 / 111
  • 124. Images/cinvestav Then We can see that vπ (s) ≤ qπ s, π (s) = Eπ [Rt+1 + αvπ (St+1) |St = s] ≤ Eπ Rt+1 + αqπ St+1, π (St+1) |St = s ≤ Eπ Rt+1 + αRt+2 + α2 qπ St+2, π (St+2) |St = s ≤ Eπ [Rt+1 + αRt+2 + · · · |St = s] = vπ (s) 60 / 111
  • 125. Images/cinvestav Then We can see that vπ (s) ≤ qπ s, π (s) = Eπ [Rt+1 + αvπ (St+1) |St = s] ≤ Eπ Rt+1 + αqπ St+1, π (St+1) |St = s ≤ Eπ Rt+1 + αRt+2 + α2 qπ St+2, π (St+2) |St = s ≤ Eπ [Rt+1 + αRt+2 + · · · |St = s] = vπ (s) 60 / 111
  • 126. Images/cinvestav Then We can see that vπ (s) ≤ qπ s, π (s) = Eπ [Rt+1 + αvπ (St+1) |St = s] ≤ Eπ Rt+1 + αqπ St+1, π (St+1) |St = s ≤ Eπ Rt+1 + αRt+2 + α2 qπ St+2, π (St+2) |St = s ≤ Eπ [Rt+1 + αRt+2 + · · · |St = s] = vπ (s) 60 / 111
  • 127. Images/cinvestav Then We can see that vπ (s) ≤ qπ s, π (s) = Eπ [Rt+1 + αvπ (St+1) |St = s] ≤ Eπ Rt+1 + αqπ St+1, π (St+1) |St = s ≤ Eπ Rt+1 + αRt+2 + α2 qπ St+2, π (St+2) |St = s ≤ Eπ [Rt+1 + αRt+2 + · · · |St = s] = vπ (s) 60 / 111
  • 128. Images/cinvestav Then If improvements stop qπ s, π (s) = max a∈A qπ (s, a) = qπ (s, π (s)) = vπ (s) This means that the Bellman equation has been satisfied vπ (s) = max a∈A qπ (s, a) Therefore, vπ (s) = v∗ (s) for all s ∈ S and π is an optimal policy. 61 / 111
  • 129. Images/cinvestav Then If improvements stop qπ s, π (s) = max a∈A qπ (s, a) = qπ (s, π (s)) = vπ (s) This means that the Bellman equation has been satisfied vπ (s) = max a∈A qπ (s, a) Therefore, vπ (s) = v∗ (s) for all s ∈ S and π is an optimal policy. 61 / 111
  • 130. Images/cinvestav Outline 1 Reinforcement Learning Introduction Example A K-Armed Bandit Problem Exploration vs Exploitation The Elements of Reinforcement Learning Limitations 2 Formalizing Reinforcement Learning Introduction Markov Process and Markov Decision Process Return, Policy and Value function 3 Solution Methods Multi-armed Bandits Problems of Implementations General Formula in Reinforcement Learning A Simple Bandit Algorithm Dynamic Programming Policy Evaluation Policy Iteration Policy and Value Improvement Example Temporal-Difference Learning A Small Review of Monte Carlo Methods The Main Idea The Monte Carlo Principle Going Back to Temporal Difference Q-learning: Off-Policy TD Control 4 The Neural Network Approach Introduction, TD-Gammon TD(λ) by Back-Propagation Conclusions 62 / 111
  • 131. Images/cinvestav Once the policy has been improved We can iteratively a sequence of monotonically improving policies and values π0 E −→ vπ0 I → π1 E −→ vπ1 I → π2 E −→ · · · I −→ π∗ E → v∗ with E = evaluation and I = improvement. 63 / 111
  • 132. Images/cinvestav Policy Iteration (using iterative policy evaluation) Step 1. Initialization V (s) ∈ R and π (s) ∈ A (s) arbitrarily for all s ∈ S Step 2. Policy Evaluation 1 Loop: 2 ∆ ← 0 3 Loop for each s ∈ S 4 v ← V (s) 5 V (s) ← s ,r p (s , r|s, π (s)) [r + αV (s )] 6 ∆ ← max (∆, |v − V (s)|) 7 until ∆ < θ (A small threshold for accuracy) 64 / 111
  • 133. Images/cinvestav Policy Iteration (using iterative policy evaluation) Step 1. Initialization V (s) ∈ R and π (s) ∈ A (s) arbitrarily for all s ∈ S Step 2. Policy Evaluation 1 Loop: 2 ∆ ← 0 3 Loop for each s ∈ S 4 v ← V (s) 5 V (s) ← s ,r p (s , r|s, π (s)) [r + αV (s )] 6 ∆ ← max (∆, |v − V (s)|) 7 until ∆ < θ (A small threshold for accuracy) 64 / 111
  • 134. Images/cinvestav Further Step 3. Policy Improvement 1 Policy-stable ← True 2 For each s ∈ S: 3 old_action ← π (s) 4 π (s) ← s ,r p (s .r|s, a) [r + αV (s )] 5 if old_action= π (s) then policy-stable ← False 6 If policy-stable then stop and return V ≈ v∗ and π ≈ π∗ else go to 2. 65 / 111
  • 135. Images/cinvestav Outline 1 Reinforcement Learning Introduction Example A K-Armed Bandit Problem Exploration vs Exploitation The Elements of Reinforcement Learning Limitations 2 Formalizing Reinforcement Learning Introduction Markov Process and Markov Decision Process Return, Policy and Value function 3 Solution Methods Multi-armed Bandits Problems of Implementations General Formula in Reinforcement Learning A Simple Bandit Algorithm Dynamic Programming Policy Evaluation Policy Iteration Policy and Value Improvement Example Temporal-Difference Learning A Small Review of Monte Carlo Methods The Main Idea The Monte Carlo Principle Going Back to Temporal Difference Q-learning: Off-Policy TD Control 4 The Neural Network Approach Introduction, TD-Gammon TD(λ) by Back-Propagation Conclusions 66 / 111
  • 136. Images/cinvestav Jack’s Car Rental Setup We have a Car’s Rental with two places (s1, s2). And each time Jack is available he rents it out and is credited $10 by the company. If Jack’s is out of cars at that location, then the business is lost. The cars are available for renting the next day Jack’s has options It can move them between the two locations overnight, at a cost of $2 dollars and only 5 of them can be moved. 67 / 111
  • 137. Images/cinvestav Jack’s Car Rental Setup We have a Car’s Rental with two places (s1, s2). And each time Jack is available he rents it out and is credited $10 by the company. If Jack’s is out of cars at that location, then the business is lost. The cars are available for renting the next day Jack’s has options It can move them between the two locations overnight, at a cost of $2 dollars and only 5 of them can be moved. 67 / 111
  • 138. Images/cinvestav Jack’s Car Rental Setup We have a Car’s Rental with two places (s1, s2). And each time Jack is available he rents it out and is credited $10 by the company. If Jack’s is out of cars at that location, then the business is lost. The cars are available for renting the next day Jack’s has options It can move them between the two locations overnight, at a cost of $2 dollars and only 5 of them can be moved. 67 / 111
  • 139. Images/cinvestav Jack’s Car Rental Setup We have a Car’s Rental with two places (s1, s2). And each time Jack is available he rents it out and is credited $10 by the company. If Jack’s is out of cars at that location, then the business is lost. The cars are available for renting the next day Jack’s has options It can move them between the two locations overnight, at a cost of $2 dollars and only 5 of them can be moved. 67 / 111
  • 140. Images/cinvestav Jack’s Car Rental Setup We have a Car’s Rental with two places (s1, s2). And each time Jack is available he rents it out and is credited $10 by the company. If Jack’s is out of cars at that location, then the business is lost. The cars are available for renting the next day Jack’s has options It can move them between the two locations overnight, at a cost of $2 dollars and only 5 of them can be moved. 67 / 111
  • 141. Images/cinvestav We choose a probability for modeling requests and returns Additionally, renting at each location has a Poisson distribution P (n|λ) = λn n! e−λ We simplify the problem by having an upper-bound n = 20 Basically, any other car generated by the return simply disappear!!! 68 / 111
  • 142. Images/cinvestav We choose a probability for modeling requests and returns Additionally, renting at each location has a Poisson distribution P (n|λ) = λn n! e−λ We simplify the problem by having an upper-bound n = 20 Basically, any other car generated by the return simply disappear!!! 68 / 111
  • 143. Images/cinvestav Finally We have the following λ s for request and returns 1 Request s1 we have λr1 = 3 2 Request s2 we have λr2 = 4 3 Return s1 we have λr1 = 3 4 Return s2 we have λr1 = 2 Actually, it defines how many cars are requested or returned in a time interval (One day) 69 / 111
  • 144. Images/cinvestav Finally We have the following λ s for request and returns 1 Request s1 we have λr1 = 3 2 Request s2 we have λr2 = 4 3 Return s1 we have λr1 = 3 4 Return s2 we have λr1 = 2 Actually, it defines how many cars are requested or returned in a time interval (One day) 69 / 111
  • 145. Images/cinvestav Finally We have the following λ s for request and returns 1 Request s1 we have λr1 = 3 2 Request s2 we have λr2 = 4 3 Return s1 we have λr1 = 3 4 Return s2 we have λr1 = 2 Actually, it defines how many cars are requested or returned in a time interval (One day) 69 / 111
  • 146. Images/cinvestav Finally We have the following λ s for request and returns 1 Request s1 we have λr1 = 3 2 Request s2 we have λr2 = 4 3 Return s1 we have λr1 = 3 4 Return s2 we have λr1 = 2 Actually, it defines how many cars are requested or returned in a time interval (One day) 69 / 111
  • 147. Images/cinvestav Finally We have the following λ s for request and returns 1 Request s1 we have λr1 = 3 2 Request s2 we have λr2 = 4 3 Return s1 we have λr1 = 3 4 Return s2 we have λr1 = 2 Actually, it defines how many cars are requested or returned in a time interval (One day) Time Interval Events 69 / 111
  • 148. Images/cinvestav Thus, we have the following running With a α = 0.9 70 / 111
  • 149. Images/cinvestav Outline 1 Reinforcement Learning Introduction Example A K-Armed Bandit Problem Exploration vs Exploitation The Elements of Reinforcement Learning Limitations 2 Formalizing Reinforcement Learning Introduction Markov Process and Markov Decision Process Return, Policy and Value function 3 Solution Methods Multi-armed Bandits Problems of Implementations General Formula in Reinforcement Learning A Simple Bandit Algorithm Dynamic Programming Policy Evaluation Policy Iteration Policy and Value Improvement Example Temporal-Difference Learning A Small Review of Monte Carlo Methods The Main Idea The Monte Carlo Principle Going Back to Temporal Difference Q-learning: Off-Policy TD Control 4 The Neural Network Approach Introduction, TD-Gammon TD(λ) by Back-Propagation Conclusions 71 / 111
  • 150. Images/cinvestav History There is an original earlier work By Arthur Samuel and it was used by Sutton to invent Temporal Difference-Lambda It was famously applied by Gerald Tesauro To build TD-Gammon A program that learned to play backgammon at the level of expert human players 72 / 111
  • 151. Images/cinvestav History There is an original earlier work By Arthur Samuel and it was used by Sutton to invent Temporal Difference-Lambda It was famously applied by Gerald Tesauro To build TD-Gammon A program that learned to play backgammon at the level of expert human players 72 / 111
  • 152. Images/cinvestav Temporal-Difference (TD) Learning Something Notable Possibly the central and novel idea of reinforcement learning, It has similarity with Monte Carlo Methods They use experience to solve the prediction problem (For More look at Sutton’s Book) Monte Carlo methods wait until Gt is known V (St) ← V (St) + γ [Gt − V (St)] and Gt = ∞ k=0 αk Rt+k+1 Let us call this method constant-α Monte Carlo. 73 / 111
  • 153. Images/cinvestav Temporal-Difference (TD) Learning Something Notable Possibly the central and novel idea of reinforcement learning, It has similarity with Monte Carlo Methods They use experience to solve the prediction problem (For More look at Sutton’s Book) Monte Carlo methods wait until Gt is known V (St) ← V (St) + γ [Gt − V (St)] and Gt = ∞ k=0 αk Rt+k+1 Let us call this method constant-α Monte Carlo. 73 / 111
  • 154. Images/cinvestav Temporal-Difference (TD) Learning Something Notable Possibly the central and novel idea of reinforcement learning, It has similarity with Monte Carlo Methods They use experience to solve the prediction problem (For More look at Sutton’s Book) Monte Carlo methods wait until Gt is known V (St) ← V (St) + γ [Gt − V (St)] and Gt = ∞ k=0 αk Rt+k+1 Let us call this method constant-α Monte Carlo. 73 / 111
  • 155. Images/cinvestav Outline 1 Reinforcement Learning Introduction Example A K-Armed Bandit Problem Exploration vs Exploitation The Elements of Reinforcement Learning Limitations 2 Formalizing Reinforcement Learning Introduction Markov Process and Markov Decision Process Return, Policy and Value function 3 Solution Methods Multi-armed Bandits Problems of Implementations General Formula in Reinforcement Learning A Simple Bandit Algorithm Dynamic Programming Policy Evaluation Policy Iteration Policy and Value Improvement Example Temporal-Difference Learning A Small Review of Monte Carlo Methods The Main Idea The Monte Carlo Principle Going Back to Temporal Difference Q-learning: Off-Policy TD Control 4 The Neural Network Approach Introduction, TD-Gammon TD(λ) by Back-Propagation Conclusions 74 / 111
  • 156. Images/cinvestav Fundamental Theorem of Simulation Cumulative Distribution Function With every random variable, we associate a function called Cumulative Distribution Function (CDF) which is defined as follows: FX(x) = P(f(X) ≤ x) With properties: FX(x) ≥ 0 FX(x) in a non-decreasing function of X. 75 / 111
  • 157. Images/cinvestav Fundamental Theorem of Simulation Cumulative Distribution Function With every random variable, we associate a function called Cumulative Distribution Function (CDF) which is defined as follows: FX(x) = P(f(X) ≤ x) With properties: FX(x) ≥ 0 FX(x) in a non-decreasing function of X. 75 / 111
  • 158. Images/cinvestav Fundamental Theorem of Simulation Cumulative Distribution Function With every random variable, we associate a function called Cumulative Distribution Function (CDF) which is defined as follows: FX(x) = P(f(X) ≤ x) With properties: FX(x) ≥ 0 FX(x) in a non-decreasing function of X. 75 / 111
  • 161. Images/cinvestav Outline 1 Reinforcement Learning Introduction Example A K-Armed Bandit Problem Exploration vs Exploitation The Elements of Reinforcement Learning Limitations 2 Formalizing Reinforcement Learning Introduction Markov Process and Markov Decision Process Return, Policy and Value function 3 Solution Methods Multi-armed Bandits Problems of Implementations General Formula in Reinforcement Learning A Simple Bandit Algorithm Dynamic Programming Policy Evaluation Policy Iteration Policy and Value Improvement Example Temporal-Difference Learning A Small Review of Monte Carlo Methods The Main Idea The Monte Carlo Principle Going Back to Temporal Difference Q-learning: Off-Policy TD Control 4 The Neural Network Approach Introduction, TD-Gammon TD(λ) by Back-Propagation Conclusions 78 / 111
  • 162. Images/cinvestav Fundamental Theorem of Simulation Theorem Simulating X ∼ f (x) It is equivalent to simulating (X, U) ∼ U ((x, u) |0 < u < f (x)) 79 / 111
  • 163. Images/cinvestav Fundamental Theorem of Simulation Theorem Simulating X ∼ f (x) It is equivalent to simulating (X, U) ∼ U ((x, u) |0 < u < f (x)) 79 / 111
  • 164. Images/cinvestav Geometrically We have Proof We have that f (x) = f(x) 0 du the marginal pdf of (X, U). 80 / 111
  • 165. Images/cinvestav Geometrically We have Proof We have that f (x) = f(x) 0 du the marginal pdf of (X, U). 80 / 111
  • 166. Images/cinvestav Therefore One thing that is made clear by this theorem is that we can generate X in three ways 1 First generate X ∼ f, and then U|X = x ∼ U {u|u ≤ f (x)} , but this is useless because we already have X and do not need X. 2 First generate U, and then X|U = u ∼ U {x|u ≤ f (x)} . 3 Generate (X, U) jointly, a smart idea. it allows us to generate data on a larger set were simulation is easier. Not as simple as it looks. 81 / 111
  • 167. Images/cinvestav Therefore One thing that is made clear by this theorem is that we can generate X in three ways 1 First generate X ∼ f, and then U|X = x ∼ U {u|u ≤ f (x)} , but this is useless because we already have X and do not need X. 2 First generate U, and then X|U = u ∼ U {x|u ≤ f (x)} . 3 Generate (X, U) jointly, a smart idea. it allows us to generate data on a larger set were simulation is easier. Not as simple as it looks. 81 / 111
  • 168. Images/cinvestav Therefore One thing that is made clear by this theorem is that we can generate X in three ways 1 First generate X ∼ f, and then U|X = x ∼ U {u|u ≤ f (x)} , but this is useless because we already have X and do not need X. 2 First generate U, and then X|U = u ∼ U {x|u ≤ f (x)} . 3 Generate (X, U) jointly, a smart idea. it allows us to generate data on a larger set were simulation is easier. Not as simple as it looks. 81 / 111
  • 169. Images/cinvestav Therefore One thing that is made clear by this theorem is that we can generate X in three ways 1 First generate X ∼ f, and then U|X = x ∼ U {u|u ≤ f (x)} , but this is useless because we already have X and do not need X. 2 First generate U, and then X|U = u ∼ U {x|u ≤ f (x)} . 3 Generate (X, U) jointly, a smart idea. it allows us to generate data on a larger set were simulation is easier. Not as simple as it looks. 81 / 111
  • 170. Images/cinvestav Markov Chains The random process Xt ∈ S for t = 1, 2, ..., T has a Markov property iif p (XT |XT−1, XT−2, ..., X1) = p (XT |XT−1) Finite-state Discrete Time Markov Chains |S| < ∞ It can be completly specified by the transition matrix. P = [pij] with pij = P [Xt = j|Xt−1 = i] For Irreducible chains, the stationary distribution π is long-term proportion of time that the chain spends in each state. Computed by π = πP . 82 / 111
  • 171. Images/cinvestav Markov Chains The random process Xt ∈ S for t = 1, 2, ..., T has a Markov property iif p (XT |XT−1, XT−2, ..., X1) = p (XT |XT−1) Finite-state Discrete Time Markov Chains |S| < ∞ It can be completly specified by the transition matrix. P = [pij] with pij = P [Xt = j|Xt−1 = i] For Irreducible chains, the stationary distribution π is long-term proportion of time that the chain spends in each state. Computed by π = πP . 82 / 111
  • 172. Images/cinvestav Markov Chains The random process Xt ∈ S for t = 1, 2, ..., T has a Markov property iif p (XT |XT−1, XT−2, ..., X1) = p (XT |XT−1) Finite-state Discrete Time Markov Chains |S| < ∞ It can be completly specified by the transition matrix. P = [pij] with pij = P [Xt = j|Xt−1 = i] For Irreducible chains, the stationary distribution π is long-term proportion of time that the chain spends in each state. Computed by π = πP . 82 / 111
  • 173. Images/cinvestav Markov Chains The random process Xt ∈ S for t = 1, 2, ..., T has a Markov property iif p (XT |XT−1, XT−2, ..., X1) = p (XT |XT−1) Finite-state Discrete Time Markov Chains |S| < ∞ It can be completly specified by the transition matrix. P = [pij] with pij = P [Xt = j|Xt−1 = i] For Irreducible chains, the stationary distribution π is long-term proportion of time that the chain spends in each state. Computed by π = πP . 82 / 111
  • 174. Images/cinvestav Outline 1 Reinforcement Learning Introduction Example A K-Armed Bandit Problem Exploration vs Exploitation The Elements of Reinforcement Learning Limitations 2 Formalizing Reinforcement Learning Introduction Markov Process and Markov Decision Process Return, Policy and Value function 3 Solution Methods Multi-armed Bandits Problems of Implementations General Formula in Reinforcement Learning A Simple Bandit Algorithm Dynamic Programming Policy Evaluation Policy Iteration Policy and Value Improvement Example Temporal-Difference Learning A Small Review of Monte Carlo Methods The Main Idea The Monte Carlo Principle Going Back to Temporal Difference Q-learning: Off-Policy TD Control 4 The Neural Network Approach Introduction, TD-Gammon TD(λ) by Back-Propagation Conclusions 83 / 111
  • 175. Images/cinvestav The Idea Draw an i.i.d. set of samples x(i) N i=1 From a target density p (x) on a high-dimensional X. The set of possible configurations of a system. Thus, we can use those samples to approximate the target density pN (x) = 1 N N i=1 δx(i) (x) where δx(i) (x) denotes the Dirac Delta mass located at x(i). 84 / 111
  • 176. Images/cinvestav The Idea Draw an i.i.d. set of samples x(i) N i=1 From a target density p (x) on a high-dimensional X. The set of possible configurations of a system. Thus, we can use those samples to approximate the target density pN (x) = 1 N N i=1 δx(i) (x) where δx(i) (x) denotes the Dirac Delta mass located at x(i). 84 / 111
  • 177. Images/cinvestav The Idea Draw an i.i.d. set of samples x(i) N i=1 From a target density p (x) on a high-dimensional X. The set of possible configurations of a system. Thus, we can use those samples to approximate the target density pN (x) = 1 N N i=1 δx(i) (x) where δx(i) (x) denotes the Dirac Delta mass located at x(i). 84 / 111
  • 178. Images/cinvestav Thus Given, the Dirac Delta δ (t) = 0 t = 0 ∞ t = 0 with t2 t1 δ(t)dt = 1 Therefore δ (t) = lim σ→0 1 √ 2πσ e− t2 2σ2 85 / 111
  • 179. Images/cinvestav Thus Given, the Dirac Delta δ (t) = 0 t = 0 ∞ t = 0 with t2 t1 δ(t)dt = 1 Therefore δ (t) = lim σ→0 1 √ 2πσ e− t2 2σ2 85 / 111
  • 180. Images/cinvestav Consequently Approximate the integrals IN (f) = 1 N N i=1 f x(i) a.s. −→ N→∞ I (f) = X f (x) p (x) dx Where Variance of f (x) σ2 f = Ep(x) f2 (x) − I2 (f) < ∞ Therefore the variance of the estimator IN (f) EIN (f) = var (IN (f)) = σ2 f N 86 / 111
  • 181. Images/cinvestav Consequently Approximate the integrals IN (f) = 1 N N i=1 f x(i) a.s. −→ N→∞ I (f) = X f (x) p (x) dx Where Variance of f (x) σ2 f = Ep(x) f2 (x) − I2 (f) < ∞ Therefore the variance of the estimator IN (f) EIN (f) = var (IN (f)) = σ2 f N 86 / 111
  • 182. Images/cinvestav Consequently Approximate the integrals IN (f) = 1 N N i=1 f x(i) a.s. −→ N→∞ I (f) = X f (x) p (x) dx Where Variance of f (x) σ2 f = Ep(x) f2 (x) − I2 (f) < ∞ Therefore the variance of the estimator IN (f) EIN (f) = var (IN (f)) = σ2 f N 86 / 111
  • 183. Images/cinvestav Then By Robert & Casella 1999, Section 3.2 √ N (IN (f) − I (f)) =⇒ N−→∞ N 0, σ2 f 87 / 111
  • 184. Images/cinvestav Then By Robert & Casella 1999, Section 3.2 √ N (IN (f) − I (f)) =⇒ N−→∞ N 0, σ2 f 87 / 111
  • 185. Images/cinvestav Outline 1 Reinforcement Learning Introduction Example A K-Armed Bandit Problem Exploration vs Exploitation The Elements of Reinforcement Learning Limitations 2 Formalizing Reinforcement Learning Introduction Markov Process and Markov Decision Process Return, Policy and Value function 3 Solution Methods Multi-armed Bandits Problems of Implementations General Formula in Reinforcement Learning A Simple Bandit Algorithm Dynamic Programming Policy Evaluation Policy Iteration Policy and Value Improvement Example Temporal-Difference Learning A Small Review of Monte Carlo Methods The Main Idea The Monte Carlo Principle Going Back to Temporal Difference Q-learning: Off-Policy TD Control 4 The Neural Network Approach Introduction, TD-Gammon TD(λ) by Back-Propagation Conclusions 88 / 111
  • 186. Images/cinvestav What if we do not wait for Gt TD methods need to wait only until the next time step At time t + 1, it uses the observed reward Rt+1 and the estimate V (St+1) The simplest TD method V (St) ← V (St) + γ [Rt+1 + αV (St+1) − V (St)] Immediately on transition to St+1 and receiving Rt+1 . 89 / 111
  • 187. Images/cinvestav What if we do not wait for Gt TD methods need to wait only until the next time step At time t + 1, it uses the observed reward Rt+1 and the estimate V (St+1) The simplest TD method V (St) ← V (St) + γ [Rt+1 + αV (St+1) − V (St)] Immediately on transition to St+1 and receiving Rt+1 . 89 / 111
  • 188. Images/cinvestav This method is called TD(0) or one-step TD Actually There are TD with n-step TD methods Differences with the Monte Carlo Methods In Monte Carlo update, the target is Gt In TD update, the target is Rt + αV (St+1) 90 / 111
  • 189. Images/cinvestav This method is called TD(0) or one-step TD Actually There are TD with n-step TD methods Differences with the Monte Carlo Methods In Monte Carlo update, the target is Gt In TD update, the target is Rt + αV (St+1) 90 / 111
  • 190. Images/cinvestav TD as a Bootstrapping Method Quite similar to Dynamic Additionally, we know that vπ (s) = Eπ [Gt|St] = Eπ [Rt+1 + αGt+1|St] = Eπ [Rt+1 + αvπ (St+1) |St] Monte Carlo methods use an estimate of vπ (s) = Eπ [Gt|St] Instead, Dynamic Programming methods use an estimate of vπ (s) = Eπ [Rt+1 + αvπ (St+1) |St]. 91 / 111
  • 191. Images/cinvestav TD as a Bootstrapping Method Quite similar to Dynamic Additionally, we know that vπ (s) = Eπ [Gt|St] = Eπ [Rt+1 + αGt+1|St] = Eπ [Rt+1 + αvπ (St+1) |St] Monte Carlo methods use an estimate of vπ (s) = Eπ [Gt|St] Instead, Dynamic Programming methods use an estimate of vπ (s) = Eπ [Rt+1 + αvπ (St+1) |St]. 91 / 111
  • 192. Images/cinvestav Furthermore The Monte Carlo target is an estimate because the expected value in (6.3) is not known, but a sample is used instead The DP target is an estimate because vπ(St + 1) is not known and the current estimate,V (St+1), is used instead. Here TD goes beyond and combines both ideas It combines the sampling of Monte Carlo with the bootstrapping of DP. 92 / 111
  • 193. Images/cinvestav Furthermore The Monte Carlo target is an estimate because the expected value in (6.3) is not known, but a sample is used instead The DP target is an estimate because vπ(St + 1) is not known and the current estimate,V (St+1), is used instead. Here TD goes beyond and combines both ideas It combines the sampling of Monte Carlo with the bootstrapping of DP. 92 / 111
  • 194. Images/cinvestav Furthermore The Monte Carlo target is an estimate because the expected value in (6.3) is not known, but a sample is used instead The DP target is an estimate because vπ(St + 1) is not known and the current estimate,V (St+1), is used instead. Here TD goes beyond and combines both ideas It combines the sampling of Monte Carlo with the bootstrapping of DP. 92 / 111
  • 195. Images/cinvestav As in many methods there is an error And it is used to improve the policy δt = Rt+1 + αV (St+1) − V (St) Because the TD error depends on the next state and next reward It is not actually available until one time step later. δt is the error in V (St) At time t + 1 93 / 111
  • 196. Images/cinvestav As in many methods there is an error And it is used to improve the policy δt = Rt+1 + αV (St+1) − V (St) Because the TD error depends on the next state and next reward It is not actually available until one time step later. δt is the error in V (St) At time t + 1 93 / 111
  • 197. Images/cinvestav As in many methods there is an error And it is used to improve the policy δt = Rt+1 + αV (St+1) − V (St) Because the TD error depends on the next state and next reward It is not actually available until one time step later. δt is the error in V (St) At time t + 1 93 / 111
  • 198. Images/cinvestav Finally If the array V does not change during the episode, The Monte Carlo Error can be written as a sum of TD errors = Gt − V (St) = Rt+1 + αGt+1 − V (St) + αV (St+1) − αV (St+1) = δt + α (Gt+1 − V (St+1)) = δt + αδt+! + α2 (Gt+2 − V (St+2)) = T−1 k=t αk−t δk 94 / 111
  • 199. Images/cinvestav Optimality of TD(0) Under batch updating TD(0) converges deterministically to a single answer independent as long as γ is chosen to be sufficiently small. V (St) ← V (St) + γ [Gt − V (St)] and 95 / 111
  • 200. Images/cinvestav Outline 1 Reinforcement Learning Introduction Example A K-Armed Bandit Problem Exploration vs Exploitation The Elements of Reinforcement Learning Limitations 2 Formalizing Reinforcement Learning Introduction Markov Process and Markov Decision Process Return, Policy and Value function 3 Solution Methods Multi-armed Bandits Problems of Implementations General Formula in Reinforcement Learning A Simple Bandit Algorithm Dynamic Programming Policy Evaluation Policy Iteration Policy and Value Improvement Example Temporal-Difference Learning A Small Review of Monte Carlo Methods The Main Idea The Monte Carlo Principle Going Back to Temporal Difference Q-learning: Off-Policy TD Control 4 The Neural Network Approach Introduction, TD-Gammon TD(λ) by Back-Propagation Conclusions 96 / 111
  • 201. Images/cinvestav Q-learning: Off-Policy TD Control One of the early breakthroughs in reinforcement learning The development of an off-policy TD control algorithm, Q-learning (Watkind, 1989) It is defined as Q (St, At) ← Q (St, At) + γ Rt+1 + α max a Q (St+1, a) − Q (St, At) Here The learned action-value function, Q, directly approximates q∗ the optimal action-value function independent of the policy being followed q. 97 / 111
  • 202. Images/cinvestav Q-learning: Off-Policy TD Control One of the early breakthroughs in reinforcement learning The development of an off-policy TD control algorithm, Q-learning (Watkind, 1989) It is defined as Q (St, At) ← Q (St, At) + γ Rt+1 + α max a Q (St+1, a) − Q (St, At) Here The learned action-value function, Q, directly approximates q∗ the optimal action-value function independent of the policy being followed q. 97 / 111
  • 203. Images/cinvestav Q-learning: Off-Policy TD Control One of the early breakthroughs in reinforcement learning The development of an off-policy TD control algorithm, Q-learning (Watkind, 1989) It is defined as Q (St, At) ← Q (St, At) + γ Rt+1 + α max a Q (St+1, a) − Q (St, At) Here The learned action-value function, Q, directly approximates q∗ the optimal action-value function independent of the policy being followed q. 97 / 111
  • 204. Images/cinvestav Furthermore The policy still has an effect to determine which state-action pairs are visited and updated. Therefore All that is required for correct convergence is that all pairs continue to be updated Under this assumption and a variant of the usual stochastic approximation conditions Q has been shown to converge with probability 1 to q∗ 98 / 111
  • 205. Images/cinvestav Furthermore The policy still has an effect to determine which state-action pairs are visited and updated. Therefore All that is required for correct convergence is that all pairs continue to be updated Under this assumption and a variant of the usual stochastic approximation conditions Q has been shown to converge with probability 1 to q∗ 98 / 111
  • 206. Images/cinvestav Furthermore The policy still has an effect to determine which state-action pairs are visited and updated. Therefore All that is required for correct convergence is that all pairs continue to be updated Under this assumption and a variant of the usual stochastic approximation conditions Q has been shown to converge with probability 1 to q∗ 98 / 111
  • 207. Images/cinvestav Q-learning (Off-Policy TD control) for estimating π ≈ π∗ Algorithm parameters α ∈ (0, 1] and small > 0 Initialize Initialize Q(s, a), for all s ∈ S+ , a ∈ A, arbitrarily except that Q (terminal, ·) = 0 99 / 111
  • 208. Images/cinvestav Q-learning (Off-Policy TD control) for estimating π ≈ π∗ Algorithm parameters α ∈ (0, 1] and small > 0 Initialize Initialize Q(s, a), for all s ∈ S+ , a ∈ A, arbitrarily except that Q (terminal, ·) = 0 99 / 111
  • 209. Images/cinvestav Then Loop for each episode 1 Initialize S 2 Loop for each step of episode: 3 Choose A from S using policy derived from Q (for example -greedy) 4 Take action A, observe R, S Q (S, A) ← Q (S, A) + γ R + α max a Q S , a − Q (S, A) 5 S ← S 6 Until S is terminal 100 / 111
  • 210. Images/cinvestav Outline 1 Reinforcement Learning Introduction Example A K-Armed Bandit Problem Exploration vs Exploitation The Elements of Reinforcement Learning Limitations 2 Formalizing Reinforcement Learning Introduction Markov Process and Markov Decision Process Return, Policy and Value function 3 Solution Methods Multi-armed Bandits Problems of Implementations General Formula in Reinforcement Learning A Simple Bandit Algorithm Dynamic Programming Policy Evaluation Policy Iteration Policy and Value Improvement Example Temporal-Difference Learning A Small Review of Monte Carlo Methods The Main Idea The Monte Carlo Principle Going Back to Temporal Difference Q-learning: Off-Policy TD Control 4 The Neural Network Approach Introduction, TD-Gammon TD(λ) by Back-Propagation Conclusions 101 / 111
  • 211. Images/cinvestav One of the earliest successes Backgammon is a major game with numerous tournaments and regular world championship matches. It is a stochastic game with a full description of the world 102 / 111
  • 212. Images/cinvestav One of the earliest successes Backgammon is a major game with numerous tournaments and regular world championship matches. It is a stochastic game with a full description of the world 102 / 111
  • 213. Images/cinvestav What do they used? TD-Gammon used a nonlinear form of TD (λ) An estimated value ˆv (s, w) is used to estimate the probability of winning from state s. Then To achieve this, rewards were defined as zero for all time steps except those on which the game is won. 103 / 111
  • 214. Images/cinvestav What do they used? TD-Gammon used a nonlinear form of TD (λ) An estimated value ˆv (s, w) is used to estimate the probability of winning from state s. Then To achieve this, rewards were defined as zero for all time steps except those on which the game is won. 103 / 111
  • 215. Images/cinvestav Implementation of the value function in TD-Gammon They took the decision to use a standard multilayer ANN 104 / 111
  • 216. Images/cinvestav The interesting part Tesauro’s representation For each point on the backgammon board Four units indicate the number of white pieces on the point Basically If no white pieces, all four units took zero values, if there was one white piece, the first unit takes a one value And so on... Therefore, given four white pieces and another four black pieces at each of the 24 points We have 4*24+4*24=192 units 105 / 111
  • 217. Images/cinvestav The interesting part Tesauro’s representation For each point on the backgammon board Four units indicate the number of white pieces on the point Basically If no white pieces, all four units took zero values, if there was one white piece, the first unit takes a one value And so on... Therefore, given four white pieces and another four black pieces at each of the 24 points We have 4*24+4*24=192 units 105 / 111
  • 218. Images/cinvestav The interesting part Tesauro’s representation For each point on the backgammon board Four units indicate the number of white pieces on the point Basically If no white pieces, all four units took zero values, if there was one white piece, the first unit takes a one value And so on... Therefore, given four white pieces and another four black pieces at each of the 24 points We have 4*24+4*24=192 units 105 / 111
  • 219. Images/cinvestav Then We have that Two additional units encoded the number of white and black pieces on the bar (Each took the value n/2 with n the number on the bar) Two other represent the number of black and white pieces already (these took the value n/15) two units indicated in a binary fashion the white or black turn. 106 / 111
  • 220. Images/cinvestav Outline 1 Reinforcement Learning Introduction Example A K-Armed Bandit Problem Exploration vs Exploitation The Elements of Reinforcement Learning Limitations 2 Formalizing Reinforcement Learning Introduction Markov Process and Markov Decision Process Return, Policy and Value function 3 Solution Methods Multi-armed Bandits Problems of Implementations General Formula in Reinforcement Learning A Simple Bandit Algorithm Dynamic Programming Policy Evaluation Policy Iteration Policy and Value Improvement Example Temporal-Difference Learning A Small Review of Monte Carlo Methods The Main Idea The Monte Carlo Principle Going Back to Temporal Difference Q-learning: Off-Policy TD Control 4 The Neural Network Approach Introduction, TD-Gammon TD(λ) by Back-Propagation Conclusions 107 / 111
  • 221. Images/cinvestav TD(λ) by Back-Propagation TD-Gammon uses a semi-gradient of the TD (λ) TD (λ) has the following general update rule: wt+1 = wt + γ [Rt+1 + αˆv (St+1, wt) − ˆv (St, wt)] zt with wt is the vector of all modifiable parameters zt is a vector of eligibility traces Eligibility traces unify and generalize TD and Monte Carlo methods. It works as Long-Short Term Memory The rough idea is that when a component of wt produces an estimate zt is bumped out and then begins to fade away. 108 / 111
  • 222. Images/cinvestav TD(λ) by Back-Propagation TD-Gammon uses a semi-gradient of the TD (λ) TD (λ) has the following general update rule: wt+1 = wt + γ [Rt+1 + αˆv (St+1, wt) − ˆv (St, wt)] zt with wt is the vector of all modifiable parameters zt is a vector of eligibility traces Eligibility traces unify and generalize TD and Monte Carlo methods. It works as Long-Short Term Memory The rough idea is that when a component of wt produces an estimate zt is bumped out and then begins to fade away. 108 / 111
  • 223. Images/cinvestav TD(λ) by Back-Propagation TD-Gammon uses a semi-gradient of the TD (λ) TD (λ) has the following general update rule: wt+1 = wt + γ [Rt+1 + αˆv (St+1, wt) − ˆv (St, wt)] zt with wt is the vector of all modifiable parameters zt is a vector of eligibility traces Eligibility traces unify and generalize TD and Monte Carlo methods. It works as Long-Short Term Memory The rough idea is that when a component of wt produces an estimate zt is bumped out and then begins to fade away. 108 / 111
  • 224. Images/cinvestav How this is done? Update Rule zt = αλzt−1 + ˆv (St, wt) with z0 = 0 Something Notable The gradient in this equation can be computed efficiently by the back-propapagation procedure For this, we set the error in the ANN ˆv (St+1, wt) − ˆv (St, wt) 109 / 111
  • 225. Images/cinvestav How this is done? Update Rule zt = αλzt−1 + ˆv (St, wt) with z0 = 0 Something Notable The gradient in this equation can be computed efficiently by the back-propapagation procedure For this, we set the error in the ANN ˆv (St+1, wt) − ˆv (St, wt) 109 / 111
  • 226. Images/cinvestav How this is done? Update Rule zt = αλzt−1 + ˆv (St, wt) with z0 = 0 Something Notable The gradient in this equation can be computed efficiently by the back-propapagation procedure For this, we set the error in the ANN ˆv (St+1, wt) − ˆv (St, wt) 109 / 111
  • 227. Images/cinvestav Outline 1 Reinforcement Learning Introduction Example A K-Armed Bandit Problem Exploration vs Exploitation The Elements of Reinforcement Learning Limitations 2 Formalizing Reinforcement Learning Introduction Markov Process and Markov Decision Process Return, Policy and Value function 3 Solution Methods Multi-armed Bandits Problems of Implementations General Formula in Reinforcement Learning A Simple Bandit Algorithm Dynamic Programming Policy Evaluation Policy Iteration Policy and Value Improvement Example Temporal-Difference Learning A Small Review of Monte Carlo Methods The Main Idea The Monte Carlo Principle Going Back to Temporal Difference Q-learning: Off-Policy TD Control 4 The Neural Network Approach Introduction, TD-Gammon TD(λ) by Back-Propagation Conclusions 110 / 111
  • 228. Images/cinvestav Conclusions With this new tools as Reinforcement Learning The New Artificial Intelligence has a lot to do in the world Thanks to be in this class As they say we have a lot to do... 111 / 111
  • 229. Images/cinvestav Conclusions With this new tools as Reinforcement Learning The New Artificial Intelligence has a lot to do in the world Thanks to be in this class As they say we have a lot to do... 111 / 111