Reinforcement Learning (RL) is a genre of Machine Learning in which an agent learns to choose optimal actions in different states in order to reach its specified goal, solely by interacting with the environment through trial and error. Unlike supervised learning, the agent does not get examples of "correct" actions in given states as ground truth. Instead, it has to use feedback from the environment (which can be sparse and delayed) to improve its policy over time. The formulation of the RL problem closely resembles the way in which human beings learn to act in different situations. Hence it is often considered the gateway to achieving the goal of Artificial General Intelligence.
The motivation of this talk is to introduce the audience to key theoretical concepts like formulation of the RL problem using Markov Decision Process (MDP) and solution of MDP using dynamic programming and policy gradient based algorithms. State-of-the-art deep reinforcement learning algorithms will also be covered. A case study of the application of reinforcement learning in robotics will also be presented.
What Are The Drone Anti-jamming Systems Technology?
An Introduction to Reinforcement Learning - The Doors to AGI
1. Department of Computer
Science and Engineering
IIT Kharagpur
Introduction to Reinforcement
Learning-the Doors to AGI
IDLI-Indian Deep Learning Initiative talk session
9-10 PM IST, 29 Apr 2017
Anirban Santara
santara.github.io
2. Department of Computer
Science and Engineering
IIT Kharagpur
About me
• Anirban Santara
• Google India Ph.D. Fellow at
IIT Kharagpur (2015-Present)
• Graduated with B.Tech. in
Electronics and Electrical
Communication Engineering
from IIT Kharagpur in 2015
• Working in Deep Learning for
3 years
3. Department of Computer
Science and Engineering
IIT Kharagpur
Contents
1. Description of the Reinforcement Learning Problem
2. Algorithms for policy optimization
3. Highlights of recent developments
5. Department of Computer
Science and Engineering
IIT KharagpurReinforcement
Learning
Reinforcement Learning
refers to learning through
trial and error using
feedback from the
environment.
Action
Reward,
New State
Environment
Agent
6. Department of Computer
Science and Engineering
IIT Kharagpur
Driving a Racing Car on TORCS
An example RL task
Reference: https://github.com/ugo-nama-kun/gym_torcs:
State variables (X):
• Position in track
• Distance from track
edges along different
directions
• Direction of heading
• Current speed
Action Variables (Y):
• Steering
• Acceleration
• Brake
7. Department of Computer
Science and Engineering
IIT Kharagpur
Comparison of ML paradigms
Supervised Learning
• Would require training
examples in the form:
{ 𝑋𝑖, 𝑌𝑖 }𝑖=1
𝑁
• Where, 𝑌𝑖 are
true/correct
actions that must be
taken in state 𝑋𝑖
Unsupervised Learning
• Works only on with the
input state information
𝑋𝑖
• Does not use any
kind of feedback
from the environment
regarding performance
of the agent
Reinforcement Learning
• Requires feedback from the
environment in the form of
reward signals
• Reward signals might be
sparse and delayed
• But it should indicate the
quality of actions being
taken by the agent in
different states
e.g. +1 if the car makes progress, -1 if it
comes to a halt, -10 if it bumps into an
obstacle, 100 if it finishes the race
8. Department of Computer
Science and Engineering
IIT Kharagpur
Mathematical Formulation
Markov Decision Process (MDP)
RL problems are often specified in terms of a Markov Decision Process (MDP). An MDP is
defined as ℳ = (𝑆, 𝐴, 𝑇, 𝑟, 𝜌0, 𝛾)
• State Space 𝑆: Set of all possible states/configurations of the environment
• Action Space 𝐴: Set of all possible actions
• Transition Probability 𝑇: 𝑆 × 𝐴 → 𝑆; T 𝑠𝑡, 𝑎 𝑡 = 𝑃 𝑠𝑡+1 𝑠𝑡, 𝑎 𝑡
• Reward function r: 𝑆 × 𝐴 → ℝ; we write 𝑟 𝑠𝑡, 𝑎 𝑡 = 𝑟𝑡
• Initial state distribution 𝜌0; 𝜌0 𝑠 = 𝑃( 𝑠0 = 𝑠)
• Temporal discount factor 𝛾
“Markov” because it assumes:
𝑃 𝑠𝑡+1 𝑠𝑡, 𝑎 𝑡, 𝑠𝑡−1, 𝑎 𝑡−1, … , 𝑠0
= 𝑃 𝑠𝑡+1 𝑠𝑡, 𝑎 𝑡 = T(𝑠𝑡, 𝑎 𝑡)
9. Department of Computer
Science and Engineering
IIT Kharagpur
Some more definitions
• Policy 𝜋: 𝑆 → 𝐴: A function that predicts actions for a given state
• Trajectory 𝜏: A sequence of (𝑠𝑡, 𝑎 𝑡, 𝑟𝑡) tuples that describe an episode of
experiences of an agent as it executes a policy.
𝜏 = 𝑠0, 𝑎0, 𝑟0, 𝑠1, 𝑎1, 𝑟1 … , 𝑠𝑡, 𝑎 𝑡, 𝑟𝑡, … , 𝑠 𝑇
• Reward of a trajectory 𝑅(𝜏): A function of all the rewards received in a
trajectory
• e.g. 𝑅 𝜏 = 𝑡 𝑟𝑡 , 𝑅 𝜏 = 𝑡 𝛾 𝑡 𝑟𝑡
10. Department of Computer
Science and Engineering
IIT Kharagpur
Goal of RL
Find a policy 𝜋∗
that
maximizes the expectation of
the reward function 𝑅 𝜏
over trajectories 𝜏
𝜋∗
= 𝑎𝑟𝑔𝑚𝑎𝑥 𝜋 Ε 𝜏[𝑅(𝜏)]
12. Department of Computer
Science and Engineering
IIT Kharagpur
Exploration Exploitation Dilemma
• Goal:
• At any particular point of time 𝑡 during training, the agent has two options:
1. Exploit: Act according to whatever policy 𝜋 𝑡 that it has learned so far
2. Explore: Take some random actions and check if there is a better alternative
• If the agent always
1. Exploits – then it will remain stuck with the initial policy (usually random), it had
learns nothing!
2. Explores – then it will keep acting randomly forever learns nothing!
• The right tradeoff between exploration and exploitation is necessary for
learning a successful policy
𝜋∗ = 𝑎𝑟𝑔𝑚𝑎𝑥 𝜋 Ε 𝜏[𝑅(𝜏)]
13. Department of Computer
Science and Engineering
IIT Kharagpur
Policy Gradient
• Objective of RL:
• Policy Gradient algorithms use a parameterized model (e.g. neural
network) of the policy - 𝜋 𝜃 where 𝜃 represents the set of parameters
• They perform gradient ascent on Ε 𝜏[𝑅(𝜏)] to find the optimal set of
parameters 𝜃∗
such that:
𝜋∗
= 𝑎𝑟𝑔𝑚𝑎𝑥 𝜋 Ε 𝜏[𝑅(𝜏)]
𝜃∗
= 𝑎𝑟𝑔𝑚𝑎𝑥 𝜃 Ε 𝜏[𝑅(𝜏)|𝜋 𝜃]
14. Department of Computer
Science and Engineering
IIT Kharagpur
Pseudo code of Policy Gradient
1. Initialize policy 𝜋 𝜃 with 𝜃 = 𝜃0; set 𝑡 = 0
2. Generate 𝑁 trajectories {𝜏𝑖}𝑖=1
𝑁
by acting randomly with probability 𝜖
(exploration) and according to 𝜋 𝜃𝑡
with probability 1 − 𝜖
(exploitation)
3. Compute Ε 𝜏 𝑖
[𝑅(𝜏)]
4. Update 𝜃𝑡+1 ← 𝜃𝑡 + η ∗ ∇ 𝜃Ε 𝜏 𝑖
[𝑅(𝜏)]
5. Go to step 2
15. Department of Computer
Science and Engineering
IIT Kharagpur
Value Functions
• State value function 𝑽 𝝅(𝒔):
Expected future reward obtained by starting at state 𝑠 and acting according to policy 𝜋.
𝑉 𝜋
𝑠 =𝔼 𝜏[𝑅(𝜏)|𝑠0 = 𝑠, 𝜋]
• State-action value function 𝑸 𝝅(𝒔, 𝒂):
Expected future reward obtained by taking action 𝑎 in state 𝑠 and acting according to
policy 𝜋 thereafter.
𝑄 𝜋 𝑠, 𝑎 =𝔼 𝜏 [𝑅(𝜏)|𝑠0 = 𝑠, 𝑎0 = 𝑎, 𝜋]
16. Department of Computer
Science and Engineering
IIT Kharagpur
Optimum Value Functions
• Optimum state value function 𝑽∗(𝒔):
• Optimum state-action value function 𝑸∗
(𝒔, 𝒂):
𝑉∗
𝑠 = 𝑚𝑎𝑥 𝜋 𝑉 𝜋
(𝑠):
𝑄∗
𝑠, 𝑎 = 𝑚𝑎𝑥 𝜋 𝑄 𝜋
(𝑠, 𝑎):
17. Department of Computer
Science and Engineering
IIT Kharagpur
Bellman Equations
• If 𝑅 𝜏 = 𝑡 𝛾 𝑡 𝑟𝑡 , from the following identity holds:
• Given 𝑄∗ 𝑠, 𝑎 , the optimal policy can be evaluated simply by greedily
choosing the action that maximizes 𝑄∗ at every state 𝑠𝑡. (Bellman
optimality)
𝜋∗
𝑠𝑡 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑎 𝑄∗
(𝑠𝑡, 𝑎)
--(B1)
--(B2)
𝑄∗ 𝑠𝑡, 𝑎 𝑡 = 𝔼 𝑠 𝑡+1
[ 𝑟𝑡 + 𝛾𝑚𝑎𝑥 𝑎 𝑡+1
𝑄∗ 𝑠𝑡+1, 𝑎 𝑡+1 ]
18. Department of Computer
Science and Engineering
IIT Kharagpur
Value Iteration Algorithms
• Value Iteration Algorithms like Q-learning use the Bellman equation B1
to estimate 𝑄∗
𝑠, 𝑎 iteratively:
• As 𝑖 → ∞, 𝑄𝑖 → 𝑄∗
𝑄𝑖+1 𝑠𝑡, 𝑎 𝑡 = 𝔼 𝑠 𝑡+1
[ 𝑟𝑡 + 𝛾𝑚𝑎𝑥 𝑎 𝑡+1
𝑄𝑖 𝑠𝑡+1, 𝑎 𝑡+1 ]
19. Department of Computer
Science and Engineering
IIT Kharagpur
Tabular Q-learning
• When the state and action spaces are discrete/finite, 𝑄𝑖 𝑠, 𝑎 can be
represented as a |𝑆| × |𝐴| table
• This table can be updated iteratively using the Bellman equation B1.
• Pseudocode:
1. Initialize the Q-table, 𝑄0(𝑠, 𝑎)
2. Sample 𝑁 trajectories {𝜏𝑖}𝑖=1
𝑁
by acting randomly with probability 𝜖
(exploration) and according to 𝜋𝑖 𝑠 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑎 𝑄𝑖 𝑠, 𝑎 with probability
1 − 𝜖 (exploitation)
3. Update the Q-table as: 𝑄𝑖+1 𝑠, 𝑎 = 𝔼 𝑠′[𝑟 + 𝛾𝑚𝑎𝑥 𝑎′ 𝑄𝑖 𝑠′, 𝑎′ ] for
𝑠, 𝑎, 𝑟, 𝑠′ ∈ 𝜏𝑖, i = 1,2, … , N
4. Go to Step 2
20. Department of Computer
Science and Engineering
IIT KharagpurQ-learning with Function
Approximation
• When the state and action spaces are not discrete/finite, 𝑄𝑖 𝑠, 𝑎 can no longer
be represented as a |𝑆| × |𝐴| table. Hence function approximation is used.
• This family of algorithms represents 𝑄𝑖 𝑠, 𝑎 as a function 𝑄 𝜃: 𝑆 × 𝐴 → ℝ
where 𝜃 is the set of parameters that has to be learned.
• 𝜃 is learned by gradient descent on the error function defined as:
• Deep Q-Network (DQN) uses a neural network as function approximator along
with some hacks like Experience Replay to make the samples i.i.d.
ℰ = 𝔼(𝑠,𝑎,𝑟,𝑠′)∈{𝜏 𝑖}𝑖=1
𝑁 𝑟 + 𝛾𝑚𝑎𝑥 𝑎′ 𝑄 𝜃 𝑠′
, 𝑎′
− 𝑄 𝜃 𝑠, 𝑎 2
𝜃𝑖+1 ← 𝜃𝑖 − η ∗ ∇ 𝜃 ℰ
21. Department of Computer
Science and Engineering
IIT Kharagpur
Policy Gradient vs. Value Iteration
Policy Gradient:
• Pros:
• Works well in conjunction with function
approximation and continuous features
• Scales well in large state-action spaces
• Cons:
• Usually only the local optima and not the
global one can be found
Value Iteration:
• Pros:
• If a complete optimal value function is
known the optimal policy can be followed
simply by greedily choosing actions to
optimize it
• Cons:
• Total coverage of the state-action space is
necessary – if that does not happen, the
method becomes brittle
• Unstable under function approximation in
high dimensional continuous state and
action spaces
Reference: Kober et al. Reinforcement Learning in Robotics: A Survey, Reinforcement Learning: State-of-the-Art,
Springer, 2012
22. Department of Computer
Science and Engineering
IIT Kharagpur
Actor Critic Algorithms
• Policy gradient algorithms are called Actor-only algorithms because they
directly try to deduce the optimal policy.
• Value iteration algorithms are called Critic-only algorithms because they first
observe and estimate the performance of choosing controls on the system
(through the value function) and then derive a policy out of it.
• Actor-critic algorithms incorporate the advantages of each of the above:
• They have a policy gradient component called the actor which calculates policy
gradients
• They also have a value function component called the critic that observes the
performance of the actor and decides when the policy needs to be updated and
which action should be preferred.
26. Department of Computer
Science and Engineering
IIT Kharagpur
OpenAI's robots that develop their own
language to interact and achieve goals in a
common world