An Introduction to Reinforcement Learning - The Doors to AGI

Department of Computer
Science and Engineering
IIT Kharagpur
Introduction to Reinforcement
Learning-the Doors to AGI
IDLI-Indian Deep Learning Initiative talk session
9-10 PM IST, 29 Apr 2017
Anirban Santara
santara.github.io

IIT Kharagpur
About me
• Anirban Santara
• Google India Ph.D. Fellow at
IIT Kharagpur (2015-Present)
• Graduated with B.Tech. in
Electronics and Electrical
Communication Engineering
from IIT Kharagpur in 2015
• Working in Deep Learning for
3 years

IIT Kharagpur
Contents
1. Description of the Reinforcement Learning Problem
2. Algorithms for policy optimization
3. Highlights of recent developments

IIT Kharagpur
Description of the RL Problem

IIT KharagpurReinforcement
Learning
Reinforcement Learning
refers to learning through
trial and error using
feedback from the
environment.
Action
Reward,
New State
Environment
Agent

IIT Kharagpur
Driving a Racing Car on TORCS
An example RL task
Reference: https://github.com/ugo-nama-kun/gym_torcs:
State variables (X):
• Position in track
• Distance from track
edges along different
directions
• Direction of heading
• Current speed
Action Variables (Y):
• Steering
• Acceleration
• Brake

IIT Kharagpur
Comparison of ML paradigms
Supervised Learning
• Would require training
examples in the form:
{ 𝑋𝑖, 𝑌𝑖 }𝑖=1
𝑁
• Where, 𝑌𝑖 are
true/correct
actions that must be
taken in state 𝑋𝑖
Unsupervised Learning
• Works only on with the
input state information
𝑋𝑖
• Does not use any
kind of feedback
from the environment
regarding performance
of the agent
Reinforcement Learning
• Requires feedback from the
environment in the form of
reward signals
• Reward signals might be
sparse and delayed
• But it should indicate the
quality of actions being
taken by the agent in
different states
e.g. +1 if the car makes progress, -1 if it
comes to a halt, -10 if it bumps into an
obstacle, 100 if it finishes the race

IIT Kharagpur
Mathematical Formulation
Markov Decision Process (MDP)
RL problems are often specified in terms of a Markov Decision Process (MDP). An MDP is
defined as ℳ = (𝑆, 𝐴, 𝑇, 𝑟, 𝜌0, 𝛾)
• State Space 𝑆: Set of all possible states/configurations of the environment
• Action Space 𝐴: Set of all possible actions
• Transition Probability 𝑇: 𝑆 × 𝐴 → 𝑆; T 𝑠𝑡, 𝑎 𝑡 = 𝑃 𝑠𝑡+1 𝑠𝑡, 𝑎 𝑡
• Reward function r: 𝑆 × 𝐴 → ℝ; we write 𝑟 𝑠𝑡, 𝑎 𝑡 = 𝑟𝑡
• Initial state distribution 𝜌0; 𝜌0 𝑠 = 𝑃( 𝑠0 = 𝑠)
• Temporal discount factor 𝛾
“Markov” because it assumes:
𝑃 𝑠𝑡+1 𝑠𝑡, 𝑎 𝑡, 𝑠𝑡−1, 𝑎 𝑡−1, … , 𝑠0
= 𝑃 𝑠𝑡+1 𝑠𝑡, 𝑎 𝑡 = T(𝑠𝑡, 𝑎 𝑡)

IIT Kharagpur
Some more definitions
• Policy 𝜋: 𝑆 → 𝐴: A function that predicts actions for a given state
• Trajectory 𝜏: A sequence of (𝑠𝑡, 𝑎 𝑡, 𝑟𝑡) tuples that describe an episode of
experiences of an agent as it executes a policy.
𝜏 = 𝑠0, 𝑎0, 𝑟0, 𝑠1, 𝑎1, 𝑟1 … , 𝑠𝑡, 𝑎 𝑡, 𝑟𝑡, … , 𝑠 𝑇
• Reward of a trajectory 𝑅(𝜏): A function of all the rewards received in a
trajectory
• e.g. 𝑅 𝜏 = 𝑡 𝑟𝑡 , 𝑅 𝜏 = 𝑡 𝛾 𝑡 𝑟𝑡

IIT Kharagpur
Goal of RL
Find a policy 𝜋∗
that
maximizes the expectation of
the reward function 𝑅 𝜏
over trajectories 𝜏
𝜋∗
= 𝑎𝑟𝑔𝑚𝑎𝑥 𝜋 Ε 𝜏[𝑅(𝜏)]

IIT Kharagpur
Algorithms for Policy Learning

IIT Kharagpur
Exploration Exploitation Dilemma
• Goal:
• At any particular point of time 𝑡 during training, the agent has two options:
1. Exploit: Act according to whatever policy 𝜋 𝑡 that it has learned so far
2. Explore: Take some random actions and check if there is a better alternative
• If the agent always
1. Exploits – then it will remain stuck with the initial policy (usually random), it had 
learns nothing!
2. Explores – then it will keep acting randomly forever  learns nothing!
• The right tradeoff between exploration and exploitation is necessary for
learning a successful policy
𝜋∗ = 𝑎𝑟𝑔𝑚𝑎𝑥 𝜋 Ε 𝜏[𝑅(𝜏)]

IIT Kharagpur
Policy Gradient
• Objective of RL:
• Policy Gradient algorithms use a parameterized model (e.g. neural
network) of the policy - 𝜋 𝜃 where 𝜃 represents the set of parameters
• They perform gradient ascent on Ε 𝜏[𝑅(𝜏)] to find the optimal set of
parameters 𝜃∗
such that:
𝜋∗
= 𝑎𝑟𝑔𝑚𝑎𝑥 𝜋 Ε 𝜏[𝑅(𝜏)]
𝜃∗
= 𝑎𝑟𝑔𝑚𝑎𝑥 𝜃 Ε 𝜏[𝑅(𝜏)|𝜋 𝜃]

IIT Kharagpur
Pseudo code of Policy Gradient
1. Initialize policy 𝜋 𝜃 with 𝜃 = 𝜃0; set 𝑡 = 0
2. Generate 𝑁 trajectories {𝜏𝑖}𝑖=1
𝑁
by acting randomly with probability 𝜖
(exploration) and according to 𝜋 𝜃𝑡
with probability 1 − 𝜖
(exploitation)
3. Compute Ε 𝜏 𝑖
[𝑅(𝜏)]
4. Update 𝜃𝑡+1 ← 𝜃𝑡 + η ∗ ∇ 𝜃Ε 𝜏 𝑖
[𝑅(𝜏)]
5. Go to step 2

IIT Kharagpur
Value Functions
• State value function 𝑽 𝝅(𝒔):
Expected future reward obtained by starting at state 𝑠 and acting according to policy 𝜋.
𝑉 𝜋
𝑠 =𝔼 𝜏[𝑅(𝜏)|𝑠0 = 𝑠, 𝜋]
• State-action value function 𝑸 𝝅(𝒔, 𝒂):
Expected future reward obtained by taking action 𝑎 in state 𝑠 and acting according to
policy 𝜋 thereafter.
𝑄 𝜋 𝑠, 𝑎 =𝔼 𝜏 [𝑅(𝜏)|𝑠0 = 𝑠, 𝑎0 = 𝑎, 𝜋]

IIT Kharagpur
Optimum Value Functions
• Optimum state value function 𝑽∗(𝒔):
• Optimum state-action value function 𝑸∗
(𝒔, 𝒂):
𝑉∗
𝑠 = 𝑚𝑎𝑥 𝜋 𝑉 𝜋
(𝑠):
𝑄∗
𝑠, 𝑎 = 𝑚𝑎𝑥 𝜋 𝑄 𝜋
(𝑠, 𝑎):

IIT Kharagpur
Bellman Equations
• If 𝑅 𝜏 = 𝑡 𝛾 𝑡 𝑟𝑡 , from the following identity holds:
• Given 𝑄∗ 𝑠, 𝑎 , the optimal policy can be evaluated simply by greedily
choosing the action that maximizes 𝑄∗ at every state 𝑠𝑡. (Bellman
optimality)
𝜋∗
𝑠𝑡 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑎 𝑄∗
(𝑠𝑡, 𝑎)
--(B1)
--(B2)
𝑄∗ 𝑠𝑡, 𝑎 𝑡 = 𝔼 𝑠 𝑡+1
[ 𝑟𝑡 + 𝛾𝑚𝑎𝑥 𝑎 𝑡+1
𝑄∗ 𝑠𝑡+1, 𝑎 𝑡+1 ]

IIT Kharagpur
Value Iteration Algorithms
• Value Iteration Algorithms like Q-learning use the Bellman equation B1
to estimate 𝑄∗
𝑠, 𝑎 iteratively:
• As 𝑖 → ∞, 𝑄𝑖 → 𝑄∗
𝑄𝑖+1 𝑠𝑡, 𝑎 𝑡 = 𝔼 𝑠 𝑡+1
[ 𝑟𝑡 + 𝛾𝑚𝑎𝑥 𝑎 𝑡+1
𝑄𝑖 𝑠𝑡+1, 𝑎 𝑡+1 ]

IIT Kharagpur
Tabular Q-learning
• When the state and action spaces are discrete/finite, 𝑄𝑖 𝑠, 𝑎 can be
represented as a |𝑆| × |𝐴| table
• This table can be updated iteratively using the Bellman equation B1.
• Pseudocode:
1. Initialize the Q-table, 𝑄0(𝑠, 𝑎)
2. Sample 𝑁 trajectories {𝜏𝑖}𝑖=1
𝑁
by acting randomly with probability 𝜖
(exploration) and according to 𝜋𝑖 𝑠 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑎 𝑄𝑖 𝑠, 𝑎 with probability
1 − 𝜖 (exploitation)
3. Update the Q-table as: 𝑄𝑖+1 𝑠, 𝑎 = 𝔼 𝑠′[𝑟 + 𝛾𝑚𝑎𝑥 𝑎′ 𝑄𝑖 𝑠′, 𝑎′ ] for
𝑠, 𝑎, 𝑟, 𝑠′ ∈ 𝜏𝑖, i = 1,2, … , N
4. Go to Step 2

IIT KharagpurQ-learning with Function
Approximation
• When the state and action spaces are not discrete/finite, 𝑄𝑖 𝑠, 𝑎 can no longer
be represented as a |𝑆| × |𝐴| table. Hence function approximation is used.
• This family of algorithms represents 𝑄𝑖 𝑠, 𝑎 as a function 𝑄 𝜃: 𝑆 × 𝐴 → ℝ
where 𝜃 is the set of parameters that has to be learned.
• 𝜃 is learned by gradient descent on the error function defined as:
• Deep Q-Network (DQN) uses a neural network as function approximator along
with some hacks like Experience Replay to make the samples i.i.d.
ℰ = 𝔼(𝑠,𝑎,𝑟,𝑠′)∈{𝜏 𝑖}𝑖=1
𝑁 𝑟 + 𝛾𝑚𝑎𝑥 𝑎′ 𝑄 𝜃 𝑠′
, 𝑎′
− 𝑄 𝜃 𝑠, 𝑎 2
𝜃𝑖+1 ← 𝜃𝑖 − η ∗ ∇ 𝜃 ℰ

IIT Kharagpur
Policy Gradient vs. Value Iteration
Policy Gradient:
• Pros:
• Works well in conjunction with function
approximation and continuous features
• Scales well in large state-action spaces
• Cons:
• Usually only the local optima and not the
global one can be found
Value Iteration:
• Pros:
• If a complete optimal value function is
known the optimal policy can be followed
simply by greedily choosing actions to
optimize it
• Cons:
• Total coverage of the state-action space is
necessary – if that does not happen, the
method becomes brittle
• Unstable under function approximation in
high dimensional continuous state and
action spaces
Reference: Kober et al. Reinforcement Learning in Robotics: A Survey, Reinforcement Learning: State-of-the-Art,
Springer, 2012

IIT Kharagpur
Actor Critic Algorithms
• Policy gradient algorithms are called Actor-only algorithms because they
directly try to deduce the optimal policy.
• Value iteration algorithms are called Critic-only algorithms because they first
observe and estimate the performance of choosing controls on the system
(through the value function) and then derive a policy out of it.
• Actor-critic algorithms incorporate the advantages of each of the above:
• They have a policy gradient component called the actor which calculates policy
gradients
• They also have a value function component called the critic that observes the
performance of the actor and decides when the policy needs to be updated and
which action should be preferred.

IIT Kharagpur
Recent Breakthroughs

IIT Kharagpur
Google DeepMind beats humans at Atari
Games with Deep Q-learning (2015)

IIT Kharagpur
Google DeepMind AlphaGo beats legendary
Go player Le Sedol

IIT Kharagpur
OpenAI's robots that develop their own
language to interact and achieve goals in a
common world

IIT Kharagpur
Thank You

An Introduction to Reinforcement Learning - The Doors to AGI

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a An Introduction to Reinforcement Learning - The Doors to AGI

Similar a An Introduction to Reinforcement Learning - The Doors to AGI (20)

Último

Último (20)

An Introduction to Reinforcement Learning - The Doors to AGI