SlideShare una empresa de Scribd logo
1 de 27
Department of Computer
Science and Engineering
IIT Kharagpur
Introduction to Reinforcement
Learning-the Doors to AGI
IDLI-Indian Deep Learning Initiative talk session
9-10 PM IST, 29 Apr 2017
Anirban Santara
santara.github.io
Department of Computer
Science and Engineering
IIT Kharagpur
About me
• Anirban Santara
• Google India Ph.D. Fellow at
IIT Kharagpur (2015-Present)
• Graduated with B.Tech. in
Electronics and Electrical
Communication Engineering
from IIT Kharagpur in 2015
• Working in Deep Learning for
3 years
Department of Computer
Science and Engineering
IIT Kharagpur
Contents
1. Description of the Reinforcement Learning Problem
2. Algorithms for policy optimization
3. Highlights of recent developments
Department of Computer
Science and Engineering
IIT Kharagpur
Description of the RL Problem
Department of Computer
Science and Engineering
IIT KharagpurReinforcement
Learning
Reinforcement Learning
refers to learning through
trial and error using
feedback from the
environment.
Action
Reward,
New State
Environment
Agent
Department of Computer
Science and Engineering
IIT Kharagpur
Driving a Racing Car on TORCS
An example RL task
Reference: https://github.com/ugo-nama-kun/gym_torcs:
State variables (X):
• Position in track
• Distance from track
edges along different
directions
• Direction of heading
• Current speed
Action Variables (Y):
• Steering
• Acceleration
• Brake
Department of Computer
Science and Engineering
IIT Kharagpur
Comparison of ML paradigms
Supervised Learning
• Would require training
examples in the form:
{ 𝑋𝑖, 𝑌𝑖 }𝑖=1
𝑁
• Where, 𝑌𝑖 are
true/correct
actions that must be
taken in state 𝑋𝑖
Unsupervised Learning
• Works only on with the
input state information
𝑋𝑖
• Does not use any
kind of feedback
from the environment
regarding performance
of the agent
Reinforcement Learning
• Requires feedback from the
environment in the form of
reward signals
• Reward signals might be
sparse and delayed
• But it should indicate the
quality of actions being
taken by the agent in
different states
e.g. +1 if the car makes progress, -1 if it
comes to a halt, -10 if it bumps into an
obstacle, 100 if it finishes the race
Department of Computer
Science and Engineering
IIT Kharagpur
Mathematical Formulation
Markov Decision Process (MDP)
RL problems are often specified in terms of a Markov Decision Process (MDP). An MDP is
defined as ℳ = (𝑆, 𝐴, 𝑇, 𝑟, 𝜌0, 𝛾)
• State Space 𝑆: Set of all possible states/configurations of the environment
• Action Space 𝐴: Set of all possible actions
• Transition Probability 𝑇: 𝑆 × 𝐴 → 𝑆; T 𝑠𝑡, 𝑎 𝑡 = 𝑃 𝑠𝑡+1 𝑠𝑡, 𝑎 𝑡
• Reward function r: 𝑆 × 𝐴 → ℝ; we write 𝑟 𝑠𝑡, 𝑎 𝑡 = 𝑟𝑡
• Initial state distribution 𝜌0; 𝜌0 𝑠 = 𝑃( 𝑠0 = 𝑠)
• Temporal discount factor 𝛾
“Markov” because it assumes:
𝑃 𝑠𝑡+1 𝑠𝑡, 𝑎 𝑡, 𝑠𝑡−1, 𝑎 𝑡−1, … , 𝑠0
= 𝑃 𝑠𝑡+1 𝑠𝑡, 𝑎 𝑡 = T(𝑠𝑡, 𝑎 𝑡)
Department of Computer
Science and Engineering
IIT Kharagpur
Some more definitions
• Policy 𝜋: 𝑆 → 𝐴: A function that predicts actions for a given state
• Trajectory 𝜏: A sequence of (𝑠𝑡, 𝑎 𝑡, 𝑟𝑡) tuples that describe an episode of
experiences of an agent as it executes a policy.
𝜏 = 𝑠0, 𝑎0, 𝑟0, 𝑠1, 𝑎1, 𝑟1 … , 𝑠𝑡, 𝑎 𝑡, 𝑟𝑡, … , 𝑠 𝑇
• Reward of a trajectory 𝑅(𝜏): A function of all the rewards received in a
trajectory
• e.g. 𝑅 𝜏 = 𝑡 𝑟𝑡 , 𝑅 𝜏 = 𝑡 𝛾 𝑡 𝑟𝑡
Department of Computer
Science and Engineering
IIT Kharagpur
Goal of RL
Find a policy 𝜋∗
that
maximizes the expectation of
the reward function 𝑅 𝜏
over trajectories 𝜏
𝜋∗
= 𝑎𝑟𝑔𝑚𝑎𝑥 𝜋 Ε 𝜏[𝑅(𝜏)]
Department of Computer
Science and Engineering
IIT Kharagpur
Algorithms for Policy Learning
Department of Computer
Science and Engineering
IIT Kharagpur
Exploration Exploitation Dilemma
• Goal:
• At any particular point of time 𝑡 during training, the agent has two options:
1. Exploit: Act according to whatever policy 𝜋 𝑡 that it has learned so far
2. Explore: Take some random actions and check if there is a better alternative
• If the agent always
1. Exploits – then it will remain stuck with the initial policy (usually random), it had 
learns nothing!
2. Explores – then it will keep acting randomly forever  learns nothing!
• The right tradeoff between exploration and exploitation is necessary for
learning a successful policy
𝜋∗ = 𝑎𝑟𝑔𝑚𝑎𝑥 𝜋 Ε 𝜏[𝑅(𝜏)]
Department of Computer
Science and Engineering
IIT Kharagpur
Policy Gradient
• Objective of RL:
• Policy Gradient algorithms use a parameterized model (e.g. neural
network) of the policy - 𝜋 𝜃 where 𝜃 represents the set of parameters
• They perform gradient ascent on Ε 𝜏[𝑅(𝜏)] to find the optimal set of
parameters 𝜃∗
such that:
𝜋∗
= 𝑎𝑟𝑔𝑚𝑎𝑥 𝜋 Ε 𝜏[𝑅(𝜏)]
𝜃∗
= 𝑎𝑟𝑔𝑚𝑎𝑥 𝜃 Ε 𝜏[𝑅(𝜏)|𝜋 𝜃]
Department of Computer
Science and Engineering
IIT Kharagpur
Pseudo code of Policy Gradient
1. Initialize policy 𝜋 𝜃 with 𝜃 = 𝜃0; set 𝑡 = 0
2. Generate 𝑁 trajectories {𝜏𝑖}𝑖=1
𝑁
by acting randomly with probability 𝜖
(exploration) and according to 𝜋 𝜃𝑡
with probability 1 − 𝜖
(exploitation)
3. Compute Ε 𝜏 𝑖
[𝑅(𝜏)]
4. Update 𝜃𝑡+1 ← 𝜃𝑡 + η ∗ ∇ 𝜃Ε 𝜏 𝑖
[𝑅(𝜏)]
5. Go to step 2
Department of Computer
Science and Engineering
IIT Kharagpur
Value Functions
• State value function 𝑽 𝝅(𝒔):
Expected future reward obtained by starting at state 𝑠 and acting according to policy 𝜋.
𝑉 𝜋
𝑠 =𝔼 𝜏[𝑅(𝜏)|𝑠0 = 𝑠, 𝜋]
• State-action value function 𝑸 𝝅(𝒔, 𝒂):
Expected future reward obtained by taking action 𝑎 in state 𝑠 and acting according to
policy 𝜋 thereafter.
𝑄 𝜋 𝑠, 𝑎 =𝔼 𝜏 [𝑅(𝜏)|𝑠0 = 𝑠, 𝑎0 = 𝑎, 𝜋]
Department of Computer
Science and Engineering
IIT Kharagpur
Optimum Value Functions
• Optimum state value function 𝑽∗(𝒔):
• Optimum state-action value function 𝑸∗
(𝒔, 𝒂):
𝑉∗
𝑠 = 𝑚𝑎𝑥 𝜋 𝑉 𝜋
(𝑠):
𝑄∗
𝑠, 𝑎 = 𝑚𝑎𝑥 𝜋 𝑄 𝜋
(𝑠, 𝑎):
Department of Computer
Science and Engineering
IIT Kharagpur
Bellman Equations
• If 𝑅 𝜏 = 𝑡 𝛾 𝑡 𝑟𝑡 , from the following identity holds:
• Given 𝑄∗ 𝑠, 𝑎 , the optimal policy can be evaluated simply by greedily
choosing the action that maximizes 𝑄∗ at every state 𝑠𝑡. (Bellman
optimality)
𝜋∗
𝑠𝑡 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑎 𝑄∗
(𝑠𝑡, 𝑎)
--(B1)
--(B2)
𝑄∗ 𝑠𝑡, 𝑎 𝑡 = 𝔼 𝑠 𝑡+1
[ 𝑟𝑡 + 𝛾𝑚𝑎𝑥 𝑎 𝑡+1
𝑄∗ 𝑠𝑡+1, 𝑎 𝑡+1 ]
Department of Computer
Science and Engineering
IIT Kharagpur
Value Iteration Algorithms
• Value Iteration Algorithms like Q-learning use the Bellman equation B1
to estimate 𝑄∗
𝑠, 𝑎 iteratively:
• As 𝑖 → ∞, 𝑄𝑖 → 𝑄∗
𝑄𝑖+1 𝑠𝑡, 𝑎 𝑡 = 𝔼 𝑠 𝑡+1
[ 𝑟𝑡 + 𝛾𝑚𝑎𝑥 𝑎 𝑡+1
𝑄𝑖 𝑠𝑡+1, 𝑎 𝑡+1 ]
Department of Computer
Science and Engineering
IIT Kharagpur
Tabular Q-learning
• When the state and action spaces are discrete/finite, 𝑄𝑖 𝑠, 𝑎 can be
represented as a |𝑆| × |𝐴| table
• This table can be updated iteratively using the Bellman equation B1.
• Pseudocode:
1. Initialize the Q-table, 𝑄0(𝑠, 𝑎)
2. Sample 𝑁 trajectories {𝜏𝑖}𝑖=1
𝑁
by acting randomly with probability 𝜖
(exploration) and according to 𝜋𝑖 𝑠 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑎 𝑄𝑖 𝑠, 𝑎 with probability
1 − 𝜖 (exploitation)
3. Update the Q-table as: 𝑄𝑖+1 𝑠, 𝑎 = 𝔼 𝑠′[𝑟 + 𝛾𝑚𝑎𝑥 𝑎′ 𝑄𝑖 𝑠′, 𝑎′ ] for
𝑠, 𝑎, 𝑟, 𝑠′ ∈ 𝜏𝑖, i = 1,2, … , N
4. Go to Step 2
Department of Computer
Science and Engineering
IIT KharagpurQ-learning with Function
Approximation
• When the state and action spaces are not discrete/finite, 𝑄𝑖 𝑠, 𝑎 can no longer
be represented as a |𝑆| × |𝐴| table. Hence function approximation is used.
• This family of algorithms represents 𝑄𝑖 𝑠, 𝑎 as a function 𝑄 𝜃: 𝑆 × 𝐴 → ℝ
where 𝜃 is the set of parameters that has to be learned.
• 𝜃 is learned by gradient descent on the error function defined as:
• Deep Q-Network (DQN) uses a neural network as function approximator along
with some hacks like Experience Replay to make the samples i.i.d.
ℰ = 𝔼(𝑠,𝑎,𝑟,𝑠′)∈{𝜏 𝑖}𝑖=1
𝑁 𝑟 + 𝛾𝑚𝑎𝑥 𝑎′ 𝑄 𝜃 𝑠′
, 𝑎′
− 𝑄 𝜃 𝑠, 𝑎 2
𝜃𝑖+1 ← 𝜃𝑖 − η ∗ ∇ 𝜃 ℰ
Department of Computer
Science and Engineering
IIT Kharagpur
Policy Gradient vs. Value Iteration
Policy Gradient:
• Pros:
• Works well in conjunction with function
approximation and continuous features
• Scales well in large state-action spaces
• Cons:
• Usually only the local optima and not the
global one can be found
Value Iteration:
• Pros:
• If a complete optimal value function is
known the optimal policy can be followed
simply by greedily choosing actions to
optimize it
• Cons:
• Total coverage of the state-action space is
necessary – if that does not happen, the
method becomes brittle
• Unstable under function approximation in
high dimensional continuous state and
action spaces
Reference: Kober et al. Reinforcement Learning in Robotics: A Survey, Reinforcement Learning: State-of-the-Art,
Springer, 2012
Department of Computer
Science and Engineering
IIT Kharagpur
Actor Critic Algorithms
• Policy gradient algorithms are called Actor-only algorithms because they
directly try to deduce the optimal policy.
• Value iteration algorithms are called Critic-only algorithms because they first
observe and estimate the performance of choosing controls on the system
(through the value function) and then derive a policy out of it.
• Actor-critic algorithms incorporate the advantages of each of the above:
• They have a policy gradient component called the actor which calculates policy
gradients
• They also have a value function component called the critic that observes the
performance of the actor and decides when the policy needs to be updated and
which action should be preferred.
Department of Computer
Science and Engineering
IIT Kharagpur
Recent Breakthroughs
Department of Computer
Science and Engineering
IIT Kharagpur
Google DeepMind beats humans at Atari
Games with Deep Q-learning (2015)
Department of Computer
Science and Engineering
IIT Kharagpur
Google DeepMind AlphaGo beats legendary
Go player Le Sedol
Department of Computer
Science and Engineering
IIT Kharagpur
OpenAI's robots that develop their own
language to interact and achieve goals in a
common world
Department of Computer
Science and Engineering
IIT Kharagpur
Thank You

Más contenido relacionado

La actualidad más candente

Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
butest
 

La actualidad más candente (20)

Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Deep Q-Learning
Deep Q-LearningDeep Q-Learning
Deep Q-Learning
 
Exploration Strategies in Reinforcement Learning
Exploration Strategies in Reinforcement LearningExploration Strategies in Reinforcement Learning
Exploration Strategies in Reinforcement Learning
 
An introduction to deep reinforcement learning
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learning
 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners Tutorial
 
Actor critic algorithm
Actor critic algorithmActor critic algorithm
Actor critic algorithm
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
 
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
 
An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)
 
An introduction to reinforcement learning
An introduction to  reinforcement learningAn introduction to  reinforcement learning
An introduction to reinforcement learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game Learning
 
Algorithms Design Patterns
Algorithms Design PatternsAlgorithms Design Patterns
Algorithms Design Patterns
 
Machine learning Algorithms with a Sagemaker demo
Machine learning Algorithms with a Sagemaker demoMachine learning Algorithms with a Sagemaker demo
Machine learning Algorithms with a Sagemaker demo
 
Generalized Reinforcement Learning
Generalized Reinforcement LearningGeneralized Reinforcement Learning
Generalized Reinforcement Learning
 
25 introduction reinforcement_learning
25 introduction reinforcement_learning25 introduction reinforcement_learning
25 introduction reinforcement_learning
 
Introduction: Asynchronous Methods for Deep Reinforcement Learning
Introduction: Asynchronous Methods for  Deep Reinforcement LearningIntroduction: Asynchronous Methods for  Deep Reinforcement Learning
Introduction: Asynchronous Methods for Deep Reinforcement Learning
 
Proximal Policy Optimization Algorithms, Schulman et al, 2017
Proximal Policy Optimization Algorithms, Schulman et al, 2017Proximal Policy Optimization Algorithms, Schulman et al, 2017
Proximal Policy Optimization Algorithms, Schulman et al, 2017
 
Planning and Learning with Tabular Methods
Planning and Learning with Tabular MethodsPlanning and Learning with Tabular Methods
Planning and Learning with Tabular Methods
 

Similar a An Introduction to Reinforcement Learning - The Doors to AGI

Parallel Machine Learning- DSGD and SystemML
Parallel Machine Learning- DSGD and SystemMLParallel Machine Learning- DSGD and SystemML
Parallel Machine Learning- DSGD and SystemML
Janani C
 
24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx
ManiMaran230751
 

Similar a An Introduction to Reinforcement Learning - The Doors to AGI (20)

Imitation Learning
Imitation LearningImitation Learning
Imitation Learning
 
Recommendation algorithm using reinforcement learning
Recommendation algorithm using reinforcement learningRecommendation algorithm using reinforcement learning
Recommendation algorithm using reinforcement learning
 
Week 13 Feature Selection Computer Vision Bagian 2
Week 13 Feature Selection Computer Vision Bagian 2Week 13 Feature Selection Computer Vision Bagian 2
Week 13 Feature Selection Computer Vision Bagian 2
 
NIPS KANSAI Reading Group #5: State Aware Imitation Learning
NIPS KANSAI Reading Group #5: State Aware Imitation LearningNIPS KANSAI Reading Group #5: State Aware Imitation Learning
NIPS KANSAI Reading Group #5: State Aware Imitation Learning
 
Learning Method In Data Mining
Learning Method In Data MiningLearning Method In Data Mining
Learning Method In Data Mining
 
Introduction to optimization technique
Introduction to optimization techniqueIntroduction to optimization technique
Introduction to optimization technique
 
Dueling network architectures for deep reinforcement learning
Dueling network architectures for deep reinforcement learningDueling network architectures for deep reinforcement learning
Dueling network architectures for deep reinforcement learning
 
Artificial Intelligence Course: Linear models
Artificial Intelligence Course: Linear models Artificial Intelligence Course: Linear models
Artificial Intelligence Course: Linear models
 
Parallel Machine Learning- DSGD and SystemML
Parallel Machine Learning- DSGD and SystemMLParallel Machine Learning- DSGD and SystemML
Parallel Machine Learning- DSGD and SystemML
 
ddpg seminar
ddpg seminarddpg seminar
ddpg seminar
 
24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx
 
Imitation Learning for Autonomous Driving in TORCS
Imitation Learning for Autonomous Driving in TORCSImitation Learning for Autonomous Driving in TORCS
Imitation Learning for Autonomous Driving in TORCS
 
SPLT Transformer.pptx
SPLT Transformer.pptxSPLT Transformer.pptx
SPLT Transformer.pptx
 
Regression.pptx
Regression.pptxRegression.pptx
Regression.pptx
 
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
 
Paper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipelinePaper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipeline
 
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdfvariBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
 
Analysis and Design of Algorithms
Analysis and Design of AlgorithmsAnalysis and Design of Algorithms
Analysis and Design of Algorithms
 
Reinforcement Learning Guide For Beginners
Reinforcement Learning Guide For BeginnersReinforcement Learning Guide For Beginners
Reinforcement Learning Guide For Beginners
 
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pptx
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pptxvariBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pptx
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pptx
 

Último

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Último (20)

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 

An Introduction to Reinforcement Learning - The Doors to AGI

  • 1. Department of Computer Science and Engineering IIT Kharagpur Introduction to Reinforcement Learning-the Doors to AGI IDLI-Indian Deep Learning Initiative talk session 9-10 PM IST, 29 Apr 2017 Anirban Santara santara.github.io
  • 2. Department of Computer Science and Engineering IIT Kharagpur About me • Anirban Santara • Google India Ph.D. Fellow at IIT Kharagpur (2015-Present) • Graduated with B.Tech. in Electronics and Electrical Communication Engineering from IIT Kharagpur in 2015 • Working in Deep Learning for 3 years
  • 3. Department of Computer Science and Engineering IIT Kharagpur Contents 1. Description of the Reinforcement Learning Problem 2. Algorithms for policy optimization 3. Highlights of recent developments
  • 4. Department of Computer Science and Engineering IIT Kharagpur Description of the RL Problem
  • 5. Department of Computer Science and Engineering IIT KharagpurReinforcement Learning Reinforcement Learning refers to learning through trial and error using feedback from the environment. Action Reward, New State Environment Agent
  • 6. Department of Computer Science and Engineering IIT Kharagpur Driving a Racing Car on TORCS An example RL task Reference: https://github.com/ugo-nama-kun/gym_torcs: State variables (X): • Position in track • Distance from track edges along different directions • Direction of heading • Current speed Action Variables (Y): • Steering • Acceleration • Brake
  • 7. Department of Computer Science and Engineering IIT Kharagpur Comparison of ML paradigms Supervised Learning • Would require training examples in the form: { 𝑋𝑖, 𝑌𝑖 }𝑖=1 𝑁 • Where, 𝑌𝑖 are true/correct actions that must be taken in state 𝑋𝑖 Unsupervised Learning • Works only on with the input state information 𝑋𝑖 • Does not use any kind of feedback from the environment regarding performance of the agent Reinforcement Learning • Requires feedback from the environment in the form of reward signals • Reward signals might be sparse and delayed • But it should indicate the quality of actions being taken by the agent in different states e.g. +1 if the car makes progress, -1 if it comes to a halt, -10 if it bumps into an obstacle, 100 if it finishes the race
  • 8. Department of Computer Science and Engineering IIT Kharagpur Mathematical Formulation Markov Decision Process (MDP) RL problems are often specified in terms of a Markov Decision Process (MDP). An MDP is defined as ℳ = (𝑆, 𝐴, 𝑇, 𝑟, 𝜌0, 𝛾) • State Space 𝑆: Set of all possible states/configurations of the environment • Action Space 𝐴: Set of all possible actions • Transition Probability 𝑇: 𝑆 × 𝐴 → 𝑆; T 𝑠𝑡, 𝑎 𝑡 = 𝑃 𝑠𝑡+1 𝑠𝑡, 𝑎 𝑡 • Reward function r: 𝑆 × 𝐴 → ℝ; we write 𝑟 𝑠𝑡, 𝑎 𝑡 = 𝑟𝑡 • Initial state distribution 𝜌0; 𝜌0 𝑠 = 𝑃( 𝑠0 = 𝑠) • Temporal discount factor 𝛾 “Markov” because it assumes: 𝑃 𝑠𝑡+1 𝑠𝑡, 𝑎 𝑡, 𝑠𝑡−1, 𝑎 𝑡−1, … , 𝑠0 = 𝑃 𝑠𝑡+1 𝑠𝑡, 𝑎 𝑡 = T(𝑠𝑡, 𝑎 𝑡)
  • 9. Department of Computer Science and Engineering IIT Kharagpur Some more definitions • Policy 𝜋: 𝑆 → 𝐴: A function that predicts actions for a given state • Trajectory 𝜏: A sequence of (𝑠𝑡, 𝑎 𝑡, 𝑟𝑡) tuples that describe an episode of experiences of an agent as it executes a policy. 𝜏 = 𝑠0, 𝑎0, 𝑟0, 𝑠1, 𝑎1, 𝑟1 … , 𝑠𝑡, 𝑎 𝑡, 𝑟𝑡, … , 𝑠 𝑇 • Reward of a trajectory 𝑅(𝜏): A function of all the rewards received in a trajectory • e.g. 𝑅 𝜏 = 𝑡 𝑟𝑡 , 𝑅 𝜏 = 𝑡 𝛾 𝑡 𝑟𝑡
  • 10. Department of Computer Science and Engineering IIT Kharagpur Goal of RL Find a policy 𝜋∗ that maximizes the expectation of the reward function 𝑅 𝜏 over trajectories 𝜏 𝜋∗ = 𝑎𝑟𝑔𝑚𝑎𝑥 𝜋 Ε 𝜏[𝑅(𝜏)]
  • 11. Department of Computer Science and Engineering IIT Kharagpur Algorithms for Policy Learning
  • 12. Department of Computer Science and Engineering IIT Kharagpur Exploration Exploitation Dilemma • Goal: • At any particular point of time 𝑡 during training, the agent has two options: 1. Exploit: Act according to whatever policy 𝜋 𝑡 that it has learned so far 2. Explore: Take some random actions and check if there is a better alternative • If the agent always 1. Exploits – then it will remain stuck with the initial policy (usually random), it had  learns nothing! 2. Explores – then it will keep acting randomly forever  learns nothing! • The right tradeoff between exploration and exploitation is necessary for learning a successful policy 𝜋∗ = 𝑎𝑟𝑔𝑚𝑎𝑥 𝜋 Ε 𝜏[𝑅(𝜏)]
  • 13. Department of Computer Science and Engineering IIT Kharagpur Policy Gradient • Objective of RL: • Policy Gradient algorithms use a parameterized model (e.g. neural network) of the policy - 𝜋 𝜃 where 𝜃 represents the set of parameters • They perform gradient ascent on Ε 𝜏[𝑅(𝜏)] to find the optimal set of parameters 𝜃∗ such that: 𝜋∗ = 𝑎𝑟𝑔𝑚𝑎𝑥 𝜋 Ε 𝜏[𝑅(𝜏)] 𝜃∗ = 𝑎𝑟𝑔𝑚𝑎𝑥 𝜃 Ε 𝜏[𝑅(𝜏)|𝜋 𝜃]
  • 14. Department of Computer Science and Engineering IIT Kharagpur Pseudo code of Policy Gradient 1. Initialize policy 𝜋 𝜃 with 𝜃 = 𝜃0; set 𝑡 = 0 2. Generate 𝑁 trajectories {𝜏𝑖}𝑖=1 𝑁 by acting randomly with probability 𝜖 (exploration) and according to 𝜋 𝜃𝑡 with probability 1 − 𝜖 (exploitation) 3. Compute Ε 𝜏 𝑖 [𝑅(𝜏)] 4. Update 𝜃𝑡+1 ← 𝜃𝑡 + η ∗ ∇ 𝜃Ε 𝜏 𝑖 [𝑅(𝜏)] 5. Go to step 2
  • 15. Department of Computer Science and Engineering IIT Kharagpur Value Functions • State value function 𝑽 𝝅(𝒔): Expected future reward obtained by starting at state 𝑠 and acting according to policy 𝜋. 𝑉 𝜋 𝑠 =𝔼 𝜏[𝑅(𝜏)|𝑠0 = 𝑠, 𝜋] • State-action value function 𝑸 𝝅(𝒔, 𝒂): Expected future reward obtained by taking action 𝑎 in state 𝑠 and acting according to policy 𝜋 thereafter. 𝑄 𝜋 𝑠, 𝑎 =𝔼 𝜏 [𝑅(𝜏)|𝑠0 = 𝑠, 𝑎0 = 𝑎, 𝜋]
  • 16. Department of Computer Science and Engineering IIT Kharagpur Optimum Value Functions • Optimum state value function 𝑽∗(𝒔): • Optimum state-action value function 𝑸∗ (𝒔, 𝒂): 𝑉∗ 𝑠 = 𝑚𝑎𝑥 𝜋 𝑉 𝜋 (𝑠): 𝑄∗ 𝑠, 𝑎 = 𝑚𝑎𝑥 𝜋 𝑄 𝜋 (𝑠, 𝑎):
  • 17. Department of Computer Science and Engineering IIT Kharagpur Bellman Equations • If 𝑅 𝜏 = 𝑡 𝛾 𝑡 𝑟𝑡 , from the following identity holds: • Given 𝑄∗ 𝑠, 𝑎 , the optimal policy can be evaluated simply by greedily choosing the action that maximizes 𝑄∗ at every state 𝑠𝑡. (Bellman optimality) 𝜋∗ 𝑠𝑡 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑎 𝑄∗ (𝑠𝑡, 𝑎) --(B1) --(B2) 𝑄∗ 𝑠𝑡, 𝑎 𝑡 = 𝔼 𝑠 𝑡+1 [ 𝑟𝑡 + 𝛾𝑚𝑎𝑥 𝑎 𝑡+1 𝑄∗ 𝑠𝑡+1, 𝑎 𝑡+1 ]
  • 18. Department of Computer Science and Engineering IIT Kharagpur Value Iteration Algorithms • Value Iteration Algorithms like Q-learning use the Bellman equation B1 to estimate 𝑄∗ 𝑠, 𝑎 iteratively: • As 𝑖 → ∞, 𝑄𝑖 → 𝑄∗ 𝑄𝑖+1 𝑠𝑡, 𝑎 𝑡 = 𝔼 𝑠 𝑡+1 [ 𝑟𝑡 + 𝛾𝑚𝑎𝑥 𝑎 𝑡+1 𝑄𝑖 𝑠𝑡+1, 𝑎 𝑡+1 ]
  • 19. Department of Computer Science and Engineering IIT Kharagpur Tabular Q-learning • When the state and action spaces are discrete/finite, 𝑄𝑖 𝑠, 𝑎 can be represented as a |𝑆| × |𝐴| table • This table can be updated iteratively using the Bellman equation B1. • Pseudocode: 1. Initialize the Q-table, 𝑄0(𝑠, 𝑎) 2. Sample 𝑁 trajectories {𝜏𝑖}𝑖=1 𝑁 by acting randomly with probability 𝜖 (exploration) and according to 𝜋𝑖 𝑠 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑎 𝑄𝑖 𝑠, 𝑎 with probability 1 − 𝜖 (exploitation) 3. Update the Q-table as: 𝑄𝑖+1 𝑠, 𝑎 = 𝔼 𝑠′[𝑟 + 𝛾𝑚𝑎𝑥 𝑎′ 𝑄𝑖 𝑠′, 𝑎′ ] for 𝑠, 𝑎, 𝑟, 𝑠′ ∈ 𝜏𝑖, i = 1,2, … , N 4. Go to Step 2
  • 20. Department of Computer Science and Engineering IIT KharagpurQ-learning with Function Approximation • When the state and action spaces are not discrete/finite, 𝑄𝑖 𝑠, 𝑎 can no longer be represented as a |𝑆| × |𝐴| table. Hence function approximation is used. • This family of algorithms represents 𝑄𝑖 𝑠, 𝑎 as a function 𝑄 𝜃: 𝑆 × 𝐴 → ℝ where 𝜃 is the set of parameters that has to be learned. • 𝜃 is learned by gradient descent on the error function defined as: • Deep Q-Network (DQN) uses a neural network as function approximator along with some hacks like Experience Replay to make the samples i.i.d. ℰ = 𝔼(𝑠,𝑎,𝑟,𝑠′)∈{𝜏 𝑖}𝑖=1 𝑁 𝑟 + 𝛾𝑚𝑎𝑥 𝑎′ 𝑄 𝜃 𝑠′ , 𝑎′ − 𝑄 𝜃 𝑠, 𝑎 2 𝜃𝑖+1 ← 𝜃𝑖 − η ∗ ∇ 𝜃 ℰ
  • 21. Department of Computer Science and Engineering IIT Kharagpur Policy Gradient vs. Value Iteration Policy Gradient: • Pros: • Works well in conjunction with function approximation and continuous features • Scales well in large state-action spaces • Cons: • Usually only the local optima and not the global one can be found Value Iteration: • Pros: • If a complete optimal value function is known the optimal policy can be followed simply by greedily choosing actions to optimize it • Cons: • Total coverage of the state-action space is necessary – if that does not happen, the method becomes brittle • Unstable under function approximation in high dimensional continuous state and action spaces Reference: Kober et al. Reinforcement Learning in Robotics: A Survey, Reinforcement Learning: State-of-the-Art, Springer, 2012
  • 22. Department of Computer Science and Engineering IIT Kharagpur Actor Critic Algorithms • Policy gradient algorithms are called Actor-only algorithms because they directly try to deduce the optimal policy. • Value iteration algorithms are called Critic-only algorithms because they first observe and estimate the performance of choosing controls on the system (through the value function) and then derive a policy out of it. • Actor-critic algorithms incorporate the advantages of each of the above: • They have a policy gradient component called the actor which calculates policy gradients • They also have a value function component called the critic that observes the performance of the actor and decides when the policy needs to be updated and which action should be preferred.
  • 23. Department of Computer Science and Engineering IIT Kharagpur Recent Breakthroughs
  • 24. Department of Computer Science and Engineering IIT Kharagpur Google DeepMind beats humans at Atari Games with Deep Q-learning (2015)
  • 25. Department of Computer Science and Engineering IIT Kharagpur Google DeepMind AlphaGo beats legendary Go player Le Sedol
  • 26. Department of Computer Science and Engineering IIT Kharagpur OpenAI's robots that develop their own language to interact and achieve goals in a common world
  • 27. Department of Computer Science and Engineering IIT Kharagpur Thank You