SlideShare una empresa de Scribd logo
1 de 41
Descargar para leer sin conexión
Actor-Critic Algorithm
Jie-Han Chen
NetDB, National Cheng Kung University
5/29, 2018 @ National Cheng Kung University, Taiwan
1
Some content and images in this slides were borrowed from:
1. Sergey Levine’s Deep Reinforcement Learning class in UCB
2. David Silver’s Reinforcement Learning class in UCL
Disclaimer
2
Outline
● Recap policy gradient
● Actor-Critic algorithm
● Recommended Papers
3
Recap: policy gradient
REINFORCE algorithm:
1. sample trajectory from
2.
3.
4
Recap: policy gradient
REINFORCE algorithm:
1. sample trajectory from
2.
3.
Issue: We need to sample whole trajectory to get this
term (Monte Carlo)
Make policy gradient learn
slowly.
5
Recap: policy gradient
REINFORCE algorithm:
1. sample trajectory from
2.
3.
Can we learn step by step?
6
Actor-Critic algorithm
In vanilla policy gradient, we only can evaluate our policy when we finish the whole
episode.
7
Why don’t you tell me how
good is the policy in earlier
stage?
Actor-Critic algorithm
In vanilla policy gradient, we only can evaluate our policy when we finish the whole
episode.
Hello, critic is here. I’ll give
you score in every step.
Why don’t you tell me how
good is the policy in earlier
stage?
8
Actor-Critic algorithm
Objective of vanilla policy gradient:
The return in policy gradient with causality in step t could
be replaced by expected action-value function.
If we could find the action-value function in each step, we
can improve learning efficiency by TD learning.
9
Actor-Critic algorithm
Policy gradient with causality:
Question: How do we get action-value function Q?
10
Actor-Critic algorithm
Policy gradient with causality:
Question: How do we get action-value function Q?
Using another neural network to approximate value
function called Critic. This is so-called Actor-Critic.
By using Critic network, we can update the neural
network step by step. However, it will also introduce
bias.
Policy Network
Critic Network
11
Actor-Critic algorithm
Which kind of format of neural network do we choose in Critic?
12
Actor-Critic algorithm
Which kind of format of neural network do we choose in Critic?
We usually fit the value function. I’ll show you the reason soon. (Other choices are fine)
13
Actor-Critic algorithm
The objective of Actor-Critic
This objective function in this version have lower variance
and higher bias than REINFORCE when we learning by
TD learning.
Can we also subtract a baseline to reduce the variance?
14
Actor-Critic algorithm
The objective of Actor-Critic:
This objective function in this version have lower variance
and higher bias than REINFORCE when we learning by
TD learning.
Can we also subtract a baseline to reduce the variance?
Yes! we could subtract this term:
15
Actor-Critic algorithm
The objective of Actor-Critic with value function baseline:
:
: The average return when other agent face the same state.
: We called this advantage function, which reflects how
good the action we’ve taken compared to other
candidates.
: how good the action we take from current state.
16
Actor-Critic algorithm
t+1
17
Actor-Critic algorithm
18
Actor-Critic algorithm
19
Actor-Critic algorithm
20
Actor-Critic algorithm
21
Actor-Critic algorithm
We just fit the value function!
22
Actor-Critic algorithm: fit value function
Monte Carlo evaluation:
we could sample multiple trajectories like this: Then, compute the loss by supervised regression:
23
Actor-Critic algorithm: fit value function
TD evaluation:
training sample: Then, compute the loss by supervised regression:
24
Actor-Critic algorithm
Online actor-critic algorithm:
1. Take action, get one-step experience (s, a, s’, r)
2. Fit Value function
3. Evaluate advantage function
4.
5.
25
Referenced from CS294 in UCB.
Actor-Critic algorithm
26
Network Architecture
Neural architecture plays an important role in Deep Learning. In Actor-Critic
algorithm, there are two kinds of network architecture:
● Separate policy network and critic network
● Two-head network
27
Network Architecture
Separate policy and critic network
● More parameters
● More stable in training policy network
critic network
28
Two-head network
● Share features, less parameters
● Hard to find good coefficient to balance actor loss and critic loss
Network Architecture
Value head
Policy head
29
AlphaGO
● MCTS, Actor-Critic algorithm
● Separate network architecture
30
AlphaGo Zero
● MCTS, Policy Iteration
● Shared ResNet
31
Correlation Issue
Online actor-critic algorithm:
1. Take action, get one-step experience (s, a, s’, r)
2. Fit Value function
3. Evaluate advantage function
4.
5.
In online actor-critic algorithm, there still exist
correlation problem.
Policy Gradient algorithm is on-policy
algorithm so that we cannot use replay
buffer to solve correlation problem.
Think about how to solve correlation
problem in Actor-Critic.
32
Advance Actor-Critic algorithm
Currently, many state-of-the-art RL algorithms are developed on the basis of
Actor-Critic algorithm:
● Asynchronous Advantage Actor-Critic (A3C)
● Synchronous Advantage Actor-Critic (A2C)
● Trust Region Policy Optimization (TRPO)
● Proximal Policy Optimization (PPO)
● Deep Deterministic Policy Gradient (DDPG)
33
Advance Actor-Critic algorithm
Asynchronous Advantage Actor-Critic (A3C)
V Mnih et al. (ICML 2016) proposed a parallel version of Actor-Critic algorithm which not only solves the
correlation problem but also speeds up learning process. It’s so-called A3C. (Asynchronous Advantage
Actor-Critic), A3C become the state-of-the-art baseline in 2016, 2017
34
Asynchronous Advantage Actor-Critic
They use multiple workers to sample n-step experience.
Each worker have shared global network and their local
network.
Upon collecting enough experience, each worker
computes the gradients of its local network, copies the
gradients to shared global network and then does
backpropagation to update global network
asynchronously.
35
Asynchronous Advantage Actor-Critic
● Asynchronous Advantage Actor-Critic
○ proposed by DeepMind
○ use n-step bootstrapping
○ easier to implement (without lock)
○ some variability in exploration between
workers
○ only use CPUs, poor GPU usage
36
Synchronous Advantage Actor-Critic
● Synchronous Actor-Critic
○ proposed by OpenAI
○ synchronous workers
○ use n-step tricks
○ better GPU usage
37
The difference between A3C and A2C
38
The usage of Actor-Critic algorithm
Gaming, model research Robotics, continuous control
39
The devil is hidden in the details
When we do research on reinforcement
learning, computation resources are needed to
be considered.
A3C/A2C use much memory to cache the status
for each worker in the forward step, especially in
the n-step bootstrapping.
40
From StarCraft II: A New Challenge for Reinforcement Learning
1. Volodymyr Mnih et al. (ICML 2016): A3C, Asynchronous Methods for Deep Reinforcement Learning
2. John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel. (ICML 2015): TRPO,
Trust Region Policy Optimization
3. John Schulman et al. (2017): PPO, Proximal Policy Optimization Algorithms
4. Timothy P. Lillicrap et al. (ICLR 2016): DDPG, Continuous control with deep reinforcement learning
5. David Silver, Aja Huang et al. (Nature 2016): AlphaGo, Mastering the Game of Go with Deep Neural
Networks and Tree Search
6. David Silver et al. (Nature 2017): AlphaGo Zero, Mastering the game of Go without human knowledge
7. Emilio Parisotto, Jimmy Lei Ba, Ruslan Salakhutdinov. (ICLR 2016): Actor-Mimic: Deep Multitask and
Transfer Reinforcement Learning
Related Papers
41

Más contenido relacionado

La actualidad más candente

An introduction to reinforcement learning
An introduction to  reinforcement learningAn introduction to  reinforcement learning
An introduction to reinforcement learningJie-Han Chen
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningKhaled Saleh
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningDing Li
 
Deep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-LearningDeep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-LearningKai-Wen Zhao
 
An introduction to deep reinforcement learning
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learningBig Data Colombia
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningSalem-Kabbani
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learningbutest
 
Reinforcement Learning Q-Learning
Reinforcement Learning   Q-Learning Reinforcement Learning   Q-Learning
Reinforcement Learning Q-Learning Melaku Eneayehu
 
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex FridmanMIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex FridmanPeerasak C.
 
An introduction to reinforcement learning
An introduction to reinforcement learningAn introduction to reinforcement learning
An introduction to reinforcement learningSubrat Panda, PhD
 
Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Dongmin Lee
 
Reinforcement learning:policy gradient (part 1)
Reinforcement learning:policy gradient (part 1)Reinforcement learning:policy gradient (part 1)
Reinforcement learning:policy gradient (part 1)Bean Yen
 
Wrapper feature selection method
Wrapper feature selection methodWrapper feature selection method
Wrapper feature selection methodAmir Razmjou
 
Deep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural NetworksDeep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural NetworksChristian Perone
 
Reinforcement Learning 6. Temporal Difference Learning
Reinforcement Learning 6. Temporal Difference LearningReinforcement Learning 6. Temporal Difference Learning
Reinforcement Learning 6. Temporal Difference LearningSeung Jae Lee
 
Continuous control with deep reinforcement learning (DDPG)
Continuous control with deep reinforcement learning (DDPG)Continuous control with deep reinforcement learning (DDPG)
Continuous control with deep reinforcement learning (DDPG)Taehoon Kim
 
Multi-armed Bandits
Multi-armed BanditsMulti-armed Bandits
Multi-armed BanditsDongmin Lee
 

La actualidad más candente (20)

An introduction to reinforcement learning
An introduction to  reinforcement learningAn introduction to  reinforcement learning
An introduction to reinforcement learning
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement Learning
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Deep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-LearningDeep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-Learning
 
An introduction to deep reinforcement learning
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Reinforcement Learning Q-Learning
Reinforcement Learning   Q-Learning Reinforcement Learning   Q-Learning
Reinforcement Learning Q-Learning
 
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex FridmanMIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
 
An introduction to reinforcement learning
An introduction to reinforcement learningAn introduction to reinforcement learning
An introduction to reinforcement learning
 
ddpg seminar
ddpg seminarddpg seminar
ddpg seminar
 
Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)
 
Reinforcement learning:policy gradient (part 1)
Reinforcement learning:policy gradient (part 1)Reinforcement learning:policy gradient (part 1)
Reinforcement learning:policy gradient (part 1)
 
Wrapper feature selection method
Wrapper feature selection methodWrapper feature selection method
Wrapper feature selection method
 
Deep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural NetworksDeep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural Networks
 
Reinforcement Learning 6. Temporal Difference Learning
Reinforcement Learning 6. Temporal Difference LearningReinforcement Learning 6. Temporal Difference Learning
Reinforcement Learning 6. Temporal Difference Learning
 
Continuous control with deep reinforcement learning (DDPG)
Continuous control with deep reinforcement learning (DDPG)Continuous control with deep reinforcement learning (DDPG)
Continuous control with deep reinforcement learning (DDPG)
 
CNN Tutorial
CNN TutorialCNN Tutorial
CNN Tutorial
 
Multi-armed Bandits
Multi-armed BanditsMulti-armed Bandits
Multi-armed Bandits
 
Human Action Recognition
Human Action RecognitionHuman Action Recognition
Human Action Recognition
 

Similar a Actor critic algorithm

Movie recommendation Engine using Artificial Intelligence
Movie recommendation Engine using Artificial IntelligenceMovie recommendation Engine using Artificial Intelligence
Movie recommendation Engine using Artificial IntelligenceHarivamshi D
 
TensorFlow London 11: Pierre Harvey Richemond 'Trends and Developments in Rei...
TensorFlow London 11: Pierre Harvey Richemond 'Trends and Developments in Rei...TensorFlow London 11: Pierre Harvey Richemond 'Trends and Developments in Rei...
TensorFlow London 11: Pierre Harvey Richemond 'Trends and Developments in Rei...Seldon
 
Comparative Analysis of Tuning Hyperparameters in Policy-Based DRL Algorithm ...
Comparative Analysis of Tuning Hyperparameters in Policy-Based DRL Algorithm ...Comparative Analysis of Tuning Hyperparameters in Policy-Based DRL Algorithm ...
Comparative Analysis of Tuning Hyperparameters in Policy-Based DRL Algorithm ...IRJET Journal
 
Online learning & adaptive game playing
Online learning & adaptive game playingOnline learning & adaptive game playing
Online learning & adaptive game playingSaeid Ghafouri
 
Building a deep learning ai.pptx
Building a deep learning ai.pptxBuilding a deep learning ai.pptx
Building a deep learning ai.pptxDaniel Slater
 
How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysYasutoTamura1
 
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratchJie-Han Chen
 
DC02. Interpretation of predictions
DC02. Interpretation of predictionsDC02. Interpretation of predictions
DC02. Interpretation of predictionsAnton Kulesh
 
Movie Recommendation engine
Movie Recommendation engineMovie Recommendation engine
Movie Recommendation engineJayesh Lahori
 
Summer internship 2014 report by Rishabh Misra, Thapar University
Summer internship 2014 report by Rishabh Misra, Thapar UniversitySummer internship 2014 report by Rishabh Misra, Thapar University
Summer internship 2014 report by Rishabh Misra, Thapar UniversityRishabh Misra
 
Introduction: Asynchronous Methods for Deep Reinforcement Learning
Introduction: Asynchronous Methods for  Deep Reinforcement LearningIntroduction: Asynchronous Methods for  Deep Reinforcement Learning
Introduction: Asynchronous Methods for Deep Reinforcement LearningTakashi Nagata
 
Model based rl
Model based rlModel based rl
Model based rlSeolhokim
 
Introduction to simulation and modeling
Introduction to simulation and modelingIntroduction to simulation and modeling
Introduction to simulation and modelingantim19
 
[CIKM 2014] Deviation-Based Contextual SLIM Recommenders
[CIKM 2014] Deviation-Based Contextual SLIM Recommenders[CIKM 2014] Deviation-Based Contextual SLIM Recommenders
[CIKM 2014] Deviation-Based Contextual SLIM RecommendersYONG ZHENG
 
[Paper Reading] Steering Query Optimizers: A Practical Take on Big Data Workl...
[Paper Reading] Steering Query Optimizers: A Practical Take on Big Data Workl...[Paper Reading] Steering Query Optimizers: A Practical Take on Big Data Workl...
[Paper Reading] Steering Query Optimizers: A Practical Take on Big Data Workl...PingCAP
 
Rokach-GomaxSlides.pptx
Rokach-GomaxSlides.pptxRokach-GomaxSlides.pptx
Rokach-GomaxSlides.pptxJadna Almeida
 
Rokach-GomaxSlides (1).pptx
Rokach-GomaxSlides (1).pptxRokach-GomaxSlides (1).pptx
Rokach-GomaxSlides (1).pptxJadna Almeida
 
IRJET- Identification of Scene Images using Convolutional Neural Networks - A...
IRJET- Identification of Scene Images using Convolutional Neural Networks - A...IRJET- Identification of Scene Images using Convolutional Neural Networks - A...
IRJET- Identification of Scene Images using Convolutional Neural Networks - A...IRJET Journal
 

Similar a Actor critic algorithm (20)

Movie recommendation Engine using Artificial Intelligence
Movie recommendation Engine using Artificial IntelligenceMovie recommendation Engine using Artificial Intelligence
Movie recommendation Engine using Artificial Intelligence
 
TensorFlow London 11: Pierre Harvey Richemond 'Trends and Developments in Rei...
TensorFlow London 11: Pierre Harvey Richemond 'Trends and Developments in Rei...TensorFlow London 11: Pierre Harvey Richemond 'Trends and Developments in Rei...
TensorFlow London 11: Pierre Harvey Richemond 'Trends and Developments in Rei...
 
Comparative Analysis of Tuning Hyperparameters in Policy-Based DRL Algorithm ...
Comparative Analysis of Tuning Hyperparameters in Policy-Based DRL Algorithm ...Comparative Analysis of Tuning Hyperparameters in Policy-Based DRL Algorithm ...
Comparative Analysis of Tuning Hyperparameters in Policy-Based DRL Algorithm ...
 
Online learning & adaptive game playing
Online learning & adaptive game playingOnline learning & adaptive game playing
Online learning & adaptive game playing
 
Building a deep learning ai.pptx
Building a deep learning ai.pptxBuilding a deep learning ai.pptx
Building a deep learning ai.pptx
 
How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative ways
 
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratch
 
DC02. Interpretation of predictions
DC02. Interpretation of predictionsDC02. Interpretation of predictions
DC02. Interpretation of predictions
 
Movie Recommendation engine
Movie Recommendation engineMovie Recommendation engine
Movie Recommendation engine
 
C3 w5
C3 w5C3 w5
C3 w5
 
Summer internship 2014 report by Rishabh Misra, Thapar University
Summer internship 2014 report by Rishabh Misra, Thapar UniversitySummer internship 2014 report by Rishabh Misra, Thapar University
Summer internship 2014 report by Rishabh Misra, Thapar University
 
Introduction: Asynchronous Methods for Deep Reinforcement Learning
Introduction: Asynchronous Methods for  Deep Reinforcement LearningIntroduction: Asynchronous Methods for  Deep Reinforcement Learning
Introduction: Asynchronous Methods for Deep Reinforcement Learning
 
Model based rl
Model based rlModel based rl
Model based rl
 
Introduction to simulation and modeling
Introduction to simulation and modelingIntroduction to simulation and modeling
Introduction to simulation and modeling
 
[CIKM 2014] Deviation-Based Contextual SLIM Recommenders
[CIKM 2014] Deviation-Based Contextual SLIM Recommenders[CIKM 2014] Deviation-Based Contextual SLIM Recommenders
[CIKM 2014] Deviation-Based Contextual SLIM Recommenders
 
Entity2rec recsys
Entity2rec recsysEntity2rec recsys
Entity2rec recsys
 
[Paper Reading] Steering Query Optimizers: A Practical Take on Big Data Workl...
[Paper Reading] Steering Query Optimizers: A Practical Take on Big Data Workl...[Paper Reading] Steering Query Optimizers: A Practical Take on Big Data Workl...
[Paper Reading] Steering Query Optimizers: A Practical Take on Big Data Workl...
 
Rokach-GomaxSlides.pptx
Rokach-GomaxSlides.pptxRokach-GomaxSlides.pptx
Rokach-GomaxSlides.pptx
 
Rokach-GomaxSlides (1).pptx
Rokach-GomaxSlides (1).pptxRokach-GomaxSlides (1).pptx
Rokach-GomaxSlides (1).pptx
 
IRJET- Identification of Scene Images using Convolutional Neural Networks - A...
IRJET- Identification of Scene Images using Convolutional Neural Networks - A...IRJET- Identification of Scene Images using Convolutional Neural Networks - A...
IRJET- Identification of Scene Images using Convolutional Neural Networks - A...
 

Más de Jie-Han Chen

Frontier in reinforcement learning
Frontier in reinforcement learningFrontier in reinforcement learning
Frontier in reinforcement learningJie-Han Chen
 
Temporal difference learning
Temporal difference learningTemporal difference learning
Temporal difference learningJie-Han Chen
 
Deep reinforcement learning
Deep reinforcement learningDeep reinforcement learning
Deep reinforcement learningJie-Han Chen
 
Markov decision process
Markov decision processMarkov decision process
Markov decision processJie-Han Chen
 
Multi armed bandit
Multi armed banditMulti armed bandit
Multi armed banditJie-Han Chen
 
Discrete sequential prediction of continuous actions for deep RL
Discrete sequential prediction of continuous actions for deep RLDiscrete sequential prediction of continuous actions for deep RL
Discrete sequential prediction of continuous actions for deep RLJie-Han Chen
 
BiCNet presentation (multi-agent reinforcement learning)
BiCNet presentation (multi-agent reinforcement learning)BiCNet presentation (multi-agent reinforcement learning)
BiCNet presentation (multi-agent reinforcement learning)Jie-Han Chen
 
Data science-toolchain
Data science-toolchainData science-toolchain
Data science-toolchainJie-Han Chen
 
The artofreadablecode
The artofreadablecodeThe artofreadablecode
The artofreadablecodeJie-Han Chen
 

Más de Jie-Han Chen (9)

Frontier in reinforcement learning
Frontier in reinforcement learningFrontier in reinforcement learning
Frontier in reinforcement learning
 
Temporal difference learning
Temporal difference learningTemporal difference learning
Temporal difference learning
 
Deep reinforcement learning
Deep reinforcement learningDeep reinforcement learning
Deep reinforcement learning
 
Markov decision process
Markov decision processMarkov decision process
Markov decision process
 
Multi armed bandit
Multi armed banditMulti armed bandit
Multi armed bandit
 
Discrete sequential prediction of continuous actions for deep RL
Discrete sequential prediction of continuous actions for deep RLDiscrete sequential prediction of continuous actions for deep RL
Discrete sequential prediction of continuous actions for deep RL
 
BiCNet presentation (multi-agent reinforcement learning)
BiCNet presentation (multi-agent reinforcement learning)BiCNet presentation (multi-agent reinforcement learning)
BiCNet presentation (multi-agent reinforcement learning)
 
Data science-toolchain
Data science-toolchainData science-toolchain
Data science-toolchain
 
The artofreadablecode
The artofreadablecodeThe artofreadablecode
The artofreadablecode
 

Último

Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jisc
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxDr. Ravikiran H M Gowda
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxJisc
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxDenish Jangid
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structuredhanjurrannsibayan2
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the ClassroomPooky Knightsmith
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfDr Vijay Vishwakarma
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.MaryamAhmad92
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentationcamerronhm
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSCeline George
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Jisc
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxheathfieldcps1
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...Nguyen Thanh Tu Collection
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Pooja Bhuva
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsMebane Rash
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsKarakKing
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfSherif Taha
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfPoh-Sun Goh
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibitjbellavia9
 

Último (20)

Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptx
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structure
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the Classroom
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 

Actor critic algorithm

  • 1. Actor-Critic Algorithm Jie-Han Chen NetDB, National Cheng Kung University 5/29, 2018 @ National Cheng Kung University, Taiwan 1
  • 2. Some content and images in this slides were borrowed from: 1. Sergey Levine’s Deep Reinforcement Learning class in UCB 2. David Silver’s Reinforcement Learning class in UCL Disclaimer 2
  • 3. Outline ● Recap policy gradient ● Actor-Critic algorithm ● Recommended Papers 3
  • 4. Recap: policy gradient REINFORCE algorithm: 1. sample trajectory from 2. 3. 4
  • 5. Recap: policy gradient REINFORCE algorithm: 1. sample trajectory from 2. 3. Issue: We need to sample whole trajectory to get this term (Monte Carlo) Make policy gradient learn slowly. 5
  • 6. Recap: policy gradient REINFORCE algorithm: 1. sample trajectory from 2. 3. Can we learn step by step? 6
  • 7. Actor-Critic algorithm In vanilla policy gradient, we only can evaluate our policy when we finish the whole episode. 7 Why don’t you tell me how good is the policy in earlier stage?
  • 8. Actor-Critic algorithm In vanilla policy gradient, we only can evaluate our policy when we finish the whole episode. Hello, critic is here. I’ll give you score in every step. Why don’t you tell me how good is the policy in earlier stage? 8
  • 9. Actor-Critic algorithm Objective of vanilla policy gradient: The return in policy gradient with causality in step t could be replaced by expected action-value function. If we could find the action-value function in each step, we can improve learning efficiency by TD learning. 9
  • 10. Actor-Critic algorithm Policy gradient with causality: Question: How do we get action-value function Q? 10
  • 11. Actor-Critic algorithm Policy gradient with causality: Question: How do we get action-value function Q? Using another neural network to approximate value function called Critic. This is so-called Actor-Critic. By using Critic network, we can update the neural network step by step. However, it will also introduce bias. Policy Network Critic Network 11
  • 12. Actor-Critic algorithm Which kind of format of neural network do we choose in Critic? 12
  • 13. Actor-Critic algorithm Which kind of format of neural network do we choose in Critic? We usually fit the value function. I’ll show you the reason soon. (Other choices are fine) 13
  • 14. Actor-Critic algorithm The objective of Actor-Critic This objective function in this version have lower variance and higher bias than REINFORCE when we learning by TD learning. Can we also subtract a baseline to reduce the variance? 14
  • 15. Actor-Critic algorithm The objective of Actor-Critic: This objective function in this version have lower variance and higher bias than REINFORCE when we learning by TD learning. Can we also subtract a baseline to reduce the variance? Yes! we could subtract this term: 15
  • 16. Actor-Critic algorithm The objective of Actor-Critic with value function baseline: : : The average return when other agent face the same state. : We called this advantage function, which reflects how good the action we’ve taken compared to other candidates. : how good the action we take from current state. 16
  • 22. Actor-Critic algorithm We just fit the value function! 22
  • 23. Actor-Critic algorithm: fit value function Monte Carlo evaluation: we could sample multiple trajectories like this: Then, compute the loss by supervised regression: 23
  • 24. Actor-Critic algorithm: fit value function TD evaluation: training sample: Then, compute the loss by supervised regression: 24
  • 25. Actor-Critic algorithm Online actor-critic algorithm: 1. Take action, get one-step experience (s, a, s’, r) 2. Fit Value function 3. Evaluate advantage function 4. 5. 25 Referenced from CS294 in UCB.
  • 27. Network Architecture Neural architecture plays an important role in Deep Learning. In Actor-Critic algorithm, there are two kinds of network architecture: ● Separate policy network and critic network ● Two-head network 27
  • 28. Network Architecture Separate policy and critic network ● More parameters ● More stable in training policy network critic network 28
  • 29. Two-head network ● Share features, less parameters ● Hard to find good coefficient to balance actor loss and critic loss Network Architecture Value head Policy head 29
  • 30. AlphaGO ● MCTS, Actor-Critic algorithm ● Separate network architecture 30
  • 31. AlphaGo Zero ● MCTS, Policy Iteration ● Shared ResNet 31
  • 32. Correlation Issue Online actor-critic algorithm: 1. Take action, get one-step experience (s, a, s’, r) 2. Fit Value function 3. Evaluate advantage function 4. 5. In online actor-critic algorithm, there still exist correlation problem. Policy Gradient algorithm is on-policy algorithm so that we cannot use replay buffer to solve correlation problem. Think about how to solve correlation problem in Actor-Critic. 32
  • 33. Advance Actor-Critic algorithm Currently, many state-of-the-art RL algorithms are developed on the basis of Actor-Critic algorithm: ● Asynchronous Advantage Actor-Critic (A3C) ● Synchronous Advantage Actor-Critic (A2C) ● Trust Region Policy Optimization (TRPO) ● Proximal Policy Optimization (PPO) ● Deep Deterministic Policy Gradient (DDPG) 33
  • 34. Advance Actor-Critic algorithm Asynchronous Advantage Actor-Critic (A3C) V Mnih et al. (ICML 2016) proposed a parallel version of Actor-Critic algorithm which not only solves the correlation problem but also speeds up learning process. It’s so-called A3C. (Asynchronous Advantage Actor-Critic), A3C become the state-of-the-art baseline in 2016, 2017 34
  • 35. Asynchronous Advantage Actor-Critic They use multiple workers to sample n-step experience. Each worker have shared global network and their local network. Upon collecting enough experience, each worker computes the gradients of its local network, copies the gradients to shared global network and then does backpropagation to update global network asynchronously. 35
  • 36. Asynchronous Advantage Actor-Critic ● Asynchronous Advantage Actor-Critic ○ proposed by DeepMind ○ use n-step bootstrapping ○ easier to implement (without lock) ○ some variability in exploration between workers ○ only use CPUs, poor GPU usage 36
  • 37. Synchronous Advantage Actor-Critic ● Synchronous Actor-Critic ○ proposed by OpenAI ○ synchronous workers ○ use n-step tricks ○ better GPU usage 37
  • 38. The difference between A3C and A2C 38
  • 39. The usage of Actor-Critic algorithm Gaming, model research Robotics, continuous control 39
  • 40. The devil is hidden in the details When we do research on reinforcement learning, computation resources are needed to be considered. A3C/A2C use much memory to cache the status for each worker in the forward step, especially in the n-step bootstrapping. 40 From StarCraft II: A New Challenge for Reinforcement Learning
  • 41. 1. Volodymyr Mnih et al. (ICML 2016): A3C, Asynchronous Methods for Deep Reinforcement Learning 2. John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel. (ICML 2015): TRPO, Trust Region Policy Optimization 3. John Schulman et al. (2017): PPO, Proximal Policy Optimization Algorithms 4. Timothy P. Lillicrap et al. (ICLR 2016): DDPG, Continuous control with deep reinforcement learning 5. David Silver, Aja Huang et al. (Nature 2016): AlphaGo, Mastering the Game of Go with Deep Neural Networks and Tree Search 6. David Silver et al. (Nature 2017): AlphaGo Zero, Mastering the game of Go without human knowledge 7. Emilio Parisotto, Jimmy Lei Ba, Ruslan Salakhutdinov. (ICLR 2016): Actor-Mimic: Deep Multitask and Transfer Reinforcement Learning Related Papers 41