SlideShare una empresa de Scribd logo
1 de 65
Descargar para leer sin conexión
Trust Region Policy Optimization, Schulman et al, 2015.
옥찬호
utilForever@gmail.com
Trust Region Policy Optimization, Schulman et al, 2015.Introduction
• Most algorithms for policy optimization can be classified...
• (1) Policy iteration methods (Alternate between estimating the value
function under the current policy and improving the policy)
• (2) Policy gradient methods (Use an estimator of the gradient of the
expected return)
• (3) Derivative-free optimization methods (Treat the return as a black box
function to be optimized in terms of the policy parameters) such as
• The Cross-Entropy Method (CEM)
• The Covariance Matrix Adaptation (CMA)
Introduction
• General derivative-free stochastic optimization methods
such as CEM and CMA are preferred on many problems,
• They achieve good results while being simple to understand and implement.
• For example, while Tetris is a classic benchmark problem for
approximate dynamic programming(ADP) methods, stochastic
optimization methods are difficult to beat on this task.
Trust Region Policy Optimization, Schulman et al, 2015.
Introduction
• For continuous control problems, methods like CMA have been
successful at learning control policies for challenging tasks like
locomotion when provided with hand-engineered policy
classes with low-dimensional parameterizations.
• The inability of ADP and gradient-based methods to
consistently beat gradient-free random search is unsatisfying,
• Gradient-based optimization algorithms enjoy much better sample
complexity guarantees than gradient-free methods.
Trust Region Policy Optimization, Schulman et al, 2015.
Introduction
• Continuous gradient-based optimization has been very
successful at learning function approximators for supervised
learning tasks with huge numbers of parameters, and
extending their success to reinforcement learning would allow
for efficient training of complex and powerful policies.
Trust Region Policy Optimization, Schulman et al, 2015.
Preliminaries
• Markov Decision Process(MDP) : 𝒮, 𝒜, 𝑃, 𝑟, 𝜌0, 𝛾
• 𝒮 is a finite set of states
• 𝒜 is a finite set of actions
• 𝑃 ∶ 𝒮 × 𝒜 × 𝒮 → ℝ is the transition probability
• 𝑟 ∶ 𝒮 → ℝ is the reward function
• 𝜌0 ∶ 𝒮 → ℝ is the distribution of the initial state 𝑠0
• 𝛾 ∈ 0, 1 is the discount factor
• Policy
• 𝜋 ∶ 𝒮 × 𝒜 → 0, 1
Trust Region Policy Optimization, Schulman et al, 2015.
Preliminaries Trust Region Policy Optimization, Schulman et al, 2015.
Preliminaries Trust Region Policy Optimization, Schulman et al, 2015.
Preliminaries Trust Region Policy Optimization, Schulman et al, 2015.
Preliminaries Trust Region Policy Optimization, Schulman et al, 2015.
Preliminaries
How to measure policy’s performance?
Trust Region Policy Optimization, Schulman et al, 2015.
Preliminaries
• Expected cumulative discounted reward
• Action value, Value/Advantage functions
Trust Region Policy Optimization, Schulman et al, 2015.
Preliminaries Trust Region Policy Optimization, Schulman et al, 2015.
Preliminaries Trust Region Policy Optimization, Schulman et al, 2015.
Preliminaries
• Lemma 1. Kakade & Langford (2002)
• Proof
Trust Region Policy Optimization, Schulman et al, 2015.
Preliminaries
• Discounted (unnormalized) visitation frequencies
• Rewrite Lemma 1
Trust Region Policy Optimization, Schulman et al, 2015.
Preliminaries
• This equation implies that any policy update 𝜋 → ෤𝜋 that has
a nonnegative expected advantage at every state 𝑠,
i.e., σ 𝑎 ෤𝜋 𝑎 𝑠 𝐴 𝜋(𝑠, 𝑎) ≥ 0, is guaranteed to increase
the policy performance 𝜂, or leave it constant in the case that
the expected advantage is zero everywhere.
Trust Region Policy Optimization, Schulman et al, 2015.
Preliminaries Trust Region Policy Optimization, Schulman et al, 2015.
Preliminaries
• However, in the approximate setting, it will typically be
unavoidable, due to estimation and approximation error, that
there will be some states 𝑠 for which the expected advantage
is negative, that is, σ 𝑎 ෤𝜋 𝑎 𝑠 𝐴 𝜋(𝑠, 𝑎) < 0.
• The complex dependency of 𝜌෥𝜋 𝑠 on ෤𝜋 makes equation
difficult to optimize directly.
Trust Region Policy Optimization, Schulman et al, 2015.
Preliminaries
• Instead, we introduce the following local approximation to 𝜂:
• Note that 𝐿 𝜋 uses the visitation frequency 𝜌 𝜋 rather than 𝜌෥𝜋,
ignoring changes in state visitation density due to changes in
the policy.
Trust Region Policy Optimization, Schulman et al, 2015.
Preliminaries Trust Region Policy Optimization, Schulman et al, 2015.
Preliminaries
• However, if we have a parameterized policy 𝜋 𝜃, where
𝜋 𝜃 𝑎 𝑠 is a differentiable function of the parameter vector 𝜃,
then 𝐿 𝜋 matches 𝜂 to first order.
• This equation implies that a sufficiently small step 𝜋 𝜃0
→ ෤𝜋 that
improves 𝐿 𝜋 𝜃 𝑜𝑙𝑑
will also improve 𝜂, but does not give us any
guidance on how big of a step to take.
Trust Region Policy Optimization, Schulman et al, 2015.
Preliminaries Trust Region Policy Optimization, Schulman et al, 2015.
Preliminaries
• Conservative Policy Iteration
• Provide explicit lower bounds on the improvement of 𝜂.
• Let 𝜋 𝑜𝑙𝑑 denote the current policy, and let 𝜋′ = argmax 𝜋′ 𝐿 𝜋 𝑜𝑙𝑑
(𝜋′).
The new policy 𝜋 𝑛𝑒𝑤 was defined to be the following texture:
Trust Region Policy Optimization, Schulman et al, 2015.
Preliminaries
• Conservative Policy Iteration
• The following lower bound:
• However, this policy class is unwieldy and restrictive in practice,
and it is desirable for a practical policy update scheme to be applicable to
all general stochastic policy classes.
Trust Region Policy Optimization, Schulman et al, 2015.
Monotonic Improvement Guarantee
• The policy improvement bound in above equation can be
extended to general stochastic policies, by
• (1) Replacing 𝛼 with a distance measure between 𝜋 and ෤𝜋
• (2) Changing the constant 𝜖 appropriately
Monotonic Improvement Guarantee
• The particular distance measure we use is
the total variation divergence for
discrete probability distributions 𝑝, 𝑞:
𝐷TV 𝑝 ∥ 𝑞 =
1
2
෍
𝑖
𝑝𝑖 − 𝑞𝑖
• Define 𝐷TV
max
𝜋, ෤𝜋 as
Monotonic Improvement Guarantee
• Theorem 1. Let 𝛼 = 𝐷TV
max
𝜋old, 𝜋new .
Then the following bound holds:
Monotonic Improvement Guarantee
• We note the following relationship between the total variation
divergence and the KL divergence.
𝐷TV 𝑝 ∥ 𝑞 2
≤ 𝐷KL 𝑝 ∥ 𝑞
• Let 𝐷KL
max
𝜋, ෤𝜋 = max
𝑠
𝐷KL 𝜋 ∙ 𝑠 ∥ ෤𝜋 ∙ |𝑠 .
The following bound then follows directly from Theorem 1:
Monotonic Improvement Guarantee
Monotonic Improvement Guarantee
• Algorithm 1
Monotonic Improvement Guarantee
Monotonic Improvement Guarantee
• Algorithm 1 is guaranteed to generate a monotonically
improving sequence of policies 𝜂 𝜋0 ≤ 𝜂 𝜋1 ≤ 𝜂 𝜋2 ≤ ⋯.
• Let 𝑀𝑖 𝜋 = 𝐿 𝜋 𝑖
− 𝐶𝐷KL
max
𝜋𝑖, 𝜋 . Then,
• Thus, by maximizing 𝑀𝑖 at each iteration, we guarantee that
the true objective 𝜂 is non-decreasing.
• This algorithm is a type of minorization-maximization (MM) algorithm.
Optimization of Parameterized Policies
• Overload our previous notation to use functions of 𝜃.
• 𝜂 𝜃 ≔ 𝜂 𝜋 𝜃
• 𝐿 𝜃
෨𝜃 ≔ 𝐿 𝜋 𝜃
𝜋෩𝜃
• 𝐷KL 𝜃 ∥ ෨𝜃 ≔ 𝐷KL 𝜋 𝜃 ∥ 𝜋෩𝜃
• Use 𝜃old to denote the previous policy parameters
Optimization of Parameterized Policies
• The preceding section showed that
𝜂 𝜃 ≥ 𝐿 𝜃old
𝜃 − 𝐶𝐷KL
max
𝜃old, 𝜃
• Thus, by performing the following maximization,
we are guaranteed to improve the true objective 𝜂:
maximize 𝜃 𝐿 𝜃old
𝜃 − 𝐶𝐷KL
max
𝜃old, 𝜃
Optimization of Parameterized Policies
• In practice, if we used the penalty coefficient 𝐶 recommended
by the theory above, the step sizes would be very small.
• One way to take largest steps in a robust way is to use a
constraint on the KL divergence between the new policy and
the old policy, i.e., a trust region constraint:
Optimization of Parameterized Policies
• This problem imposes a constraint that the KL divergence is
bounded at every point in the state space. While it is motivated
by the theory, this problem is impractical to solve due to the
large number of constraints.
• Instead, we can use a heuristic approximation which considers
the average KL divergence:
Optimization of Parameterized Policies
• We therefore propose solving the following optimization
problem to generate a policy update:
Optimization of Parameterized Policies
Sample-Based Estimation
• How the objective and constraint functions can be
approximated using Monte Carlo simulation?
• We seek to solve the following optimization problem, obtained
by expanding 𝐿 𝜃old
in previous equation:
Sample-Based Estimation
• We replace
• σ 𝑠 𝜌 𝜃old
𝑠 … →
1
1−𝛾
𝐸𝑠~𝜌 𝜃old
[… ]
• 𝐴 𝜃old
→ 𝑄 𝜃old
• σ 𝑎 𝜋 𝜃old
𝑎|𝑠 𝐴 𝜃old
→ 𝐸 𝑎~𝑞
𝜋 𝜃(𝑎|𝑠)
𝑞(𝑎|𝑠)
𝐴 𝜃old
(We replace the sum over the actions by an importance sampling estimator.)
Sample-Based Estimation
Sample-Based Estimation
Sample-Based Estimation
• Now, our optimization problem in previous equation is exactly
equivalent to the following one, written in terms of
expectations:
• Constrained Optimization → Sample-based Estimation
Sample-Based Estimation
• All that remains is to replace the expectations by sample
averages and replace the 𝑄 value by an empirical estimate.
The following sections describe two different schemes for
performing this estimation.
• Single Path: Typically used for policy gradient estimation (Bartlett & Baxter,
2011), and is based on sampling individual trajectories.
• Vine: Constructing a rollout set and then performing multiple actions from
each state in the rollout set.
Sample-Based Estimation
Sample-Based Estimation
Sample-Based Estimation
Practical Algorithm
1. Use the single path or vine procedures to collect a set of state-
action pairs along with Monte Carlo estimates of their 𝑄-values.
2. By averaging over samples, construct the estimated objective and
constraint in Equation .
3. Approximately solve this constrained optimization problem to
update the policy’s parameter vector 𝜃. We use the conjugate
gradient algorithm followed by a line search, which is altogether
only slightly more expensive than computing the gradient itself.
See Appendix C for details.
Trust Region Policy Optimization, Schulman et al, 2015.
Practical Algorithm
• With regard to (3), we construct the Fisher information matrix
(FIM) by analytically computing the Hessian of KL divergence,
rather than using the covariance matrix of the gradients.
• That is, we estimate 𝐴𝑖𝑗 as ,
rather than .
• This analytic estimator has computational benefits in the
large-scale setting, since it removes the need to store a dense
Hessian or all policy gradients from a batch of trajectories.
Trust Region Policy Optimization, Schulman et al, 2015.
Experiments
We designed experiments to investigate the following questions:
1. What are the performance characteristics of the single path and vine
sampling procedures?
2. TRPO is related to prior methods but makes several changes, most notably by
using a fixed KL divergence rather than a fixed penalty coefficient. How does
this affect the performance of the algorithm?
3. Can TRPO be used to solve challenging large-scale problems? How does
TRPO compare with other methods when applied to large-scale problems,
with regard to final performance, computation time, and sample complexity?
Trust Region Policy Optimization, Schulman et al, 2015.
Experiments
• Simulated Robotic Locomotion
Trust Region Policy Optimization, Schulman et al, 2015.
Experiments
• Simulated Robotic Locomotion
Trust Region Policy Optimization, Schulman et al, 2015.
Experiments
• Simulated Robotic Locomotion
Trust Region Policy Optimization, Schulman et al, 2015.
Experiments
• Simulated Robotic Locomotion
Trust Region Policy Optimization, Schulman et al, 2015.
Experiments
• Playing Games from Images
• To evaluate TRPO on a partially observed task with complex observations, we
trained policies for playing Atari games, using raw images as input.
• Challenging points
• The high dimensionality, challenging elements of these games include delayed rewards (no
immediate penalty is incurred when a life is lost in Breakout or Space Invaders)
• Complex sequences of behavior (Q*bert requires a character to hop on 21 different platforms)
• Non-stationary image statistics (Enduro involves a changing and flickering background)
Trust Region Policy Optimization, Schulman et al, 2015.
Experiments
• Playing Games from Images
Trust Region Policy Optimization, Schulman et al, 2015.
Experiments
• Playing Games from Images
Trust Region Policy Optimization, Schulman et al, 2015.
Experiments
• Playing Games from Images
Trust Region Policy Optimization, Schulman et al, 2015.
Discussion
• We proposed and analyzed trust region methods for optimizing
stochastic control policies.
• We proved monotonic improvement for an algorithm that
repeatedly optimizes a local approximation to the expected
return of the policy with a KL divergence penalty.
Trust Region Policy Optimization, Schulman et al, 2015.
Discussion
• In the domain of robotic locomotion, we successfully learned
controllers for swimming, walking and hopping in a physics
simulator, using general purpose neural networks and
minimally informative rewards.
• At the intersection of the two experimental domains we
explored, there is the possibility of learning robotic control
policies that use vision and raw sensory data as input, providing
a unified scheme for training robotic controllers that perform
both perception and control.
Trust Region Policy Optimization, Schulman et al, 2015.
Discussion
• By combining our method with model learning, it would also be
possible to substantially reduce its sample complexity, making
it applicable to real-world settings where samples are
expensive.
Trust Region Policy Optimization, Schulman et al, 2015.
Summary Trust Region Policy Optimization, Schulman et al, 2015.
References
• https://www.slideshare.net/WoongwonLee/trpo-87165690
• https://reinforcement-learning-kr.github.io/2018/06/24/5_trpo/
• https://www.youtube.com/watch?v=XBO4oPChMfI
• https://www.youtube.com/watch?v=CKaN5PgkSBc
Trust Region Policy Optimization, Schulman et al, 2015.
Thank you!

Más contenido relacionado

La actualidad más candente

Deep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-LearningDeep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-LearningKai-Wen Zhao
 
Imitation learning tutorial
Imitation learning tutorialImitation learning tutorial
Imitation learning tutorialYisong Yue
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningCloudxLab
 
Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Dongmin Lee
 
Reinforcement Learning 8: Planning and Learning with Tabular Methods
Reinforcement Learning 8: Planning and Learning with Tabular MethodsReinforcement Learning 8: Planning and Learning with Tabular Methods
Reinforcement Learning 8: Planning and Learning with Tabular MethodsSeung Jae Lee
 
Reinforcement Learning 4. Dynamic Programming
Reinforcement Learning 4. Dynamic ProgrammingReinforcement Learning 4. Dynamic Programming
Reinforcement Learning 4. Dynamic ProgrammingSeung Jae Lee
 
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratchJie-Han Chen
 
Actor critic algorithm
Actor critic algorithmActor critic algorithm
Actor critic algorithmJie-Han Chen
 
Deep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its ApplicationsDeep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its ApplicationsBill Liu
 
Planning and Learning with Tabular Methods
Planning and Learning with Tabular MethodsPlanning and Learning with Tabular Methods
Planning and Learning with Tabular MethodsDongmin Lee
 
Multi-armed Bandits
Multi-armed BanditsMulti-armed Bandits
Multi-armed BanditsDongmin Lee
 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialOmar Enayet
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningNAVER Engineering
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement LearningUsman Qayyum
 
Reinforcement Learning Q-Learning
Reinforcement Learning   Q-Learning Reinforcement Learning   Q-Learning
Reinforcement Learning Q-Learning Melaku Eneayehu
 

La actualidad más candente (20)

Deep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-LearningDeep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-Learning
 
Reinforcement Learning - DQN
Reinforcement Learning - DQNReinforcement Learning - DQN
Reinforcement Learning - DQN
 
Imitation learning tutorial
Imitation learning tutorialImitation learning tutorial
Imitation learning tutorial
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)
 
Reinforcement Learning 8: Planning and Learning with Tabular Methods
Reinforcement Learning 8: Planning and Learning with Tabular MethodsReinforcement Learning 8: Planning and Learning with Tabular Methods
Reinforcement Learning 8: Planning and Learning with Tabular Methods
 
Reinforcement Learning 4. Dynamic Programming
Reinforcement Learning 4. Dynamic ProgrammingReinforcement Learning 4. Dynamic Programming
Reinforcement Learning 4. Dynamic Programming
 
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratch
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
Actor critic algorithm
Actor critic algorithmActor critic algorithm
Actor critic algorithm
 
Deep Q-Learning
Deep Q-LearningDeep Q-Learning
Deep Q-Learning
 
Deep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its ApplicationsDeep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its Applications
 
Planning and Learning with Tabular Methods
Planning and Learning with Tabular MethodsPlanning and Learning with Tabular Methods
Planning and Learning with Tabular Methods
 
Multi-armed Bandits
Multi-armed BanditsMulti-armed Bandits
Multi-armed Bandits
 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners Tutorial
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
Reinforcement Learning Q-Learning
Reinforcement Learning   Q-Learning Reinforcement Learning   Q-Learning
Reinforcement Learning Q-Learning
 
ddpg seminar
ddpg seminarddpg seminar
ddpg seminar
 

Similar a Trust Region Policy Optimization, Schulman et al, 2015

Proximal Policy Optimization Algorithms, Schulman et al, 2017
Proximal Policy Optimization Algorithms, Schulman et al, 2017Proximal Policy Optimization Algorithms, Schulman et al, 2017
Proximal Policy Optimization Algorithms, Schulman et al, 2017Chris Ohk
 
Intro to Reinforcement learning - part I
Intro to Reinforcement learning - part IIntro to Reinforcement learning - part I
Intro to Reinforcement learning - part IMikko Mäkipää
 
Regression analysis made easy
Regression analysis made easyRegression analysis made easy
Regression analysis made easyWeam Banjar
 
Modern Recommendation for Advanced Practitioners part2
Modern Recommendation for Advanced Practitioners part2Modern Recommendation for Advanced Practitioners part2
Modern Recommendation for Advanced Practitioners part2Flavian Vasile
 
Intro to Reinforcement learning - part II
Intro to Reinforcement learning - part IIIntro to Reinforcement learning - part II
Intro to Reinforcement learning - part IIMikko Mäkipää
 
Statistical Learning and Model Selection (1).pptx
Statistical Learning and Model Selection (1).pptxStatistical Learning and Model Selection (1).pptx
Statistical Learning and Model Selection (1).pptxrajalakshmi5921
 
Dummy variables - title
Dummy variables - titleDummy variables - title
Dummy variables - titleMehul Mohan
 
Reinforcement Learning Guide For Beginners
Reinforcement Learning Guide For BeginnersReinforcement Learning Guide For Beginners
Reinforcement Learning Guide For Beginnersgokulprasath06
 
Models of Operational research, Advantages & disadvantages of Operational res...
Models of Operational research, Advantages & disadvantages of Operational res...Models of Operational research, Advantages & disadvantages of Operational res...
Models of Operational research, Advantages & disadvantages of Operational res...Sunny Mervyne Baa
 
Deep Reinforcement learning
Deep Reinforcement learningDeep Reinforcement learning
Deep Reinforcement learningCairo University
 
3.3 Forecasting Part 2(1) (1).pptx
3.3 Forecasting Part 2(1) (1).pptx3.3 Forecasting Part 2(1) (1).pptx
3.3 Forecasting Part 2(1) (1).pptxMBALOGISTICSSUPPLYCH
 
How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysYasutoTamura1
 
Reinforcement learning:policy gradient (part 1)
Reinforcement learning:policy gradient (part 1)Reinforcement learning:policy gradient (part 1)
Reinforcement learning:policy gradient (part 1)Bean Yen
 
Hypothesis Testing using STATA
Hypothesis Testing using STATAHypothesis Testing using STATA
Hypothesis Testing using STATAPantho Sarker
 
Artificial Intelligence Course: Linear models
Artificial Intelligence Course: Linear models Artificial Intelligence Course: Linear models
Artificial Intelligence Course: Linear models ananth
 
FE3.ppt
FE3.pptFE3.ppt
FE3.pptasde13
 

Similar a Trust Region Policy Optimization, Schulman et al, 2015 (20)

Proximal Policy Optimization Algorithms, Schulman et al, 2017
Proximal Policy Optimization Algorithms, Schulman et al, 2017Proximal Policy Optimization Algorithms, Schulman et al, 2017
Proximal Policy Optimization Algorithms, Schulman et al, 2017
 
Intro to Reinforcement learning - part I
Intro to Reinforcement learning - part IIntro to Reinforcement learning - part I
Intro to Reinforcement learning - part I
 
Regression analysis made easy
Regression analysis made easyRegression analysis made easy
Regression analysis made easy
 
Modern Recommendation for Advanced Practitioners part2
Modern Recommendation for Advanced Practitioners part2Modern Recommendation for Advanced Practitioners part2
Modern Recommendation for Advanced Practitioners part2
 
Intro to Reinforcement learning - part II
Intro to Reinforcement learning - part IIIntro to Reinforcement learning - part II
Intro to Reinforcement learning - part II
 
Statistical Learning and Model Selection (1).pptx
Statistical Learning and Model Selection (1).pptxStatistical Learning and Model Selection (1).pptx
Statistical Learning and Model Selection (1).pptx
 
Dummy variables (1)
Dummy variables (1)Dummy variables (1)
Dummy variables (1)
 
Dummy variables - title
Dummy variables - titleDummy variables - title
Dummy variables - title
 
Dummy variables
Dummy variablesDummy variables
Dummy variables
 
Dummy variables xd
Dummy variables xdDummy variables xd
Dummy variables xd
 
Dummy ppt
Dummy pptDummy ppt
Dummy ppt
 
Reinforcement Learning Guide For Beginners
Reinforcement Learning Guide For BeginnersReinforcement Learning Guide For Beginners
Reinforcement Learning Guide For Beginners
 
Models of Operational research, Advantages & disadvantages of Operational res...
Models of Operational research, Advantages & disadvantages of Operational res...Models of Operational research, Advantages & disadvantages of Operational res...
Models of Operational research, Advantages & disadvantages of Operational res...
 
Deep Reinforcement learning
Deep Reinforcement learningDeep Reinforcement learning
Deep Reinforcement learning
 
3.3 Forecasting Part 2(1) (1).pptx
3.3 Forecasting Part 2(1) (1).pptx3.3 Forecasting Part 2(1) (1).pptx
3.3 Forecasting Part 2(1) (1).pptx
 
How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative ways
 
Reinforcement learning:policy gradient (part 1)
Reinforcement learning:policy gradient (part 1)Reinforcement learning:policy gradient (part 1)
Reinforcement learning:policy gradient (part 1)
 
Hypothesis Testing using STATA
Hypothesis Testing using STATAHypothesis Testing using STATA
Hypothesis Testing using STATA
 
Artificial Intelligence Course: Linear models
Artificial Intelligence Course: Linear models Artificial Intelligence Course: Linear models
Artificial Intelligence Course: Linear models
 
FE3.ppt
FE3.pptFE3.ppt
FE3.ppt
 

Más de Chris Ohk

인프콘 2022 - Rust 크로스 플랫폼 프로그래밍
인프콘 2022 - Rust 크로스 플랫폼 프로그래밍인프콘 2022 - Rust 크로스 플랫폼 프로그래밍
인프콘 2022 - Rust 크로스 플랫폼 프로그래밍Chris Ohk
 
고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들
고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들
고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들Chris Ohk
 
Momenti Seminar - 5 Years of RosettaStone
Momenti Seminar - 5 Years of RosettaStoneMomenti Seminar - 5 Years of RosettaStone
Momenti Seminar - 5 Years of RosettaStoneChris Ohk
 
선린인터넷고등학교 2021 알고리즘 컨퍼런스 - Rust로 알고리즘 문제 풀어보기
선린인터넷고등학교 2021 알고리즘 컨퍼런스 - Rust로 알고리즘 문제 풀어보기선린인터넷고등학교 2021 알고리즘 컨퍼런스 - Rust로 알고리즘 문제 풀어보기
선린인터넷고등학교 2021 알고리즘 컨퍼런스 - Rust로 알고리즘 문제 풀어보기Chris Ohk
 
Momenti Seminar - A Tour of Rust, Part 2
Momenti Seminar - A Tour of Rust, Part 2Momenti Seminar - A Tour of Rust, Part 2
Momenti Seminar - A Tour of Rust, Part 2Chris Ohk
 
Momenti Seminar - A Tour of Rust, Part 1
Momenti Seminar - A Tour of Rust, Part 1Momenti Seminar - A Tour of Rust, Part 1
Momenti Seminar - A Tour of Rust, Part 1Chris Ohk
 
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021Chris Ohk
 
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021Chris Ohk
 
Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020
Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020
Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020Chris Ohk
 
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015Chris Ohk
 
GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기
GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기
GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기Chris Ohk
 
[RLKorea] <하스스톤> 강화학습 환경 개발기
[RLKorea] <하스스톤> 강화학습 환경 개발기[RLKorea] <하스스톤> 강화학습 환경 개발기
[RLKorea] <하스스톤> 강화학습 환경 개발기Chris Ohk
 
[NDC 2019] 하스스톤 강화학습 환경 개발기
[NDC 2019] 하스스톤 강화학습 환경 개발기[NDC 2019] 하스스톤 강화학습 환경 개발기
[NDC 2019] 하스스톤 강화학습 환경 개발기Chris Ohk
 
C++20 Key Features Summary
C++20 Key Features SummaryC++20 Key Features Summary
C++20 Key Features SummaryChris Ohk
 
[델리만주] 대학원 캐슬 - 석사에서 게임 프로그래머까지
[델리만주] 대학원 캐슬 - 석사에서 게임 프로그래머까지[델리만주] 대학원 캐슬 - 석사에서 게임 프로그래머까지
[델리만주] 대학원 캐슬 - 석사에서 게임 프로그래머까지Chris Ohk
 
디미고 특강 - 개발을 시작하려는 여러분에게
디미고 특강 - 개발을 시작하려는 여러분에게디미고 특강 - 개발을 시작하려는 여러분에게
디미고 특강 - 개발을 시작하려는 여러분에게Chris Ohk
 
청강대 특강 - 프로젝트 제대로 해보기
청강대 특강 - 프로젝트 제대로 해보기청강대 특강 - 프로젝트 제대로 해보기
청강대 특강 - 프로젝트 제대로 해보기Chris Ohk
 
[NDC 2018] 유체역학 엔진 개발기
[NDC 2018] 유체역학 엔진 개발기[NDC 2018] 유체역학 엔진 개발기
[NDC 2018] 유체역학 엔진 개발기Chris Ohk
 
My Way, Your Way
My Way, Your WayMy Way, Your Way
My Way, Your WayChris Ohk
 
Re:Zero부터 시작하지 않는 오픈소스 개발
Re:Zero부터 시작하지 않는 오픈소스 개발Re:Zero부터 시작하지 않는 오픈소스 개발
Re:Zero부터 시작하지 않는 오픈소스 개발Chris Ohk
 

Más de Chris Ohk (20)

인프콘 2022 - Rust 크로스 플랫폼 프로그래밍
인프콘 2022 - Rust 크로스 플랫폼 프로그래밍인프콘 2022 - Rust 크로스 플랫폼 프로그래밍
인프콘 2022 - Rust 크로스 플랫폼 프로그래밍
 
고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들
고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들
고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들
 
Momenti Seminar - 5 Years of RosettaStone
Momenti Seminar - 5 Years of RosettaStoneMomenti Seminar - 5 Years of RosettaStone
Momenti Seminar - 5 Years of RosettaStone
 
선린인터넷고등학교 2021 알고리즘 컨퍼런스 - Rust로 알고리즘 문제 풀어보기
선린인터넷고등학교 2021 알고리즘 컨퍼런스 - Rust로 알고리즘 문제 풀어보기선린인터넷고등학교 2021 알고리즘 컨퍼런스 - Rust로 알고리즘 문제 풀어보기
선린인터넷고등학교 2021 알고리즘 컨퍼런스 - Rust로 알고리즘 문제 풀어보기
 
Momenti Seminar - A Tour of Rust, Part 2
Momenti Seminar - A Tour of Rust, Part 2Momenti Seminar - A Tour of Rust, Part 2
Momenti Seminar - A Tour of Rust, Part 2
 
Momenti Seminar - A Tour of Rust, Part 1
Momenti Seminar - A Tour of Rust, Part 1Momenti Seminar - A Tour of Rust, Part 1
Momenti Seminar - A Tour of Rust, Part 1
 
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
 
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021
 
Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020
Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020
Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020
 
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
 
GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기
GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기
GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기
 
[RLKorea] <하스스톤> 강화학습 환경 개발기
[RLKorea] <하스스톤> 강화학습 환경 개발기[RLKorea] <하스스톤> 강화학습 환경 개발기
[RLKorea] <하스스톤> 강화학습 환경 개발기
 
[NDC 2019] 하스스톤 강화학습 환경 개발기
[NDC 2019] 하스스톤 강화학습 환경 개발기[NDC 2019] 하스스톤 강화학습 환경 개발기
[NDC 2019] 하스스톤 강화학습 환경 개발기
 
C++20 Key Features Summary
C++20 Key Features SummaryC++20 Key Features Summary
C++20 Key Features Summary
 
[델리만주] 대학원 캐슬 - 석사에서 게임 프로그래머까지
[델리만주] 대학원 캐슬 - 석사에서 게임 프로그래머까지[델리만주] 대학원 캐슬 - 석사에서 게임 프로그래머까지
[델리만주] 대학원 캐슬 - 석사에서 게임 프로그래머까지
 
디미고 특강 - 개발을 시작하려는 여러분에게
디미고 특강 - 개발을 시작하려는 여러분에게디미고 특강 - 개발을 시작하려는 여러분에게
디미고 특강 - 개발을 시작하려는 여러분에게
 
청강대 특강 - 프로젝트 제대로 해보기
청강대 특강 - 프로젝트 제대로 해보기청강대 특강 - 프로젝트 제대로 해보기
청강대 특강 - 프로젝트 제대로 해보기
 
[NDC 2018] 유체역학 엔진 개발기
[NDC 2018] 유체역학 엔진 개발기[NDC 2018] 유체역학 엔진 개발기
[NDC 2018] 유체역학 엔진 개발기
 
My Way, Your Way
My Way, Your WayMy Way, Your Way
My Way, Your Way
 
Re:Zero부터 시작하지 않는 오픈소스 개발
Re:Zero부터 시작하지 않는 오픈소스 개발Re:Zero부터 시작하지 않는 오픈소스 개발
Re:Zero부터 시작하지 않는 오픈소스 개발
 

Último

20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-pyJamie (Taka) Wang
 
Building AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxBuilding AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxUdaiappa Ramachandran
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8DianaGray10
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfAijun Zhang
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Commit University
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6DianaGray10
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Websitedgelyza
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...Aggregage
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Adtran
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostMatt Ray
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UbiTrack UK
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesMd Hossain Ali
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXTarek Kalaji
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAshyamraj55
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemAsko Soukka
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfJamie (Taka) Wang
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?IES VE
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdfPedro Manuel
 

Último (20)

20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-py
 
Building AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxBuilding AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptx
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdf
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Website
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBX
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystem
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdf
 

Trust Region Policy Optimization, Schulman et al, 2015

  • 1. Trust Region Policy Optimization, Schulman et al, 2015. 옥찬호 utilForever@gmail.com
  • 2. Trust Region Policy Optimization, Schulman et al, 2015.Introduction • Most algorithms for policy optimization can be classified... • (1) Policy iteration methods (Alternate between estimating the value function under the current policy and improving the policy) • (2) Policy gradient methods (Use an estimator of the gradient of the expected return) • (3) Derivative-free optimization methods (Treat the return as a black box function to be optimized in terms of the policy parameters) such as • The Cross-Entropy Method (CEM) • The Covariance Matrix Adaptation (CMA)
  • 3. Introduction • General derivative-free stochastic optimization methods such as CEM and CMA are preferred on many problems, • They achieve good results while being simple to understand and implement. • For example, while Tetris is a classic benchmark problem for approximate dynamic programming(ADP) methods, stochastic optimization methods are difficult to beat on this task. Trust Region Policy Optimization, Schulman et al, 2015.
  • 4. Introduction • For continuous control problems, methods like CMA have been successful at learning control policies for challenging tasks like locomotion when provided with hand-engineered policy classes with low-dimensional parameterizations. • The inability of ADP and gradient-based methods to consistently beat gradient-free random search is unsatisfying, • Gradient-based optimization algorithms enjoy much better sample complexity guarantees than gradient-free methods. Trust Region Policy Optimization, Schulman et al, 2015.
  • 5. Introduction • Continuous gradient-based optimization has been very successful at learning function approximators for supervised learning tasks with huge numbers of parameters, and extending their success to reinforcement learning would allow for efficient training of complex and powerful policies. Trust Region Policy Optimization, Schulman et al, 2015.
  • 6. Preliminaries • Markov Decision Process(MDP) : 𝒮, 𝒜, 𝑃, 𝑟, 𝜌0, 𝛾 • 𝒮 is a finite set of states • 𝒜 is a finite set of actions • 𝑃 ∶ 𝒮 × 𝒜 × 𝒮 → ℝ is the transition probability • 𝑟 ∶ 𝒮 → ℝ is the reward function • 𝜌0 ∶ 𝒮 → ℝ is the distribution of the initial state 𝑠0 • 𝛾 ∈ 0, 1 is the discount factor • Policy • 𝜋 ∶ 𝒮 × 𝒜 → 0, 1 Trust Region Policy Optimization, Schulman et al, 2015.
  • 7. Preliminaries Trust Region Policy Optimization, Schulman et al, 2015.
  • 8. Preliminaries Trust Region Policy Optimization, Schulman et al, 2015.
  • 9. Preliminaries Trust Region Policy Optimization, Schulman et al, 2015.
  • 10. Preliminaries Trust Region Policy Optimization, Schulman et al, 2015.
  • 11. Preliminaries How to measure policy’s performance? Trust Region Policy Optimization, Schulman et al, 2015.
  • 12. Preliminaries • Expected cumulative discounted reward • Action value, Value/Advantage functions Trust Region Policy Optimization, Schulman et al, 2015.
  • 13. Preliminaries Trust Region Policy Optimization, Schulman et al, 2015.
  • 14. Preliminaries Trust Region Policy Optimization, Schulman et al, 2015.
  • 15. Preliminaries • Lemma 1. Kakade & Langford (2002) • Proof Trust Region Policy Optimization, Schulman et al, 2015.
  • 16. Preliminaries • Discounted (unnormalized) visitation frequencies • Rewrite Lemma 1 Trust Region Policy Optimization, Schulman et al, 2015.
  • 17. Preliminaries • This equation implies that any policy update 𝜋 → ෤𝜋 that has a nonnegative expected advantage at every state 𝑠, i.e., σ 𝑎 ෤𝜋 𝑎 𝑠 𝐴 𝜋(𝑠, 𝑎) ≥ 0, is guaranteed to increase the policy performance 𝜂, or leave it constant in the case that the expected advantage is zero everywhere. Trust Region Policy Optimization, Schulman et al, 2015.
  • 18. Preliminaries Trust Region Policy Optimization, Schulman et al, 2015.
  • 19. Preliminaries • However, in the approximate setting, it will typically be unavoidable, due to estimation and approximation error, that there will be some states 𝑠 for which the expected advantage is negative, that is, σ 𝑎 ෤𝜋 𝑎 𝑠 𝐴 𝜋(𝑠, 𝑎) < 0. • The complex dependency of 𝜌෥𝜋 𝑠 on ෤𝜋 makes equation difficult to optimize directly. Trust Region Policy Optimization, Schulman et al, 2015.
  • 20. Preliminaries • Instead, we introduce the following local approximation to 𝜂: • Note that 𝐿 𝜋 uses the visitation frequency 𝜌 𝜋 rather than 𝜌෥𝜋, ignoring changes in state visitation density due to changes in the policy. Trust Region Policy Optimization, Schulman et al, 2015.
  • 21. Preliminaries Trust Region Policy Optimization, Schulman et al, 2015.
  • 22. Preliminaries • However, if we have a parameterized policy 𝜋 𝜃, where 𝜋 𝜃 𝑎 𝑠 is a differentiable function of the parameter vector 𝜃, then 𝐿 𝜋 matches 𝜂 to first order. • This equation implies that a sufficiently small step 𝜋 𝜃0 → ෤𝜋 that improves 𝐿 𝜋 𝜃 𝑜𝑙𝑑 will also improve 𝜂, but does not give us any guidance on how big of a step to take. Trust Region Policy Optimization, Schulman et al, 2015.
  • 23. Preliminaries Trust Region Policy Optimization, Schulman et al, 2015.
  • 24. Preliminaries • Conservative Policy Iteration • Provide explicit lower bounds on the improvement of 𝜂. • Let 𝜋 𝑜𝑙𝑑 denote the current policy, and let 𝜋′ = argmax 𝜋′ 𝐿 𝜋 𝑜𝑙𝑑 (𝜋′). The new policy 𝜋 𝑛𝑒𝑤 was defined to be the following texture: Trust Region Policy Optimization, Schulman et al, 2015.
  • 25. Preliminaries • Conservative Policy Iteration • The following lower bound: • However, this policy class is unwieldy and restrictive in practice, and it is desirable for a practical policy update scheme to be applicable to all general stochastic policy classes. Trust Region Policy Optimization, Schulman et al, 2015.
  • 26. Monotonic Improvement Guarantee • The policy improvement bound in above equation can be extended to general stochastic policies, by • (1) Replacing 𝛼 with a distance measure between 𝜋 and ෤𝜋 • (2) Changing the constant 𝜖 appropriately
  • 27. Monotonic Improvement Guarantee • The particular distance measure we use is the total variation divergence for discrete probability distributions 𝑝, 𝑞: 𝐷TV 𝑝 ∥ 𝑞 = 1 2 ෍ 𝑖 𝑝𝑖 − 𝑞𝑖 • Define 𝐷TV max 𝜋, ෤𝜋 as
  • 28. Monotonic Improvement Guarantee • Theorem 1. Let 𝛼 = 𝐷TV max 𝜋old, 𝜋new . Then the following bound holds:
  • 29. Monotonic Improvement Guarantee • We note the following relationship between the total variation divergence and the KL divergence. 𝐷TV 𝑝 ∥ 𝑞 2 ≤ 𝐷KL 𝑝 ∥ 𝑞 • Let 𝐷KL max 𝜋, ෤𝜋 = max 𝑠 𝐷KL 𝜋 ∙ 𝑠 ∥ ෤𝜋 ∙ |𝑠 . The following bound then follows directly from Theorem 1:
  • 33. Monotonic Improvement Guarantee • Algorithm 1 is guaranteed to generate a monotonically improving sequence of policies 𝜂 𝜋0 ≤ 𝜂 𝜋1 ≤ 𝜂 𝜋2 ≤ ⋯. • Let 𝑀𝑖 𝜋 = 𝐿 𝜋 𝑖 − 𝐶𝐷KL max 𝜋𝑖, 𝜋 . Then, • Thus, by maximizing 𝑀𝑖 at each iteration, we guarantee that the true objective 𝜂 is non-decreasing. • This algorithm is a type of minorization-maximization (MM) algorithm.
  • 34. Optimization of Parameterized Policies • Overload our previous notation to use functions of 𝜃. • 𝜂 𝜃 ≔ 𝜂 𝜋 𝜃 • 𝐿 𝜃 ෨𝜃 ≔ 𝐿 𝜋 𝜃 𝜋෩𝜃 • 𝐷KL 𝜃 ∥ ෨𝜃 ≔ 𝐷KL 𝜋 𝜃 ∥ 𝜋෩𝜃 • Use 𝜃old to denote the previous policy parameters
  • 35. Optimization of Parameterized Policies • The preceding section showed that 𝜂 𝜃 ≥ 𝐿 𝜃old 𝜃 − 𝐶𝐷KL max 𝜃old, 𝜃 • Thus, by performing the following maximization, we are guaranteed to improve the true objective 𝜂: maximize 𝜃 𝐿 𝜃old 𝜃 − 𝐶𝐷KL max 𝜃old, 𝜃
  • 36. Optimization of Parameterized Policies • In practice, if we used the penalty coefficient 𝐶 recommended by the theory above, the step sizes would be very small. • One way to take largest steps in a robust way is to use a constraint on the KL divergence between the new policy and the old policy, i.e., a trust region constraint:
  • 37. Optimization of Parameterized Policies • This problem imposes a constraint that the KL divergence is bounded at every point in the state space. While it is motivated by the theory, this problem is impractical to solve due to the large number of constraints. • Instead, we can use a heuristic approximation which considers the average KL divergence:
  • 38. Optimization of Parameterized Policies • We therefore propose solving the following optimization problem to generate a policy update:
  • 40. Sample-Based Estimation • How the objective and constraint functions can be approximated using Monte Carlo simulation? • We seek to solve the following optimization problem, obtained by expanding 𝐿 𝜃old in previous equation:
  • 41. Sample-Based Estimation • We replace • σ 𝑠 𝜌 𝜃old 𝑠 … → 1 1−𝛾 𝐸𝑠~𝜌 𝜃old [… ] • 𝐴 𝜃old → 𝑄 𝜃old • σ 𝑎 𝜋 𝜃old 𝑎|𝑠 𝐴 𝜃old → 𝐸 𝑎~𝑞 𝜋 𝜃(𝑎|𝑠) 𝑞(𝑎|𝑠) 𝐴 𝜃old (We replace the sum over the actions by an importance sampling estimator.)
  • 44. Sample-Based Estimation • Now, our optimization problem in previous equation is exactly equivalent to the following one, written in terms of expectations: • Constrained Optimization → Sample-based Estimation
  • 45. Sample-Based Estimation • All that remains is to replace the expectations by sample averages and replace the 𝑄 value by an empirical estimate. The following sections describe two different schemes for performing this estimation. • Single Path: Typically used for policy gradient estimation (Bartlett & Baxter, 2011), and is based on sampling individual trajectories. • Vine: Constructing a rollout set and then performing multiple actions from each state in the rollout set.
  • 49. Practical Algorithm 1. Use the single path or vine procedures to collect a set of state- action pairs along with Monte Carlo estimates of their 𝑄-values. 2. By averaging over samples, construct the estimated objective and constraint in Equation . 3. Approximately solve this constrained optimization problem to update the policy’s parameter vector 𝜃. We use the conjugate gradient algorithm followed by a line search, which is altogether only slightly more expensive than computing the gradient itself. See Appendix C for details. Trust Region Policy Optimization, Schulman et al, 2015.
  • 50. Practical Algorithm • With regard to (3), we construct the Fisher information matrix (FIM) by analytically computing the Hessian of KL divergence, rather than using the covariance matrix of the gradients. • That is, we estimate 𝐴𝑖𝑗 as , rather than . • This analytic estimator has computational benefits in the large-scale setting, since it removes the need to store a dense Hessian or all policy gradients from a batch of trajectories. Trust Region Policy Optimization, Schulman et al, 2015.
  • 51. Experiments We designed experiments to investigate the following questions: 1. What are the performance characteristics of the single path and vine sampling procedures? 2. TRPO is related to prior methods but makes several changes, most notably by using a fixed KL divergence rather than a fixed penalty coefficient. How does this affect the performance of the algorithm? 3. Can TRPO be used to solve challenging large-scale problems? How does TRPO compare with other methods when applied to large-scale problems, with regard to final performance, computation time, and sample complexity? Trust Region Policy Optimization, Schulman et al, 2015.
  • 52. Experiments • Simulated Robotic Locomotion Trust Region Policy Optimization, Schulman et al, 2015.
  • 53. Experiments • Simulated Robotic Locomotion Trust Region Policy Optimization, Schulman et al, 2015.
  • 54. Experiments • Simulated Robotic Locomotion Trust Region Policy Optimization, Schulman et al, 2015.
  • 55. Experiments • Simulated Robotic Locomotion Trust Region Policy Optimization, Schulman et al, 2015.
  • 56. Experiments • Playing Games from Images • To evaluate TRPO on a partially observed task with complex observations, we trained policies for playing Atari games, using raw images as input. • Challenging points • The high dimensionality, challenging elements of these games include delayed rewards (no immediate penalty is incurred when a life is lost in Breakout or Space Invaders) • Complex sequences of behavior (Q*bert requires a character to hop on 21 different platforms) • Non-stationary image statistics (Enduro involves a changing and flickering background) Trust Region Policy Optimization, Schulman et al, 2015.
  • 57. Experiments • Playing Games from Images Trust Region Policy Optimization, Schulman et al, 2015.
  • 58. Experiments • Playing Games from Images Trust Region Policy Optimization, Schulman et al, 2015.
  • 59. Experiments • Playing Games from Images Trust Region Policy Optimization, Schulman et al, 2015.
  • 60. Discussion • We proposed and analyzed trust region methods for optimizing stochastic control policies. • We proved monotonic improvement for an algorithm that repeatedly optimizes a local approximation to the expected return of the policy with a KL divergence penalty. Trust Region Policy Optimization, Schulman et al, 2015.
  • 61. Discussion • In the domain of robotic locomotion, we successfully learned controllers for swimming, walking and hopping in a physics simulator, using general purpose neural networks and minimally informative rewards. • At the intersection of the two experimental domains we explored, there is the possibility of learning robotic control policies that use vision and raw sensory data as input, providing a unified scheme for training robotic controllers that perform both perception and control. Trust Region Policy Optimization, Schulman et al, 2015.
  • 62. Discussion • By combining our method with model learning, it would also be possible to substantially reduce its sample complexity, making it applicable to real-world settings where samples are expensive. Trust Region Policy Optimization, Schulman et al, 2015.
  • 63. Summary Trust Region Policy Optimization, Schulman et al, 2015.
  • 64. References • https://www.slideshare.net/WoongwonLee/trpo-87165690 • https://reinforcement-learning-kr.github.io/2018/06/24/5_trpo/ • https://www.youtube.com/watch?v=XBO4oPChMfI • https://www.youtube.com/watch?v=CKaN5PgkSBc Trust Region Policy Optimization, Schulman et al, 2015.