I reviewed the PEARL paper.
PEARL (Probabilistic Embeddings for Actor-critic RL) is an off-policy meta-RL algorithm to achieve both meta-training and adaptation efficiency. It performs probabilistic encoder filtering of latent task variables to enables posterior sampling for structured and efficient exploration.
Outline
- Abstract
- Introduction
- Probabilistic Latent Context
- Off-Policy Meta-Reinforcement Learning
- Experiments
Link: https://arxiv.org/abs/1903.08254
Thank you!
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables
1. Efficient Off-Policy Meta-Reinforcement
Learning via Probabilistic Context Variables
International Conference on Machine Learning (ICML), 2019
Kate Rakelly*, Aurick Zhou*, Deirdre Quillen
Chelsea Finn, Sergey Levine
Dongmin Lee
March, 2020
4. Abstract
3
Challenges of meta-reinforcement learning (meta-RL)
• Current methods rely heavily on on-policy experience, limiting their sample
efficiency
• There is also lack of mechanisms to reason about task uncertainty when
adapting to new tasks, limiting their effectiveness in sparse reward
problems
5. Abstract
4
Challenges of meta-reinforcement learning (meta-RL)
• Current methods rely heavily on on-policy experience, limiting their sample
efficiency
• There is also lack of mechanisms to reason about task uncertainty when
adapting to new tasks, limiting their effectiveness in sparse reward
problems
PEARL (Probabilistic Embeddings for Actor-critic RL)
• An off-policy meta-RL algorithm to achieve both meta-training and
adaptation efficiency
• Perform probabilistic encoder filtering of latent task variables to enables
posterior sampling for structured and efficient exploration
6. Abstract
5
Challenges of meta-reinforcement learning (meta-RL)
• Current methods rely heavily on on-policy experience, limiting their sample
efficiency
• There is also lack of mechanisms to reason about task uncertainty when
adapting to new tasks, limiting their effectiveness in sparse reward
problems
PEARL (Probabilistic Embeddings for Actor-critic RL)
• An off-policy meta-RL algorithm to achieve both meta-training and
adaptation efficiency
• Perform probabilistic encoder filtering of latent task variables to enables
posterior sampling for structured and efficient exploration
Experiments
• Six continuous control meta-learning environments (MuJoCo simulator)
• 2-D navigation environment with sparse rewards
8. Introduction
7
Meta-learning problem statement
• The goal of meta-learning is to learn to adapt quickly to new task given
a small amount of experience from previous tasks
meta-training efficiency
adaptation efficiency
Supervised learning Reinforcement learning
9. Meta-RL: learn adaptation rule
𝜃∗
= arg max
"
(
#$%
&
𝔼'!"
()) 𝑅 𝜏
where 𝜙# = 𝑓"(ℳ#)
MDP for task 𝑖
ℳ% ℳ+ ℳ, ℳ-./-
Introduction
8
Meta-RL problem statement
• The goal of meta-RL is to learn to adapt quickly to new task given
a small amount of experience from previous tasks
Regular RL: learn policy for single task
𝜃∗
= arg max
"
𝔼'#()) 𝑅 𝜏
= 𝑓01(ℳ)
MDP
meta-training /
outer loop
adaptation /
inner loop
10. Introduction
9
General meta-RL algorithm outline
• While training:
1. Sample task 𝑖, collect data 𝒟#;
2. Adapt policy by computing 𝜙# = 𝑓(𝜃, 𝒟#);
3. Collect data 𝒟#
2
with adapted policy 𝜋3"
;
4. Update 𝜃 according to ℒ(𝒟#
2
, 𝜙#);
11. Introduction
10
General meta-RL algorithm outline
• While training:
1. Sample task 𝑖, collect data 𝒟#;
2. Adapt policy by computing 𝜙# = 𝑓(𝜃, 𝒟#);
3. Collect data 𝒟#
2
with adapted policy 𝜋3"
;
4. Update 𝜃 according to ℒ(𝒟#
2
, 𝜙#);
à Different algorithms:
• Choice of function 𝑓
• Choice of loss function ℒ
12. Introduction
11
Solution I: recurrent policies
• Implement the policy as a recurrent network, train across a set of tasks
𝜃∗
= arg max
"
(
#$%
&
𝔼'!"
()) 𝑅 𝜏
where 𝜙# = 𝑓"(ℳ#)
RNN
PG
Y. Duan et al., “RL!
: Fast Reinforcement Learning via Slow Reinforcement Learning”, ICLR 2017
13. Introduction
12
Solution II: optimization problem
• Learn a parameter initialization from which fine-tuning for a new task works
𝜃∗
= arg max
"
(
#$%
&
𝔼'!"
()) 𝑅 𝜏
where 𝜙# = 𝑓"(ℳ#)
PG
PG
C. Finn et al., “Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks”, ICML 2017
14. Introduction
13
Challenge of meta-RL: sample efficiency
• Most current meta-RL methods require on-policy data during both meta-
training and adaptation, which makes them exceedingly inefficient during
meta-training
15. Introduction
14
Challenge of meta-RL: sample efficiency
• Most current meta-RL methods require on-policy data during both meta-
training and adaptation, which makes them exceedingly inefficient during
meta-training
• In image classification case, meta-learning typically operates on the principle
that meta-training time should match meta-test time
16. Introduction
15
Challenge of meta-RL: sample efficiency
• Most current meta-RL methods require on-policy data during both meta-
training and adaptation, which makes them exceedingly inefficient during
meta-training
• In image classification case, meta-learning typically operates on the principle
that meta-training time should match meta-test time
• In meta-RL, this makes it inherently difficult to meta-train a policy to adapt
using off-policy data, which is systematically different from the data that the
policy would see when it explores on-policy in a new task at meta-test time.
𝜃∗
= arg max
"
(
#$%
&
𝔼'!"
()) 𝑅 𝜏
where 𝜙# = 𝑓"(ℳ#)
Train with off-policy data,
but then 𝑓!∗ is on-policy...
17. Introduction
16
Challenge of meta-RL: sample efficiency
• Most current meta-RL methods require on-policy data during both meta-
training and adaptation, which makes them exceedingly inefficient during
meta-training
• In image classification case, meta-learning typically operates on the principle
that meta-training time should match meta-test time
• In meta-RL, this makes it inherently difficult to meta-train a policy to adapt
using off-policy data, which is systematically different from the data that the
policy would see when it explores on-policy in a new task at meta-test time.
à Can we meta-train as off-policy data and meta-test as on-policy data?
𝜃∗
= arg max
"
(
#$%
&
𝔼'!"
()) 𝑅 𝜏
where 𝜙# = 𝑓"(ℳ#)
Train with off-policy data,
but then 𝑓!∗ is on-policy...
18. 17
Introduction
Definition
• 𝑝(𝒯): distribution of tasks
• 𝑆: set of states
• 𝐴: set of actions
• 𝑝(𝑠#): initial state distribution
• 𝑝(𝑠$%&|𝑠$, 𝑎$): transition distribution
• 𝑟(𝑠$, 𝑎$): reward function
• 𝒯 = 𝑝 𝑠# , 𝑝 𝑠$%& 𝑠$, 𝑎$ , 𝑟(𝑠$, 𝑎$) : a task
• 𝑐𝑜𝑛𝑡𝑒𝑥𝑡 𝑐: history of past transitions
• 𝑐'
𝒯 = (𝑠', 𝑎', 𝑟', 𝑠'
) ): one transition in task 𝒯
27. Introduction
26
Can we achieve both meta-training and adaptation efficiency?
• Yes, let’s make an off-policy meta-RL algorithm that disentangles task
inference and control
28. Introduction
27
Can we achieve both meta-training and adaptation efficiency?
• Yes, let’s make an off-policy meta-RL algorithm that disentangles task
inference and control
29. Introduction
28
Can we achieve both meta-training and adaptation efficiency?
• Yes, let’s make an off-policy meta-RL algorithm that disentangles task
inference and control
Solution III: partially observed RL
• Meta-training efficiency: off-policy RL algorithm
• Adaptation efficiency: probabilistic encoder to estimate task-belief
30. Introduction
29
Can we achieve both meta-training and adaptation efficiency?
• Yes, let’s make an off-policy meta-RL algorithm that disentangles task
inference and control
Solution III: partially observed RL
• Meta-training efficiency: off-policy RL algorithm
• Adaptation efficiency: probabilistic encoder to estimate task-belief
𝜃∗
= arg max
"
(
#$%
&
𝔼'!"
()) 𝑅 𝜏
where 𝜙# = 𝑓"(ℳ#)
Probabilistic
encoder
SAC
31. Modeling and learning latent contexts
• Adopt an amortized variational inference (VAE) approach to learn to infer a
latent probabilistic context variable 𝑍
à Inference network 𝑞*(𝑧|𝑐) that estimates the true posterior 𝑝(𝑧|𝑐)
30
Probabilistic Latent Context
32. Modeling and learning latent contexts
• Adopt an amortized variational inference (VAE) approach to learn to infer a
latent probabilistic context variable 𝑍
à Inference network 𝑞*(𝑧|𝑐) that estimates the true posterior 𝑝(𝑧|𝑐)
• Learn a predictive model of reward and dynamics by optimizing 𝑞*(𝑧|𝑐),
and reconstruct the MDPs
31
Probabilistic Latent Context
inference network
33. Modeling and learning latent contexts
• Adopt an amortized variational inference (VAE) approach to learn to infer a
latent probabilistic context variable 𝑍
à Inference network 𝑞*(𝑧|𝑐) that estimates the true posterior 𝑝(𝑧|𝑐)
• Learn a predictive model of reward and dynamics by optimizing 𝑞*(𝑧|𝑐),
and reconstruct the MDP
• Assuming an objective to be a log-likelihood, the resulting variational
lower bound is:
𝔼 𝒯 𝔼+~-6(+|0 𝒯) 𝑅(𝒯, 𝑧) + 𝛽𝐷23 𝑞*(𝑧|𝑐 𝒯) ∥ 𝑝(𝑧)
• 𝑝(𝑧) is a unit Gaussian prior over 𝑍
• 𝛽 = 0.1 is weight on KL divergence term
32
Probabilistic Latent Context
34. Modeling and learning latent contexts
• Adopt an amortized variational inference (VAE) approach to learn to infer a
latent probabilistic context variable 𝑍
à Inference network 𝑞*(𝑧|𝑐) that estimates the true posterior 𝑝(𝑧|𝑐)
• Learn a predictive model of reward and dynamics by optimizing 𝑞*(𝑧|𝑐),
and reconstruct the MDP
• Assuming an objective to be a log-likelihood, the resulting variational
lower bound is:
𝔼 𝒯 𝔼+~-6(+|0 𝒯) 𝑅(𝒯, 𝑧) + 𝛽𝐷23 𝑞*(𝑧|𝑐 𝒯) ∥ 𝑝(𝑧)
• 𝑝(𝑧) is a unit Gaussian prior over 𝑍
• 𝛽 = 0.1 is weight on KL divergence term
33
Probabilistic Latent Context
“likelihood” term
(Bellman error)
35. Modeling and learning latent contexts
• Adopt an amortized variational inference (VAE) approach to learn to infer a
latent probabilistic context variable 𝑍
à Inference network 𝑞*(𝑧|𝑐) that estimates the true posterior 𝑝(𝑧|𝑐)
• Learn a predictive model of reward and dynamics by optimizing 𝑞*(𝑧|𝑐),
and reconstruct the MDP
• Assuming an objective to be a log-likelihood, the resulting variational
lower bound is:
𝔼 𝒯 𝔼+~-6(+|0 𝒯) 𝑅(𝒯, 𝑧) + 𝛽𝐷23 𝑞*(𝑧|𝑐 𝒯) ∥ 𝑝(𝑧)
• 𝑝(𝑧) is a unit Gaussian prior over 𝑍
• 𝛽 = 0.1 is weight on KL divergence term
34
Probabilistic Latent Context
variational approximation to
posterior and prior
“likelihood” term
(Bellman error)
36. Modeling and learning latent contexts
• Adopt an amortized variational inference (VAE) approach to learn to infer a
latent probabilistic context variable 𝑍
à Inference network 𝑞*(𝑧|𝑐) that estimates the true posterior 𝑝(𝑧|𝑐)
• Learn a predictive model of reward and dynamics by optimizing 𝑞*(𝑧|𝑐),
and reconstruct the MDP
• Assuming an objective to be a log-likelihood, the resulting variational
lower bound is:
𝔼 𝒯 𝔼+~-6(+|0 𝒯) 𝑅(𝒯, 𝑧) + 𝛽𝐷23 𝑞*(𝑧|𝑐 𝒯) ∥ 𝑝(𝑧)
• 𝑝(𝑧) is a unit Gaussian prior over 𝑍
• 𝛽 = 0.1 is weight on KL divergence term
35
Probabilistic Latent Context
“regularization” term /
information bottleneck
constrain the mutual information
between 𝑍 and 𝑐
variational approximation to
posterior and prior
“likelihood” term
(Bellman error)
38. Designing architecture of inference network
• If we would like to infer what the task is or identify the MDP, it is enough
to have access to a collection of transitions, without regard for the order of
transitions
37
Probabilistic Latent Context
39. Designing architecture of inference network
• If we would like to infer what the task is or identify the MDP, it is enough
to have access to a collection of transitions, without regard for the order of
transitions
• So, use a permutation-invariant encoder for simplicity and speed
38
Probabilistic Latent Context
40. Designing architecture of inference network
• If we would like to infer what the task is or identify the MDP, it is enough
to have access to a collection of transitions, without regard for the order of
transitions
• So, use a permutation-invariant encoder for simplicity and speed
• To keep the method tractable, we use product of Gaussians, which result in
Gaussian posterior
39
Probabilistic Latent Context
41. Aside: Soft Actor-Critic (SAC)
• “Soft”: maximize rewards and entropy of the policy
(higher entropy policies explore better)
𝐽 𝜋 = @
$6#
7
𝔼 88,:8
𝑟 𝑠$, 𝑎$ + 𝛼𝐻 𝜋 ⋅ 𝑠$
• “Actor-Critic”: model both the actor (aka the policy) and the critic (aka the
Q-function)
𝐽; 𝜃 = 𝔼 88,:8
&
<
𝑄= 𝑠$, 𝑎$ − 𝑟 𝑠$, 𝑎$ + 𝛾𝔼889:
𝑉=; 𝑠$%&
<
𝐽> 𝜙 = 𝔼 88,:8
𝑄= 𝑠$, 𝑎$ + 𝛼𝐻(𝜋* ⋅ 𝑠$ )
• Much more sample efficient than on-policy algorithms!
40
Off-Policy Meta-RL
46. Ablations I: inference network architecture
• This result demonstrates the importance of decorrelating the samples used
for the RL objective
45
Experiments
47. Ablations II: data sampling strategies
• This result demonstrate the importance of careful data sampling in off-
policy meta-RL
46
Experiments
48. Ablations III: deterministic context
• With no stochasticity in the latent context variable, the only stochasticity
comes from the policy and is thus time-invariant, hindering temporally
extended exploration
47
Experiments