Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables

Efficient Off-Policy Meta-Reinforcement
Learning via Probabilistic Context Variables
International Conference on Machine Learning (ICML), 2019
Kate Rakelly*, Aurick Zhou*, Deirdre Quillen
Chelsea Finn, Sergey Levine
Dongmin Lee
March, 2020

Outline
• Abstract
• Introduction
• Probabilistic Latent Context
• Off-Policy Meta-Reinforcement Learning
• Experiments
1

Abstract
2
Challenges of meta-reinforcement learning (meta-RL)

Abstract
3
• Current methods rely heavily on on-policy experience, limiting their sample
efficiency
• There is also lack of mechanisms to reason about task uncertainty when
adapting to new tasks, limiting their effectiveness in sparse reward
problems

Abstract
4
efficiency
problems
PEARL (Probabilistic Embeddings for Actor-critic RL)
• An off-policy meta-RL algorithm to achieve both meta-training and
adaptation efficiency
• Perform probabilistic encoder filtering of latent task variables to enables
posterior sampling for structured and efficient exploration

Abstract
5
efficiency
problems
PEARL (Probabilistic Embeddings for Actor-critic RL)
• An off-policy meta-RL algorithm to achieve both meta-training and
• Perform probabilistic encoder filtering of latent task variables to enables
posterior sampling for structured and efficient exploration
Experiments
• Six continuous control meta-learning environments (MuJoCo simulator)
• 2-D navigation environment with sparse rewards

Introduction
6
Meta-learning problem statement
• The goal of meta-learning is to learn to adapt quickly to new task given
a small amount of experience from previous tasks

Introduction
7
Meta-learning problem statement
• The goal of meta-learning is to learn to adapt quickly to new task given
meta-training efficiency
Supervised learning Reinforcement learning

Meta-RL: learn adaptation rule
𝜃∗
= arg max
"
(
#$%
&
𝔼'!"
()) 𝑅 𝜏
where 𝜙# = 𝑓"(ℳ#)
MDP for task 𝑖
ℳ% ℳ+ ℳ, ℳ-./-
Introduction
8
Meta-RL problem statement
• The goal of meta-RL is to learn to adapt quickly to new task given
Regular RL: learn policy for single task
𝜃∗
= arg max
"
𝔼'#()) 𝑅 𝜏
= 𝑓01(ℳ)
MDP
meta-training /
outer loop
adaptation /
inner loop

Introduction
9
General meta-RL algorithm outline
• While training:
1. Sample task 𝑖, collect data 𝒟#;
2. Adapt policy by computing 𝜙# = 𝑓(𝜃, 𝒟#);
3. Collect data 𝒟#
2
with adapted policy 𝜋3"
;
4. Update 𝜃 according to ℒ(𝒟#
2
, 𝜙#);

Introduction
10
General meta-RL algorithm outline
• While training:
1. Sample task 𝑖, collect data 𝒟#;
2. Adapt policy by computing 𝜙# = 𝑓(𝜃, 𝒟#);
3. Collect data 𝒟#
2
with adapted policy 𝜋3"
;
4. Update 𝜃 according to ℒ(𝒟#
2
, 𝜙#);
à Different algorithms:
• Choice of function 𝑓
• Choice of loss function ℒ

Introduction
11
Solution I: recurrent policies
• Implement the policy as a recurrent network, train across a set of tasks
𝜃∗
= arg max
"
(
#$%
&
𝔼'!"
()) 𝑅 𝜏
RNN
PG
Y. Duan et al., “RL!
: Fast Reinforcement Learning via Slow Reinforcement Learning”, ICLR 2017

Introduction
12
Solution II: optimization problem
• Learn a parameter initialization from which fine-tuning for a new task works
𝜃∗
= arg max
"
(
#$%
&
𝔼'!"
()) 𝑅 𝜏
PG
PG
C. Finn et al., “Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks”, ICML 2017

Introduction
13
Challenge of meta-RL: sample efficiency
• Most current meta-RL methods require on-policy data during both meta-
training and adaptation, which makes them exceedingly inefficient during
meta-training

Introduction
14
meta-training
• In image classification case, meta-learning typically operates on the principle
that meta-training time should match meta-test time

Introduction
15
meta-training
• In meta-RL, this makes it inherently difficult to meta-train a policy to adapt
using off-policy data, which is systematically different from the data that the
policy would see when it explores on-policy in a new task at meta-test time.
𝜃∗
= arg max
"
(
#$%
&
𝔼'!"
()) 𝑅 𝜏
Train with off-policy data,
but then 𝑓!∗ is on-policy...

Introduction
16
meta-training
• In meta-RL, this makes it inherently difficult to meta-train a policy to adapt
using off-policy data, which is systematically different from the data that the
policy would see when it explores on-policy in a new task at meta-test time.
à Can we meta-train as off-policy data and meta-test as on-policy data?
𝜃∗
= arg max
"
(
#$%
&
𝔼'!"
()) 𝑅 𝜏
Train with off-policy data,
but then 𝑓!∗ is on-policy...

17
Introduction
Definition
• 𝑝(𝒯): distribution of tasks
• 𝑆: set of states
• 𝐴: set of actions
• 𝑝(𝑠#): initial state distribution
• 𝑝(𝑠$%&|𝑠$, 𝑎$): transition distribution
• 𝑟(𝑠$, 𝑎$): reward function
• 𝒯 = 𝑝 𝑠# , 𝑝 𝑠$%& 𝑠$, 𝑎$ , 𝑟(𝑠$, 𝑎$) : a task
• 𝑐𝑜𝑛𝑡𝑒𝑥𝑡 𝑐: history of past transitions
• 𝑐'
𝒯 = (𝑠', 𝑎', 𝑟', 𝑠'
) ): one transition in task 𝒯

Introduction
18
Aside: POMDPs (Partially-Observable Markov Decision Processes)
observation gives
incomplete
information
about the state
state is unobserved
(hidden)
“That Way We Go” by Matt Spangler

Introduction
19
The POMDP view of meta-RL

Introduction
20
The POMDP view of meta-RL
à Can we leverage this connection to design a new meta-RL algorithm?

Introduction
21
Model belief over latent task variables
a belief distribution

Introduction
22
MDP 0
MDP 1
MDP 2Goal
Goal
Goal

Introduction
23

Introduction
24
RL with task-belief states
• “Task” can be supervised by reconstructing the MDP

Introduction
25
Can we achieve both meta-training and adaptation efficiency?

Introduction
26
• Yes, let’s make an off-policy meta-RL algorithm that disentangles task
inference and control

Introduction
27

Introduction
28
Solution III: partially observed RL
• Meta-training efficiency: off-policy RL algorithm
• Adaptation efficiency: probabilistic encoder to estimate task-belief

Introduction
29
Solution III: partially observed RL
• Meta-training efficiency: off-policy RL algorithm
• Adaptation efficiency: probabilistic encoder to estimate task-belief
𝜃∗
= arg max
"
(
#$%
&
𝔼'!"
()) 𝑅 𝜏
Probabilistic
encoder
SAC

Modeling and learning latent contexts
• Adopt an amortized variational inference (VAE) approach to learn to infer a
latent probabilistic context variable 𝑍
à Inference network 𝑞*(𝑧|𝑐) that estimates the true posterior 𝑝(𝑧|𝑐)
30
Probabilistic Latent Context

• Learn a predictive model of reward and dynamics by optimizing 𝑞*(𝑧|𝑐),
and reconstruct the MDPs
31
inference network

and reconstruct the MDP
• Assuming an objective to be a log-likelihood, the resulting variational
lower bound is:
𝔼 𝒯 𝔼+~-6(+|0 𝒯) 𝑅(𝒯, 𝑧) + 𝛽𝐷23 𝑞*(𝑧|𝑐 𝒯) ∥ 𝑝(𝑧)
• 𝑝(𝑧) is a unit Gaussian prior over 𝑍
• 𝛽 = 0.1 is weight on KL divergence term
32

lower bound is:
33
“likelihood” term
(Bellman error)

lower bound is:
34
variational approximation to
posterior and prior
(Bellman error)

lower bound is:
35
“regularization” term /
information bottleneck
constrain the mutual information
between 𝑍 and 𝑐
variational approximation to
posterior and prior
(Bellman error)

Designing architecture of inference network
36

• If we would like to infer what the task is or identify the MDP, it is enough
to have access to a collection of transitions, without regard for the order of
transitions
37

transitions
• So, use a permutation-invariant encoder for simplicity and speed
38

transitions
• So, use a permutation-invariant encoder for simplicity and speed
• To keep the method tractable, we use product of Gaussians, which result in
Gaussian posterior
39

Aside: Soft Actor-Critic (SAC)
• “Soft”: maximize rewards and entropy of the policy
(higher entropy policies explore better)
𝐽 𝜋 = @
$6#
7
𝔼 88,:8
𝑟 𝑠$, 𝑎$ + 𝛼𝐻 𝜋 ⋅ 𝑠$
• “Actor-Critic”: model both the actor (aka the policy) and the critic (aka the
Q-function)
𝐽; 𝜃 = 𝔼 88,:8
&
<
𝑄= 𝑠$, 𝑎$ − 𝑟 𝑠$, 𝑎$ + 𝛾𝔼889:
𝑉=; 𝑠$%&
<
𝐽> 𝜙 = 𝔼 88,:8
𝑄= 𝑠$, 𝑎$ + 𝛼𝐻(𝜋* ⋅ 𝑠$ )
• Much more sample efficient than on-policy algorithms!
40
Off-Policy Meta-RL

41
Off-Policy Meta-RL
𝜃∗
= arg max
"
(
#$%
&
𝔼'!"
()) 𝑅 𝜏
Probabilistic
encoder
SAC

43
Experiments
Experimental setup
variable reward function
(locomotion direction, velocity, or goal)
variable dynamics
(joint parameters)

44
Experiments
Experimental results

Ablations I: inference network architecture
• This result demonstrates the importance of decorrelating the samples used
for the RL objective
45
Experiments

Ablations II: data sampling strategies
• This result demonstrate the importance of careful data sampling in off-
policy meta-RL
46
Experiments

Ablations III: deterministic context
• With no stochasticity in the latent context variable, the only stochasticity
comes from the policy and is thus time-invariant, hindering temporally
extended exploration
47
Experiments

Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables

Similar a Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables (20)

Más de Dongmin Lee

Más de Dongmin Lee (14)

Último

Último (20)

Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables