SlideShare una empresa de Scribd logo
1 de 49
Descargar para leer sin conexión
Efficient Off-Policy Meta-Reinforcement
Learning via Probabilistic Context Variables
International Conference on Machine Learning (ICML), 2019
Kate Rakelly*, Aurick Zhou*, Deirdre Quillen
Chelsea Finn, Sergey Levine
Dongmin Lee
March, 2020
Outline
• Abstract
• Introduction
• Probabilistic Latent Context
• Off-Policy Meta-Reinforcement Learning
• Experiments
1
Abstract
2
Challenges of meta-reinforcement learning (meta-RL)
Abstract
3
Challenges of meta-reinforcement learning (meta-RL)
• Current methods rely heavily on on-policy experience, limiting their sample
efficiency
• There is also lack of mechanisms to reason about task uncertainty when
adapting to new tasks, limiting their effectiveness in sparse reward
problems
Abstract
4
Challenges of meta-reinforcement learning (meta-RL)
• Current methods rely heavily on on-policy experience, limiting their sample
efficiency
• There is also lack of mechanisms to reason about task uncertainty when
adapting to new tasks, limiting their effectiveness in sparse reward
problems
PEARL (Probabilistic Embeddings for Actor-critic RL)
• An off-policy meta-RL algorithm to achieve both meta-training and
adaptation efficiency
• Perform probabilistic encoder filtering of latent task variables to enables
posterior sampling for structured and efficient exploration
Abstract
5
Challenges of meta-reinforcement learning (meta-RL)
• Current methods rely heavily on on-policy experience, limiting their sample
efficiency
• There is also lack of mechanisms to reason about task uncertainty when
adapting to new tasks, limiting their effectiveness in sparse reward
problems
PEARL (Probabilistic Embeddings for Actor-critic RL)
• An off-policy meta-RL algorithm to achieve both meta-training and
adaptation efficiency
• Perform probabilistic encoder filtering of latent task variables to enables
posterior sampling for structured and efficient exploration
Experiments
• Six continuous control meta-learning environments (MuJoCo simulator)
• 2-D navigation environment with sparse rewards
Introduction
6
Meta-learning problem statement
• The goal of meta-learning is to learn to adapt quickly to new task given
a small amount of experience from previous tasks
Introduction
7
Meta-learning problem statement
• The goal of meta-learning is to learn to adapt quickly to new task given
a small amount of experience from previous tasks
meta-training efficiency
adaptation efficiency
Supervised learning Reinforcement learning
Meta-RL: learn adaptation rule
𝜃∗
= arg max
"
(
#$%
&
𝔼'!"
()) 𝑅 𝜏
where 𝜙# = 𝑓"(ℳ#)
MDP for task 𝑖
ℳ% ℳ+ ℳ, ℳ-./-
Introduction
8
Meta-RL problem statement
• The goal of meta-RL is to learn to adapt quickly to new task given
a small amount of experience from previous tasks
Regular RL: learn policy for single task
𝜃∗
= arg max
"
𝔼'#()) 𝑅 𝜏
= 𝑓01(ℳ)
MDP
meta-training /
outer loop
adaptation /
inner loop
Introduction
9
General meta-RL algorithm outline
• While training:
1. Sample task 𝑖, collect data 𝒟#;
2. Adapt policy by computing 𝜙# = 𝑓(𝜃, 𝒟#);
3. Collect data 𝒟#
2
with adapted policy 𝜋3"
;
4. Update 𝜃 according to ℒ(𝒟#
2
, 𝜙#);
Introduction
10
General meta-RL algorithm outline
• While training:
1. Sample task 𝑖, collect data 𝒟#;
2. Adapt policy by computing 𝜙# = 𝑓(𝜃, 𝒟#);
3. Collect data 𝒟#
2
with adapted policy 𝜋3"
;
4. Update 𝜃 according to ℒ(𝒟#
2
, 𝜙#);
à Different algorithms:
• Choice of function 𝑓
• Choice of loss function ℒ
Introduction
11
Solution I: recurrent policies
• Implement the policy as a recurrent network, train across a set of tasks
𝜃∗
= arg max
"
(
#$%
&
𝔼'!"
()) 𝑅 𝜏
where 𝜙# = 𝑓"(ℳ#)
RNN
PG
Y. Duan et al., “RL!
: Fast Reinforcement Learning via Slow Reinforcement Learning”, ICLR 2017
Introduction
12
Solution II: optimization problem
• Learn a parameter initialization from which fine-tuning for a new task works
𝜃∗
= arg max
"
(
#$%
&
𝔼'!"
()) 𝑅 𝜏
where 𝜙# = 𝑓"(ℳ#)
PG
PG
C. Finn et al., “Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks”, ICML 2017
Introduction
13
Challenge of meta-RL: sample efficiency
• Most current meta-RL methods require on-policy data during both meta-
training and adaptation, which makes them exceedingly inefficient during
meta-training
Introduction
14
Challenge of meta-RL: sample efficiency
• Most current meta-RL methods require on-policy data during both meta-
training and adaptation, which makes them exceedingly inefficient during
meta-training
• In image classification case, meta-learning typically operates on the principle
that meta-training time should match meta-test time
Introduction
15
Challenge of meta-RL: sample efficiency
• Most current meta-RL methods require on-policy data during both meta-
training and adaptation, which makes them exceedingly inefficient during
meta-training
• In image classification case, meta-learning typically operates on the principle
that meta-training time should match meta-test time
• In meta-RL, this makes it inherently difficult to meta-train a policy to adapt
using off-policy data, which is systematically different from the data that the
policy would see when it explores on-policy in a new task at meta-test time.
𝜃∗
= arg max
"
(
#$%
&
𝔼'!"
()) 𝑅 𝜏
where 𝜙# = 𝑓"(ℳ#)
Train with off-policy data,
but then 𝑓!∗ is on-policy...
Introduction
16
Challenge of meta-RL: sample efficiency
• Most current meta-RL methods require on-policy data during both meta-
training and adaptation, which makes them exceedingly inefficient during
meta-training
• In image classification case, meta-learning typically operates on the principle
that meta-training time should match meta-test time
• In meta-RL, this makes it inherently difficult to meta-train a policy to adapt
using off-policy data, which is systematically different from the data that the
policy would see when it explores on-policy in a new task at meta-test time.
à Can we meta-train as off-policy data and meta-test as on-policy data?
𝜃∗
= arg max
"
(
#$%
&
𝔼'!"
()) 𝑅 𝜏
where 𝜙# = 𝑓"(ℳ#)
Train with off-policy data,
but then 𝑓!∗ is on-policy...
17
Introduction
Definition
• 𝑝(𝒯): distribution of tasks
• 𝑆: set of states
• 𝐴: set of actions
• 𝑝(𝑠#): initial state distribution
• 𝑝(𝑠$%&|𝑠$, 𝑎$): transition distribution
• 𝑟(𝑠$, 𝑎$): reward function
• 𝒯 = 𝑝 𝑠# , 𝑝 𝑠$%& 𝑠$, 𝑎$ , 𝑟(𝑠$, 𝑎$) : a task
• 𝑐𝑜𝑛𝑡𝑒𝑥𝑡 𝑐: history of past transitions
• 𝑐'
𝒯 = (𝑠', 𝑎', 𝑟', 𝑠'
) ): one transition in task 𝒯
Introduction
18
Aside: POMDPs (Partially-Observable Markov Decision Processes)
observation gives
incomplete
information
about the state
state is unobserved
(hidden)
“That Way We Go” by Matt Spangler
Introduction
19
The POMDP view of meta-RL
Introduction
20
The POMDP view of meta-RL
à Can we leverage this connection to design a new meta-RL algorithm?
Introduction
21
Model belief over latent task variables
a belief distribution
Introduction
22
Model belief over latent task variables
MDP 0
MDP 1
MDP 2Goal
Goal
Goal
Introduction
23
Model belief over latent task variables
Introduction
24
RL with task-belief states
• “Task” can be supervised by reconstructing the MDP
Introduction
25
Can we achieve both meta-training and adaptation efficiency?
Introduction
26
Can we achieve both meta-training and adaptation efficiency?
• Yes, let’s make an off-policy meta-RL algorithm that disentangles task
inference and control
Introduction
27
Can we achieve both meta-training and adaptation efficiency?
• Yes, let’s make an off-policy meta-RL algorithm that disentangles task
inference and control
Introduction
28
Can we achieve both meta-training and adaptation efficiency?
• Yes, let’s make an off-policy meta-RL algorithm that disentangles task
inference and control
Solution III: partially observed RL
• Meta-training efficiency: off-policy RL algorithm
• Adaptation efficiency: probabilistic encoder to estimate task-belief
Introduction
29
Can we achieve both meta-training and adaptation efficiency?
• Yes, let’s make an off-policy meta-RL algorithm that disentangles task
inference and control
Solution III: partially observed RL
• Meta-training efficiency: off-policy RL algorithm
• Adaptation efficiency: probabilistic encoder to estimate task-belief
𝜃∗
= arg max
"
(
#$%
&
𝔼'!"
()) 𝑅 𝜏
where 𝜙# = 𝑓"(ℳ#)
Probabilistic
encoder
SAC
Modeling and learning latent contexts
• Adopt an amortized variational inference (VAE) approach to learn to infer a
latent probabilistic context variable 𝑍
à Inference network 𝑞*(𝑧|𝑐) that estimates the true posterior 𝑝(𝑧|𝑐)
30
Probabilistic Latent Context
Modeling and learning latent contexts
• Adopt an amortized variational inference (VAE) approach to learn to infer a
latent probabilistic context variable 𝑍
à Inference network 𝑞*(𝑧|𝑐) that estimates the true posterior 𝑝(𝑧|𝑐)
• Learn a predictive model of reward and dynamics by optimizing 𝑞*(𝑧|𝑐),
and reconstruct the MDPs
31
Probabilistic Latent Context
inference network
Modeling and learning latent contexts
• Adopt an amortized variational inference (VAE) approach to learn to infer a
latent probabilistic context variable 𝑍
à Inference network 𝑞*(𝑧|𝑐) that estimates the true posterior 𝑝(𝑧|𝑐)
• Learn a predictive model of reward and dynamics by optimizing 𝑞*(𝑧|𝑐),
and reconstruct the MDP
• Assuming an objective to be a log-likelihood, the resulting variational
lower bound is:
𝔼 𝒯 𝔼+~-6(+|0 𝒯) 𝑅(𝒯, 𝑧) + 𝛽𝐷23 𝑞*(𝑧|𝑐 𝒯) ∥ 𝑝(𝑧)
• 𝑝(𝑧) is a unit Gaussian prior over 𝑍
• 𝛽 = 0.1 is weight on KL divergence term
32
Probabilistic Latent Context
Modeling and learning latent contexts
• Adopt an amortized variational inference (VAE) approach to learn to infer a
latent probabilistic context variable 𝑍
à Inference network 𝑞*(𝑧|𝑐) that estimates the true posterior 𝑝(𝑧|𝑐)
• Learn a predictive model of reward and dynamics by optimizing 𝑞*(𝑧|𝑐),
and reconstruct the MDP
• Assuming an objective to be a log-likelihood, the resulting variational
lower bound is:
𝔼 𝒯 𝔼+~-6(+|0 𝒯) 𝑅(𝒯, 𝑧) + 𝛽𝐷23 𝑞*(𝑧|𝑐 𝒯) ∥ 𝑝(𝑧)
• 𝑝(𝑧) is a unit Gaussian prior over 𝑍
• 𝛽 = 0.1 is weight on KL divergence term
33
Probabilistic Latent Context
“likelihood” term
(Bellman error)
Modeling and learning latent contexts
• Adopt an amortized variational inference (VAE) approach to learn to infer a
latent probabilistic context variable 𝑍
à Inference network 𝑞*(𝑧|𝑐) that estimates the true posterior 𝑝(𝑧|𝑐)
• Learn a predictive model of reward and dynamics by optimizing 𝑞*(𝑧|𝑐),
and reconstruct the MDP
• Assuming an objective to be a log-likelihood, the resulting variational
lower bound is:
𝔼 𝒯 𝔼+~-6(+|0 𝒯) 𝑅(𝒯, 𝑧) + 𝛽𝐷23 𝑞*(𝑧|𝑐 𝒯) ∥ 𝑝(𝑧)
• 𝑝(𝑧) is a unit Gaussian prior over 𝑍
• 𝛽 = 0.1 is weight on KL divergence term
34
Probabilistic Latent Context
variational approximation to
posterior and prior
“likelihood” term
(Bellman error)
Modeling and learning latent contexts
• Adopt an amortized variational inference (VAE) approach to learn to infer a
latent probabilistic context variable 𝑍
à Inference network 𝑞*(𝑧|𝑐) that estimates the true posterior 𝑝(𝑧|𝑐)
• Learn a predictive model of reward and dynamics by optimizing 𝑞*(𝑧|𝑐),
and reconstruct the MDP
• Assuming an objective to be a log-likelihood, the resulting variational
lower bound is:
𝔼 𝒯 𝔼+~-6(+|0 𝒯) 𝑅(𝒯, 𝑧) + 𝛽𝐷23 𝑞*(𝑧|𝑐 𝒯) ∥ 𝑝(𝑧)
• 𝑝(𝑧) is a unit Gaussian prior over 𝑍
• 𝛽 = 0.1 is weight on KL divergence term
35
Probabilistic Latent Context
“regularization” term /
information bottleneck
constrain the mutual information
between 𝑍 and 𝑐
variational approximation to
posterior and prior
“likelihood” term
(Bellman error)
Designing architecture of inference network
36
Probabilistic Latent Context
Designing architecture of inference network
• If we would like to infer what the task is or identify the MDP, it is enough
to have access to a collection of transitions, without regard for the order of
transitions
37
Probabilistic Latent Context
Designing architecture of inference network
• If we would like to infer what the task is or identify the MDP, it is enough
to have access to a collection of transitions, without regard for the order of
transitions
• So, use a permutation-invariant encoder for simplicity and speed
38
Probabilistic Latent Context
Designing architecture of inference network
• If we would like to infer what the task is or identify the MDP, it is enough
to have access to a collection of transitions, without regard for the order of
transitions
• So, use a permutation-invariant encoder for simplicity and speed
• To keep the method tractable, we use product of Gaussians, which result in
Gaussian posterior
39
Probabilistic Latent Context
Aside: Soft Actor-Critic (SAC)
• “Soft”: maximize rewards and entropy of the policy
(higher entropy policies explore better)
𝐽 𝜋 = @
$6#
7
𝔼 88,:8
𝑟 𝑠$, 𝑎$ + 𝛼𝐻 𝜋 ⋅ 𝑠$
• “Actor-Critic”: model both the actor (aka the policy) and the critic (aka the
Q-function)
𝐽; 𝜃 = 𝔼 88,:8
&
<
𝑄= 𝑠$, 𝑎$ − 𝑟 𝑠$, 𝑎$ + 𝛾𝔼889:
𝑉=; 𝑠$%&
<
𝐽> 𝜙 = 𝔼 88,:8
𝑄= 𝑠$, 𝑎$ + 𝛼𝐻(𝜋* ⋅ 𝑠$ )
• Much more sample efficient than on-policy algorithms!
40
Off-Policy Meta-RL
41
Off-Policy Meta-RL
𝜃∗
= arg max
"
(
#$%
&
𝔼'!"
()) 𝑅 𝜏
where 𝜙# = 𝑓"(ℳ#)
Probabilistic
encoder
SAC
42
Off-Policy Meta-RL
43
Experiments
Experimental setup
variable reward function
(locomotion direction, velocity, or goal)
variable dynamics
(joint parameters)
44
Experiments
Experimental results
Ablations I: inference network architecture
• This result demonstrates the importance of decorrelating the samples used
for the RL objective
45
Experiments
Ablations II: data sampling strategies
• This result demonstrate the importance of careful data sampling in off-
policy meta-RL
46
Experiments
Ablations III: deterministic context
• With no stochasticity in the latent context variable, the only stochasticity
comes from the policy and is thus time-invariant, hindering temporally
extended exploration
47
Experiments
Thank You!
Any Questions?

Más contenido relacionado

La actualidad más candente

制限ボルツマンマシン入門
制限ボルツマンマシン入門制限ボルツマンマシン入門
制限ボルツマンマシン入門佑馬 斎藤
 
가깝고도 먼 Trpo
가깝고도 먼 Trpo가깝고도 먼 Trpo
가깝고도 먼 TrpoWoong won Lee
 
ネオ・サイバネティクス略史
ネオ・サイバネティクス略史ネオ・サイバネティクス略史
ネオ・サイバネティクス略史Yohei Nishida
 
Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Dongmin Lee
 
[DL輪読会]Model-Based Reinforcement Learning via Meta-Policy Optimization
[DL輪読会]Model-Based Reinforcement Learning via Meta-Policy Optimization[DL輪読会]Model-Based Reinforcement Learning via Meta-Policy Optimization
[DL輪読会]Model-Based Reinforcement Learning via Meta-Policy OptimizationDeep Learning JP
 
相互相関関数の最大化と時間差推定
相互相関関数の最大化と時間差推定相互相関関数の最大化と時間差推定
相互相関関数の最大化と時間差推定KoueiYamaoka
 
Bayes Independence Test - HSIC と性能を比較する-
Bayes Independence Test - HSIC と性能を比較する-Bayes Independence Test - HSIC と性能を比較する-
Bayes Independence Test - HSIC と性能を比較する-Joe Suzuki
 
Rnn개념정리
Rnn개념정리Rnn개념정리
Rnn개념정리종현 최
 
2015 高一基礎物理-8-2-原子光譜
2015 高一基礎物理-8-2-原子光譜2015 高一基礎物理-8-2-原子光譜
2015 高一基礎物理-8-2-原子光譜阿Samn的物理課本
 
なぜベイズ統計はリスク分析に向いているのか? その哲学上および実用上の理由
なぜベイズ統計はリスク分析に向いているのか? その哲学上および実用上の理由なぜベイズ統計はリスク分析に向いているのか? その哲学上および実用上の理由
なぜベイズ統計はリスク分析に向いているのか? その哲学上および実用上の理由takehikoihayashi
 
Introduction to SAC(Soft Actor-Critic)
Introduction to SAC(Soft Actor-Critic)Introduction to SAC(Soft Actor-Critic)
Introduction to SAC(Soft Actor-Critic)Suhyun Cho
 
5分で分かる自己組織化マップ
5分で分かる自己組織化マップ5分で分かる自己組織化マップ
5分で分かる自己組織化マップDaisuke Takai
 
PRML輪読#7
PRML輪読#7PRML輪読#7
PRML輪読#7matsuolab
 
線形カルマンフィルタの導出
線形カルマンフィルタの導出線形カルマンフィルタの導出
線形カルマンフィルタの導出Fumiya Watanabe
 
深層学習 勉強会第5回 ボルツマンマシン
深層学習 勉強会第5回 ボルツマンマシン深層学習 勉強会第5回 ボルツマンマシン
深層学習 勉強会第5回 ボルツマンマシンYuta Sugii
 
PRML 3.3.3-3.4 ベイズ線形回帰とモデル選択 / Baysian Linear Regression and Model Comparison)
PRML 3.3.3-3.4 ベイズ線形回帰とモデル選択 / Baysian Linear Regression and Model Comparison)PRML 3.3.3-3.4 ベイズ線形回帰とモデル選択 / Baysian Linear Regression and Model Comparison)
PRML 3.3.3-3.4 ベイズ線形回帰とモデル選択 / Baysian Linear Regression and Model Comparison)Akihiro Nitta
 
[DL輪読会]Control as Inferenceと発展
[DL輪読会]Control as Inferenceと発展[DL輪読会]Control as Inferenceと発展
[DL輪読会]Control as Inferenceと発展Deep Learning JP
 
今さら聞けないカーネル法とサポートベクターマシン
今さら聞けないカーネル法とサポートベクターマシン今さら聞けないカーネル法とサポートベクターマシン
今さら聞けないカーネル法とサポートベクターマシンShinya Shimizu
 
Imitation learning tutorial
Imitation learning tutorialImitation learning tutorial
Imitation learning tutorialYisong Yue
 

La actualidad más candente (20)

制限ボルツマンマシン入門
制限ボルツマンマシン入門制限ボルツマンマシン入門
制限ボルツマンマシン入門
 
가깝고도 먼 Trpo
가깝고도 먼 Trpo가깝고도 먼 Trpo
가깝고도 먼 Trpo
 
ネオ・サイバネティクス略史
ネオ・サイバネティクス略史ネオ・サイバネティクス略史
ネオ・サイバネティクス略史
 
Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)
 
[DL輪読会]Model-Based Reinforcement Learning via Meta-Policy Optimization
[DL輪読会]Model-Based Reinforcement Learning via Meta-Policy Optimization[DL輪読会]Model-Based Reinforcement Learning via Meta-Policy Optimization
[DL輪読会]Model-Based Reinforcement Learning via Meta-Policy Optimization
 
相互相関関数の最大化と時間差推定
相互相関関数の最大化と時間差推定相互相関関数の最大化と時間差推定
相互相関関数の最大化と時間差推定
 
PRML11.2 - 11.6
PRML11.2 - 11.6PRML11.2 - 11.6
PRML11.2 - 11.6
 
Bayes Independence Test - HSIC と性能を比較する-
Bayes Independence Test - HSIC と性能を比較する-Bayes Independence Test - HSIC と性能を比較する-
Bayes Independence Test - HSIC と性能を比較する-
 
Rnn개념정리
Rnn개념정리Rnn개념정리
Rnn개념정리
 
2015 高一基礎物理-8-2-原子光譜
2015 高一基礎物理-8-2-原子光譜2015 高一基礎物理-8-2-原子光譜
2015 高一基礎物理-8-2-原子光譜
 
なぜベイズ統計はリスク分析に向いているのか? その哲学上および実用上の理由
なぜベイズ統計はリスク分析に向いているのか? その哲学上および実用上の理由なぜベイズ統計はリスク分析に向いているのか? その哲学上および実用上の理由
なぜベイズ統計はリスク分析に向いているのか? その哲学上および実用上の理由
 
Introduction to SAC(Soft Actor-Critic)
Introduction to SAC(Soft Actor-Critic)Introduction to SAC(Soft Actor-Critic)
Introduction to SAC(Soft Actor-Critic)
 
5分で分かる自己組織化マップ
5分で分かる自己組織化マップ5分で分かる自己組織化マップ
5分で分かる自己組織化マップ
 
PRML輪読#7
PRML輪読#7PRML輪読#7
PRML輪読#7
 
線形カルマンフィルタの導出
線形カルマンフィルタの導出線形カルマンフィルタの導出
線形カルマンフィルタの導出
 
深層学習 勉強会第5回 ボルツマンマシン
深層学習 勉強会第5回 ボルツマンマシン深層学習 勉強会第5回 ボルツマンマシン
深層学習 勉強会第5回 ボルツマンマシン
 
PRML 3.3.3-3.4 ベイズ線形回帰とモデル選択 / Baysian Linear Regression and Model Comparison)
PRML 3.3.3-3.4 ベイズ線形回帰とモデル選択 / Baysian Linear Regression and Model Comparison)PRML 3.3.3-3.4 ベイズ線形回帰とモデル選択 / Baysian Linear Regression and Model Comparison)
PRML 3.3.3-3.4 ベイズ線形回帰とモデル選択 / Baysian Linear Regression and Model Comparison)
 
[DL輪読会]Control as Inferenceと発展
[DL輪読会]Control as Inferenceと発展[DL輪読会]Control as Inferenceと発展
[DL輪読会]Control as Inferenceと発展
 
今さら聞けないカーネル法とサポートベクターマシン
今さら聞けないカーネル法とサポートベクターマシン今さら聞けないカーネル法とサポートベクターマシン
今さら聞けないカーネル法とサポートベクターマシン
 
Imitation learning tutorial
Imitation learning tutorialImitation learning tutorial
Imitation learning tutorial
 

Similar a Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables

Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningNAVER Engineering
 
Introduction to cyclical learning rates for training neural nets
Introduction to cyclical learning rates for training neural netsIntroduction to cyclical learning rates for training neural nets
Introduction to cyclical learning rates for training neural netsSayak Paul
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningDongHyun Kwak
 
Deep Reinforcement learning
Deep Reinforcement learningDeep Reinforcement learning
Deep Reinforcement learningCairo University
 
How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysYasutoTamura1
 
Optimization as a model for few shot learning
Optimization as a model for few shot learningOptimization as a model for few shot learning
Optimization as a model for few shot learningKaty Lee
 
Bayesian Model-Agnostic Meta-Learning
Bayesian Model-Agnostic Meta-LearningBayesian Model-Agnostic Meta-Learning
Bayesian Model-Agnostic Meta-LearningSangwoo Mo
 
Introduction to MAML (Model Agnostic Meta Learning) with Discussions
Introduction to MAML (Model Agnostic Meta Learning) with DiscussionsIntroduction to MAML (Model Agnostic Meta Learning) with Discussions
Introduction to MAML (Model Agnostic Meta Learning) with DiscussionsJoonyoung Yi
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Universitat Politècnica de Catalunya
 
Off-Policy Deep Reinforcement Learning without Exploration.pdf
Off-Policy Deep Reinforcement Learning without Exploration.pdfOff-Policy Deep Reinforcement Learning without Exploration.pdf
Off-Policy Deep Reinforcement Learning without Exploration.pdfPo-Chuan Chen
 
Learning to Learn by Gradient Descent by Gradient Descent
Learning to Learn by Gradient Descent by Gradient DescentLearning to Learn by Gradient Descent by Gradient Descent
Learning to Learn by Gradient Descent by Gradient DescentKaty Lee
 
Experimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles BakerExperimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles BakerDatabricks
 
NITW_Improving Deep Neural Networks (1).pptx
NITW_Improving Deep Neural Networks (1).pptxNITW_Improving Deep Neural Networks (1).pptx
NITW_Improving Deep Neural Networks (1).pptxDrKBManwade
 
NITW_Improving Deep Neural Networks.pptx
NITW_Improving Deep Neural Networks.pptxNITW_Improving Deep Neural Networks.pptx
NITW_Improving Deep Neural Networks.pptxssuserd23711
 
Unsupervised Curricula for Visual Meta Reinforcement Learning(CARML)
Unsupervised Curricula for Visual Meta Reinforcement Learning(CARML)Unsupervised Curricula for Visual Meta Reinforcement Learning(CARML)
Unsupervised Curricula for Visual Meta Reinforcement Learning(CARML)Jeong-Gwan Lee
 
Information Theoretic aspect of reinforcement learning
Information Theoretic aspect of reinforcement learningInformation Theoretic aspect of reinforcement learning
Information Theoretic aspect of reinforcement learningJongsuHa
 
Hyperparameter Tuning
Hyperparameter TuningHyperparameter Tuning
Hyperparameter TuningJon Lederman
 
Winning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to StackingWinning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to StackingTed Xiao
 
AI_Unit-4_Learning.pptx
AI_Unit-4_Learning.pptxAI_Unit-4_Learning.pptx
AI_Unit-4_Learning.pptxMohammadAsim91
 

Similar a Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables (20)

Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
 
Introduction to cyclical learning rates for training neural nets
Introduction to cyclical learning rates for training neural netsIntroduction to cyclical learning rates for training neural nets
Introduction to cyclical learning rates for training neural nets
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Deep Reinforcement learning
Deep Reinforcement learningDeep Reinforcement learning
Deep Reinforcement learning
 
weka data mining
weka data mining weka data mining
weka data mining
 
How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative ways
 
Optimization as a model for few shot learning
Optimization as a model for few shot learningOptimization as a model for few shot learning
Optimization as a model for few shot learning
 
Bayesian Model-Agnostic Meta-Learning
Bayesian Model-Agnostic Meta-LearningBayesian Model-Agnostic Meta-Learning
Bayesian Model-Agnostic Meta-Learning
 
Introduction to MAML (Model Agnostic Meta Learning) with Discussions
Introduction to MAML (Model Agnostic Meta Learning) with DiscussionsIntroduction to MAML (Model Agnostic Meta Learning) with Discussions
Introduction to MAML (Model Agnostic Meta Learning) with Discussions
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
 
Off-Policy Deep Reinforcement Learning without Exploration.pdf
Off-Policy Deep Reinforcement Learning without Exploration.pdfOff-Policy Deep Reinforcement Learning without Exploration.pdf
Off-Policy Deep Reinforcement Learning without Exploration.pdf
 
Learning to Learn by Gradient Descent by Gradient Descent
Learning to Learn by Gradient Descent by Gradient DescentLearning to Learn by Gradient Descent by Gradient Descent
Learning to Learn by Gradient Descent by Gradient Descent
 
Experimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles BakerExperimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles Baker
 
NITW_Improving Deep Neural Networks (1).pptx
NITW_Improving Deep Neural Networks (1).pptxNITW_Improving Deep Neural Networks (1).pptx
NITW_Improving Deep Neural Networks (1).pptx
 
NITW_Improving Deep Neural Networks.pptx
NITW_Improving Deep Neural Networks.pptxNITW_Improving Deep Neural Networks.pptx
NITW_Improving Deep Neural Networks.pptx
 
Unsupervised Curricula for Visual Meta Reinforcement Learning(CARML)
Unsupervised Curricula for Visual Meta Reinforcement Learning(CARML)Unsupervised Curricula for Visual Meta Reinforcement Learning(CARML)
Unsupervised Curricula for Visual Meta Reinforcement Learning(CARML)
 
Information Theoretic aspect of reinforcement learning
Information Theoretic aspect of reinforcement learningInformation Theoretic aspect of reinforcement learning
Information Theoretic aspect of reinforcement learning
 
Hyperparameter Tuning
Hyperparameter TuningHyperparameter Tuning
Hyperparameter Tuning
 
Winning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to StackingWinning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to Stacking
 
AI_Unit-4_Learning.pptx
AI_Unit-4_Learning.pptxAI_Unit-4_Learning.pptx
AI_Unit-4_Learning.pptx
 

Más de Dongmin Lee

Causal Confusion in Imitation Learning
Causal Confusion in Imitation LearningCausal Confusion in Imitation Learning
Causal Confusion in Imitation LearningDongmin Lee
 
Character Controllers using Motion VAEs
Character Controllers using Motion VAEsCharacter Controllers using Motion VAEs
Character Controllers using Motion VAEsDongmin Lee
 
Causal Confusion in Imitation Learning
Causal Confusion in Imitation LearningCausal Confusion in Imitation Learning
Causal Confusion in Imitation LearningDongmin Lee
 
PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learn...
PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learn...PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learn...
PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learn...Dongmin Lee
 
Exploration Strategies in Reinforcement Learning
Exploration Strategies in Reinforcement LearningExploration Strategies in Reinforcement Learning
Exploration Strategies in Reinforcement LearningDongmin Lee
 
Let's do Inverse RL
Let's do Inverse RLLet's do Inverse RL
Let's do Inverse RLDongmin Lee
 
모두를 위한 PG여행 가이드
모두를 위한 PG여행 가이드모두를 위한 PG여행 가이드
모두를 위한 PG여행 가이드Dongmin Lee
 
Safe Reinforcement Learning
Safe Reinforcement LearningSafe Reinforcement Learning
Safe Reinforcement LearningDongmin Lee
 
안.전.제.일. 강화학습!
안.전.제.일. 강화학습!안.전.제.일. 강화학습!
안.전.제.일. 강화학습!Dongmin Lee
 
Planning and Learning with Tabular Methods
Planning and Learning with Tabular MethodsPlanning and Learning with Tabular Methods
Planning and Learning with Tabular MethodsDongmin Lee
 
Multi-armed Bandits
Multi-armed BanditsMulti-armed Bandits
Multi-armed BanditsDongmin Lee
 
강화학습 알고리즘의 흐름도 Part 2
강화학습 알고리즘의 흐름도 Part 2강화학습 알고리즘의 흐름도 Part 2
강화학습 알고리즘의 흐름도 Part 2Dongmin Lee
 
강화학습의 흐름도 Part 1
강화학습의 흐름도 Part 1강화학습의 흐름도 Part 1
강화학습의 흐름도 Part 1Dongmin Lee
 
강화학습의 개요
강화학습의 개요강화학습의 개요
강화학습의 개요Dongmin Lee
 

Más de Dongmin Lee (14)

Causal Confusion in Imitation Learning
Causal Confusion in Imitation LearningCausal Confusion in Imitation Learning
Causal Confusion in Imitation Learning
 
Character Controllers using Motion VAEs
Character Controllers using Motion VAEsCharacter Controllers using Motion VAEs
Character Controllers using Motion VAEs
 
Causal Confusion in Imitation Learning
Causal Confusion in Imitation LearningCausal Confusion in Imitation Learning
Causal Confusion in Imitation Learning
 
PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learn...
PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learn...PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learn...
PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learn...
 
Exploration Strategies in Reinforcement Learning
Exploration Strategies in Reinforcement LearningExploration Strategies in Reinforcement Learning
Exploration Strategies in Reinforcement Learning
 
Let's do Inverse RL
Let's do Inverse RLLet's do Inverse RL
Let's do Inverse RL
 
모두를 위한 PG여행 가이드
모두를 위한 PG여행 가이드모두를 위한 PG여행 가이드
모두를 위한 PG여행 가이드
 
Safe Reinforcement Learning
Safe Reinforcement LearningSafe Reinforcement Learning
Safe Reinforcement Learning
 
안.전.제.일. 강화학습!
안.전.제.일. 강화학습!안.전.제.일. 강화학습!
안.전.제.일. 강화학습!
 
Planning and Learning with Tabular Methods
Planning and Learning with Tabular MethodsPlanning and Learning with Tabular Methods
Planning and Learning with Tabular Methods
 
Multi-armed Bandits
Multi-armed BanditsMulti-armed Bandits
Multi-armed Bandits
 
강화학습 알고리즘의 흐름도 Part 2
강화학습 알고리즘의 흐름도 Part 2강화학습 알고리즘의 흐름도 Part 2
강화학습 알고리즘의 흐름도 Part 2
 
강화학습의 흐름도 Part 1
강화학습의 흐름도 Part 1강화학습의 흐름도 Part 1
강화학습의 흐름도 Part 1
 
강화학습의 개요
강화학습의 개요강화학습의 개요
강화학습의 개요
 

Último

Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptDineshKumar4165
 
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLPVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLManishPatel169454
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)simmis5
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfKamal Acharya
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTbhaskargani46
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringmulugeta48
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...Call Girls in Nagpur High Profile
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdfankushspencer015
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...ranjana rawat
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VDineshKumar4165
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...SUHANI PANDEY
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptMsecMca
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxfenichawla
 

Último (20)

Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLPVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
 

Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables

  • 1. Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables International Conference on Machine Learning (ICML), 2019 Kate Rakelly*, Aurick Zhou*, Deirdre Quillen Chelsea Finn, Sergey Levine Dongmin Lee March, 2020
  • 2. Outline • Abstract • Introduction • Probabilistic Latent Context • Off-Policy Meta-Reinforcement Learning • Experiments 1
  • 4. Abstract 3 Challenges of meta-reinforcement learning (meta-RL) • Current methods rely heavily on on-policy experience, limiting their sample efficiency • There is also lack of mechanisms to reason about task uncertainty when adapting to new tasks, limiting their effectiveness in sparse reward problems
  • 5. Abstract 4 Challenges of meta-reinforcement learning (meta-RL) • Current methods rely heavily on on-policy experience, limiting their sample efficiency • There is also lack of mechanisms to reason about task uncertainty when adapting to new tasks, limiting their effectiveness in sparse reward problems PEARL (Probabilistic Embeddings for Actor-critic RL) • An off-policy meta-RL algorithm to achieve both meta-training and adaptation efficiency • Perform probabilistic encoder filtering of latent task variables to enables posterior sampling for structured and efficient exploration
  • 6. Abstract 5 Challenges of meta-reinforcement learning (meta-RL) • Current methods rely heavily on on-policy experience, limiting their sample efficiency • There is also lack of mechanisms to reason about task uncertainty when adapting to new tasks, limiting their effectiveness in sparse reward problems PEARL (Probabilistic Embeddings for Actor-critic RL) • An off-policy meta-RL algorithm to achieve both meta-training and adaptation efficiency • Perform probabilistic encoder filtering of latent task variables to enables posterior sampling for structured and efficient exploration Experiments • Six continuous control meta-learning environments (MuJoCo simulator) • 2-D navigation environment with sparse rewards
  • 7. Introduction 6 Meta-learning problem statement • The goal of meta-learning is to learn to adapt quickly to new task given a small amount of experience from previous tasks
  • 8. Introduction 7 Meta-learning problem statement • The goal of meta-learning is to learn to adapt quickly to new task given a small amount of experience from previous tasks meta-training efficiency adaptation efficiency Supervised learning Reinforcement learning
  • 9. Meta-RL: learn adaptation rule 𝜃∗ = arg max " ( #$% & 𝔼'!" ()) 𝑅 𝜏 where 𝜙# = 𝑓"(ℳ#) MDP for task 𝑖 ℳ% ℳ+ ℳ, ℳ-./- Introduction 8 Meta-RL problem statement • The goal of meta-RL is to learn to adapt quickly to new task given a small amount of experience from previous tasks Regular RL: learn policy for single task 𝜃∗ = arg max " 𝔼'#()) 𝑅 𝜏 = 𝑓01(ℳ) MDP meta-training / outer loop adaptation / inner loop
  • 10. Introduction 9 General meta-RL algorithm outline • While training: 1. Sample task 𝑖, collect data 𝒟#; 2. Adapt policy by computing 𝜙# = 𝑓(𝜃, 𝒟#); 3. Collect data 𝒟# 2 with adapted policy 𝜋3" ; 4. Update 𝜃 according to ℒ(𝒟# 2 , 𝜙#);
  • 11. Introduction 10 General meta-RL algorithm outline • While training: 1. Sample task 𝑖, collect data 𝒟#; 2. Adapt policy by computing 𝜙# = 𝑓(𝜃, 𝒟#); 3. Collect data 𝒟# 2 with adapted policy 𝜋3" ; 4. Update 𝜃 according to ℒ(𝒟# 2 , 𝜙#); à Different algorithms: • Choice of function 𝑓 • Choice of loss function ℒ
  • 12. Introduction 11 Solution I: recurrent policies • Implement the policy as a recurrent network, train across a set of tasks 𝜃∗ = arg max " ( #$% & 𝔼'!" ()) 𝑅 𝜏 where 𝜙# = 𝑓"(ℳ#) RNN PG Y. Duan et al., “RL! : Fast Reinforcement Learning via Slow Reinforcement Learning”, ICLR 2017
  • 13. Introduction 12 Solution II: optimization problem • Learn a parameter initialization from which fine-tuning for a new task works 𝜃∗ = arg max " ( #$% & 𝔼'!" ()) 𝑅 𝜏 where 𝜙# = 𝑓"(ℳ#) PG PG C. Finn et al., “Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks”, ICML 2017
  • 14. Introduction 13 Challenge of meta-RL: sample efficiency • Most current meta-RL methods require on-policy data during both meta- training and adaptation, which makes them exceedingly inefficient during meta-training
  • 15. Introduction 14 Challenge of meta-RL: sample efficiency • Most current meta-RL methods require on-policy data during both meta- training and adaptation, which makes them exceedingly inefficient during meta-training • In image classification case, meta-learning typically operates on the principle that meta-training time should match meta-test time
  • 16. Introduction 15 Challenge of meta-RL: sample efficiency • Most current meta-RL methods require on-policy data during both meta- training and adaptation, which makes them exceedingly inefficient during meta-training • In image classification case, meta-learning typically operates on the principle that meta-training time should match meta-test time • In meta-RL, this makes it inherently difficult to meta-train a policy to adapt using off-policy data, which is systematically different from the data that the policy would see when it explores on-policy in a new task at meta-test time. 𝜃∗ = arg max " ( #$% & 𝔼'!" ()) 𝑅 𝜏 where 𝜙# = 𝑓"(ℳ#) Train with off-policy data, but then 𝑓!∗ is on-policy...
  • 17. Introduction 16 Challenge of meta-RL: sample efficiency • Most current meta-RL methods require on-policy data during both meta- training and adaptation, which makes them exceedingly inefficient during meta-training • In image classification case, meta-learning typically operates on the principle that meta-training time should match meta-test time • In meta-RL, this makes it inherently difficult to meta-train a policy to adapt using off-policy data, which is systematically different from the data that the policy would see when it explores on-policy in a new task at meta-test time. à Can we meta-train as off-policy data and meta-test as on-policy data? 𝜃∗ = arg max " ( #$% & 𝔼'!" ()) 𝑅 𝜏 where 𝜙# = 𝑓"(ℳ#) Train with off-policy data, but then 𝑓!∗ is on-policy...
  • 18. 17 Introduction Definition • 𝑝(𝒯): distribution of tasks • 𝑆: set of states • 𝐴: set of actions • 𝑝(𝑠#): initial state distribution • 𝑝(𝑠$%&|𝑠$, 𝑎$): transition distribution • 𝑟(𝑠$, 𝑎$): reward function • 𝒯 = 𝑝 𝑠# , 𝑝 𝑠$%& 𝑠$, 𝑎$ , 𝑟(𝑠$, 𝑎$) : a task • 𝑐𝑜𝑛𝑡𝑒𝑥𝑡 𝑐: history of past transitions • 𝑐' 𝒯 = (𝑠', 𝑎', 𝑟', 𝑠' ) ): one transition in task 𝒯
  • 19. Introduction 18 Aside: POMDPs (Partially-Observable Markov Decision Processes) observation gives incomplete information about the state state is unobserved (hidden) “That Way We Go” by Matt Spangler
  • 21. Introduction 20 The POMDP view of meta-RL à Can we leverage this connection to design a new meta-RL algorithm?
  • 22. Introduction 21 Model belief over latent task variables a belief distribution
  • 23. Introduction 22 Model belief over latent task variables MDP 0 MDP 1 MDP 2Goal Goal Goal
  • 24. Introduction 23 Model belief over latent task variables
  • 25. Introduction 24 RL with task-belief states • “Task” can be supervised by reconstructing the MDP
  • 26. Introduction 25 Can we achieve both meta-training and adaptation efficiency?
  • 27. Introduction 26 Can we achieve both meta-training and adaptation efficiency? • Yes, let’s make an off-policy meta-RL algorithm that disentangles task inference and control
  • 28. Introduction 27 Can we achieve both meta-training and adaptation efficiency? • Yes, let’s make an off-policy meta-RL algorithm that disentangles task inference and control
  • 29. Introduction 28 Can we achieve both meta-training and adaptation efficiency? • Yes, let’s make an off-policy meta-RL algorithm that disentangles task inference and control Solution III: partially observed RL • Meta-training efficiency: off-policy RL algorithm • Adaptation efficiency: probabilistic encoder to estimate task-belief
  • 30. Introduction 29 Can we achieve both meta-training and adaptation efficiency? • Yes, let’s make an off-policy meta-RL algorithm that disentangles task inference and control Solution III: partially observed RL • Meta-training efficiency: off-policy RL algorithm • Adaptation efficiency: probabilistic encoder to estimate task-belief 𝜃∗ = arg max " ( #$% & 𝔼'!" ()) 𝑅 𝜏 where 𝜙# = 𝑓"(ℳ#) Probabilistic encoder SAC
  • 31. Modeling and learning latent contexts • Adopt an amortized variational inference (VAE) approach to learn to infer a latent probabilistic context variable 𝑍 à Inference network 𝑞*(𝑧|𝑐) that estimates the true posterior 𝑝(𝑧|𝑐) 30 Probabilistic Latent Context
  • 32. Modeling and learning latent contexts • Adopt an amortized variational inference (VAE) approach to learn to infer a latent probabilistic context variable 𝑍 à Inference network 𝑞*(𝑧|𝑐) that estimates the true posterior 𝑝(𝑧|𝑐) • Learn a predictive model of reward and dynamics by optimizing 𝑞*(𝑧|𝑐), and reconstruct the MDPs 31 Probabilistic Latent Context inference network
  • 33. Modeling and learning latent contexts • Adopt an amortized variational inference (VAE) approach to learn to infer a latent probabilistic context variable 𝑍 à Inference network 𝑞*(𝑧|𝑐) that estimates the true posterior 𝑝(𝑧|𝑐) • Learn a predictive model of reward and dynamics by optimizing 𝑞*(𝑧|𝑐), and reconstruct the MDP • Assuming an objective to be a log-likelihood, the resulting variational lower bound is: 𝔼 𝒯 𝔼+~-6(+|0 𝒯) 𝑅(𝒯, 𝑧) + 𝛽𝐷23 𝑞*(𝑧|𝑐 𝒯) ∥ 𝑝(𝑧) • 𝑝(𝑧) is a unit Gaussian prior over 𝑍 • 𝛽 = 0.1 is weight on KL divergence term 32 Probabilistic Latent Context
  • 34. Modeling and learning latent contexts • Adopt an amortized variational inference (VAE) approach to learn to infer a latent probabilistic context variable 𝑍 à Inference network 𝑞*(𝑧|𝑐) that estimates the true posterior 𝑝(𝑧|𝑐) • Learn a predictive model of reward and dynamics by optimizing 𝑞*(𝑧|𝑐), and reconstruct the MDP • Assuming an objective to be a log-likelihood, the resulting variational lower bound is: 𝔼 𝒯 𝔼+~-6(+|0 𝒯) 𝑅(𝒯, 𝑧) + 𝛽𝐷23 𝑞*(𝑧|𝑐 𝒯) ∥ 𝑝(𝑧) • 𝑝(𝑧) is a unit Gaussian prior over 𝑍 • 𝛽 = 0.1 is weight on KL divergence term 33 Probabilistic Latent Context “likelihood” term (Bellman error)
  • 35. Modeling and learning latent contexts • Adopt an amortized variational inference (VAE) approach to learn to infer a latent probabilistic context variable 𝑍 à Inference network 𝑞*(𝑧|𝑐) that estimates the true posterior 𝑝(𝑧|𝑐) • Learn a predictive model of reward and dynamics by optimizing 𝑞*(𝑧|𝑐), and reconstruct the MDP • Assuming an objective to be a log-likelihood, the resulting variational lower bound is: 𝔼 𝒯 𝔼+~-6(+|0 𝒯) 𝑅(𝒯, 𝑧) + 𝛽𝐷23 𝑞*(𝑧|𝑐 𝒯) ∥ 𝑝(𝑧) • 𝑝(𝑧) is a unit Gaussian prior over 𝑍 • 𝛽 = 0.1 is weight on KL divergence term 34 Probabilistic Latent Context variational approximation to posterior and prior “likelihood” term (Bellman error)
  • 36. Modeling and learning latent contexts • Adopt an amortized variational inference (VAE) approach to learn to infer a latent probabilistic context variable 𝑍 à Inference network 𝑞*(𝑧|𝑐) that estimates the true posterior 𝑝(𝑧|𝑐) • Learn a predictive model of reward and dynamics by optimizing 𝑞*(𝑧|𝑐), and reconstruct the MDP • Assuming an objective to be a log-likelihood, the resulting variational lower bound is: 𝔼 𝒯 𝔼+~-6(+|0 𝒯) 𝑅(𝒯, 𝑧) + 𝛽𝐷23 𝑞*(𝑧|𝑐 𝒯) ∥ 𝑝(𝑧) • 𝑝(𝑧) is a unit Gaussian prior over 𝑍 • 𝛽 = 0.1 is weight on KL divergence term 35 Probabilistic Latent Context “regularization” term / information bottleneck constrain the mutual information between 𝑍 and 𝑐 variational approximation to posterior and prior “likelihood” term (Bellman error)
  • 37. Designing architecture of inference network 36 Probabilistic Latent Context
  • 38. Designing architecture of inference network • If we would like to infer what the task is or identify the MDP, it is enough to have access to a collection of transitions, without regard for the order of transitions 37 Probabilistic Latent Context
  • 39. Designing architecture of inference network • If we would like to infer what the task is or identify the MDP, it is enough to have access to a collection of transitions, without regard for the order of transitions • So, use a permutation-invariant encoder for simplicity and speed 38 Probabilistic Latent Context
  • 40. Designing architecture of inference network • If we would like to infer what the task is or identify the MDP, it is enough to have access to a collection of transitions, without regard for the order of transitions • So, use a permutation-invariant encoder for simplicity and speed • To keep the method tractable, we use product of Gaussians, which result in Gaussian posterior 39 Probabilistic Latent Context
  • 41. Aside: Soft Actor-Critic (SAC) • “Soft”: maximize rewards and entropy of the policy (higher entropy policies explore better) 𝐽 𝜋 = @ $6# 7 𝔼 88,:8 𝑟 𝑠$, 𝑎$ + 𝛼𝐻 𝜋 ⋅ 𝑠$ • “Actor-Critic”: model both the actor (aka the policy) and the critic (aka the Q-function) 𝐽; 𝜃 = 𝔼 88,:8 & < 𝑄= 𝑠$, 𝑎$ − 𝑟 𝑠$, 𝑎$ + 𝛾𝔼889: 𝑉=; 𝑠$%& < 𝐽> 𝜙 = 𝔼 88,:8 𝑄= 𝑠$, 𝑎$ + 𝛼𝐻(𝜋* ⋅ 𝑠$ ) • Much more sample efficient than on-policy algorithms! 40 Off-Policy Meta-RL
  • 42. 41 Off-Policy Meta-RL 𝜃∗ = arg max " ( #$% & 𝔼'!" ()) 𝑅 𝜏 where 𝜙# = 𝑓"(ℳ#) Probabilistic encoder SAC
  • 44. 43 Experiments Experimental setup variable reward function (locomotion direction, velocity, or goal) variable dynamics (joint parameters)
  • 46. Ablations I: inference network architecture • This result demonstrates the importance of decorrelating the samples used for the RL objective 45 Experiments
  • 47. Ablations II: data sampling strategies • This result demonstrate the importance of careful data sampling in off- policy meta-RL 46 Experiments
  • 48. Ablations III: deterministic context • With no stochasticity in the latent context variable, the only stochasticity comes from the policy and is thus time-invariant, hindering temporally extended exploration 47 Experiments