SlideShare una empresa de Scribd logo
1 de 43
Descargar para leer sin conexión
Deep Reinforcement Learning Through
Policy Optimization
John Schulman
October 20, 2016
Introduction and Overview
What is Reinforcement Learning?
Branch of machine learning concerned with taking
sequences of actions
Usually described in terms of agent interacting with a
previously unknown environment, trying to maximize
cumulative reward
Agent Environment
action
observation, reward
What Is Deep Reinforcement Learning?
Reinforcement learning using neural networks to approximate
functions
Policies (select next action)
Value functions (measure goodness of states or
state-action pairs)
Models (predict next states and rewards)
Motor Control and Robotics
Robotics:
Observations: camera images, joint angles
Actions: joint torques
Rewards: stay balanced, navigate to target locations,
serve and protect humans
Business Operations
Inventory Management
Observations: current inventory levels
Actions: number of units of each item to purchase
Rewards: profit
In Other ML Problems
Hard Attention1
Observation: current image window
Action: where to look
Reward: classification
Sequential/structured prediction, e.g., machine
translation2
Observations: words in source language
Actions: emit word in target language
Rewards: sentence-level metric, e.g. BLEU score
1
V. Mnih et al. “Recurrent models of visual attention”. In: Advances in Neural Information Processing
Systems. 2014, pp. 2204–2212.
2
H. Daum´e Iii, J. Langford, and D. Marcu. “Search-based structured prediction”. In: Machine learning 75.3
(2009), pp. 297–325; S. Ross, G. J. Gordon, and D. Bagnell. “A Reduction of Imitation Learning and Structured
Prediction to No-Regret Online Learning.” In: AISTATS. vol. 1. 2. 2011, p. 6; M. Ranzato et al. “Sequence level
training with recurrent neural networks”. In: arXiv preprint arXiv:1511.06732 (2015).
How Does RL Relate to Other ML Problems?
How Does RL Relate to Other ML Problems?
Supervised learning:
Environment samples input-output pair (xt, yt) ∼ ρ
Agent predicts ˆyt = f (xt)
Agent receives loss (yt, ˆyt)
Environment asks agent a question, and then tells her the
right answer
How Does RL Relate to Other ML Problems?
Contextual bandits:
Environment samples input xt ∼ ρ
Agent takes action ˆyt = f (xt)
Agent receives cost ct ∼ P(ct | xt, ˆyt) where P is an
unknown probability distribution
Environment asks agent a question, and gives her a noisy
score on her answer
Application: personalized recommendations
How Does RL Relate to Other ML Problems?
Reinforcement learning:
Environment samples input xt ∼ P(xt | xt−1, yt−1)
Input depends on your previous actions!
Agent takes action ˆyt = f (xt)
Agent receives cost ct ∼ P(ct | xt, ˆyt) where P a
probability distribution unknown to the agent.
How Does RL Relate to Other Machine Learning
Problems?
Summary of differences between RL and supervised learning:
You don’t have full access to the function you’re trying to
optimize—must query it through interaction.
Interacting with a stateful world: input xt depend on your
previous actions
Should I Use Deep RL On My Practical Problem?
Might be overkill
Other methods worth investigating first
Derivative-free optimization (simulated annealing, cross
entropy method, SPSA)
Is it a contextual bandit problem?
Non-deep RL methods developed by Operations
Research community3
3
W. B. Powell. Approximate Dynamic Programming: Solving the curses of dimensionality. Vol. 703. John
Wiley & Sons, 2007.
Recent Success Stories in Deep RL
ATARI using deep Q-learning4
, policy gradients5
,
DAGGER6
Superhuman Go using supervised learning + policy
gradients + Monte Carlo tree search + value functions7
Robotic manipulation using guided policy search8
Robotic locomotion using policy gradients9
3D games using policy gradients10
4
V. Mnih et al. “Playing Atari with Deep Reinforcement Learning”. In: arXiv preprint arXiv:1312.5602 (2013).
5
J. Schulman et al. “Trust Region Policy Optimization”. In: arXiv preprint arXiv:1502.05477 (2015).
6
X. Guo et al. “Deep learning for real-time Atari game play using offline Monte-Carlo tree search planning”. In:
Advances in Neural Information Processing Systems. 2014, pp. 3338–3346.
7
D. Silver et al. “Mastering the game of Go with deep neural networks and tree search”. In: Nature 529.7587
(2016), pp. 484–489.
8
S. Levine et al. “End-to-end training of deep visuomotor policies”. In: arXiv preprint arXiv:1504.00702 (2015).
9
J. Schulman et al. “High-dimensional continuous control using generalized advantage estimation”. In: arXiv
preprint arXiv:1506.02438 (2015).
10
V. Mnih et al. “Asynchronous Methods for Deep Reinforcement Learning”. In: arXiv preprint
arXiv:1602.01783 (2016).
Markov Decision Processes
Definition
Markov Decision Process (MDP) defined by (S, A, P),
where
S: state space
A: action space
P(r, s | s, a): transition + reward probability distribution
Extra objects defined depending on problem setting
µ: Initial state distribution
Optimization problem: maximize expected cumulative
reward
Episodic Setting
In each episode, the initial state is sampled from µ, and
the agent acts until the terminal state is reached. For
example:
Taxi robot reaches its destination (termination = good)
Waiter robot finishes a shift (fixed time)
Walking robot falls over (termination = bad)
Goal: maximize expected reward per episode
Policies
Deterministic policies: a = π(s)
Stochastic policies: a ∼ π(a | s)
Episodic Setting
s0 ∼ µ(s0)
a0 ∼ π(a0 | s0)
s1, r0 ∼ P(s1, r0 | s0, a0)
a1 ∼ π(a1 | s1)
s2, r1 ∼ P(s2, r1 | s1, a1)
. . .
aT−1 ∼ π(aT−1 | sT−1)
sT , rT−1 ∼ P(sT | sT−1, aT−1)
Objective:
maximize η(π), where
η(π) = E[r0 + r1 + · · · + rT−1 | π]
Episodic Setting
μ0
a0
s0 s1
a1 aT-1
sT
π
P
Agent
r0 r1 rT-1
Environment
s2
Objective:
maximize η(π), where
η(π) = E[r0 + r1 + · · · + rT−1 | π]
Parameterized Policies
A family of policies indexed by parameter vector θ ∈ Rd
Deterministic: a = π(s, θ)
Stochastic: π(a | s, θ)
Analogous to classification or regression with input s,
output a.
Discrete action space: network outputs vector of
probabilities
Continuous action space: network outputs mean and
diagonal covariance of Gaussian
Black-Box Optimization Methods
Derivative Free Optimization Approach
Objective:
maximize E[R | π(·, θ)]
View θ → → R as a black box
Ignore all other information other than R collected during
episode
Cross-Entropy Method
Evolutionary algorithm
Works embarrassingly well
I. Szita and A. L¨orincz. “Learning Tetris using
the noisy cross-entropy method”. In: Neural
computation 18.12 (2006), pp. 2936–2941
V. Gabillon, M. Ghavamzadeh, and
B. Scherrer. “Approximate Dynamic
Programming Finally Performs Well in the
Game of Tetris”. In: Advances in Neural
Information Processing Systems. 2013
Cross-Entropy Method
Evolutionary algorithm
Works embarrassingly well
A similar algorithm, Covariance Matrix Adaptation, has
become standard in graphics:
Cross-Entropy Method
Initialize µ ∈ Rd
, σ ∈ Rd
for iteration = 1, 2, . . . do
Collect n samples of θi ∼ N(µ, diag(σ))
Perform a noisy evaluation Ri ∼ θi
Select the top p% of samples (e.g. p = 20), which we’ll
call the elite set
Fit a Gaussian distribution, with diagonal covariance,
to the elite set, obtaining a new µ, σ.
end for
Return the final µ.
Cross-Entropy Method
Analysis: a very similar algorithm is an
minorization-maximization (MM) algorithm, guaranteed
to monotonically increase expected reward
Recall that Monte-Carlo EM algorithm collects samples,
reweights them, and them maximizes their logprob
We can derive MM algorithm where each iteration you
maximize i log p(θi )Ri
Policy Gradient Methods
Policy Gradient Methods: Overview
Problem:
maximize E[R | πθ]
Intuitions: collect a bunch of trajectories, and ...
1. Make the good trajectories more probable
2. Make the good actions more probable
3. Push the actions towards good actions (DPG11
, SVG12
)
11
D. Silver et al. “Deterministic policy gradient algorithms”. In: ICML. 2014.
12
N. Heess et al. “Learning continuous control policies by stochastic value gradients”. In: Advances in Neural
Information Processing Systems. 2015, pp. 2926–2934.
Score Function Gradient Estimator
Consider an expectation Ex∼p(x | θ)[f (x)]. Want to compute
gradient wrt θ
θEx [f (x)] = θ dx p(x | θ)f (x)
= dx θp(x | θ)f (x)
= dx p(x | θ)
θp(x | θ)
p(x | θ)
f (x)
= dx p(x | θ) θ log p(x | θ)f (x)
= Ex [f (x) θ log p(x | θ)].
Last expression gives us an unbiased gradient estimator. Just
sample xi ∼ p(x | θ), and compute ˆgi = f (xi ) θ log p(xi | θ).
Need to be able to compute and differentiate density p(x | θ)
wrt θ
Score Function Gradient Estimator: Intuition
ˆgi = f (xi ) θ log p(xi | θ)
Let’s say that f (x) measures how good the sample x is.
Moving in the direction ˆgi pushes up the logprob of the
sample, in proportion to how good it is
Valid even if f (x) is discontinuous, and unknown, or
sample space (containing x) is a discrete set
Score Function Gradient Estimator: Intuition
ˆgi = f (xi ) θ log p(xi | θ)
Score Function Gradient Estimator for Policies
Now random variable x is a whole trajectory
τ = (s0, a0, r0, s1, a1, r1, . . . , sT−1, aT−1, rT−1, sT )
θEτ [R(τ)] = Eτ [ θ log p(τ | θ)R(τ)]
Just need to write out p(τ | θ):
p(τ | θ) = µ(s0)
T−1
t=0
[π(at | st, θ)P(st+1, rt | st, at)]
log p(τ | θ) = log µ(s0) +
T−1
t=0
[log π(at | st, θ) + log P(st+1, rt | st, at)]
θ log p(τ | θ) = θ
T−1
t=0
log π(at | st, θ)
θEτ [R] = Eτ R θ
T−1
t=0
log π(at | st, θ)
Interpretation: using good trajectories (high R) as supervised
examples in classification / regression
Policy Gradient: Use Temporal Structure
Previous slide:
θEτ [R] = Eτ
T−1
t=0
rt
T−1
t=0
θ log π(at | st, θ)
We can repeat the same argument to derive the gradient
estimator for a single reward term rt .
θE [rt ] = E rt
t
t=0
θ log π(at | st, θ)
Sum this formula over t, we obtain
θE [R] = E
T−1
t=0
rt
t
t=0
θ log π(at | st, θ)
= E
T−1
t=0
θ log π(at | st, θ)
T−1
t =t
rt
Policy Gradient: Introduce Baseline
Further reduce variance by introducing a baseline b(s)
θEτ [R] = Eτ
T−1
t=0
θ log π(at | st, θ)
T−1
t =t
rt − b(st)
For any choice of b, gradient estimator is unbiased.
Near optimal choice is expected return,
b(st) ≈ E [rt + rt+1 + rt+2 + · · · + rT−1]
Interpretation: increase logprob of action at proportionally
to how much returns T−1
t=t rt are better than expected
Discounts for Variance Reduction
Introduce discount factor γ, which ignores delayed effects
between actions and rewards
θEτ [R] ≈ Eτ
T−1
t=0
θ log π(at | st, θ)
T−1
t =t
γt −t
rt − b(st)
Now, we want
b(st) ≈ E rt + γrt+1 + γ2
rt+2 + · · · + γT−1−t
rT−1
Write gradient estimator more generally as
θEτ [R] ≈ Eτ
T−1
t=0
θ log π(at | st, θ) ˆAt
ˆAt is the advantage estimate
“Vanilla” Policy Gradient Algorithm
Initialize policy parameter θ, baseline b
for iteration=1, 2, . . . do
Collect a set of trajectories by executing the current
policy
At each timestep in each trajectory, compute
the return Rt = T−1
t =t γt −t
rt , and
the advantage estimate ˆAt = Rt − b(st).
Re-fit the baseline, by minimizing b(st) − Rt
2
,
summed over all trajectories and timesteps.
Update the policy, using a policy gradient estimate ˆg,
which is a sum of terms θ log π(at | st, θ) ˆAt
end for
Extension: Step Sizes and Trust Regions
Why are step sizes a big deal in RL?
Supervised learning
Step too far → next update will fix it
Reinforcement learning
Step too far → bad policy
Next batch: collected under bad policy
Can’t recover, collapse in performance!
Extension: Step Sizes and Trust Regions
Trust Region Policy Optimization: limit KL divergence
between action distribution of pre-update and
post-update policy13
Es DKL
(πold(· | s) π(· | s)) ≤ δ
Closely elated to previous natural policy gradient
methods14
13
J. Schulman et al. “Trust Region Policy Optimization”. In: arXiv preprint arXiv:1502.05477 (2015).
14
S. Kakade. “A Natural Policy Gradient.” In: NIPS. vol. 14. 2001, pp. 1531–1538; J. A. Bagnell and
J. Schneider. “Covariant policy search”. In: IJCAI. 2003; J. Peters and S. Schaal. “Natural actor-critic”. In:
Neurocomputing 71.7 (2008), pp. 1180–1190.
Extension: Further Variance Reduction
Use value functions for more variance reduction (at the
cost of bias): actor-critic methods15
Reparameterization trick: instead of increasing the
probability of the good actions, push the actions towards
(hopefully) better actions16
15
J. Schulman et al. “High-dimensional continuous control using generalized advantage estimation”. In: arXiv
preprint arXiv:1506.02438 (2015); V. Mnih et al. “Asynchronous Methods for Deep Reinforcement Learning”. In:
arXiv preprint arXiv:1602.01783 (2016).
16
D. Silver et al. “Deterministic policy gradient algorithms”. In: ICML. 2014; N. Heess et al. “Learning
continuous control policies by stochastic value gradients”. In: Advances in Neural Information Processing
Systems. 2015, pp. 2926–2934.
Demo
Fin
Thank you. Questions?

Más contenido relacionado

La actualidad más candente

Wasserstein GAN 수학 이해하기 I
Wasserstein GAN 수학 이해하기 IWasserstein GAN 수학 이해하기 I
Wasserstein GAN 수학 이해하기 ISungbin Lim
 
Multi-Armed Bandit and Applications
Multi-Armed Bandit and ApplicationsMulti-Armed Bandit and Applications
Multi-Armed Bandit and ApplicationsSangwoo Mo
 
An introduction to deep reinforcement learning
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learningBig Data Colombia
 
강화학습 알고리즘의 흐름도 Part 2
강화학습 알고리즘의 흐름도 Part 2강화학습 알고리즘의 흐름도 Part 2
강화학습 알고리즘의 흐름도 Part 2Dongmin Lee
 
Reinforcement Learning Q-Learning
Reinforcement Learning   Q-Learning Reinforcement Learning   Q-Learning
Reinforcement Learning Q-Learning Melaku Eneayehu
 
RLCode와 A3C 쉽고 깊게 이해하기
RLCode와 A3C 쉽고 깊게 이해하기RLCode와 A3C 쉽고 깊게 이해하기
RLCode와 A3C 쉽고 깊게 이해하기Woong won Lee
 
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex FridmanMIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex FridmanPeerasak C.
 
Relational knowledge distillation
Relational knowledge distillationRelational knowledge distillation
Relational knowledge distillationNAVER Engineering
 
Rl chapter 1 introduction
Rl chapter 1 introductionRl chapter 1 introduction
Rl chapter 1 introductionConnorShorten2
 
Q-learning and Deep Q Network (Reinforcement Learning)
Q-learning and Deep Q Network (Reinforcement Learning)Q-learning and Deep Q Network (Reinforcement Learning)
Q-learning and Deep Q Network (Reinforcement Learning)Thom Lane
 
Gradient descent optimizer
Gradient descent optimizerGradient descent optimizer
Gradient descent optimizerHojin Yang
 
Methods of Optimization in Machine Learning
Methods of Optimization in Machine LearningMethods of Optimization in Machine Learning
Methods of Optimization in Machine LearningKnoldus Inc.
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learningPruet Boonma
 
지적 대화를 위한 깊고 넓은 딥러닝 PyCon APAC 2016
지적 대화를 위한 깊고 넓은 딥러닝 PyCon APAC 2016지적 대화를 위한 깊고 넓은 딥러닝 PyCon APAC 2016
지적 대화를 위한 깊고 넓은 딥러닝 PyCon APAC 2016Taehoon Kim
 
Multi-armed Bandits
Multi-armed BanditsMulti-armed Bandits
Multi-armed BanditsDongmin Lee
 

La actualidad más candente (20)

Wasserstein GAN 수학 이해하기 I
Wasserstein GAN 수학 이해하기 IWasserstein GAN 수학 이해하기 I
Wasserstein GAN 수학 이해하기 I
 
Multi-Armed Bandit and Applications
Multi-Armed Bandit and ApplicationsMulti-Armed Bandit and Applications
Multi-Armed Bandit and Applications
 
An introduction to deep reinforcement learning
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learning
 
강화학습 알고리즘의 흐름도 Part 2
강화학습 알고리즘의 흐름도 Part 2강화학습 알고리즘의 흐름도 Part 2
강화학습 알고리즘의 흐름도 Part 2
 
Reinforcement Learning Q-Learning
Reinforcement Learning   Q-Learning Reinforcement Learning   Q-Learning
Reinforcement Learning Q-Learning
 
RLCode와 A3C 쉽고 깊게 이해하기
RLCode와 A3C 쉽고 깊게 이해하기RLCode와 A3C 쉽고 깊게 이해하기
RLCode와 A3C 쉽고 깊게 이해하기
 
Xgboost
XgboostXgboost
Xgboost
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex FridmanMIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
 
Relational knowledge distillation
Relational knowledge distillationRelational knowledge distillation
Relational knowledge distillation
 
Rl chapter 1 introduction
Rl chapter 1 introductionRl chapter 1 introduction
Rl chapter 1 introduction
 
Xgboost
XgboostXgboost
Xgboost
 
Q-learning and Deep Q Network (Reinforcement Learning)
Q-learning and Deep Q Network (Reinforcement Learning)Q-learning and Deep Q Network (Reinforcement Learning)
Q-learning and Deep Q Network (Reinforcement Learning)
 
Gradient descent optimizer
Gradient descent optimizerGradient descent optimizer
Gradient descent optimizer
 
multi-armed bandit
multi-armed banditmulti-armed bandit
multi-armed bandit
 
Methods of Optimization in Machine Learning
Methods of Optimization in Machine LearningMethods of Optimization in Machine Learning
Methods of Optimization in Machine Learning
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learning
 
지적 대화를 위한 깊고 넓은 딥러닝 PyCon APAC 2016
지적 대화를 위한 깊고 넓은 딥러닝 PyCon APAC 2016지적 대화를 위한 깊고 넓은 딥러닝 PyCon APAC 2016
지적 대화를 위한 깊고 넓은 딥러닝 PyCon APAC 2016
 
Lec3 dqn
Lec3 dqnLec3 dqn
Lec3 dqn
 
Multi-armed Bandits
Multi-armed BanditsMulti-armed Bandits
Multi-armed Bandits
 

Destacado

Gibbs cloner を用いた組み合わせ最適化と cross-entropy を用いた期待値推計: 道路ネットワーク強靭化のための耐震化戦略を例として
Gibbs cloner を用いた組み合わせ最適化と cross-entropy を用いた期待値推計: 道路ネットワーク強靭化のための耐震化戦略を例としてGibbs cloner を用いた組み合わせ最適化と cross-entropy を用いた期待値推計: 道路ネットワーク強靭化のための耐震化戦略を例として
Gibbs cloner を用いた組み合わせ最適化と cross-entropy を用いた期待値推計: 道路ネットワーク強靭化のための耐震化戦略を例としてTakeshi Nagae
 
An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)pauldix
 
스웨덴개황%282009.7%29
스웨덴개황%282009.7%29스웨덴개황%282009.7%29
스웨덴개황%282009.7%29drtravel
 
Simulation of rare events and optimisation with the cross-entropy method
Simulation of rare events and optimisation with the cross-entropy methodSimulation of rare events and optimisation with the cross-entropy method
Simulation of rare events and optimisation with the cross-entropy methodArthur Breitman
 
「これからの強化学習」勉強会#2
「これからの強化学習」勉強会#2「これからの強化学習」勉強会#2
「これからの強化学習」勉強会#2Chihiro Kusunoki
 
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...npinto
 
「これからの強化学習」勉強会#1
「これからの強化学習」勉強会#1「これからの強化学習」勉強会#1
「これからの強化学習」勉強会#1Chihiro Kusunoki
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learningbutest
 
Continuous control with deep reinforcement learning (DDPG)
Continuous control with deep reinforcement learning (DDPG)Continuous control with deep reinforcement learning (DDPG)
Continuous control with deep reinforcement learning (DDPG)Taehoon Kim
 
Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning Chandra Meena
 
正則化つき線形モデル(「入門機械学習第6章」より)
正則化つき線形モデル(「入門機械学習第6章」より)正則化つき線形モデル(「入門機械学習第6章」より)
正則化つき線形モデル(「入門機械学習第6章」より)Eric Sartre
 
NIPS 2016 Overview and Deep Learning Topics
NIPS 2016 Overview and Deep Learning Topics  NIPS 2016 Overview and Deep Learning Topics
NIPS 2016 Overview and Deep Learning Topics Koichi Hamada
 
PFN Spring Internship Final Report: Autonomous Drive by Deep RL
PFN Spring Internship Final Report: Autonomous Drive by Deep RLPFN Spring Internship Final Report: Autonomous Drive by Deep RL
PFN Spring Internship Final Report: Autonomous Drive by Deep RLNaoto Yoshida
 
強化学習その3
強化学習その3強化学習その3
強化学習その3nishio
 
ロジスティック回帰の考え方・使い方 - TokyoR #33
ロジスティック回帰の考え方・使い方 - TokyoR #33ロジスティック回帰の考え方・使い方 - TokyoR #33
ロジスティック回帰の考え方・使い方 - TokyoR #33horihorio
 
最近のDQN
最近のDQN最近のDQN
最近のDQNmooopan
 
Build Your Own 3D Scanner: 3D Scanning with Swept-Planes
Build Your Own 3D Scanner: 3D Scanning with Swept-PlanesBuild Your Own 3D Scanner: 3D Scanning with Swept-Planes
Build Your Own 3D Scanner: 3D Scanning with Swept-PlanesDouglas Lanman
 
Deep Learning and Reinforcement Learning
Deep Learning and Reinforcement LearningDeep Learning and Reinforcement Learning
Deep Learning and Reinforcement LearningRenārs Liepiņš
 

Destacado (20)

Gibbs cloner を用いた組み合わせ最適化と cross-entropy を用いた期待値推計: 道路ネットワーク強靭化のための耐震化戦略を例として
Gibbs cloner を用いた組み合わせ最適化と cross-entropy を用いた期待値推計: 道路ネットワーク強靭化のための耐震化戦略を例としてGibbs cloner を用いた組み合わせ最適化と cross-entropy を用いた期待値推計: 道路ネットワーク強靭化のための耐震化戦略を例として
Gibbs cloner を用いた組み合わせ最適化と cross-entropy を用いた期待値推計: 道路ネットワーク強靭化のための耐震化戦略を例として
 
An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)
 
스웨덴개황%282009.7%29
스웨덴개황%282009.7%29스웨덴개황%282009.7%29
스웨덴개황%282009.7%29
 
Simulation of rare events and optimisation with the cross-entropy method
Simulation of rare events and optimisation with the cross-entropy methodSimulation of rare events and optimisation with the cross-entropy method
Simulation of rare events and optimisation with the cross-entropy method
 
「これからの強化学習」勉強会#2
「これからの強化学習」勉強会#2「これからの強化学習」勉強会#2
「これからの強化学習」勉強会#2
 
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
 
「これからの強化学習」勉強会#1
「これからの強化学習」勉強会#1「これからの強化学習」勉強会#1
「これからの強化学習」勉強会#1
 
Deep Q-Learning
Deep Q-LearningDeep Q-Learning
Deep Q-Learning
 
Hierarchical Object Detection with Deep Reinforcement Learning
Hierarchical Object Detection with Deep Reinforcement LearningHierarchical Object Detection with Deep Reinforcement Learning
Hierarchical Object Detection with Deep Reinforcement Learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Continuous control with deep reinforcement learning (DDPG)
Continuous control with deep reinforcement learning (DDPG)Continuous control with deep reinforcement learning (DDPG)
Continuous control with deep reinforcement learning (DDPG)
 
Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning
 
正則化つき線形モデル(「入門機械学習第6章」より)
正則化つき線形モデル(「入門機械学習第6章」より)正則化つき線形モデル(「入門機械学習第6章」より)
正則化つき線形モデル(「入門機械学習第6章」より)
 
NIPS 2016 Overview and Deep Learning Topics
NIPS 2016 Overview and Deep Learning Topics  NIPS 2016 Overview and Deep Learning Topics
NIPS 2016 Overview and Deep Learning Topics
 
PFN Spring Internship Final Report: Autonomous Drive by Deep RL
PFN Spring Internship Final Report: Autonomous Drive by Deep RLPFN Spring Internship Final Report: Autonomous Drive by Deep RL
PFN Spring Internship Final Report: Autonomous Drive by Deep RL
 
強化学習その3
強化学習その3強化学習その3
強化学習その3
 
ロジスティック回帰の考え方・使い方 - TokyoR #33
ロジスティック回帰の考え方・使い方 - TokyoR #33ロジスティック回帰の考え方・使い方 - TokyoR #33
ロジスティック回帰の考え方・使い方 - TokyoR #33
 
最近のDQN
最近のDQN最近のDQN
最近のDQN
 
Build Your Own 3D Scanner: 3D Scanning with Swept-Planes
Build Your Own 3D Scanner: 3D Scanning with Swept-PlanesBuild Your Own 3D Scanner: 3D Scanning with Swept-Planes
Build Your Own 3D Scanner: 3D Scanning with Swept-Planes
 
Deep Learning and Reinforcement Learning
Deep Learning and Reinforcement LearningDeep Learning and Reinforcement Learning
Deep Learning and Reinforcement Learning
 

Similar a Deep Reinforcement Learning Through Policy Optimization

Learning to discover monte carlo algorithm on spin ice manifold
Learning to discover monte carlo algorithm on spin ice manifoldLearning to discover monte carlo algorithm on spin ice manifold
Learning to discover monte carlo algorithm on spin ice manifoldKai-Wen Zhao
 
Hierarchical Reinforcement Learning with Option-Critic Architecture
Hierarchical Reinforcement Learning with Option-Critic ArchitectureHierarchical Reinforcement Learning with Option-Critic Architecture
Hierarchical Reinforcement Learning with Option-Critic ArchitectureNecip Oguz Serbetci
 
MS CS - Selecting Machine Learning Algorithm
MS CS - Selecting Machine Learning AlgorithmMS CS - Selecting Machine Learning Algorithm
MS CS - Selecting Machine Learning AlgorithmKaniska Mandal
 
Counterfactual Learning for Recommendation
Counterfactual Learning for RecommendationCounterfactual Learning for Recommendation
Counterfactual Learning for RecommendationOlivier Jeunen
 
Introduction to Reinforcement Learning for Molecular Design
Introduction to Reinforcement Learning for Molecular Design Introduction to Reinforcement Learning for Molecular Design
Introduction to Reinforcement Learning for Molecular Design Dan Elton
 
Practical Reinforcement Learning with TensorFlow
Practical Reinforcement Learning with TensorFlowPractical Reinforcement Learning with TensorFlow
Practical Reinforcement Learning with TensorFlowIllia Polosukhin
 
Optimizing a New Nonlinear Reinforcement Scheme with Breeder genetic algorithm
Optimizing a New Nonlinear Reinforcement Scheme with Breeder genetic algorithmOptimizing a New Nonlinear Reinforcement Scheme with Breeder genetic algorithm
Optimizing a New Nonlinear Reinforcement Scheme with Breeder genetic algorithminfopapers
 
safe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningsafe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningRyo Iwaki
 
Machine Learning and Artificial Neural Networks.ppt
Machine Learning and Artificial Neural Networks.pptMachine Learning and Artificial Neural Networks.ppt
Machine Learning and Artificial Neural Networks.pptAnshika865276
 
Uncertainty Awareness in Integrating Machine Learning and Game Theory
Uncertainty Awareness in Integrating Machine Learning and Game TheoryUncertainty Awareness in Integrating Machine Learning and Game Theory
Uncertainty Awareness in Integrating Machine Learning and Game TheoryRikiya Takahashi
 
block-mdp-masters-defense.pdf
block-mdp-masters-defense.pdfblock-mdp-masters-defense.pdf
block-mdp-masters-defense.pdfJunghyun Lee
 
reiniforcement learning.ppt
reiniforcement learning.pptreiniforcement learning.ppt
reiniforcement learning.pptcharusharma165
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017MLconf
 
Chap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsChap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsYoung-Geun Choi
 
Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3Charles Martin
 

Similar a Deep Reinforcement Learning Through Policy Optimization (20)

Learning to discover monte carlo algorithm on spin ice manifold
Learning to discover monte carlo algorithm on spin ice manifoldLearning to discover monte carlo algorithm on spin ice manifold
Learning to discover monte carlo algorithm on spin ice manifold
 
Hierarchical Reinforcement Learning with Option-Critic Architecture
Hierarchical Reinforcement Learning with Option-Critic ArchitectureHierarchical Reinforcement Learning with Option-Critic Architecture
Hierarchical Reinforcement Learning with Option-Critic Architecture
 
PMED Undergraduate Workshop - Introduction to Reinforcement Learning - Lili W...
PMED Undergraduate Workshop - Introduction to Reinforcement Learning - Lili W...PMED Undergraduate Workshop - Introduction to Reinforcement Learning - Lili W...
PMED Undergraduate Workshop - Introduction to Reinforcement Learning - Lili W...
 
ddpg seminar
ddpg seminarddpg seminar
ddpg seminar
 
Playing Atari with Deep Reinforcement Learning
Playing Atari with Deep Reinforcement LearningPlaying Atari with Deep Reinforcement Learning
Playing Atari with Deep Reinforcement Learning
 
MS CS - Selecting Machine Learning Algorithm
MS CS - Selecting Machine Learning AlgorithmMS CS - Selecting Machine Learning Algorithm
MS CS - Selecting Machine Learning Algorithm
 
nnml.ppt
nnml.pptnnml.ppt
nnml.ppt
 
Lec0
Lec0Lec0
Lec0
 
Counterfactual Learning for Recommendation
Counterfactual Learning for RecommendationCounterfactual Learning for Recommendation
Counterfactual Learning for Recommendation
 
Introduction to Reinforcement Learning for Molecular Design
Introduction to Reinforcement Learning for Molecular Design Introduction to Reinforcement Learning for Molecular Design
Introduction to Reinforcement Learning for Molecular Design
 
Practical Reinforcement Learning with TensorFlow
Practical Reinforcement Learning with TensorFlowPractical Reinforcement Learning with TensorFlow
Practical Reinforcement Learning with TensorFlow
 
Optimizing a New Nonlinear Reinforcement Scheme with Breeder genetic algorithm
Optimizing a New Nonlinear Reinforcement Scheme with Breeder genetic algorithmOptimizing a New Nonlinear Reinforcement Scheme with Breeder genetic algorithm
Optimizing a New Nonlinear Reinforcement Scheme with Breeder genetic algorithm
 
safe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningsafe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learning
 
Machine Learning and Artificial Neural Networks.ppt
Machine Learning and Artificial Neural Networks.pptMachine Learning and Artificial Neural Networks.ppt
Machine Learning and Artificial Neural Networks.ppt
 
Uncertainty Awareness in Integrating Machine Learning and Game Theory
Uncertainty Awareness in Integrating Machine Learning and Game TheoryUncertainty Awareness in Integrating Machine Learning and Game Theory
Uncertainty Awareness in Integrating Machine Learning and Game Theory
 
block-mdp-masters-defense.pdf
block-mdp-masters-defense.pdfblock-mdp-masters-defense.pdf
block-mdp-masters-defense.pdf
 
reiniforcement learning.ppt
reiniforcement learning.pptreiniforcement learning.ppt
reiniforcement learning.ppt
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
 
Chap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsChap 8. Optimization for training deep models
Chap 8. Optimization for training deep models
 
Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3
 

Último

Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
Clustering techniques data mining book ....
Clustering techniques data mining book ....Clustering techniques data mining book ....
Clustering techniques data mining book ....ShaimaaMohamedGalal
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 

Último (20)

Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
Clustering techniques data mining book ....
Clustering techniques data mining book ....Clustering techniques data mining book ....
Clustering techniques data mining book ....
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 

Deep Reinforcement Learning Through Policy Optimization

  • 1. Deep Reinforcement Learning Through Policy Optimization John Schulman October 20, 2016
  • 2.
  • 4. What is Reinforcement Learning? Branch of machine learning concerned with taking sequences of actions Usually described in terms of agent interacting with a previously unknown environment, trying to maximize cumulative reward Agent Environment action observation, reward
  • 5. What Is Deep Reinforcement Learning? Reinforcement learning using neural networks to approximate functions Policies (select next action) Value functions (measure goodness of states or state-action pairs) Models (predict next states and rewards)
  • 6. Motor Control and Robotics Robotics: Observations: camera images, joint angles Actions: joint torques Rewards: stay balanced, navigate to target locations, serve and protect humans
  • 7. Business Operations Inventory Management Observations: current inventory levels Actions: number of units of each item to purchase Rewards: profit
  • 8. In Other ML Problems Hard Attention1 Observation: current image window Action: where to look Reward: classification Sequential/structured prediction, e.g., machine translation2 Observations: words in source language Actions: emit word in target language Rewards: sentence-level metric, e.g. BLEU score 1 V. Mnih et al. “Recurrent models of visual attention”. In: Advances in Neural Information Processing Systems. 2014, pp. 2204–2212. 2 H. Daum´e Iii, J. Langford, and D. Marcu. “Search-based structured prediction”. In: Machine learning 75.3 (2009), pp. 297–325; S. Ross, G. J. Gordon, and D. Bagnell. “A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning.” In: AISTATS. vol. 1. 2. 2011, p. 6; M. Ranzato et al. “Sequence level training with recurrent neural networks”. In: arXiv preprint arXiv:1511.06732 (2015).
  • 9. How Does RL Relate to Other ML Problems?
  • 10. How Does RL Relate to Other ML Problems? Supervised learning: Environment samples input-output pair (xt, yt) ∼ ρ Agent predicts ˆyt = f (xt) Agent receives loss (yt, ˆyt) Environment asks agent a question, and then tells her the right answer
  • 11. How Does RL Relate to Other ML Problems? Contextual bandits: Environment samples input xt ∼ ρ Agent takes action ˆyt = f (xt) Agent receives cost ct ∼ P(ct | xt, ˆyt) where P is an unknown probability distribution Environment asks agent a question, and gives her a noisy score on her answer Application: personalized recommendations
  • 12. How Does RL Relate to Other ML Problems? Reinforcement learning: Environment samples input xt ∼ P(xt | xt−1, yt−1) Input depends on your previous actions! Agent takes action ˆyt = f (xt) Agent receives cost ct ∼ P(ct | xt, ˆyt) where P a probability distribution unknown to the agent.
  • 13. How Does RL Relate to Other Machine Learning Problems? Summary of differences between RL and supervised learning: You don’t have full access to the function you’re trying to optimize—must query it through interaction. Interacting with a stateful world: input xt depend on your previous actions
  • 14. Should I Use Deep RL On My Practical Problem? Might be overkill Other methods worth investigating first Derivative-free optimization (simulated annealing, cross entropy method, SPSA) Is it a contextual bandit problem? Non-deep RL methods developed by Operations Research community3 3 W. B. Powell. Approximate Dynamic Programming: Solving the curses of dimensionality. Vol. 703. John Wiley & Sons, 2007.
  • 15. Recent Success Stories in Deep RL ATARI using deep Q-learning4 , policy gradients5 , DAGGER6 Superhuman Go using supervised learning + policy gradients + Monte Carlo tree search + value functions7 Robotic manipulation using guided policy search8 Robotic locomotion using policy gradients9 3D games using policy gradients10 4 V. Mnih et al. “Playing Atari with Deep Reinforcement Learning”. In: arXiv preprint arXiv:1312.5602 (2013). 5 J. Schulman et al. “Trust Region Policy Optimization”. In: arXiv preprint arXiv:1502.05477 (2015). 6 X. Guo et al. “Deep learning for real-time Atari game play using offline Monte-Carlo tree search planning”. In: Advances in Neural Information Processing Systems. 2014, pp. 3338–3346. 7 D. Silver et al. “Mastering the game of Go with deep neural networks and tree search”. In: Nature 529.7587 (2016), pp. 484–489. 8 S. Levine et al. “End-to-end training of deep visuomotor policies”. In: arXiv preprint arXiv:1504.00702 (2015). 9 J. Schulman et al. “High-dimensional continuous control using generalized advantage estimation”. In: arXiv preprint arXiv:1506.02438 (2015). 10 V. Mnih et al. “Asynchronous Methods for Deep Reinforcement Learning”. In: arXiv preprint arXiv:1602.01783 (2016).
  • 17. Definition Markov Decision Process (MDP) defined by (S, A, P), where S: state space A: action space P(r, s | s, a): transition + reward probability distribution Extra objects defined depending on problem setting µ: Initial state distribution Optimization problem: maximize expected cumulative reward
  • 18. Episodic Setting In each episode, the initial state is sampled from µ, and the agent acts until the terminal state is reached. For example: Taxi robot reaches its destination (termination = good) Waiter robot finishes a shift (fixed time) Walking robot falls over (termination = bad) Goal: maximize expected reward per episode
  • 19. Policies Deterministic policies: a = π(s) Stochastic policies: a ∼ π(a | s)
  • 20. Episodic Setting s0 ∼ µ(s0) a0 ∼ π(a0 | s0) s1, r0 ∼ P(s1, r0 | s0, a0) a1 ∼ π(a1 | s1) s2, r1 ∼ P(s2, r1 | s1, a1) . . . aT−1 ∼ π(aT−1 | sT−1) sT , rT−1 ∼ P(sT | sT−1, aT−1) Objective: maximize η(π), where η(π) = E[r0 + r1 + · · · + rT−1 | π]
  • 21. Episodic Setting μ0 a0 s0 s1 a1 aT-1 sT π P Agent r0 r1 rT-1 Environment s2 Objective: maximize η(π), where η(π) = E[r0 + r1 + · · · + rT−1 | π]
  • 22. Parameterized Policies A family of policies indexed by parameter vector θ ∈ Rd Deterministic: a = π(s, θ) Stochastic: π(a | s, θ) Analogous to classification or regression with input s, output a. Discrete action space: network outputs vector of probabilities Continuous action space: network outputs mean and diagonal covariance of Gaussian
  • 24. Derivative Free Optimization Approach Objective: maximize E[R | π(·, θ)] View θ → → R as a black box Ignore all other information other than R collected during episode
  • 25. Cross-Entropy Method Evolutionary algorithm Works embarrassingly well I. Szita and A. L¨orincz. “Learning Tetris using the noisy cross-entropy method”. In: Neural computation 18.12 (2006), pp. 2936–2941 V. Gabillon, M. Ghavamzadeh, and B. Scherrer. “Approximate Dynamic Programming Finally Performs Well in the Game of Tetris”. In: Advances in Neural Information Processing Systems. 2013
  • 26. Cross-Entropy Method Evolutionary algorithm Works embarrassingly well A similar algorithm, Covariance Matrix Adaptation, has become standard in graphics:
  • 27. Cross-Entropy Method Initialize µ ∈ Rd , σ ∈ Rd for iteration = 1, 2, . . . do Collect n samples of θi ∼ N(µ, diag(σ)) Perform a noisy evaluation Ri ∼ θi Select the top p% of samples (e.g. p = 20), which we’ll call the elite set Fit a Gaussian distribution, with diagonal covariance, to the elite set, obtaining a new µ, σ. end for Return the final µ.
  • 28. Cross-Entropy Method Analysis: a very similar algorithm is an minorization-maximization (MM) algorithm, guaranteed to monotonically increase expected reward Recall that Monte-Carlo EM algorithm collects samples, reweights them, and them maximizes their logprob We can derive MM algorithm where each iteration you maximize i log p(θi )Ri
  • 30. Policy Gradient Methods: Overview Problem: maximize E[R | πθ] Intuitions: collect a bunch of trajectories, and ... 1. Make the good trajectories more probable 2. Make the good actions more probable 3. Push the actions towards good actions (DPG11 , SVG12 ) 11 D. Silver et al. “Deterministic policy gradient algorithms”. In: ICML. 2014. 12 N. Heess et al. “Learning continuous control policies by stochastic value gradients”. In: Advances in Neural Information Processing Systems. 2015, pp. 2926–2934.
  • 31. Score Function Gradient Estimator Consider an expectation Ex∼p(x | θ)[f (x)]. Want to compute gradient wrt θ θEx [f (x)] = θ dx p(x | θ)f (x) = dx θp(x | θ)f (x) = dx p(x | θ) θp(x | θ) p(x | θ) f (x) = dx p(x | θ) θ log p(x | θ)f (x) = Ex [f (x) θ log p(x | θ)]. Last expression gives us an unbiased gradient estimator. Just sample xi ∼ p(x | θ), and compute ˆgi = f (xi ) θ log p(xi | θ). Need to be able to compute and differentiate density p(x | θ) wrt θ
  • 32. Score Function Gradient Estimator: Intuition ˆgi = f (xi ) θ log p(xi | θ) Let’s say that f (x) measures how good the sample x is. Moving in the direction ˆgi pushes up the logprob of the sample, in proportion to how good it is Valid even if f (x) is discontinuous, and unknown, or sample space (containing x) is a discrete set
  • 33. Score Function Gradient Estimator: Intuition ˆgi = f (xi ) θ log p(xi | θ)
  • 34. Score Function Gradient Estimator for Policies Now random variable x is a whole trajectory τ = (s0, a0, r0, s1, a1, r1, . . . , sT−1, aT−1, rT−1, sT ) θEτ [R(τ)] = Eτ [ θ log p(τ | θ)R(τ)] Just need to write out p(τ | θ): p(τ | θ) = µ(s0) T−1 t=0 [π(at | st, θ)P(st+1, rt | st, at)] log p(τ | θ) = log µ(s0) + T−1 t=0 [log π(at | st, θ) + log P(st+1, rt | st, at)] θ log p(τ | θ) = θ T−1 t=0 log π(at | st, θ) θEτ [R] = Eτ R θ T−1 t=0 log π(at | st, θ) Interpretation: using good trajectories (high R) as supervised examples in classification / regression
  • 35. Policy Gradient: Use Temporal Structure Previous slide: θEτ [R] = Eτ T−1 t=0 rt T−1 t=0 θ log π(at | st, θ) We can repeat the same argument to derive the gradient estimator for a single reward term rt . θE [rt ] = E rt t t=0 θ log π(at | st, θ) Sum this formula over t, we obtain θE [R] = E T−1 t=0 rt t t=0 θ log π(at | st, θ) = E T−1 t=0 θ log π(at | st, θ) T−1 t =t rt
  • 36. Policy Gradient: Introduce Baseline Further reduce variance by introducing a baseline b(s) θEτ [R] = Eτ T−1 t=0 θ log π(at | st, θ) T−1 t =t rt − b(st) For any choice of b, gradient estimator is unbiased. Near optimal choice is expected return, b(st) ≈ E [rt + rt+1 + rt+2 + · · · + rT−1] Interpretation: increase logprob of action at proportionally to how much returns T−1 t=t rt are better than expected
  • 37. Discounts for Variance Reduction Introduce discount factor γ, which ignores delayed effects between actions and rewards θEτ [R] ≈ Eτ T−1 t=0 θ log π(at | st, θ) T−1 t =t γt −t rt − b(st) Now, we want b(st) ≈ E rt + γrt+1 + γ2 rt+2 + · · · + γT−1−t rT−1 Write gradient estimator more generally as θEτ [R] ≈ Eτ T−1 t=0 θ log π(at | st, θ) ˆAt ˆAt is the advantage estimate
  • 38. “Vanilla” Policy Gradient Algorithm Initialize policy parameter θ, baseline b for iteration=1, 2, . . . do Collect a set of trajectories by executing the current policy At each timestep in each trajectory, compute the return Rt = T−1 t =t γt −t rt , and the advantage estimate ˆAt = Rt − b(st). Re-fit the baseline, by minimizing b(st) − Rt 2 , summed over all trajectories and timesteps. Update the policy, using a policy gradient estimate ˆg, which is a sum of terms θ log π(at | st, θ) ˆAt end for
  • 39. Extension: Step Sizes and Trust Regions Why are step sizes a big deal in RL? Supervised learning Step too far → next update will fix it Reinforcement learning Step too far → bad policy Next batch: collected under bad policy Can’t recover, collapse in performance!
  • 40. Extension: Step Sizes and Trust Regions Trust Region Policy Optimization: limit KL divergence between action distribution of pre-update and post-update policy13 Es DKL (πold(· | s) π(· | s)) ≤ δ Closely elated to previous natural policy gradient methods14 13 J. Schulman et al. “Trust Region Policy Optimization”. In: arXiv preprint arXiv:1502.05477 (2015). 14 S. Kakade. “A Natural Policy Gradient.” In: NIPS. vol. 14. 2001, pp. 1531–1538; J. A. Bagnell and J. Schneider. “Covariant policy search”. In: IJCAI. 2003; J. Peters and S. Schaal. “Natural actor-critic”. In: Neurocomputing 71.7 (2008), pp. 1180–1190.
  • 41. Extension: Further Variance Reduction Use value functions for more variance reduction (at the cost of bias): actor-critic methods15 Reparameterization trick: instead of increasing the probability of the good actions, push the actions towards (hopefully) better actions16 15 J. Schulman et al. “High-dimensional continuous control using generalized advantage estimation”. In: arXiv preprint arXiv:1506.02438 (2015); V. Mnih et al. “Asynchronous Methods for Deep Reinforcement Learning”. In: arXiv preprint arXiv:1602.01783 (2016). 16 D. Silver et al. “Deterministic policy gradient algorithms”. In: ICML. 2014; N. Heess et al. “Learning continuous control policies by stochastic value gradients”. In: Advances in Neural Information Processing Systems. 2015, pp. 2926–2934.
  • 42. Demo