SlideShare una empresa de Scribd logo
1 de 39
Descargar para leer sin conexión
Distributed Deep Q-Learning
Hao Yi Ong
joint work with K. Chavez, A. Hong
Stanford University
Box, 6/3/15
Outline
Introduction
Reinforcement learning
Serial algorithm
Distributed algorithm
Numerical experiments
Conclusion
Introduction 2/39
Motivation
long-standing challenge of reinforcement learning (RL)
– control with high-dimensional sensory inputs (e.g., vision, speech)
– shift away from reliance on hand-crafted features
utilize breakthroughs in deep learning for RL [M+
13, M+
15]
– extract high-level features from raw sensory data
– learn better representations than handcrafted features with neural
network architectures used in supervised and unsupervised learning
create fast learning algorithm
– train efficiently with stochastic gradient descent (SGD)
– distribute training process to accelerate learning [DCM+
12]
Introduction 3/39
Success with Atari games
Introduction 4/39
Theoretical complications
deep learning algorithms require
huge training datasets
– sparse, noisy, and delayed reward signal in RL
– delay of ∼ 103
time steps between actions and resulting rewards
– cf. direct association between inputs and targets in supervised
learning
independence between samples
– sequences of highly correlated states in RL problems
fixed underlying data distribution
– distribution changes as RL algorithm learns new behaviors
Introduction 5/39
Goals
distributed deep RL algorithm
robust neural network agent
– must succeed in challenging test problems
control policies with high-dimensional sensory input
– obtain better internal representations than handcrafted features
fast training algorithm
– efficiently produce, use, and process training data
Introduction 6/39
Outline
Introduction
Reinforcement learning
Serial algorithm
Distributed algorithm
Numerical experiments
Conclusion
Reinforcement learning 7/39
Playing games
Environment:	
  
Game	
  emulator	
  
Agent	
  
Ac3on:	
  
Game	
  input	
  
State:	
  
Series	
  of	
  screens	
  	
  
and	
  inputs	
  
Reward:	
  
Game	
  score	
  change	
  
objective: learned policy maximizes future rewards
Rt =
T
t =t
γt −t
rt ,
discount factor γ
reward change at time t rt
Reinforcement learning 8/39
State-action value function
basic idea behind RL is to estimate
Q (s, a) = max
π
E [Rt | st = s, at = a, π] ,
where π maps states to actions (or distributions over actions)
optimal value function obeys Bellman equation
Q (s, a) = E
s ∼E
r + γ max
a
Q (s , a ) | s, a ,
where E is the MDP environment
Reinforcement learning 9/39
Value approximation
typically, a linear function approximator is used to estimate Q
Q (s, a; θ) ≈ Q (s, a) ,
which is parameterized by θ
we introduce the Q-network
– nonlinear neural network state-action value function approximator
– “Q” for Q-learning
Reinforcement learning 10/39
Q-network
trained by minimizing a sequence of loss functions
Li (θi) = E
s,a∼ρ(·)
(yi − Q (s, a; θi))
2
,
with
– iteration number i
– target yi = Es ∼E [r + γ maxa Q (s , a ; θi−1) | s, a]
– “behavior distribution” (exploration policy) ρ (s, a)
architecture varies according to application
Reinforcement learning 11/39
Outline
Introduction
Reinforcement learning
Serial algorithm
Distributed algorithm
Numerical experiments
Conclusion
Serial algorithm 12/39
Preprocessing
raw	
  screen	
   downsample	
  +	
  grayscale	
  
final	
  input	
  
Serial algorithm 13/39
Network architecture
Serial algorithm 14/39
Convolutional neural network
biologically-inspired by the visual cortex
CNN example: single layer, single frame to single filter, stride = 1
Serial algorithm 15/39
Stochastic gradient descent
optimize Q-network loss function by gradient descent
Q (s, a; θ) := Q (s, a; θ) + α θQ (s, a; θ) ,
with
– learning rate α
for computational expedience
– update weights after every time step
– avoid computing full expectations
– replace with single samples from ρ and E
Serial algorithm 16/39
Q-learning
Q (s, a) := Q (s, a) + α r + γ max
a
Q (s , a ) − Q (s, a)
model free RL
– avoids estimating E
off-policy
– learns policy a = argmaxa Q (s, a; θ)
– uses behavior distribution selected by an -greedy strategy
Serial algorithm 17/39
Experience replay
a kind of short-term memory
trains optimal policy using “behavior policy” (off-policy)
– learns policy π (s) = argmaxa Q (s, a; θ)
– uses an -greedy strategy (behavior policy) for state-space exploration
store agent’s experiences at each time step
et = (st, at, rt, st+1)
– experiences form a replay memory dataset with fixed capacity
– execute Q-learning updates with random samples of experience
Serial algorithm 18/39
Serial deep Q-learning
given replay memory D with capacity N
initialize Q-networks Q, ˆQ with same random weights θ
repeat until timeout
initialize frame sequence s1 = {x1} and preprocessed state φ1 = φ (s1)
for t = 1, . . . , T
1. select action at =
maxa Q (φ (st) , a; θ) w.p. 1 −
random action otherwise
2. execute action at and observe reward rt and frame xt+1
3. append st+1 = (st, at, xt+1) and preprocess φt+1 = φ (st+1)
4. store experience (φt, at, rt, φt+1) in D
5. uniformly sample minibatch (φj, aj, rj, φj+1) ∼ D
6. set yj =
rj if φj+1 terminal
rj + γ maxa
ˆQ (φj+1, a ; θ) otherwise
7. perform gradient descent step for Q on minibatch
8. every C steps reset ˆQ = Q
Serial algorithm 19/39
Theoretical complications
deep learning algorithms require
huge training datasets
independence between samples
fixed underlying data distribution
Serial algorithm 20/39
Deep Q-learning
avoids theoretical complications
greater data efficiency
– each experience potentially used in many weight udpates
reduce correlations between samples
– randomizing samples breaks correlations from consecutive samples
experience replay averages behavior distribution over states
– smooths out learning
– avoids oscillations or divergence in gradient descent
Serial algorithm 21/39
Cat video
Mini-break 22/39
Outline
Introduction
Reinforcement learning
Serial algorithm
Distributed algorithm
Numerical experiments
Conclusion
Distributed algorithm 23/39
Data parallelism
downpour SGD: generic asynchronous distributed SGD
θ := θ + αθ
θ	
  Δθ	
  
Distributed algorithm 24/39
Model parallelism
on each worker machine
computation of gradient is pushed down to hardware
– parallelized according to available CPU/GPU resources
– uses the Caffe deep learning framework
complexity scales linearly with number of parameters
– GPU provides speedup, but limits model size
– CPU slower, but model can be much larger
Distributed algorithm 25/39
Implementation
data shards are generated locally on each model worker in real-time
– data is stored independently for each worker
– since game emulation is simple, generating data is fast
– simple fault tolerance approach: regenerate data if worker dies
algorithm scales very well with data
– since data lives locally on workers, no data is sent
update parameter with gradients using RMSprop or AdaGrad
communication pattern: multiple asynchronous all-reduces
– one-to-all and all-to-one, but asynchronous for every minibatch
Distributed algorithm 26/39
Implementation
bottleneck is parameter update time on parameter server
– e.g., if parameter server gradient update takes 10 ms, then we can
only do up to 100 updates per second (using buffers, etc.)
trade-off between parallel updates and model staleness
– because worker is likely using a stale model, the updates are “noisy”
and not of the same quality as in serial implementation
Distributed algorithm 27/39
Outline
Introduction
Reinforcement learning
Serial algorithm
Distributed algorithm
Numerical experiments
Conclusion
Numerical experiments 28/39
Evaluation
Numerical experiments 29/39
Snake
parameters
– snake length grows with number of apples eaten (+1 reward)
– one apple at any time, regenerated once eaten
– n × n array, with walled-off world (−1 if snake dies)
– want to maximize score, equal to apples eaten (minus 1)
complexity
– four possible states for each cell: {empty, head, body, apple}
– state space cardinality is O n4
2n2
(-ish)
– four possible actions: {north, south, east, west}
Numerical experiments 30/39
Software
at initialization, broadcast neural network architecture
– each worker spawns Caffe with architecture
– populates replay dataset with experiences via random policy
for some number of iterations:
– workers fetch latest parameters for Q network from server
– compute and send gradient update
– parameters updated on server with RMSprop or AdaGrad (requires
O(p) memory and time)
Lightweight use of Spark
– shipping required files and serialized code to worker machines
– partitioning and scheduling number of updates to do on each worker
– coordinating identities of worker/server machines
– partial implementation of generic interface between Caffe and Spark
ran on dual core Intel i7 clocked at 2.2 GHz, 12 GB RAM
Numerical experiments 31/39
Complexity analysis
model complexity
– determined by architecture; roughly on the order of number of
parameters
gradient calculation via backpropagation
– distributed across worker’s CPU/GPU, linear with model size
communication time and cost
– for each update, linear with model size
Numerical experiments 32/39
Compute/communicate times
compute/communicate time scales linearly with model size
0 2 4 6 8
·106
0
50
100
150
number of parameters
experimentaltimes
comms (x1 ms)
gradient (x100 ms)
latency (x1 ms)
– process is compute-bound by gradient calculations
– upper bound on update rate inversely proportional to model size
– with many workers in parallel, independent of batch size
Numerical experiments 33/39
Serial vs. distributed
performance scales linearly with number of workers
0 100 200 300
−1
0
1
wall clock time (min)
averagereward
serial
double
Numerical experiments 34/39
Example game play
Figure: Dumb snake. Figure: Smart snake.
Numerical experiments 35/39
Outline
Introduction
Reinforcement learning
Serial algorithm
Distributed algorithm
Numerical experiments
Conclusion
Conclusion 36/39
Summary
deep Q-learning [M+
13, M+
15] scales well via DistBelief [DCM+
12]
asynchronous model updates accelerate training despite lower
update quality (vs. serial)
Conclusion 37/39
Contact
questions, code, ideas, go-karting, swing dancing, . . .
haoyi.ong@gmail.com
Conclusion 38/39
References
Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin,
Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al.
Large scale distributed deep networks.
In Advances in Neural Information Processing Systems, pages
1223–1231, 2012.
V. Mnih et al.
Playing Atari with deep reinforcement learning.
arXiv preprint arXiv:1312.5602, 2013.
V. Mnih et al.
Human-level control through deep reinforcement learning.
Nature, 518(7540):529–533, 2015.
39/39

Más contenido relacionado

La actualidad más candente

Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningDongHyun Kwak
 
Reinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural NetsReinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural NetsPierre de Lacaze
 
Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning Chandra Meena
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement LearningUsman Qayyum
 
DQN (Deep Q-Network)
DQN (Deep Q-Network)DQN (Deep Q-Network)
DQN (Deep Q-Network)Dong Guo
 
An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)pauldix
 
Dueling Network Architectures for Deep Reinforcement Learning
Dueling Network Architectures for Deep Reinforcement LearningDueling Network Architectures for Deep Reinforcement Learning
Dueling Network Architectures for Deep Reinforcement LearningYoonho Lee
 
Optimization for Deep Learning
Optimization for Deep LearningOptimization for Deep Learning
Optimization for Deep LearningSebastian Ruder
 
DDPG algortihm for angry birds
DDPG algortihm for angry birdsDDPG algortihm for angry birds
DDPG algortihm for angry birdsWangyu Han
 
Exploration Strategies in Reinforcement Learning
Exploration Strategies in Reinforcement LearningExploration Strategies in Reinforcement Learning
Exploration Strategies in Reinforcement LearningDongmin Lee
 
Overview on Optimization algorithms in Deep Learning
Overview on Optimization algorithms in Deep LearningOverview on Optimization algorithms in Deep Learning
Overview on Optimization algorithms in Deep LearningKhang Pham
 
Multi PPT - Agent Actor-Critic for Mixed Cooperative-Competitive Environments
Multi PPT - Agent Actor-Critic for Mixed Cooperative-Competitive EnvironmentsMulti PPT - Agent Actor-Critic for Mixed Cooperative-Competitive Environments
Multi PPT - Agent Actor-Critic for Mixed Cooperative-Competitive EnvironmentsJisang Yoon
 
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
1118_Seminar_Continuous_Deep Q-Learning with Model based accelerationHye-min Ahn
 
An Introduction to Reinforcement Learning - The Doors to AGI
An Introduction to Reinforcement Learning - The Doors to AGIAn Introduction to Reinforcement Learning - The Doors to AGI
An Introduction to Reinforcement Learning - The Doors to AGIAnirban Santara
 
Chap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsChap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsYoung-Geun Choi
 
0415_seminar_DeepDPG
0415_seminar_DeepDPG0415_seminar_DeepDPG
0415_seminar_DeepDPGHye-min Ahn
 

La actualidad más candente (20)

Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Reinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural NetsReinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural Nets
 
Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
Deep Q-Learning
Deep Q-LearningDeep Q-Learning
Deep Q-Learning
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
DQN (Deep Q-Network)
DQN (Deep Q-Network)DQN (Deep Q-Network)
DQN (Deep Q-Network)
 
An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)
 
Playing Atari with Deep Reinforcement Learning
Playing Atari with Deep Reinforcement LearningPlaying Atari with Deep Reinforcement Learning
Playing Atari with Deep Reinforcement Learning
 
Dueling Network Architectures for Deep Reinforcement Learning
Dueling Network Architectures for Deep Reinforcement LearningDueling Network Architectures for Deep Reinforcement Learning
Dueling Network Architectures for Deep Reinforcement Learning
 
Optimization for Deep Learning
Optimization for Deep LearningOptimization for Deep Learning
Optimization for Deep Learning
 
DDPG algortihm for angry birds
DDPG algortihm for angry birdsDDPG algortihm for angry birds
DDPG algortihm for angry birds
 
Exploration Strategies in Reinforcement Learning
Exploration Strategies in Reinforcement LearningExploration Strategies in Reinforcement Learning
Exploration Strategies in Reinforcement Learning
 
Overview on Optimization algorithms in Deep Learning
Overview on Optimization algorithms in Deep LearningOverview on Optimization algorithms in Deep Learning
Overview on Optimization algorithms in Deep Learning
 
Multi PPT - Agent Actor-Critic for Mixed Cooperative-Competitive Environments
Multi PPT - Agent Actor-Critic for Mixed Cooperative-Competitive EnvironmentsMulti PPT - Agent Actor-Critic for Mixed Cooperative-Competitive Environments
Multi PPT - Agent Actor-Critic for Mixed Cooperative-Competitive Environments
 
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
 
An Introduction to Reinforcement Learning - The Doors to AGI
An Introduction to Reinforcement Learning - The Doors to AGIAn Introduction to Reinforcement Learning - The Doors to AGI
An Introduction to Reinforcement Learning - The Doors to AGI
 
Chap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsChap 8. Optimization for training deep models
Chap 8. Optimization for training deep models
 
0415_seminar_DeepDPG
0415_seminar_DeepDPG0415_seminar_DeepDPG
0415_seminar_DeepDPG
 

Destacado

Inspection of CloudML Hyper Parameter Tuning
Inspection of CloudML Hyper Parameter TuningInspection of CloudML Hyper Parameter Tuning
Inspection of CloudML Hyper Parameter Tuningnagachika t
 
自分たちでつくった"UXガイドライン"を片手に、クラウドワークスを作り変える。
自分たちでつくった"UXガイドライン"を片手に、クラウドワークスを作り変える。自分たちでつくった"UXガイドライン"を片手に、クラウドワークスを作り変える。
自分たちでつくった"UXガイドライン"を片手に、クラウドワークスを作り変える。kazuma ueda
 
強化学習@PyData.Tokyo
強化学習@PyData.Tokyo強化学習@PyData.Tokyo
強化学習@PyData.TokyoNaoto Yoshida
 
Convolutional Neural Netwoks で自然言語処理をする
Convolutional Neural Netwoks で自然言語処理をするConvolutional Neural Netwoks で自然言語処理をする
Convolutional Neural Netwoks で自然言語処理をするDaiki Shimada
 
"Playing Atari with Deep Reinforcement Learning"
"Playing Atari with Deep Reinforcement Learning""Playing Atari with Deep Reinforcement Learning"
"Playing Atari with Deep Reinforcement Learning"mooopan
 
Cooperative Collision Avoidance via Proximal Message Passing
Cooperative Collision Avoidance via Proximal Message PassingCooperative Collision Avoidance via Proximal Message Passing
Cooperative Collision Avoidance via Proximal Message PassingLyft
 
Scaling up Machine Learning Algorithms for Classification
Scaling up Machine Learning Algorithms for ClassificationScaling up Machine Learning Algorithms for Classification
Scaling up Machine Learning Algorithms for Classificationsmatsus
 
Encoding Robotic Sensor States for Q-Learning using the
Encoding Robotic Sensor States for Q-Learning using the Encoding Robotic Sensor States for Q-Learning using the
Encoding Robotic Sensor States for Q-Learning using the butest
 
The Test, Gillie, Gettysburg, 4 JUL 2014
The Test, Gillie, Gettysburg, 4 JUL 2014The Test, Gillie, Gettysburg, 4 JUL 2014
The Test, Gillie, Gettysburg, 4 JUL 2014David R. Gillie
 
福岡商工会議所講演会(2017年2月17日)
福岡商工会議所講演会(2017年2月17日)福岡商工会議所講演会(2017年2月17日)
福岡商工会議所講演会(2017年2月17日)隆志 柳瀬
 
How To Select Best Transmission For Your Vehicle
How To Select Best Transmission For Your VehicleHow To Select Best Transmission For Your Vehicle
How To Select Best Transmission For Your VehicleDreamcars Auto Repair
 
TreeFrog Frameworkの紹介
TreeFrog Frameworkの紹介TreeFrog Frameworkの紹介
TreeFrog Frameworkの紹介ao27
 
ホームセンターにある画像をVision apiで分析してみた話
ホームセンターにある画像をVision apiで分析してみた話ホームセンターにある画像をVision apiで分析してみた話
ホームセンターにある画像をVision apiで分析してみた話Wasaburo Miyata
 
BigQuery case study in Groovenauts & Dive into the DataflowJavaSDK
BigQuery case study in Groovenauts & Dive into the DataflowJavaSDKBigQuery case study in Groovenauts & Dive into the DataflowJavaSDK
BigQuery case study in Groovenauts & Dive into the DataflowJavaSDKnagachika t
 
Dueling network architectures for deep reinforcement learning
Dueling network architectures for deep reinforcement learningDueling network architectures for deep reinforcement learning
Dueling network architectures for deep reinforcement learningTaehoon Kim
 

Destacado (20)

Inspection of CloudML Hyper Parameter Tuning
Inspection of CloudML Hyper Parameter TuningInspection of CloudML Hyper Parameter Tuning
Inspection of CloudML Hyper Parameter Tuning
 
自分たちでつくった"UXガイドライン"を片手に、クラウドワークスを作り変える。
自分たちでつくった"UXガイドライン"を片手に、クラウドワークスを作り変える。自分たちでつくった"UXガイドライン"を片手に、クラウドワークスを作り変える。
自分たちでつくった"UXガイドライン"を片手に、クラウドワークスを作り変える。
 
強化学習@PyData.Tokyo
強化学習@PyData.Tokyo強化学習@PyData.Tokyo
強化学習@PyData.Tokyo
 
Convolutional Neural Netwoks で自然言語処理をする
Convolutional Neural Netwoks で自然言語処理をするConvolutional Neural Netwoks で自然言語処理をする
Convolutional Neural Netwoks で自然言語処理をする
 
aiconf2017okanohara
aiconf2017okanoharaaiconf2017okanohara
aiconf2017okanohara
 
Handover Parameters Self-optimization by Q-Learning in 4G Networks
Handover Parameters Self-optimization by Q-Learning in 4G NetworksHandover Parameters Self-optimization by Q-Learning in 4G Networks
Handover Parameters Self-optimization by Q-Learning in 4G Networks
 
"Playing Atari with Deep Reinforcement Learning"
"Playing Atari with Deep Reinforcement Learning""Playing Atari with Deep Reinforcement Learning"
"Playing Atari with Deep Reinforcement Learning"
 
Cooperative Collision Avoidance via Proximal Message Passing
Cooperative Collision Avoidance via Proximal Message PassingCooperative Collision Avoidance via Proximal Message Passing
Cooperative Collision Avoidance via Proximal Message Passing
 
Scaling up Machine Learning Algorithms for Classification
Scaling up Machine Learning Algorithms for ClassificationScaling up Machine Learning Algorithms for Classification
Scaling up Machine Learning Algorithms for Classification
 
Encoding Robotic Sensor States for Q-Learning using the
Encoding Robotic Sensor States for Q-Learning using the Encoding Robotic Sensor States for Q-Learning using the
Encoding Robotic Sensor States for Q-Learning using the
 
The Test, Gillie, Gettysburg, 4 JUL 2014
The Test, Gillie, Gettysburg, 4 JUL 2014The Test, Gillie, Gettysburg, 4 JUL 2014
The Test, Gillie, Gettysburg, 4 JUL 2014
 
福岡商工会議所講演会(2017年2月17日)
福岡商工会議所講演会(2017年2月17日)福岡商工会議所講演会(2017年2月17日)
福岡商工会議所講演会(2017年2月17日)
 
How To Select Best Transmission For Your Vehicle
How To Select Best Transmission For Your VehicleHow To Select Best Transmission For Your Vehicle
How To Select Best Transmission For Your Vehicle
 
Ansvar Community Insurance Proposal
Ansvar Community Insurance ProposalAnsvar Community Insurance Proposal
Ansvar Community Insurance Proposal
 
企画案
企画案企画案
企画案
 
TreeFrog Frameworkの紹介
TreeFrog Frameworkの紹介TreeFrog Frameworkの紹介
TreeFrog Frameworkの紹介
 
ホームセンターにある画像をVision apiで分析してみた話
ホームセンターにある画像をVision apiで分析してみた話ホームセンターにある画像をVision apiで分析してみた話
ホームセンターにある画像をVision apiで分析してみた話
 
AIG Corporate Travel PDS
AIG Corporate Travel PDSAIG Corporate Travel PDS
AIG Corporate Travel PDS
 
BigQuery case study in Groovenauts & Dive into the DataflowJavaSDK
BigQuery case study in Groovenauts & Dive into the DataflowJavaSDKBigQuery case study in Groovenauts & Dive into the DataflowJavaSDK
BigQuery case study in Groovenauts & Dive into the DataflowJavaSDK
 
Dueling network architectures for deep reinforcement learning
Dueling network architectures for deep reinforcement learningDueling network architectures for deep reinforcement learning
Dueling network architectures for deep reinforcement learning
 

Similar a Distributed Deep Q-Learning

Practical Reinforcement Learning with TensorFlow
Practical Reinforcement Learning with TensorFlowPractical Reinforcement Learning with TensorFlow
Practical Reinforcement Learning with TensorFlowIllia Polosukhin
 
Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Intel® Software
 
Asynchronous Methods for Deep Reinforcement Learning
Asynchronous Methods for Deep Reinforcement LearningAsynchronous Methods for Deep Reinforcement Learning
Asynchronous Methods for Deep Reinforcement LearningSsu-Rui Lee
 
lecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdflecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdfTigabu Yaya
 
byteLAKE's expertise across NVIDIA architectures and configurations
byteLAKE's expertise across NVIDIA architectures and configurationsbyteLAKE's expertise across NVIDIA architectures and configurations
byteLAKE's expertise across NVIDIA architectures and configurationsbyteLAKE
 
FPGA-BASED-CNN.pdf
FPGA-BASED-CNN.pdfFPGA-BASED-CNN.pdf
FPGA-BASED-CNN.pdfdajiba
 
181123 asynchronous method for deep reinforcement learning seunghyeok back
181123 asynchronous method for deep reinforcement learning seunghyeok back181123 asynchronous method for deep reinforcement learning seunghyeok back
181123 asynchronous method for deep reinforcement learning seunghyeok backSeungHyeok Baek
 
Introduction to Chainer 11 may,2018
Introduction to Chainer 11 may,2018Introduction to Chainer 11 may,2018
Introduction to Chainer 11 may,2018Preferred Networks
 
Test different neural networks models for forecasting of wind,solar and energ...
Test different neural networks models for forecasting of wind,solar and energ...Test different neural networks models for forecasting of wind,solar and energ...
Test different neural networks models for forecasting of wind,solar and energ...Tonmoy Ibne Arif
 
Applying Reinforcement Learning for Network Routing
Applying Reinforcement Learning for Network RoutingApplying Reinforcement Learning for Network Routing
Applying Reinforcement Learning for Network Routingbutest
 
A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...
A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...
A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...NECST Lab @ Politecnico di Milano
 
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDSDistributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDSPeterAndreasEntschev
 
Transfer Learning for Improving Model Predictions in Highly Configurable Soft...
Transfer Learning for Improving Model Predictions in Highly Configurable Soft...Transfer Learning for Improving Model Predictions in Highly Configurable Soft...
Transfer Learning for Improving Model Predictions in Highly Configurable Soft...Pooyan Jamshidi
 
safe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningsafe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningRyo Iwaki
 
A TRAINING METHOD USING
 DNN-GUIDED LAYERWISE PRETRAINING
 FOR DEEP GAUSSIAN ...
A TRAINING METHOD USING
 DNN-GUIDED LAYERWISE PRETRAINING
 FOR DEEP GAUSSIAN ...A TRAINING METHOD USING
 DNN-GUIDED LAYERWISE PRETRAINING
 FOR DEEP GAUSSIAN ...
A TRAINING METHOD USING
 DNN-GUIDED LAYERWISE PRETRAINING
 FOR DEEP GAUSSIAN ...Tomoki Koriyama
 
The CAOS framework: democratize the acceleration of compute intensive applica...
The CAOS framework: democratize the acceleration of compute intensive applica...The CAOS framework: democratize the acceleration of compute intensive applica...
The CAOS framework: democratize the acceleration of compute intensive applica...NECST Lab @ Politecnico di Milano
 
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)Alexander Ulanov
 
Automated Parameterization of Performance Models from Measurements
Automated Parameterization of Performance Models from MeasurementsAutomated Parameterization of Performance Models from Measurements
Automated Parameterization of Performance Models from MeasurementsWeikun Wang
 

Similar a Distributed Deep Q-Learning (20)

Practical Reinforcement Learning with TensorFlow
Practical Reinforcement Learning with TensorFlowPractical Reinforcement Learning with TensorFlow
Practical Reinforcement Learning with TensorFlow
 
Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*
 
Asynchronous Methods for Deep Reinforcement Learning
Asynchronous Methods for Deep Reinforcement LearningAsynchronous Methods for Deep Reinforcement Learning
Asynchronous Methods for Deep Reinforcement Learning
 
lecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdflecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdf
 
byteLAKE's expertise across NVIDIA architectures and configurations
byteLAKE's expertise across NVIDIA architectures and configurationsbyteLAKE's expertise across NVIDIA architectures and configurations
byteLAKE's expertise across NVIDIA architectures and configurations
 
Parallel Algorithm mid.pdf
Parallel Algorithm mid.pdfParallel Algorithm mid.pdf
Parallel Algorithm mid.pdf
 
FPGA-BASED-CNN.pdf
FPGA-BASED-CNN.pdfFPGA-BASED-CNN.pdf
FPGA-BASED-CNN.pdf
 
181123 asynchronous method for deep reinforcement learning seunghyeok back
181123 asynchronous method for deep reinforcement learning seunghyeok back181123 asynchronous method for deep reinforcement learning seunghyeok back
181123 asynchronous method for deep reinforcement learning seunghyeok back
 
Introduction to Chainer 11 may,2018
Introduction to Chainer 11 may,2018Introduction to Chainer 11 may,2018
Introduction to Chainer 11 may,2018
 
Test different neural networks models for forecasting of wind,solar and energ...
Test different neural networks models for forecasting of wind,solar and energ...Test different neural networks models for forecasting of wind,solar and energ...
Test different neural networks models for forecasting of wind,solar and energ...
 
Applying Reinforcement Learning for Network Routing
Applying Reinforcement Learning for Network RoutingApplying Reinforcement Learning for Network Routing
Applying Reinforcement Learning for Network Routing
 
autoTVM
autoTVMautoTVM
autoTVM
 
A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...
A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...
A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...
 
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDSDistributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
 
Transfer Learning for Improving Model Predictions in Highly Configurable Soft...
Transfer Learning for Improving Model Predictions in Highly Configurable Soft...Transfer Learning for Improving Model Predictions in Highly Configurable Soft...
Transfer Learning for Improving Model Predictions in Highly Configurable Soft...
 
safe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningsafe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learning
 
A TRAINING METHOD USING
 DNN-GUIDED LAYERWISE PRETRAINING
 FOR DEEP GAUSSIAN ...
A TRAINING METHOD USING
 DNN-GUIDED LAYERWISE PRETRAINING
 FOR DEEP GAUSSIAN ...A TRAINING METHOD USING
 DNN-GUIDED LAYERWISE PRETRAINING
 FOR DEEP GAUSSIAN ...
A TRAINING METHOD USING
 DNN-GUIDED LAYERWISE PRETRAINING
 FOR DEEP GAUSSIAN ...
 
The CAOS framework: democratize the acceleration of compute intensive applica...
The CAOS framework: democratize the acceleration of compute intensive applica...The CAOS framework: democratize the acceleration of compute intensive applica...
The CAOS framework: democratize the acceleration of compute intensive applica...
 
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
 
Automated Parameterization of Performance Models from Measurements
Automated Parameterization of Performance Models from MeasurementsAutomated Parameterization of Performance Models from Measurements
Automated Parameterization of Performance Models from Measurements
 

Último

YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.JasonViviers2
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best PracticesDataArchiva
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationGiorgio Carbone
 
MEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptMEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptaigil2
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Guido X Jansen
 
Create Data Model & Conduct Visualisation in Power BI Desktop
Create Data Model & Conduct Visualisation in Power BI DesktopCreate Data Model & Conduct Visualisation in Power BI Desktop
Create Data Model & Conduct Visualisation in Power BI DesktopThinkInnovation
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxVenkatasubramani13
 
SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024Becky Burwell
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Vladislav Solodkiy
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...PrithaVashisht1
 
The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerPavel Šabatka
 
Cyclistic Memberships Data Analysis Project
Cyclistic Memberships Data Analysis ProjectCyclistic Memberships Data Analysis Project
Cyclistic Memberships Data Analysis Projectdanielbell861
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructuresonikadigital1
 

Último (13)

YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - Presentation
 
MEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptMEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .ppt
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
 
Create Data Model & Conduct Visualisation in Power BI Desktop
Create Data Model & Conduct Visualisation in Power BI DesktopCreate Data Model & Conduct Visualisation in Power BI Desktop
Create Data Model & Conduct Visualisation in Power BI Desktop
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptx
 
SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...
 
The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayer
 
Cyclistic Memberships Data Analysis Project
Cyclistic Memberships Data Analysis ProjectCyclistic Memberships Data Analysis Project
Cyclistic Memberships Data Analysis Project
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructure
 

Distributed Deep Q-Learning

  • 1. Distributed Deep Q-Learning Hao Yi Ong joint work with K. Chavez, A. Hong Stanford University Box, 6/3/15
  • 2. Outline Introduction Reinforcement learning Serial algorithm Distributed algorithm Numerical experiments Conclusion Introduction 2/39
  • 3. Motivation long-standing challenge of reinforcement learning (RL) – control with high-dimensional sensory inputs (e.g., vision, speech) – shift away from reliance on hand-crafted features utilize breakthroughs in deep learning for RL [M+ 13, M+ 15] – extract high-level features from raw sensory data – learn better representations than handcrafted features with neural network architectures used in supervised and unsupervised learning create fast learning algorithm – train efficiently with stochastic gradient descent (SGD) – distribute training process to accelerate learning [DCM+ 12] Introduction 3/39
  • 4. Success with Atari games Introduction 4/39
  • 5. Theoretical complications deep learning algorithms require huge training datasets – sparse, noisy, and delayed reward signal in RL – delay of ∼ 103 time steps between actions and resulting rewards – cf. direct association between inputs and targets in supervised learning independence between samples – sequences of highly correlated states in RL problems fixed underlying data distribution – distribution changes as RL algorithm learns new behaviors Introduction 5/39
  • 6. Goals distributed deep RL algorithm robust neural network agent – must succeed in challenging test problems control policies with high-dimensional sensory input – obtain better internal representations than handcrafted features fast training algorithm – efficiently produce, use, and process training data Introduction 6/39
  • 7. Outline Introduction Reinforcement learning Serial algorithm Distributed algorithm Numerical experiments Conclusion Reinforcement learning 7/39
  • 8. Playing games Environment:   Game  emulator   Agent   Ac3on:   Game  input   State:   Series  of  screens     and  inputs   Reward:   Game  score  change   objective: learned policy maximizes future rewards Rt = T t =t γt −t rt , discount factor γ reward change at time t rt Reinforcement learning 8/39
  • 9. State-action value function basic idea behind RL is to estimate Q (s, a) = max π E [Rt | st = s, at = a, π] , where π maps states to actions (or distributions over actions) optimal value function obeys Bellman equation Q (s, a) = E s ∼E r + γ max a Q (s , a ) | s, a , where E is the MDP environment Reinforcement learning 9/39
  • 10. Value approximation typically, a linear function approximator is used to estimate Q Q (s, a; θ) ≈ Q (s, a) , which is parameterized by θ we introduce the Q-network – nonlinear neural network state-action value function approximator – “Q” for Q-learning Reinforcement learning 10/39
  • 11. Q-network trained by minimizing a sequence of loss functions Li (θi) = E s,a∼ρ(·) (yi − Q (s, a; θi)) 2 , with – iteration number i – target yi = Es ∼E [r + γ maxa Q (s , a ; θi−1) | s, a] – “behavior distribution” (exploration policy) ρ (s, a) architecture varies according to application Reinforcement learning 11/39
  • 12. Outline Introduction Reinforcement learning Serial algorithm Distributed algorithm Numerical experiments Conclusion Serial algorithm 12/39
  • 13. Preprocessing raw  screen   downsample  +  grayscale   final  input   Serial algorithm 13/39
  • 15. Convolutional neural network biologically-inspired by the visual cortex CNN example: single layer, single frame to single filter, stride = 1 Serial algorithm 15/39
  • 16. Stochastic gradient descent optimize Q-network loss function by gradient descent Q (s, a; θ) := Q (s, a; θ) + α θQ (s, a; θ) , with – learning rate α for computational expedience – update weights after every time step – avoid computing full expectations – replace with single samples from ρ and E Serial algorithm 16/39
  • 17. Q-learning Q (s, a) := Q (s, a) + α r + γ max a Q (s , a ) − Q (s, a) model free RL – avoids estimating E off-policy – learns policy a = argmaxa Q (s, a; θ) – uses behavior distribution selected by an -greedy strategy Serial algorithm 17/39
  • 18. Experience replay a kind of short-term memory trains optimal policy using “behavior policy” (off-policy) – learns policy π (s) = argmaxa Q (s, a; θ) – uses an -greedy strategy (behavior policy) for state-space exploration store agent’s experiences at each time step et = (st, at, rt, st+1) – experiences form a replay memory dataset with fixed capacity – execute Q-learning updates with random samples of experience Serial algorithm 18/39
  • 19. Serial deep Q-learning given replay memory D with capacity N initialize Q-networks Q, ˆQ with same random weights θ repeat until timeout initialize frame sequence s1 = {x1} and preprocessed state φ1 = φ (s1) for t = 1, . . . , T 1. select action at = maxa Q (φ (st) , a; θ) w.p. 1 − random action otherwise 2. execute action at and observe reward rt and frame xt+1 3. append st+1 = (st, at, xt+1) and preprocess φt+1 = φ (st+1) 4. store experience (φt, at, rt, φt+1) in D 5. uniformly sample minibatch (φj, aj, rj, φj+1) ∼ D 6. set yj = rj if φj+1 terminal rj + γ maxa ˆQ (φj+1, a ; θ) otherwise 7. perform gradient descent step for Q on minibatch 8. every C steps reset ˆQ = Q Serial algorithm 19/39
  • 20. Theoretical complications deep learning algorithms require huge training datasets independence between samples fixed underlying data distribution Serial algorithm 20/39
  • 21. Deep Q-learning avoids theoretical complications greater data efficiency – each experience potentially used in many weight udpates reduce correlations between samples – randomizing samples breaks correlations from consecutive samples experience replay averages behavior distribution over states – smooths out learning – avoids oscillations or divergence in gradient descent Serial algorithm 21/39
  • 23. Outline Introduction Reinforcement learning Serial algorithm Distributed algorithm Numerical experiments Conclusion Distributed algorithm 23/39
  • 24. Data parallelism downpour SGD: generic asynchronous distributed SGD θ := θ + αθ θ  Δθ   Distributed algorithm 24/39
  • 25. Model parallelism on each worker machine computation of gradient is pushed down to hardware – parallelized according to available CPU/GPU resources – uses the Caffe deep learning framework complexity scales linearly with number of parameters – GPU provides speedup, but limits model size – CPU slower, but model can be much larger Distributed algorithm 25/39
  • 26. Implementation data shards are generated locally on each model worker in real-time – data is stored independently for each worker – since game emulation is simple, generating data is fast – simple fault tolerance approach: regenerate data if worker dies algorithm scales very well with data – since data lives locally on workers, no data is sent update parameter with gradients using RMSprop or AdaGrad communication pattern: multiple asynchronous all-reduces – one-to-all and all-to-one, but asynchronous for every minibatch Distributed algorithm 26/39
  • 27. Implementation bottleneck is parameter update time on parameter server – e.g., if parameter server gradient update takes 10 ms, then we can only do up to 100 updates per second (using buffers, etc.) trade-off between parallel updates and model staleness – because worker is likely using a stale model, the updates are “noisy” and not of the same quality as in serial implementation Distributed algorithm 27/39
  • 28. Outline Introduction Reinforcement learning Serial algorithm Distributed algorithm Numerical experiments Conclusion Numerical experiments 28/39
  • 30. Snake parameters – snake length grows with number of apples eaten (+1 reward) – one apple at any time, regenerated once eaten – n × n array, with walled-off world (−1 if snake dies) – want to maximize score, equal to apples eaten (minus 1) complexity – four possible states for each cell: {empty, head, body, apple} – state space cardinality is O n4 2n2 (-ish) – four possible actions: {north, south, east, west} Numerical experiments 30/39
  • 31. Software at initialization, broadcast neural network architecture – each worker spawns Caffe with architecture – populates replay dataset with experiences via random policy for some number of iterations: – workers fetch latest parameters for Q network from server – compute and send gradient update – parameters updated on server with RMSprop or AdaGrad (requires O(p) memory and time) Lightweight use of Spark – shipping required files and serialized code to worker machines – partitioning and scheduling number of updates to do on each worker – coordinating identities of worker/server machines – partial implementation of generic interface between Caffe and Spark ran on dual core Intel i7 clocked at 2.2 GHz, 12 GB RAM Numerical experiments 31/39
  • 32. Complexity analysis model complexity – determined by architecture; roughly on the order of number of parameters gradient calculation via backpropagation – distributed across worker’s CPU/GPU, linear with model size communication time and cost – for each update, linear with model size Numerical experiments 32/39
  • 33. Compute/communicate times compute/communicate time scales linearly with model size 0 2 4 6 8 ·106 0 50 100 150 number of parameters experimentaltimes comms (x1 ms) gradient (x100 ms) latency (x1 ms) – process is compute-bound by gradient calculations – upper bound on update rate inversely proportional to model size – with many workers in parallel, independent of batch size Numerical experiments 33/39
  • 34. Serial vs. distributed performance scales linearly with number of workers 0 100 200 300 −1 0 1 wall clock time (min) averagereward serial double Numerical experiments 34/39
  • 35. Example game play Figure: Dumb snake. Figure: Smart snake. Numerical experiments 35/39
  • 36. Outline Introduction Reinforcement learning Serial algorithm Distributed algorithm Numerical experiments Conclusion Conclusion 36/39
  • 37. Summary deep Q-learning [M+ 13, M+ 15] scales well via DistBelief [DCM+ 12] asynchronous model updates accelerate training despite lower update quality (vs. serial) Conclusion 37/39
  • 38. Contact questions, code, ideas, go-karting, swing dancing, . . . haoyi.ong@gmail.com Conclusion 38/39
  • 39. References Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al. Large scale distributed deep networks. In Advances in Neural Information Processing Systems, pages 1223–1231, 2012. V. Mnih et al. Playing Atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013. V. Mnih et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015. 39/39