SlideShare una empresa de Scribd logo
1 de 24
Practical RL with
TensorFlow
Illia Polosukhin, XIX.ai
Reinforcement Learning Problem
OpenAI Gym
- Library of environments
Control, Atari, Doom, etc.
- Same API
- Provides way to share and
compare results
https://gym.openai.com/
Acting in an Environment
Random Agent
Let’s review some theory
Markov Decision Process
MDP < S, A, P, R, 𝛾 >
- S: set of states
- A: set of actions
- T(s, a, s’): probability of transition
- Reward(s): reward function
- 𝛾: discounting factory
Trace: {<s0,a0,r0>, …, <sn,an,rn>}
Definitions
- Return: total discounted reward:
- Policy: Agent’s behavior
- Deterministic policy: π(s) = a
- Stochastic policy: π(a | s) = P[At = a | St = s]
- Value function: Expected return starting from state s:
- State-value function: Vπ(s) = Eπ[R | St = s]
- Action-value function: Qπ(s, a) = Eπ[R | St = s, At = a]
Deep Q Learning
- Model-free, off-policy technique to learn optimal Q(s, a):
- Qi+1(s, a) ← Qi(s, a) + 𝛼(R + 𝛾 maxa’ Qi(s’, a’) - Qi(s, a))
- Optimal policy then π(s) = argmaxa’ Q(s, a’)
- Requires exploration (ε-greedy) to explore various transitions from the states.
- Take random action with ε probability, start ε high and decay to low value as training
progresses.
- Deep Q Learning: approximate Q(s, a) with neural network: Q(s, a, 𝜃)
- Do stochastic gradient descent using loss
Q-network
Run Optimization
Full example: https://github.com/ilblackdragon/tensorflow-rl/blob/master/examples/atari-rl.py
Monitored Session
- Handles pitfalls of distributed training.
- Saving and restoring checkpoints.
- Hooks is a general interface for injecting
computation into TensorFlow training
loop.
Original Results on Atari Games
Mnih et al., 2013
Beating Human Level Mnih at el., 2015
Policy Gradient
- Given policy π 𝜃(a | s) find such 𝜃 that maximizes expected return:
J(𝜃) = ∑sdπ(s)V(s)
- In Deep RL, we approximate π 𝜃(a | s) with neural network.
- Usually with softmax layer on top to estimate probabilities of each action.
- We can estimate J(𝜃) from samples of observed behavior: ∑k=0..Tp𝜃( 𝜏k | π)R( 𝜏k)
- Do stochastic gradient descent using update:
𝜃i+1 = 𝜃i + 𝛼 (1/T) ∑k=0..T ∇log p𝜃( 𝜏k | π)R( 𝜏k)
Policy Network
Run Optimization
Async Advantage Actor-Critic (A3C)
- Asynchronous: using multiple instances of
environments and networks
- Actor-Critic: using both policy and
estimate of value function.
- Advantage: estimate how different was
outcome than expected.
Image by Arthur Juliani
Policy and Value Networks
Run optimization
A3C Results on Atari Games
Mnih at el., 2016
Mnih at el., 2016
Practical use cases
- Robotics
- Finance
- Industrial optimization
- Predictive assistant
Illia Polosukhin
XIX.ai
@ilblackdragon, illia@xix.ai
Questions?
Full code will be available soon at
https://github.com/ilblackdragon/tensorflow-rl/

Más contenido relacionado

La actualidad más candente

ICML2013読み会 Large-Scale Learning with Less RAM via Randomization
ICML2013読み会 Large-Scale Learning with Less RAM via RandomizationICML2013読み会 Large-Scale Learning with Less RAM via Randomization
ICML2013読み会 Large-Scale Learning with Less RAM via Randomization
Hidekazu Oiwa
 
Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...
Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...
Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...
MLconf
 

La actualidad más candente (20)

GTC Japan 2016 Chainer feature introduction
GTC Japan 2016 Chainer feature introductionGTC Japan 2016 Chainer feature introduction
GTC Japan 2016 Chainer feature introduction
 
Chainer v3
Chainer v3Chainer v3
Chainer v3
 
ICML2013読み会 Large-Scale Learning with Less RAM via Randomization
ICML2013読み会 Large-Scale Learning with Less RAM via RandomizationICML2013読み会 Large-Scale Learning with Less RAM via Randomization
ICML2013読み会 Large-Scale Learning with Less RAM via Randomization
 
Deep Learning with PyTorch
Deep Learning with PyTorchDeep Learning with PyTorch
Deep Learning with PyTorch
 
Machine Intelligence at Google Scale: TensorFlow
Machine Intelligence at Google Scale: TensorFlowMachine Intelligence at Google Scale: TensorFlow
Machine Intelligence at Google Scale: TensorFlow
 
Introduction to Machine Learning with TensorFlow
Introduction to Machine Learning with TensorFlowIntroduction to Machine Learning with TensorFlow
Introduction to Machine Learning with TensorFlow
 
Introduction to theano, case study of Word Embeddings
Introduction to theano, case study of Word EmbeddingsIntroduction to theano, case study of Word Embeddings
Introduction to theano, case study of Word Embeddings
 
Cv mini project (1)
Cv mini project (1)Cv mini project (1)
Cv mini project (1)
 
PFN Spring Internship Final Report: Autonomous Drive by Deep RL
PFN Spring Internship Final Report: Autonomous Drive by Deep RLPFN Spring Internship Final Report: Autonomous Drive by Deep RL
PFN Spring Internship Final Report: Autonomous Drive by Deep RL
 
DQN with Differentiable Memory Architectures
DQN with Differentiable Memory ArchitecturesDQN with Differentiable Memory Architectures
DQN with Differentiable Memory Architectures
 
Introduction to Chainer 11 may,2018
Introduction to Chainer 11 may,2018Introduction to Chainer 11 may,2018
Introduction to Chainer 11 may,2018
 
Introduction to TensorFlow
Introduction to TensorFlowIntroduction to TensorFlow
Introduction to TensorFlow
 
Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...
Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...
Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...
 
TensorFlow example for AI Ukraine2016
TensorFlow example  for AI Ukraine2016TensorFlow example  for AI Ukraine2016
TensorFlow example for AI Ukraine2016
 
Deep Learning in theano
Deep Learning in theanoDeep Learning in theano
Deep Learning in theano
 
TensorFlow Dev Summit 2017 요약
TensorFlow Dev Summit 2017 요약TensorFlow Dev Summit 2017 요약
TensorFlow Dev Summit 2017 요약
 
CUDA and Caffe for deep learning
CUDA and Caffe for deep learningCUDA and Caffe for deep learning
CUDA and Caffe for deep learning
 
Overview of Chainer and Its Features
Overview of Chainer and Its FeaturesOverview of Chainer and Its Features
Overview of Chainer and Its Features
 
TENSORFLOW: ARCHITECTURE AND USE CASE - NASA SPACE APPS CHALLENGE by Gema Par...
TENSORFLOW: ARCHITECTURE AND USE CASE - NASA SPACE APPS CHALLENGE by Gema Par...TENSORFLOW: ARCHITECTURE AND USE CASE - NASA SPACE APPS CHALLENGE by Gema Par...
TENSORFLOW: ARCHITECTURE AND USE CASE - NASA SPACE APPS CHALLENGE by Gema Par...
 
Alex Smola at AI Frontiers: Scalable Deep Learning Using MXNet
Alex Smola at AI Frontiers: Scalable Deep Learning Using MXNetAlex Smola at AI Frontiers: Scalable Deep Learning Using MXNet
Alex Smola at AI Frontiers: Scalable Deep Learning Using MXNet
 

Similar a Practical Reinforcement Learning with TensorFlow

shuyangli_summerpresentation08082014
shuyangli_summerpresentation08082014shuyangli_summerpresentation08082014
shuyangli_summerpresentation08082014
Shuyang Li
 

Similar a Practical Reinforcement Learning with TensorFlow (20)

Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
 
Distributed Deep Q-Learning
Distributed Deep Q-LearningDistributed Deep Q-Learning
Distributed Deep Q-Learning
 
Playing Atari with Deep Reinforcement Learning
Playing Atari with Deep Reinforcement LearningPlaying Atari with Deep Reinforcement Learning
Playing Atari with Deep Reinforcement Learning
 
Reinforcement Learning Overview | Marco Del Pra
Reinforcement Learning Overview | Marco Del PraReinforcement Learning Overview | Marco Del Pra
Reinforcement Learning Overview | Marco Del Pra
 
Cheatsheet deep-learning
Cheatsheet deep-learningCheatsheet deep-learning
Cheatsheet deep-learning
 
Value Function Approximation via Low-Rank Models
Value Function Approximation via Low-Rank ModelsValue Function Approximation via Low-Rank Models
Value Function Approximation via Low-Rank Models
 
A new Evolutionary Reinforcement Scheme for Stochastic Learning Automata
A new Evolutionary Reinforcement Scheme for Stochastic Learning AutomataA new Evolutionary Reinforcement Scheme for Stochastic Learning Automata
A new Evolutionary Reinforcement Scheme for Stochastic Learning Automata
 
Demystifying deep reinforement learning
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learning
 
Optimizing a New Nonlinear Reinforcement Scheme with Breeder genetic algorithm
Optimizing a New Nonlinear Reinforcement Scheme with Breeder genetic algorithmOptimizing a New Nonlinear Reinforcement Scheme with Breeder genetic algorithm
Optimizing a New Nonlinear Reinforcement Scheme with Breeder genetic algorithm
 
Multinomial Logistic Regression with Apache Spark
Multinomial Logistic Regression with Apache SparkMultinomial Logistic Regression with Apache Spark
Multinomial Logistic Regression with Apache Spark
 
Alpine Spark Implementation - Technical
Alpine Spark Implementation - TechnicalAlpine Spark Implementation - Technical
Alpine Spark Implementation - Technical
 
shuyangli_summerpresentation08082014
shuyangli_summerpresentation08082014shuyangli_summerpresentation08082014
shuyangli_summerpresentation08082014
 
La question de la durabilité des technologies de calcul et de télécommunication
La question de la durabilité des technologies de calcul et de télécommunicationLa question de la durabilité des technologies de calcul et de télécommunication
La question de la durabilité des technologies de calcul et de télécommunication
 
Learning to discover monte carlo algorithm on spin ice manifold
Learning to discover monte carlo algorithm on spin ice manifoldLearning to discover monte carlo algorithm on spin ice manifold
Learning to discover monte carlo algorithm on spin ice manifold
 
Reinforcement Learning - DQN
Reinforcement Learning - DQNReinforcement Learning - DQN
Reinforcement Learning - DQN
 
A New Nonlinear Reinforcement Scheme for Stochastic Learning Automata
A New Nonlinear Reinforcement Scheme for Stochastic Learning AutomataA New Nonlinear Reinforcement Scheme for Stochastic Learning Automata
A New Nonlinear Reinforcement Scheme for Stochastic Learning Automata
 
Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3
 
safe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningsafe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learning
 
Dueling Network Architectures for Deep Reinforcement Learning
Dueling Network Architectures for Deep Reinforcement LearningDueling Network Architectures for Deep Reinforcement Learning
Dueling Network Architectures for Deep Reinforcement Learning
 
Supervisory control of discrete event systems for linear temporal logic speci...
Supervisory control of discrete event systems for linear temporal logic speci...Supervisory control of discrete event systems for linear temporal logic speci...
Supervisory control of discrete event systems for linear temporal logic speci...
 

Último

%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
masabamasaba
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg
 
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg
 
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
chiefasafspells
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
masabamasaba
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
masabamasaba
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 

Último (20)

%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go Platformless
 
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
 
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
 
tonesoftg
tonesoftgtonesoftg
tonesoftg
 
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 

Practical Reinforcement Learning with TensorFlow

  • 3. OpenAI Gym - Library of environments Control, Atari, Doom, etc. - Same API - Provides way to share and compare results https://gym.openai.com/
  • 4. Acting in an Environment
  • 7. Markov Decision Process MDP < S, A, P, R, 𝛾 > - S: set of states - A: set of actions - T(s, a, s’): probability of transition - Reward(s): reward function - 𝛾: discounting factory Trace: {<s0,a0,r0>, …, <sn,an,rn>}
  • 8. Definitions - Return: total discounted reward: - Policy: Agent’s behavior - Deterministic policy: π(s) = a - Stochastic policy: π(a | s) = P[At = a | St = s] - Value function: Expected return starting from state s: - State-value function: Vπ(s) = Eπ[R | St = s] - Action-value function: Qπ(s, a) = Eπ[R | St = s, At = a]
  • 9. Deep Q Learning - Model-free, off-policy technique to learn optimal Q(s, a): - Qi+1(s, a) ← Qi(s, a) + 𝛼(R + 𝛾 maxa’ Qi(s’, a’) - Qi(s, a)) - Optimal policy then π(s) = argmaxa’ Q(s, a’) - Requires exploration (ε-greedy) to explore various transitions from the states. - Take random action with ε probability, start ε high and decay to low value as training progresses. - Deep Q Learning: approximate Q(s, a) with neural network: Q(s, a, 𝜃) - Do stochastic gradient descent using loss
  • 11. Run Optimization Full example: https://github.com/ilblackdragon/tensorflow-rl/blob/master/examples/atari-rl.py
  • 12. Monitored Session - Handles pitfalls of distributed training. - Saving and restoring checkpoints. - Hooks is a general interface for injecting computation into TensorFlow training loop.
  • 13. Original Results on Atari Games Mnih et al., 2013
  • 14. Beating Human Level Mnih at el., 2015
  • 15. Policy Gradient - Given policy π 𝜃(a | s) find such 𝜃 that maximizes expected return: J(𝜃) = ∑sdπ(s)V(s) - In Deep RL, we approximate π 𝜃(a | s) with neural network. - Usually with softmax layer on top to estimate probabilities of each action. - We can estimate J(𝜃) from samples of observed behavior: ∑k=0..Tp𝜃( 𝜏k | π)R( 𝜏k) - Do stochastic gradient descent using update: 𝜃i+1 = 𝜃i + 𝛼 (1/T) ∑k=0..T ∇log p𝜃( 𝜏k | π)R( 𝜏k)
  • 18. Async Advantage Actor-Critic (A3C) - Asynchronous: using multiple instances of environments and networks - Actor-Critic: using both policy and estimate of value function. - Advantage: estimate how different was outcome than expected. Image by Arthur Juliani
  • 19. Policy and Value Networks
  • 21. A3C Results on Atari Games Mnih at el., 2016
  • 22. Mnih at el., 2016
  • 23. Practical use cases - Robotics - Finance - Industrial optimization - Predictive assistant
  • 24. Illia Polosukhin XIX.ai @ilblackdragon, illia@xix.ai Questions? Full code will be available soon at https://github.com/ilblackdragon/tensorflow-rl/

Notas del editor

  1. Let’s start by defining a problem that we are trying to solve. ... Agents divide into model-based and model-free agents Model based agent try to simulate the environment inside it to make decisions based on that. Model free though just take observation and choose action. This is interesting, because this is very close how animals and people learn - based on some limited feedback from the environment or teacher. Like animals get positive reinforcement when developing reflexes. Or children getting positive or negative reinforcement from parents on their behaviour.
  2. Let’s review some theory around RL. The set of states and actions, together with rules for transitioning from one state to another, make up a Markov decision process. One episode of this process (e.g. one game) forms a finite sequence of states, actions and rewards. Additional term - set of [(s, a), ..] is a trajectory.
  3. Model free - meaning there is no MDP approximation or learning inside the agent. Observations are stored into replay buffers and used as training data for the model. Off policy means that learning optimal policy is independent of agent’s actions. Because the policy of taking action would be deterministic, force it to explore by taking random action with ε probability. Where ε starts high in the beginning and slowly decays as training progresses. For example for Atari game, there is lots of possible states (number of pixels by number of colors). E.g. breakout game 84x84 pixels screen by 256 colors - at least 256^84*84 states. And it will take a long time to even visit each state. Approximate with neural network, that will be able to learn how to deal with state based on their similarity. Deep Q Learning - popularized by DeepMind - first Deep RL model that worked.
  4. Expected return is can be defined in few ways. One way is to define as sum of values of state-value function of each state weighted by how much we will end up at that state under current policy (it’s also called stationary distribution). This can be estimated from observations - trajectories, as a sum of probability of a trajectory under policy multiplied by reward from this trajectory.
  5. Asynchronous: Unlike DQN, where a single agent represented by a single neural network interacts with a single environment, A3C utilizes multiple incarnations of the above in order to learn more efficiently. In A3C there is a global network, and multiple worker agents which each have their own set of network parameters. Each of these agents interacts with it’s own copy of the environment at the same time as the other agents are interacting with their environments. The reason this works better than having a single agent (beyond the speedup of getting more work done), is that the experience of each agent is independent of the experience of the others. In this way the overall experience available for training becomes more diverse. Actor-Critic: Actor-Critic combines the benefits of both approaches. In the case of A3C, our network will estimate both a value function V(s) (how good a certain state is to be in) and a policy π(s) (a set of action probability outputs). These will each be separate fully-connected layers sitting at the top of the network. Critically, the agent uses the value estimate (the critic) to update the policy (the actor) more intelligently than traditional policy gradient methods. The insight of using advantage estimates rather than just discounted returns is to allow the agent to determine not just how good its actions were, but how much better they turned out to be than expected.
  6. Mean and median human-normalized scores on 57 Atari games using the human starts evaluation metric. D-DQN - double DQN. A3C paper - https://arxiv.org/pdf/1602.01783.pdf