SlideShare una empresa de Scribd logo
1 de 84
Descargar para leer sin conexión
Deep Reinforcement Learning
from scratch
Jie-Han Chen
NetDB, National Cheng Kung University
1
The content and images in this slides were borrowed from:
1. Rich Sutton’s textbook
2. David Silver’s Reinforcement Learning class in UCL
3. Sergey Levine’s Deep Reinforcement Learning class in UCB
4. Deep Reinforcement Learning and Control in CMU (CMU 10703)
2
Disclaimer
Outline
3
1. Introduction RL and MDP
2. Q-Learning
3. Deep Q Network
4. Discussion
Introduction - RL
4Figure from Sutton & Barto, RL textbook
Reinforcement Learning V.S Supervised Learning
Supervised Learning:
Input data is independent. Current
output will not affect next input data.
5
Reinforcement Learning V.S Supervised Learning
Reinforcement Learning:
The agent’s action affect the data it
will receive in the future. (from CMU
10703)
6Figure from Wikipedia, made by waldoalvarez
When do we use
Reinforcement Learning?
7
If the problem can be modeled as MDP,
we can try RL to solve it!
8
Type of RL task
1. Episodic Task: the task will terminate after number of steps.
eg: Game, Chess
2. Continuous Task: the task never terminate.
9
Markov Decision Process
Defined by:
1. S: set of states
2. A: set of actions
3. R: reward model R(s)/ R(s, a)/ R(s, a, s’)
4. P: dynamics of model and its transition probability
5. : The discounted factor
10
Define agent-environment boundary
Before defining the set of state, we should define the
boundary between agent and environment.
According to Richard Sutton’s textbook:
1. “The agent-environment boundary represents the
limit of the agent’s absolute control, not of its
knowledge.”
2. “The general rule we follow is that anything
cannot be changed arbitrarily by the agent is
considered to be outside of it and thus part of its
environment.”
11
Define agent-environment boundary
Before defining the set of state, we should define the
boundary between agent and environment.
According to Richard Sutton’s textbook:
1. “The agent-environment boundary represents the
limit of the agent’s absolute control, not of its
knowledge.”
2. “The general rule we follow is that anything
cannot be changed arbitrarily by the agent is
considered to be outside of it and thus part of its
environment.”
12
Markov Property
● A state is Markov if and only if
● A state should summarize past sensation so as to retain all “essential”
information.
● We should be able to throw away the history once state is known.
13from CMU 10703
Define State (Observation)
14
Atari 2600: Space Invaders Go
Define Action
1. Discrete Action Space
2. Continuous Action Space
15Atari 2600: Breakout Robotic Arm
Markov Decision Process
Defined by:
1. S: set of states ✓
2. A: set of actions ✓
3. R: reward model R(s)/ R(s, a)/ R(s, a, s’)
4. P: dynamics model and its transition probability
5. : The discounted factor
16
Define Rewards
Rewards specify WHAT the agent needs to achieve, NOT HOW to achieve it.
17
S: start state
G: Goal
Define Rewards
Rewards specify WHAT the agent needs to achieve, NOT HOW to achieve it.
18
Markov Decision Process
Defined by:
1. S: set of states ✓
2. A: set of actions ✓
3. R: reward model R(s)/ R(s, a)/ R(s, a, s’) ✓
4. P: dynamics of model and its transition probability
5. : The discounted factor
19
Markov Decision Process
Definition: A policy is a distribution over actions given states,
MDP policies depend on the current state (time-independent)
20
Markov Decision Process
The objective in RL is to maximize long-trem future reward
Definition: The return is the total discounted reward from timestep t
In episodic tasks, we can consider undiscounted future wards
21
Markov Decision Process
Definition: The state-value function of an MDP is the expected return
starting from state s, and then following policy
Definition: The action-value function is the expected return starting from
state s, taking action a, and then following policy
22
Bellman Expected Equation
The state-value function can be decomposed into immediate reward plus
discounted value of successor state,
The action-value function can similarly be decomposed,
23
Optimal Value Functions
Definition: The optimal state-value function is the maximum value function
over all policies.
Definition: The optimal action-value function is the maximum
action-value function over all policies.
24
We can use backup diagram to explain the relationship between and
, and how to update each other.
Backup Diagram
25
We can use backup diagram to explain the relationship between and
, and how to update each other.
Backup Diagram
26
Backup Diagram
27
Backup Diagram
28
Optimal Policy
if
Theorem: For any Markov Decision Process
1. There exists an optimal policy that is better than or equal to all other
policies,
2. All optimal policies achieve the optimal value function,
3. All optimal policies achieve the optimal action-value function,
29
How to get Optimal Policies?
An optimal policy can be found by maximizing over
There is always a deterministic optimal policy for and MDP
If we know , we immediately have the optimal policy.
30
Optimal action-value function in MDP
31
Solving Markov Decision Process
● Find the optimal policy
● Prediction: for a given policy, estimate value functions of state and state-action
pairs.
● Control: estimate the value function of state and state-action pairs for the
optimal policy.
32
Solving the Bellman Optimality Equation
Equation requires the following:
1. accurate knowledge of environments dynamics
2. we have enough space and time to do the computation
3. the Markov Property
33
Markov Decision Process
Defined by:
1. S: set of states ✓
2. A: set of actions ✓
3. R: reward model R(s)/ R(s, a)/ R(s, a, s’) ✓
4. P: dynamics of model and its transition probability
5. : The discounted factor
34
Outline
35
1. Introduction RL and MDP
2. Q-Learning
3. Deep Q Network
4. Discussion
The category of RL
36
Value-based: select action
according to value function,
SGD on Bellman Error.
Policy-based: using SGD directly on
discounted expected return with
policy
Model-based: Learning the mode from interact with
environment or simulate trajectory to estimate environment
model. eg: Dyna, MCTS
The category of RL
Model-based method: Learn the model of the MDP (transition probability and
rewards) and try to solve MDP concurrently.
Model-free method: Learn how to act without explicitly learning the transition
probability
37
Model-free Reinforcement Learning
Using sample backup, sample transition experience:
38
Q-Learning
Proposed by Watkins, 1989
● A model-free algorithm
● Tabular method: using large table to save each action-value pair Q(s, a)
● Learn from one step experience:
● Off-policy
● Online learning
Update Q table:
39
Q-Learning
Learning by sample:
Update Q:
bootstrapping: using the estimate of the return as the target to update old value
function.
40
Target
an estimate of the return
step size
Q-Learning
Update Q table:
41
Q-Learning v.s. Optimal MDP
Q-Learning:
Optimal MDP:
42
Off-policy
Off-policy: If the agent learn the policy from the experience, which was generated
by other policy (not current policy), we call this algorithm is off-policy.
Why Q-Learning is off-policy?
● given experience:
● update Q:
43
On-policy
The agent can only learn the policy from the experience, which was generated by
current policy. If the experience is not generated by current policy, the learning
process won’t converge.
44
But there is still a problem
If we use optimal policy at all times, most of Q table
won’t be updated, and we will found the policy NOT
OPTIMAL.
45
Exploration v.s Exploitation
Exploration: gather more information
Exploitation: make the best decision given current information
Q-Learing use strategy:
● With probability , select
● With probability , select a random action.
46
Q-Learning Algorithm
47
Q-Learning Algorithm
Tabular method needs tremendous memory to store action-value pair, when facing
large/high dimensional state space it suffers from the curse of dimensionality.
Only can be used in discrete action task. Because of it select the optimal action by
48
We need Function Approximator!
49
Function Approximator
There are many kinds of function approximator:
● Linear combination of features
● Neural networks
● Decision Tree
● Nearest neighbour
● Fourier/wavlet bases
● ...
50
Function Approximator
51
Deep Q Network
1. Proposed by V Mnih, K Kavukcuoglu, David Silver et al., DeepMind [1][2]
2. Using neural network as non-linear function approximator
3. DQN = Q-Learning + Deep Network
4. Testbed: 49 Atari Game
52
[1]V Mnih et al., Playing Atari with Deep Reinforcement Learning
[2]V Mnih et al., Human-level control through deep reinforcement learning (2015 Nature)
Deep Q Network - Define MDP
Is it an episodic task or continuous task?
Is the action space discrete or continuous?
How to define state? Is it Markov?
How to define rewards?
53
Deep Q Network - Define MDP
1. The game is episodic task
a. if there are multiple lives each game, they define terminal state when losing a life.
2. The action space is discrete
3. They using multi-frame as state, 4-frame here. Because of the object motion
cannot be detected by only 1 frame. 1-frame state is not Markov.
4. Clip the rewards between [-1, 1]
a. limit the scale of error derivatives
b. make it easier to use the same learning rate across multiple games
54
Deep Q Network - State in details
1. The origin screen size is 210x160x3 (RGB)
2. They transformed the origin screen into Grayscale (210x160x1)
3. Resize the screen size to 84x84 to train faster
4. Stack the nearest 4 screen frame together as its state
55
Deep Q Network - Architecture
56
DQN !
Deep Q Network - Architecture (2013)
1. 2 Convolutional neural network
a. 16 filters, 8x8 each with 4 stride
b. 32 filters, 4x4 each with 2 stride
2. 2 Fully Connected network
a. flatten to 256 neurons
b. 256 to # of actions (output layer)
57
3. Without:
a. pooling
b. batch normalization
c. dropout
Deep Q Network - Architecture (2015)
1. 3 Convolutional neural network
a. 32 filters, 8x8 each with 4 stride
b. 64 filters, 4x4 each with 2 stride
c. 64 filters, 3x3 each with 1 stride
2. 2 Fully Connected network
a. flatten to 512 neurons
b. 512 to # of actions (output layer)
58
3. Again without:
a. pooling
b. batch normalization
c. dropout
Deep Q Network - preliminary summary
Currently, we have:
1. Markov Decision Process
2. Non-linear function approximator to estimate
we can apply to random control.
But, we want our agent performs better and better.
59
Deep Q Network - Algorithm
In previous slides, we define optimal
action-value function in MDP.
which was:
we can iteratively update action-value by:
when , which means it
converge.
60
Deep Q Network - Algorithm
However, because we estimate the
action-value by non-linear function
approximator, we cannot directly update the
action-value by the formula (in right hand
side).
It just works in linear function approximator.
61
Deep Q Network - Algorithm
The good news: in neural network, we can use Stochastic Gradient Descent
(SGD) to approach Q* (a estimate, not equal)
In supervised learning, we often model this problem as an regression problem.
eg:
62
is weights of neural network in
iteration i
target
Deep Q Network - Algorithm
recap: the concept of neural network in supervised learning, the target is fixed! The
fixed target doesn’t need to gradient.
How to fix it?
63
Deep Q Network - Algorithm
Using seperated network to fix SGD:
● evaluation network: to estimate current action-value
● target network: as an fixed target.
We initialize target network using the same weights as evaluation network
The gradient of Loss function:
64
Deep Q Network - Algorithm
Update neural weights
65
Deep Q Network - Algorithm
We use online-learning in DQN, just like Q-learning:
step1: we observe the environement, get observation
step2: we take the action according to current observation
step3: update the neural weights
66
This is called sampling,
sample experience
(s, a, r, s’)
Wait, there still exists another problem!
67
Correlation
68
Deep Q Network - Algorithm
There still exist another problem -- correlation.
They use experience replay to solve it!
69
Deep Q Network - Algorithm
Experience replay: when the agent iteract with environment with policy
it will store transition experience (s, a, r, s’) in replay buffer.
When learning with SGD, the agent sample batch-experience from replay buffer,
learning batch by batch.
70
71
Experiment settings
SGD optimizer: RMSProp
Learning rate: 2.5e-4 (0.00025)
batch size: 32
Loss function: MSE Loss, clip loss within [-1, 1]
Decay epsilon (exploration rate) from 1.0 to 0.1 in 1M steps
72
Deep Q Network - Result
The human performance is the average reward
achieved from around 20 episodes of each game
lasting a maximum of 5 min each, following around
2 h of practice playing each game.
73
You can see the figure at p.3:
https://storage.googleapis.com/deepmind-media/dqn/DQNNat
urePaper.pdf
Experiments on Space Invaders (Atari2600)
74
Space Invaders
1. We have 3 lives (episodic task)
2. We also have 3 Shields
3. Need to beat all Invaders
4. The bullets blink with some frequency
75
Huber Loss
76
DQN: Huber Loss + Adam
Learning rate: 2.5e-4
77
DQN: MSE clamp loss + RMSProp
Learning rate: 2.5e-4
Total steps: 1e+7
78
DQN: MSE clamp loss + RMSProp
79
DQN: Paper settings
● MSE Loss, clamp loss within [-1, 1]
● using RMSProp as optimizer, LR=2.5e-4
80
750 !!
DQN: Huber Loss (without clamp loss) + RMSProp
81
DQN: MSE clamp loss + Adam
82
DQN: Huber Loss + Adam
83
The content not covered in this slides
The Proof of convergence of linear function approximator & non-linear function
approximator, but you can find it in Rich Sutton’s text book in Ch9 - Ch11.
84

Más contenido relacionado

La actualidad más candente

Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
butest
 

La actualidad más candente (20)

Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
Actor critic algorithm
Actor critic algorithmActor critic algorithm
Actor critic algorithm
 
DQN (Deep Q-Network)
DQN (Deep Q-Network)DQN (Deep Q-Network)
DQN (Deep Q-Network)
 
Lecture 9 Markov decision process
Lecture 9 Markov decision processLecture 9 Markov decision process
Lecture 9 Markov decision process
 
Human-level Control Through Deep Reinforcement Learning (Presentation)
Human-level Control Through Deep Reinforcement Learning (Presentation)Human-level Control Through Deep Reinforcement Learning (Presentation)
Human-level Control Through Deep Reinforcement Learning (Presentation)
 
Reinforcement Learning Q-Learning
Reinforcement Learning   Q-Learning Reinforcement Learning   Q-Learning
Reinforcement Learning Q-Learning
 
Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning
 
Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-Learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Intro to Reinforcement learning - part III
Intro to Reinforcement learning - part IIIIntro to Reinforcement learning - part III
Intro to Reinforcement learning - part III
 
Deep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its ApplicationsDeep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its Applications
 
Reinforcement learning
Reinforcement  learningReinforcement  learning
Reinforcement learning
 
알아두면 쓸데있는 신기한 강화학습 NAVER 2017
알아두면 쓸데있는 신기한 강화학습 NAVER 2017알아두면 쓸데있는 신기한 강화학습 NAVER 2017
알아두면 쓸데있는 신기한 강화학습 NAVER 2017
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
A brief overview of Reinforcement Learning applied to games
A brief overview of Reinforcement Learning applied to gamesA brief overview of Reinforcement Learning applied to games
A brief overview of Reinforcement Learning applied to games
 
Reinforcement Learning with Deep Energy-Based Policies
Reinforcement Learning with Deep Energy-Based PoliciesReinforcement Learning with Deep Energy-Based Policies
Reinforcement Learning with Deep Energy-Based Policies
 
Deep Q-Learning
Deep Q-LearningDeep Q-Learning
Deep Q-Learning
 
Multi armed bandit
Multi armed banditMulti armed bandit
Multi armed bandit
 

Similar a Deep reinforcement learning from scratch

reinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdfreinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdf
VaishnavGhadge1
 

Similar a Deep reinforcement learning from scratch (20)

How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative ways
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
 
An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game Learning
 
Head First Reinforcement Learning
Head First Reinforcement LearningHead First Reinforcement Learning
Head First Reinforcement Learning
 
technical seminar2.pptx.on markov decision process
technical seminar2.pptx.on markov decision processtechnical seminar2.pptx.on markov decision process
technical seminar2.pptx.on markov decision process
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...
Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...
Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...
 
REINFORCEMENT LEARNING
REINFORCEMENT LEARNINGREINFORCEMENT LEARNING
REINFORCEMENT LEARNING
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
TensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksTensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and Tricks
 
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
 
Reinforcement Learning - DQN
Reinforcement Learning - DQNReinforcement Learning - DQN
Reinforcement Learning - DQN
 
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
 
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
 
Deep Reinforcement learning
Deep Reinforcement learningDeep Reinforcement learning
Deep Reinforcement learning
 
Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement Learning
 
Discrete sequential prediction of continuous actions for deep RL
Discrete sequential prediction of continuous actions for deep RLDiscrete sequential prediction of continuous actions for deep RL
Discrete sequential prediction of continuous actions for deep RL
 
reinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdfreinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdf
 
Reinforcement Learning Guide For Beginners
Reinforcement Learning Guide For BeginnersReinforcement Learning Guide For Beginners
Reinforcement Learning Guide For Beginners
 
Introduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement LearningIntroduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement Learning
 

Más de Jie-Han Chen

Más de Jie-Han Chen (8)

Frontier in reinforcement learning
Frontier in reinforcement learningFrontier in reinforcement learning
Frontier in reinforcement learning
 
Temporal difference learning
Temporal difference learningTemporal difference learning
Temporal difference learning
 
Deep reinforcement learning
Deep reinforcement learningDeep reinforcement learning
Deep reinforcement learning
 
Temporal difference learning
Temporal difference learningTemporal difference learning
Temporal difference learning
 
Markov decision process
Markov decision processMarkov decision process
Markov decision process
 
BiCNet presentation (multi-agent reinforcement learning)
BiCNet presentation (multi-agent reinforcement learning)BiCNet presentation (multi-agent reinforcement learning)
BiCNet presentation (multi-agent reinforcement learning)
 
Data science-toolchain
Data science-toolchainData science-toolchain
Data science-toolchain
 
The artofreadablecode
The artofreadablecodeThe artofreadablecode
The artofreadablecode
 

Último

If this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New NigeriaIf this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New Nigeria
Kayode Fayemi
 
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptxChiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
raffaeleoman
 
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
amilabibi1
 
Uncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac FolorunsoUncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac Folorunso
Kayode Fayemi
 

Último (18)

If this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New NigeriaIf this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New Nigeria
 
lONG QUESTION ANSWER PAKISTAN STUDIES10.
lONG QUESTION ANSWER PAKISTAN STUDIES10.lONG QUESTION ANSWER PAKISTAN STUDIES10.
lONG QUESTION ANSWER PAKISTAN STUDIES10.
 
Thirunelveli call girls Tamil escorts 7877702510
Thirunelveli call girls Tamil escorts 7877702510Thirunelveli call girls Tamil escorts 7877702510
Thirunelveli call girls Tamil escorts 7877702510
 
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdfThe workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
 
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdfAWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
 
Dreaming Marissa Sánchez Music Video Treatment
Dreaming Marissa Sánchez Music Video TreatmentDreaming Marissa Sánchez Music Video Treatment
Dreaming Marissa Sánchez Music Video Treatment
 
Causes of poverty in France presentation.pptx
Causes of poverty in France presentation.pptxCauses of poverty in France presentation.pptx
Causes of poverty in France presentation.pptx
 
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptxChiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
 
Sector 62, Noida Call girls :8448380779 Noida Escorts | 100% verified
Sector 62, Noida Call girls :8448380779 Noida Escorts | 100% verifiedSector 62, Noida Call girls :8448380779 Noida Escorts | 100% verified
Sector 62, Noida Call girls :8448380779 Noida Escorts | 100% verified
 
My Presentation "In Your Hands" by Halle Bailey
My Presentation "In Your Hands" by Halle BaileyMy Presentation "In Your Hands" by Halle Bailey
My Presentation "In Your Hands" by Halle Bailey
 
Aesthetic Colaba Mumbai Cst Call girls 📞 7738631006 Grant road Call Girls ❤️-...
Aesthetic Colaba Mumbai Cst Call girls 📞 7738631006 Grant road Call Girls ❤️-...Aesthetic Colaba Mumbai Cst Call girls 📞 7738631006 Grant road Call Girls ❤️-...
Aesthetic Colaba Mumbai Cst Call girls 📞 7738631006 Grant road Call Girls ❤️-...
 
Digital collaboration with Microsoft 365 as extension of Drupal
Digital collaboration with Microsoft 365 as extension of DrupalDigital collaboration with Microsoft 365 as extension of Drupal
Digital collaboration with Microsoft 365 as extension of Drupal
 
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
 
Dreaming Music Video Treatment _ Project & Portfolio III
Dreaming Music Video Treatment _ Project & Portfolio IIIDreaming Music Video Treatment _ Project & Portfolio III
Dreaming Music Video Treatment _ Project & Portfolio III
 
ICT role in 21st century education and it's challenges.pdf
ICT role in 21st century education and it's challenges.pdfICT role in 21st century education and it's challenges.pdf
ICT role in 21st century education and it's challenges.pdf
 
Uncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac FolorunsoUncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac Folorunso
 
Report Writing Webinar Training
Report Writing Webinar TrainingReport Writing Webinar Training
Report Writing Webinar Training
 
Busty Desi⚡Call Girls in Sector 51 Noida Escorts >༒8448380779 Escort Service-...
Busty Desi⚡Call Girls in Sector 51 Noida Escorts >༒8448380779 Escort Service-...Busty Desi⚡Call Girls in Sector 51 Noida Escorts >༒8448380779 Escort Service-...
Busty Desi⚡Call Girls in Sector 51 Noida Escorts >༒8448380779 Escort Service-...
 

Deep reinforcement learning from scratch

  • 1. Deep Reinforcement Learning from scratch Jie-Han Chen NetDB, National Cheng Kung University 1
  • 2. The content and images in this slides were borrowed from: 1. Rich Sutton’s textbook 2. David Silver’s Reinforcement Learning class in UCL 3. Sergey Levine’s Deep Reinforcement Learning class in UCB 4. Deep Reinforcement Learning and Control in CMU (CMU 10703) 2 Disclaimer
  • 3. Outline 3 1. Introduction RL and MDP 2. Q-Learning 3. Deep Q Network 4. Discussion
  • 4. Introduction - RL 4Figure from Sutton & Barto, RL textbook
  • 5. Reinforcement Learning V.S Supervised Learning Supervised Learning: Input data is independent. Current output will not affect next input data. 5
  • 6. Reinforcement Learning V.S Supervised Learning Reinforcement Learning: The agent’s action affect the data it will receive in the future. (from CMU 10703) 6Figure from Wikipedia, made by waldoalvarez
  • 7. When do we use Reinforcement Learning? 7
  • 8. If the problem can be modeled as MDP, we can try RL to solve it! 8
  • 9. Type of RL task 1. Episodic Task: the task will terminate after number of steps. eg: Game, Chess 2. Continuous Task: the task never terminate. 9
  • 10. Markov Decision Process Defined by: 1. S: set of states 2. A: set of actions 3. R: reward model R(s)/ R(s, a)/ R(s, a, s’) 4. P: dynamics of model and its transition probability 5. : The discounted factor 10
  • 11. Define agent-environment boundary Before defining the set of state, we should define the boundary between agent and environment. According to Richard Sutton’s textbook: 1. “The agent-environment boundary represents the limit of the agent’s absolute control, not of its knowledge.” 2. “The general rule we follow is that anything cannot be changed arbitrarily by the agent is considered to be outside of it and thus part of its environment.” 11
  • 12. Define agent-environment boundary Before defining the set of state, we should define the boundary between agent and environment. According to Richard Sutton’s textbook: 1. “The agent-environment boundary represents the limit of the agent’s absolute control, not of its knowledge.” 2. “The general rule we follow is that anything cannot be changed arbitrarily by the agent is considered to be outside of it and thus part of its environment.” 12
  • 13. Markov Property ● A state is Markov if and only if ● A state should summarize past sensation so as to retain all “essential” information. ● We should be able to throw away the history once state is known. 13from CMU 10703
  • 14. Define State (Observation) 14 Atari 2600: Space Invaders Go
  • 15. Define Action 1. Discrete Action Space 2. Continuous Action Space 15Atari 2600: Breakout Robotic Arm
  • 16. Markov Decision Process Defined by: 1. S: set of states ✓ 2. A: set of actions ✓ 3. R: reward model R(s)/ R(s, a)/ R(s, a, s’) 4. P: dynamics model and its transition probability 5. : The discounted factor 16
  • 17. Define Rewards Rewards specify WHAT the agent needs to achieve, NOT HOW to achieve it. 17 S: start state G: Goal
  • 18. Define Rewards Rewards specify WHAT the agent needs to achieve, NOT HOW to achieve it. 18
  • 19. Markov Decision Process Defined by: 1. S: set of states ✓ 2. A: set of actions ✓ 3. R: reward model R(s)/ R(s, a)/ R(s, a, s’) ✓ 4. P: dynamics of model and its transition probability 5. : The discounted factor 19
  • 20. Markov Decision Process Definition: A policy is a distribution over actions given states, MDP policies depend on the current state (time-independent) 20
  • 21. Markov Decision Process The objective in RL is to maximize long-trem future reward Definition: The return is the total discounted reward from timestep t In episodic tasks, we can consider undiscounted future wards 21
  • 22. Markov Decision Process Definition: The state-value function of an MDP is the expected return starting from state s, and then following policy Definition: The action-value function is the expected return starting from state s, taking action a, and then following policy 22
  • 23. Bellman Expected Equation The state-value function can be decomposed into immediate reward plus discounted value of successor state, The action-value function can similarly be decomposed, 23
  • 24. Optimal Value Functions Definition: The optimal state-value function is the maximum value function over all policies. Definition: The optimal action-value function is the maximum action-value function over all policies. 24
  • 25. We can use backup diagram to explain the relationship between and , and how to update each other. Backup Diagram 25
  • 26. We can use backup diagram to explain the relationship between and , and how to update each other. Backup Diagram 26
  • 29. Optimal Policy if Theorem: For any Markov Decision Process 1. There exists an optimal policy that is better than or equal to all other policies, 2. All optimal policies achieve the optimal value function, 3. All optimal policies achieve the optimal action-value function, 29
  • 30. How to get Optimal Policies? An optimal policy can be found by maximizing over There is always a deterministic optimal policy for and MDP If we know , we immediately have the optimal policy. 30
  • 32. Solving Markov Decision Process ● Find the optimal policy ● Prediction: for a given policy, estimate value functions of state and state-action pairs. ● Control: estimate the value function of state and state-action pairs for the optimal policy. 32
  • 33. Solving the Bellman Optimality Equation Equation requires the following: 1. accurate knowledge of environments dynamics 2. we have enough space and time to do the computation 3. the Markov Property 33
  • 34. Markov Decision Process Defined by: 1. S: set of states ✓ 2. A: set of actions ✓ 3. R: reward model R(s)/ R(s, a)/ R(s, a, s’) ✓ 4. P: dynamics of model and its transition probability 5. : The discounted factor 34
  • 35. Outline 35 1. Introduction RL and MDP 2. Q-Learning 3. Deep Q Network 4. Discussion
  • 36. The category of RL 36 Value-based: select action according to value function, SGD on Bellman Error. Policy-based: using SGD directly on discounted expected return with policy Model-based: Learning the mode from interact with environment or simulate trajectory to estimate environment model. eg: Dyna, MCTS
  • 37. The category of RL Model-based method: Learn the model of the MDP (transition probability and rewards) and try to solve MDP concurrently. Model-free method: Learn how to act without explicitly learning the transition probability 37
  • 38. Model-free Reinforcement Learning Using sample backup, sample transition experience: 38
  • 39. Q-Learning Proposed by Watkins, 1989 ● A model-free algorithm ● Tabular method: using large table to save each action-value pair Q(s, a) ● Learn from one step experience: ● Off-policy ● Online learning Update Q table: 39
  • 40. Q-Learning Learning by sample: Update Q: bootstrapping: using the estimate of the return as the target to update old value function. 40 Target an estimate of the return step size
  • 42. Q-Learning v.s. Optimal MDP Q-Learning: Optimal MDP: 42
  • 43. Off-policy Off-policy: If the agent learn the policy from the experience, which was generated by other policy (not current policy), we call this algorithm is off-policy. Why Q-Learning is off-policy? ● given experience: ● update Q: 43
  • 44. On-policy The agent can only learn the policy from the experience, which was generated by current policy. If the experience is not generated by current policy, the learning process won’t converge. 44
  • 45. But there is still a problem If we use optimal policy at all times, most of Q table won’t be updated, and we will found the policy NOT OPTIMAL. 45
  • 46. Exploration v.s Exploitation Exploration: gather more information Exploitation: make the best decision given current information Q-Learing use strategy: ● With probability , select ● With probability , select a random action. 46
  • 48. Q-Learning Algorithm Tabular method needs tremendous memory to store action-value pair, when facing large/high dimensional state space it suffers from the curse of dimensionality. Only can be used in discrete action task. Because of it select the optimal action by 48
  • 49. We need Function Approximator! 49
  • 50. Function Approximator There are many kinds of function approximator: ● Linear combination of features ● Neural networks ● Decision Tree ● Nearest neighbour ● Fourier/wavlet bases ● ... 50
  • 52. Deep Q Network 1. Proposed by V Mnih, K Kavukcuoglu, David Silver et al., DeepMind [1][2] 2. Using neural network as non-linear function approximator 3. DQN = Q-Learning + Deep Network 4. Testbed: 49 Atari Game 52 [1]V Mnih et al., Playing Atari with Deep Reinforcement Learning [2]V Mnih et al., Human-level control through deep reinforcement learning (2015 Nature)
  • 53. Deep Q Network - Define MDP Is it an episodic task or continuous task? Is the action space discrete or continuous? How to define state? Is it Markov? How to define rewards? 53
  • 54. Deep Q Network - Define MDP 1. The game is episodic task a. if there are multiple lives each game, they define terminal state when losing a life. 2. The action space is discrete 3. They using multi-frame as state, 4-frame here. Because of the object motion cannot be detected by only 1 frame. 1-frame state is not Markov. 4. Clip the rewards between [-1, 1] a. limit the scale of error derivatives b. make it easier to use the same learning rate across multiple games 54
  • 55. Deep Q Network - State in details 1. The origin screen size is 210x160x3 (RGB) 2. They transformed the origin screen into Grayscale (210x160x1) 3. Resize the screen size to 84x84 to train faster 4. Stack the nearest 4 screen frame together as its state 55
  • 56. Deep Q Network - Architecture 56 DQN !
  • 57. Deep Q Network - Architecture (2013) 1. 2 Convolutional neural network a. 16 filters, 8x8 each with 4 stride b. 32 filters, 4x4 each with 2 stride 2. 2 Fully Connected network a. flatten to 256 neurons b. 256 to # of actions (output layer) 57 3. Without: a. pooling b. batch normalization c. dropout
  • 58. Deep Q Network - Architecture (2015) 1. 3 Convolutional neural network a. 32 filters, 8x8 each with 4 stride b. 64 filters, 4x4 each with 2 stride c. 64 filters, 3x3 each with 1 stride 2. 2 Fully Connected network a. flatten to 512 neurons b. 512 to # of actions (output layer) 58 3. Again without: a. pooling b. batch normalization c. dropout
  • 59. Deep Q Network - preliminary summary Currently, we have: 1. Markov Decision Process 2. Non-linear function approximator to estimate we can apply to random control. But, we want our agent performs better and better. 59
  • 60. Deep Q Network - Algorithm In previous slides, we define optimal action-value function in MDP. which was: we can iteratively update action-value by: when , which means it converge. 60
  • 61. Deep Q Network - Algorithm However, because we estimate the action-value by non-linear function approximator, we cannot directly update the action-value by the formula (in right hand side). It just works in linear function approximator. 61
  • 62. Deep Q Network - Algorithm The good news: in neural network, we can use Stochastic Gradient Descent (SGD) to approach Q* (a estimate, not equal) In supervised learning, we often model this problem as an regression problem. eg: 62 is weights of neural network in iteration i target
  • 63. Deep Q Network - Algorithm recap: the concept of neural network in supervised learning, the target is fixed! The fixed target doesn’t need to gradient. How to fix it? 63
  • 64. Deep Q Network - Algorithm Using seperated network to fix SGD: ● evaluation network: to estimate current action-value ● target network: as an fixed target. We initialize target network using the same weights as evaluation network The gradient of Loss function: 64
  • 65. Deep Q Network - Algorithm Update neural weights 65
  • 66. Deep Q Network - Algorithm We use online-learning in DQN, just like Q-learning: step1: we observe the environement, get observation step2: we take the action according to current observation step3: update the neural weights 66 This is called sampling, sample experience (s, a, r, s’)
  • 67. Wait, there still exists another problem! 67
  • 69. Deep Q Network - Algorithm There still exist another problem -- correlation. They use experience replay to solve it! 69
  • 70. Deep Q Network - Algorithm Experience replay: when the agent iteract with environment with policy it will store transition experience (s, a, r, s’) in replay buffer. When learning with SGD, the agent sample batch-experience from replay buffer, learning batch by batch. 70
  • 71. 71
  • 72. Experiment settings SGD optimizer: RMSProp Learning rate: 2.5e-4 (0.00025) batch size: 32 Loss function: MSE Loss, clip loss within [-1, 1] Decay epsilon (exploration rate) from 1.0 to 0.1 in 1M steps 72
  • 73. Deep Q Network - Result The human performance is the average reward achieved from around 20 episodes of each game lasting a maximum of 5 min each, following around 2 h of practice playing each game. 73 You can see the figure at p.3: https://storage.googleapis.com/deepmind-media/dqn/DQNNat urePaper.pdf
  • 74. Experiments on Space Invaders (Atari2600) 74
  • 75. Space Invaders 1. We have 3 lives (episodic task) 2. We also have 3 Shields 3. Need to beat all Invaders 4. The bullets blink with some frequency 75
  • 77. DQN: Huber Loss + Adam Learning rate: 2.5e-4 77
  • 78. DQN: MSE clamp loss + RMSProp Learning rate: 2.5e-4 Total steps: 1e+7 78
  • 79. DQN: MSE clamp loss + RMSProp 79
  • 80. DQN: Paper settings ● MSE Loss, clamp loss within [-1, 1] ● using RMSProp as optimizer, LR=2.5e-4 80 750 !!
  • 81. DQN: Huber Loss (without clamp loss) + RMSProp 81
  • 82. DQN: MSE clamp loss + Adam 82
  • 83. DQN: Huber Loss + Adam 83
  • 84. The content not covered in this slides The Proof of convergence of linear function approximator & non-linear function approximator, but you can find it in Rich Sutton’s text book in Ch9 - Ch11. 84