Deep reinforcement learning from scratch

Deep Reinforcement Learning
from scratch
Jie-Han Chen
NetDB, National Cheng Kung University
1

The content and images in this slides were borrowed from:
1. Rich Sutton’s textbook
2. David Silver’s Reinforcement Learning class in UCL
3. Sergey Levine’s Deep Reinforcement Learning class in UCB
4. Deep Reinforcement Learning and Control in CMU (CMU 10703)
2
Disclaimer

Outline
3
1. Introduction RL and MDP
2. Q-Learning
3. Deep Q Network
4. Discussion

Introduction - RL
4Figure from Sutton & Barto, RL textbook

Reinforcement Learning V.S Supervised Learning
Supervised Learning:
Input data is independent. Current
output will not affect next input data.
5

Reinforcement Learning V.S Supervised Learning
Reinforcement Learning:
The agent’s action affect the data it
will receive in the future. (from CMU
10703)
6Figure from Wikipedia, made by waldoalvarez

When do we use
Reinforcement Learning?
7

If the problem can be modeled as MDP,
we can try RL to solve it!
8

Type of RL task
1. Episodic Task: the task will terminate after number of steps.
eg: Game, Chess
2. Continuous Task: the task never terminate.
9

Markov Decision Process
Defined by:
1. S: set of states
2. A: set of actions
3. R: reward model R(s)/ R(s, a)/ R(s, a, s’)
4. P: dynamics of model and its transition probability
5. : The discounted factor
10

Define agent-environment boundary
Before defining the set of state, we should define the
boundary between agent and environment.
According to Richard Sutton’s textbook:
1. “The agent-environment boundary represents the
limit of the agent’s absolute control, not of its
knowledge.”
2. “The general rule we follow is that anything
cannot be changed arbitrarily by the agent is
considered to be outside of it and thus part of its
environment.”
11

Define agent-environment boundary
Before defining the set of state, we should define the
boundary between agent and environment.
According to Richard Sutton’s textbook:
1. “The agent-environment boundary represents the
limit of the agent’s absolute control, not of its
knowledge.”
2. “The general rule we follow is that anything
cannot be changed arbitrarily by the agent is
considered to be outside of it and thus part of its
environment.”
12

Markov Property
● A state is Markov if and only if
● A state should summarize past sensation so as to retain all “essential”
information.
● We should be able to throw away the history once state is known.
13from CMU 10703

Define State (Observation)
14
Atari 2600: Space Invaders Go

Define Action
1. Discrete Action Space
2. Continuous Action Space
15Atari 2600: Breakout Robotic Arm

Defined by:
1. S: set of states ✓
2. A: set of actions ✓
3. R: reward model R(s)/ R(s, a)/ R(s, a, s’)
4. P: dynamics model and its transition probability
16

Define Rewards
Rewards specify WHAT the agent needs to achieve, NOT HOW to achieve it.
17
S: start state
G: Goal

Define Rewards
Rewards specify WHAT the agent needs to achieve, NOT HOW to achieve it.
18

Defined by:
3. R: reward model R(s)/ R(s, a)/ R(s, a, s’) ✓
19

Definition: A policy is a distribution over actions given states,
MDP policies depend on the current state (time-independent)
20

The objective in RL is to maximize long-trem future reward
Definition: The return is the total discounted reward from timestep t
In episodic tasks, we can consider undiscounted future wards
21

Definition: The state-value function of an MDP is the expected return
starting from state s, and then following policy
Definition: The action-value function is the expected return starting from
state s, taking action a, and then following policy
22

Bellman Expected Equation
The state-value function can be decomposed into immediate reward plus
discounted value of successor state,
The action-value function can similarly be decomposed,
23

Optimal Value Functions
Definition: The optimal state-value function is the maximum value function
over all policies.
Definition: The optimal action-value function is the maximum
action-value function over all policies.
24

We can use backup diagram to explain the relationship between and
, and how to update each other.
Backup Diagram
25

We can use backup diagram to explain the relationship between and
, and how to update each other.
Backup Diagram
26

Optimal Policy
if
Theorem: For any Markov Decision Process
1. There exists an optimal policy that is better than or equal to all other
policies,
2. All optimal policies achieve the optimal value function,
3. All optimal policies achieve the optimal action-value function,
29

How to get Optimal Policies?
An optimal policy can be found by maximizing over
There is always a deterministic optimal policy for and MDP
If we know , we immediately have the optimal policy.
30

Optimal action-value function in MDP
31

Solving Markov Decision Process
● Find the optimal policy
● Prediction: for a given policy, estimate value functions of state and state-action
pairs.
● Control: estimate the value function of state and state-action pairs for the
optimal policy.
32

Solving the Bellman Optimality Equation
Equation requires the following:
1. accurate knowledge of environments dynamics
2. we have enough space and time to do the computation
3. the Markov Property
33

Defined by:
3. R: reward model R(s)/ R(s, a)/ R(s, a, s’) ✓
34

Outline
35
1. Introduction RL and MDP
2. Q-Learning
3. Deep Q Network
4. Discussion

The category of RL
36
Value-based: select action
according to value function,
SGD on Bellman Error.
Policy-based: using SGD directly on
discounted expected return with
policy
Model-based: Learning the mode from interact with
environment or simulate trajectory to estimate environment
model. eg: Dyna, MCTS

The category of RL
Model-based method: Learn the model of the MDP (transition probability and
rewards) and try to solve MDP concurrently.
Model-free method: Learn how to act without explicitly learning the transition
probability
37

Model-free Reinforcement Learning
Using sample backup, sample transition experience:
38

Q-Learning
Proposed by Watkins, 1989
● A model-free algorithm
● Tabular method: using large table to save each action-value pair Q(s, a)
● Learn from one step experience:
● Off-policy
● Online learning
Update Q table:
39

Q-Learning
Learning by sample:
Update Q:
bootstrapping: using the estimate of the return as the target to update old value
function.
40
Target
an estimate of the return
step size

Q-Learning v.s. Optimal MDP
Q-Learning:
Optimal MDP:
42

Off-policy
Off-policy: If the agent learn the policy from the experience, which was generated
by other policy (not current policy), we call this algorithm is off-policy.
Why Q-Learning is off-policy?
● given experience:
● update Q:
43

On-policy
The agent can only learn the policy from the experience, which was generated by
current policy. If the experience is not generated by current policy, the learning
process won’t converge.
44

But there is still a problem
If we use optimal policy at all times, most of Q table
won’t be updated, and we will found the policy NOT
OPTIMAL.
45

Exploration v.s Exploitation
Exploration: gather more information
Exploitation: make the best decision given current information
Q-Learing use strategy:
● With probability , select
● With probability , select a random action.
46

Q-Learning Algorithm
Tabular method needs tremendous memory to store action-value pair, when facing
large/high dimensional state space it suffers from the curse of dimensionality.
Only can be used in discrete action task. Because of it select the optimal action by
48

We need Function Approximator!
49

Function Approximator
There are many kinds of function approximator:
● Linear combination of features
● Neural networks
● Decision Tree
● Nearest neighbour
● Fourier/wavlet bases
● ...
50

Deep Q Network
1. Proposed by V Mnih, K Kavukcuoglu, David Silver et al., DeepMind [1][2]
2. Using neural network as non-linear function approximator
3. DQN = Q-Learning + Deep Network
4. Testbed: 49 Atari Game
52
[1]V Mnih et al., Playing Atari with Deep Reinforcement Learning
[2]V Mnih et al., Human-level control through deep reinforcement learning (2015 Nature)

Deep Q Network - Define MDP
Is it an episodic task or continuous task?
Is the action space discrete or continuous?
How to define state? Is it Markov?
How to define rewards?
53

Deep Q Network - Define MDP
1. The game is episodic task
a. if there are multiple lives each game, they define terminal state when losing a life.
2. The action space is discrete
3. They using multi-frame as state, 4-frame here. Because of the object motion
cannot be detected by only 1 frame. 1-frame state is not Markov.
4. Clip the rewards between [-1, 1]
a. limit the scale of error derivatives
b. make it easier to use the same learning rate across multiple games
54

Deep Q Network - State in details
1. The origin screen size is 210x160x3 (RGB)
2. They transformed the origin screen into Grayscale (210x160x1)
3. Resize the screen size to 84x84 to train faster
4. Stack the nearest 4 screen frame together as its state
55

Deep Q Network - Architecture
56
DQN !

Deep Q Network - Architecture (2013)
1. 2 Convolutional neural network
a. 16 filters, 8x8 each with 4 stride
b. 32 filters, 4x4 each with 2 stride
2. 2 Fully Connected network
a. flatten to 256 neurons
b. 256 to # of actions (output layer)
57
3. Without:
a. pooling
b. batch normalization
c. dropout

Deep Q Network - Architecture (2015)
1. 3 Convolutional neural network
a. 32 filters, 8x8 each with 4 stride
b. 64 filters, 4x4 each with 2 stride
c. 64 filters, 3x3 each with 1 stride
2. 2 Fully Connected network
a. flatten to 512 neurons
b. 512 to # of actions (output layer)
58
3. Again without:
a. pooling
b. batch normalization
c. dropout

Deep Q Network - preliminary summary
Currently, we have:
1. Markov Decision Process
2. Non-linear function approximator to estimate
we can apply to random control.
But, we want our agent performs better and better.
59

Deep Q Network - Algorithm
In previous slides, we define optimal
action-value function in MDP.
which was:
we can iteratively update action-value by:
when , which means it
converge.
60

However, because we estimate the
action-value by non-linear function
approximator, we cannot directly update the
action-value by the formula (in right hand
side).
It just works in linear function approximator.
61

The good news: in neural network, we can use Stochastic Gradient Descent
(SGD) to approach Q* (a estimate, not equal)
In supervised learning, we often model this problem as an regression problem.
eg:
62
is weights of neural network in
iteration i
target

recap: the concept of neural network in supervised learning, the target is fixed! The
fixed target doesn’t need to gradient.
How to fix it?
63

Using seperated network to fix SGD:
● evaluation network: to estimate current action-value
● target network: as an fixed target.
We initialize target network using the same weights as evaluation network
The gradient of Loss function:
64

Update neural weights
65

We use online-learning in DQN, just like Q-learning:
step1: we observe the environement, get observation
step2: we take the action according to current observation
step3: update the neural weights
66
This is called sampling,
sample experience
(s, a, r, s’)

Wait, there still exists another problem!
67

There still exist another problem -- correlation.
They use experience replay to solve it!
69

Experience replay: when the agent iteract with environment with policy
it will store transition experience (s, a, r, s’) in replay buffer.
When learning with SGD, the agent sample batch-experience from replay buffer,
learning batch by batch.
70

Experiment settings
SGD optimizer: RMSProp
Learning rate: 2.5e-4 (0.00025)
batch size: 32
Loss function: MSE Loss, clip loss within [-1, 1]
Decay epsilon (exploration rate) from 1.0 to 0.1 in 1M steps
72

Deep Q Network - Result
The human performance is the average reward
achieved from around 20 episodes of each game
lasting a maximum of 5 min each, following around
2 h of practice playing each game.
73
You can see the figure at p.3:
https://storage.googleapis.com/deepmind-media/dqn/DQNNat
urePaper.pdf

Experiments on Space Invaders (Atari2600)
74

Space Invaders
1. We have 3 lives (episodic task)
2. We also have 3 Shields
3. Need to beat all Invaders
4. The bullets blink with some frequency
75

DQN: Huber Loss + Adam
Learning rate: 2.5e-4
77

DQN: MSE clamp loss + RMSProp
Learning rate: 2.5e-4
Total steps: 1e+7
78

DQN: MSE clamp loss + RMSProp
79

DQN: Paper settings
● MSE Loss, clamp loss within [-1, 1]
● using RMSProp as optimizer, LR=2.5e-4
80
750 !!

DQN: Huber Loss (without clamp loss) + RMSProp
81

The content not covered in this slides
The Proof of convergence of linear function approximator & non-linear function
approximator, but you can find it in Rich Sutton’s text book in Ch9 - Ch11.
84

Deep reinforcement learning from scratch

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Deep reinforcement learning from scratch

Similar a Deep reinforcement learning from scratch (20)

Más de Jie-Han Chen

Más de Jie-Han Chen (8)

Último

Último (18)

Deep reinforcement learning from scratch