7. Markov Decision Process
MDP < S, A, P, R, 𝛾 >
- S: set of states
- A: set of actions
- T(s, a, s’): probability of transition
- Reward(s): reward function
- 𝛾: discounting factory
Trace: {<s0,a0,r0>, …, <sn,an,rn>}
8. Definitions
- Return: total discounted reward:
- Policy: Agent’s behavior
- Deterministic policy: π(s) = a
- Stochastic policy: π(a | s) = P[At = a | St = s]
- Value function: Expected return starting from state s:
- State-value function: Vπ(s) = Eπ[R | St = s]
- Action-value function: Qπ(s, a) = Eπ[R | St = s, At = a]
9. Deep Q Learning
- Model-free, off-policy technique to learn optimal Q(s, a):
- Qi+1(s, a) ← Qi(s, a) + 𝛼(R + 𝛾 maxa’ Qi(s’, a’) - Qi(s, a))
- Optimal policy then π(s) = argmaxa’ Q(s, a’)
- Requires exploration (ε-greedy) to explore various transitions from the states.
- Take random action with ε probability, start ε high and decay to low value as training
progresses.
- Deep Q Learning: approximate Q(s, a) with neural network: Q(s, a, 𝜃)
- Do stochastic gradient descent using loss
12. Monitored Session
- Handles pitfalls of distributed training.
- Saving and restoring checkpoints.
- Hooks is a general interface for injecting
computation into TensorFlow training
loop.
15. Policy Gradient
- Given policy π 𝜃(a | s) find such 𝜃 that maximizes expected return:
J(𝜃) = ∑sdπ(s)V(s)
- In Deep RL, we approximate π 𝜃(a | s) with neural network.
- Usually with softmax layer on top to estimate probabilities of each action.
- We can estimate J(𝜃) from samples of observed behavior: ∑k=0..Tp𝜃( 𝜏k | π)R( 𝜏k)
- Do stochastic gradient descent using update:
𝜃i+1 = 𝜃i + 𝛼 (1/T) ∑k=0..T ∇log p𝜃( 𝜏k | π)R( 𝜏k)
18. Async Advantage Actor-Critic (A3C)
- Asynchronous: using multiple instances of
environments and networks
- Actor-Critic: using both policy and
estimate of value function.
- Advantage: estimate how different was
outcome than expected.
Image by Arthur Juliani
Let’s start by defining a problem that we are trying to solve.
...
Agents divide into model-based and model-free agents
Model based agent try to simulate the environment inside it to make decisions based on that.
Model free though just take observation and choose action.
This is interesting, because this is very close how animals and people learn - based on some limited feedback from the environment or teacher. Like animals get positive reinforcement when developing reflexes. Or children getting positive or negative reinforcement from parents on their behaviour.
Let’s review some theory around RL.
The set of states and actions, together with rules for transitioning from one state to another, make up a Markov decision process. One episode of this process (e.g. one game) forms a finite sequence of states, actions and rewards.
Additional term - set of [(s, a), ..] is a trajectory.
Model free - meaning there is no MDP approximation or learning inside the agent.
Observations are stored into replay buffers and used as training data for the model.
Off policy means that learning optimal policy is independent of agent’s actions.
Because the policy of taking action would be deterministic, force it to explore by taking random action with ε probability. Where ε starts high in the beginning and slowly decays as training progresses.
For example for Atari game, there is lots of possible states (number of pixels by number of colors).
E.g. breakout game 84x84 pixels screen by 256 colors - at least 256^84*84 states.
And it will take a long time to even visit each state. Approximate with neural network, that will be able to learn how to deal with state based on their similarity.
Deep Q Learning - popularized by DeepMind - first Deep RL model that worked.
Expected return is can be defined in few ways.
One way is to define as sum of values of state-value function of each state weighted by how much we will end up at that state under current policy (it’s also called stationary distribution).
This can be estimated from observations - trajectories, as a sum of probability of a trajectory under policy multiplied by reward from this trajectory.
Asynchronous: Unlike DQN, where a single agent represented by a single neural network interacts with a single environment, A3C utilizes multiple incarnations of the above in order to learn more efficiently. In A3C there is a global network, and multiple worker agents which each have their own set of network parameters. Each of these agents interacts with it’s own copy of the environment at the same time as the other agents are interacting with their environments. The reason this works better than having a single agent (beyond the speedup of getting more work done), is that the experience of each agent is independent of the experience of the others. In this way the overall experience available for training becomes more diverse.
Actor-Critic: Actor-Critic combines the benefits of both approaches. In the case of A3C, our network will estimate both a value function V(s) (how good a certain state is to be in) and a policy π(s) (a set of action probability outputs). These will each be separate fully-connected layers sitting at the top of the network. Critically, the agent uses the value estimate (the critic) to update the policy (the actor) more intelligently than traditional policy gradient methods.
The insight of using advantage estimates rather than just discounted returns is to allow the agent to determine not just how good its actions were, but how much better they turned out to be than expected.
Mean and median human-normalized scores on 57 Atari games using the human starts evaluation metric.
D-DQN - double DQN.
A3C paper - https://arxiv.org/pdf/1602.01783.pdf