Reinforcement learning is an area of machine learning inspired by behaviorist psychology, concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward.
2. What is Reinforcement Learning?
- Much like biological agents behave
- No supervisor, only a reward
- Data is time dependent (non iid)
- Feedback is delayed
- Agent actions affect the data it receives
3. Examples
- Play checkers (1959)
- Defeat the world champion at Backgammon (1992)
- Control a helicopter (2008)
- Make a robot to walk
- Robocup Soccer
- Play ATARI games better than humans (2014)
- Defeat the world champion at Go (2016)
Videos
4. Reward Hypothesis
All goals can be described by the maximisation of expected cumulative reward
- Defeat the world champion at Go: +R / -R for winning/losing a game
- Make a robot to walk: +R for forward, -R for falling over
- Play ATARI games: +R / -R for increasing/decreasing score
- Control a helicopter: + R / -R following trajectory / crashing
6. Fully Observable Environments
Fully Observable Environments (agent state = environment state):
- Agent directly observes environment
- Example: chess board
Partially Observable Environments (agent state not equal environment state):
- Agent indirectly observes environment
- Example: A robot with motion sensor or camera
- Agent must construct its own state representation
7. RL components: Policy and Value Function
Policy is agent’s behaviour function
- Maps from state to action
- Deterministic policy:
- Stochastic:
Value function is a is a prediction of future reward
- Used to evaluate state and select between actions
-
12. Math: Markov Decision Process (MDP)
Almost all RL problems can be formalised as MDPs
It’s a tuple:
- S is finite set of states
- A is finite set of actions
- P is state transition probability matrix:
- R is a reward function:
- Discount factor:
13. State-Value and Action-Value functions, Bellman eq.
Expected return starting from state s, and then following policy :
Expected return starting from state s, taking action a, and then following policy :
14. Finding an Optimal Policy
- There is always optimal policy for any MPD
- All optimal policies achieve the optimal value function
- All optimal policies achieve the optimal action-value function
All you need is to find
15. Bellman Opt Equation for state-value function
[David Silver. Advanced Topics: RL]
16. Bellman Opt Equation for action-value function
[David Silver. Advanced Topics: RL]
17. Bellman Opt Equation for state-value function
[David Silver. Advanced Topics: RL]
18. Bellman Opt Equation for action-value function
[David Silver. Advanced Topics: RL]
20. Q-Learning - model-free off-policy control algorithm
Model-free (vs Model-based):
- MDP model is unknown, but experience can be sampled MDP
- Model is known, but is too big to use, except by samples
Off-policy (vs On-policy):
- Can learn about policy from experience sampled from some other policy
Control (vs Prediction):
- Find best policy
24. Issues with Q-learning with neural network
- Data is sequential (non-iid)
- Policy changes rapidly with slight changes to Q-values
- Policy may oscillate
- Experience flows from one extreme to another
- Scale of rewards and Q-values is unknown
- Unstable backpropagation due to large gradients
25. DQN solutions
- Use experience replay
- Breaks correlations in data
- Learn from all past policies
- Using off-policy Q-learning
- Freeze target Q-network
- Avoid policy oscillations
- Break correlations between Q-network and target
- Clip rewards and gradients