This presentation contains an introduction to reinforcement learning, comparison with others learning ways, introduction to Q-Learning and some applications of reinforcement learning in video games.
3. 3
Introduction
Supervised learning : a situation in which
sample (input, output) pairs of the function to be
learned can be perceived or are given
Unsupervised learning : Data Driven (Clustering)
Reinforcement learning —
Close to human learning.
Algorithm learns a policy of how to act in a given
environment.
Every action has some effect in the environment,
and the environment provides rewards that guides
the learning algorithm.
4. 4
Supervised Learning vs Reinforcement Learning
Supervised Learning
Step: 1
Teacher: Does picture 1 show a car or a flower?
Learner: A flower.
Teacher: No, it’s a car.
Step: 2
Teacher: Does picture 2 show a car or a flower?
Learner: A car.
Teacher: Yes, it’s a car.
Step: 3 ....
5. 5
Reinforcement Learning
Step: 1
World: You are in state 9. Choose action A or C.
Learner: Action A.
World: Your reward is 100.
Step: 2
World: You are in state 32. Choose action B or E.
Learner: Action B.
World: Your reward is 50.
Step: 3 ....
Supervised Learning vs Reinforcement Learning
8. 8
Introduction (Cont..)
Meaning of Reinforcement: Occurrence of an
event, in the proper relation to a response, that tends
to increase the probability that the response will
occur again in the same situation.
Reinforcement learning is the problem faced by an
agent that learns behavior through trial-and-error
interactions with a dynamic environment.
Reinforcement Learning is learning how to act in
order to maximize a numerical reward.
9. 9
Introduction …
Reinforcement learning is not a type of neural
network, nor is it an alternative to neural networks.
Rather, it is an area of Learning Machine.
Reinforcement learning return delayed feedback that
evaluates the learner's performance but is not told of
which action is the correct one to achieve its goal
10. Reward Hypothesis
All goals can be described by the maximization of expected
cumulative reward.
Make a robot to walk: +R for forward, -R for falling over.
Play ATARI games: +R / -R for increasing/decreasing score.
Control a helicopter: + R / -R following trajectory / crashing.
10
11. Q – Learning
There are many different ways a reinforcement learning agent
can be trained, but a common one is call
Q-learning.
Before we talk about Q-learning, we need to cover some
background material.
Markov Decision Processes.
Value functions
11
12. Model-free (vs Model-based):
MDP model is unknown, but experience can be sampled MDP
Model is known, but is too big to use, except by samples.
Off-policy (vs On-policy):
Can learn about policy from experience sampled from some
other policy.
Q-Learning …
12
13. Markov Decision Process
A set of possible world states 𝑆
A set of possible actions 𝐴
A real valued reward function 𝑅(𝑠, 𝑎)
A transition function 𝑇(𝑠, 𝑎, 𝑠’) = 𝑃(𝑠’|𝑠, 𝑎) - the
probability of transition from 𝑠 to 𝑠’ given action 𝑎
A policy 𝜋 is a mapping from 𝑆 to 𝐴
Policy
13
14. 𝑄 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛: 𝑄 𝜋
𝑠, 𝑎 = 𝔼 𝑅𝑡|𝑠𝑡 = 𝑠, 𝑎 𝑡 = 𝑎, 𝜋
Is a prediction of future reward.
Next reward plus the best I can do from the next state
𝑄 𝑠, 𝑎 = 𝑅 𝑠, 𝑎, 𝑠′
+ 𝛾𝑚𝑎x 𝑎′Q s′
, a′
𝛾 𝜖 [0,1] a discount factor to give later rewards less effect
Value functions
14
16. We’re looking for the optimal policy that no policy generates
more reward than it.
𝑄∗
𝑠, 𝑎 = max
𝜋
𝑄 𝜋
𝑠, 𝑎
Deterministic policy a = argmax
𝑎′∈𝐴
𝑄∗
𝑠, 𝑎′
Bellman equation 𝑄∗
𝑠, 𝑎 = 𝔼 𝑠′ 𝑟 + 𝛾 max
𝑎′
𝑄∗
𝑠′, 𝑎′ |𝑠, 𝑎
Recursively with dynamic programming.
Getting the Policy
16
17. We want to pick good actions most of the time, but
also do some exploration:
Exploring means that we can learn better policies
But, we want to balance known good actions with
exploratory ones
This is called the exploration/exploitation problem
Exploration - Exploitation dilemma
17
23. Deep learning algorithms require
huge training datasets
independence between samples
fixed underlying data distribution
Theoretical complications
23
24. To avoids theoretical complications.
greater data efficiency
each experience potentially used in many weight udpates
reduce correlations between samples
randomizing samples breaks correlations from consecutive
samples
experience replay averages behavior distribution over states
smooths out learning
avoids oscillations or divergence in gradient descent
Deep Q-learning …
24
32. • Mnih et al. Playing Atari with deep reinforcement learning.
arXiv preprint arXiv:1312.5602, 2013.
• Mnih et al. Human-level control through deep reinforcement
learning. Nature, 518(7540):529–533, 2015.
• Course Udacity Machine Learning:Reinforcement Learning
https://www.youtube.com/playlist?list=PLAwxTw4SYaPnidDwo9e2c7ixIsu_pdSNp
References
Q-Learning Algorithm1. Initialize Q(s, a) to small random values, ∀s, a2. Observe state, s3. Pick an action, a, and do it4. Observe next state, s’, and reward, r5. Q(s, a) ← (1 - α)Q(s, a) + α(r + γmaxa’Q(s’, a’))6. Go to 20 ≤ α ≤ 1 is the learning rate
And user ε-greedy in pivking actiones
• Pick best (greedy) action with probability ε
• Otherwise, pick a random action
- There is always optimal policy for any MPD- All optimal policies achieve the optimal value function- All optimal policies achieve the optimal action-value functionAll you need is to findq*