Reinforcement Learning

REINFORCEMENT
LEARNING
AbdalmuGhith Alzbibi
Ahmad Ataya
Mhd Salem Kabbani
Nadir Pervez

Outline
■ Introduction
■ Element of reinforcement learning
■ Q-learning
■ Deep Q-Network
■ Demo
■ References
2

3
Introduction
 Supervised learning : a situation in which
sample (input, output) pairs of the function to be
learned can be perceived or are given
 Unsupervised learning : Data Driven (Clustering)
 Reinforcement learning —
 Close to human learning.
 Algorithm learns a policy of how to act in a given
environment.
 Every action has some effect in the environment,
and the environment provides rewards that guides
the learning algorithm.

4
Supervised Learning vs Reinforcement Learning
Supervised Learning
Step: 1
Teacher: Does picture 1 show a car or a flower?
Learner: A flower.
Teacher: No, it’s a car.
Step: 2
Teacher: Does picture 2 show a car or a flower?
Learner: A car.
Teacher: Yes, it’s a car.
Step: 3 ....

5
Reinforcement Learning
Step: 1
World: You are in state 9. Choose action A or C.
Learner: Action A.
World: Your reward is 100.
Step: 2
World: You are in state 32. Choose action B or E.
Learner: Action B.
World: Your reward is 50.
Step: 3 ....
Supervised Learning vs Reinforcement Learning

8
Introduction (Cont..)
 Meaning of Reinforcement: Occurrence of an
event, in the proper relation to a response, that tends
to increase the probability that the response will
occur again in the same situation.
 Reinforcement learning is the problem faced by an
agent that learns behavior through trial-and-error
interactions with a dynamic environment.
 Reinforcement Learning is learning how to act in
order to maximize a numerical reward.

9
Introduction …
 Reinforcement learning is not a type of neural
network, nor is it an alternative to neural networks.
Rather, it is an area of Learning Machine.
 Reinforcement learning return delayed feedback that
evaluates the learner's performance but is not told of
which action is the correct one to achieve its goal

Reward Hypothesis
 All goals can be described by the maximization of expected
cumulative reward.
 Make a robot to walk: +R for forward, -R for falling over.
 Play ATARI games: +R / -R for increasing/decreasing score.
 Control a helicopter: + R / -R following trajectory / crashing.
10

Q – Learning
 There are many different ways a reinforcement learning agent
can be trained, but a common one is call
Q-learning.
 Before we talk about Q-learning, we need to cover some
background material.
 Markov Decision Processes.
 Value functions
11

 Model-free (vs Model-based):
MDP model is unknown, but experience can be sampled MDP
Model is known, but is too big to use, except by samples.
 Off-policy (vs On-policy):
Can learn about policy from experience sampled from some
other policy.
Q-Learning …
12

Markov Decision Process
 A set of possible world states 𝑆
 A set of possible actions 𝐴
 A real valued reward function 𝑅(𝑠, 𝑎)
 A transition function 𝑇(𝑠, 𝑎, 𝑠’) = 𝑃(𝑠’|𝑠, 𝑎) - the
probability of transition from 𝑠 to 𝑠’ given action 𝑎
 A policy 𝜋 is a mapping from 𝑆 to 𝐴
Policy
13

 𝑄 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛: 𝑄 𝜋
𝑠, 𝑎 = 𝔼 𝑅𝑡|𝑠𝑡 = 𝑠, 𝑎 𝑡 = 𝑎, 𝜋
 Is a prediction of future reward.
 Next reward plus the best I can do from the next state
𝑄 𝑠, 𝑎 = 𝑅 𝑠, 𝑎, 𝑠′
+ 𝛾𝑚𝑎x 𝑎′Q s′
, a′
𝛾 𝜖 [0,1] a discount factor to give later rewards less effect
Value functions
14

 We’re looking for the optimal policy that no policy generates
more reward than it.
𝑄∗
𝑠, 𝑎 = max
𝜋
𝑄 𝜋
𝑠, 𝑎
 Deterministic policy a = argmax
𝑎′∈𝐴
𝑄∗
𝑠, 𝑎′
 Bellman equation 𝑄∗
𝑠, 𝑎 = 𝔼 𝑠′ 𝑟 + 𝛾 max
𝑎′
𝑄∗
𝑠′, 𝑎′ |𝑠, 𝑎
 Recursively with dynamic programming.
Getting the Policy
16

 We want to pick good actions most of the time, but
also do some exploration:
 Exploring means that we can learn better policies
 But, we want to balance known good actions with
exploratory ones
 This is called the exploration/exploitation problem
Exploration - Exploitation dilemma
17

Stochastic gradient descent
22

 Deep learning algorithms require
 huge training datasets
 independence between samples
 fixed underlying data distribution
Theoretical complications
23

 To avoids theoretical complications.
 greater data efficiency
each experience potentially used in many weight udpates
 reduce correlations between samples
randomizing samples breaks correlations from consecutive
samples
 experience replay averages behavior distribution over states
smooths out learning
avoids oscillations or divergence in gradient descent
Deep Q-learning …
24

• Mnih et al. Playing Atari with deep reinforcement learning.
arXiv preprint arXiv:1312.5602, 2013.
• Mnih et al. Human-level control through deep reinforcement
learning. Nature, 518(7540):529–533, 2015.
• Course Udacity Machine Learning:Reinforcement Learning
https://www.youtube.com/playlist?list=PLAwxTw4SYaPnidDwo9e2c7ixIsu_pdSNp
References

Reinforcement Learning

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Reinforcement Learning

Similar a Reinforcement Learning (20)

Último

Último (20)

Reinforcement Learning

Notas del editor