SlideShare una empresa de Scribd logo
1 de 7
Descargar para leer sin conexión
Defining the Problem
Learning
Improvements
Reinforcement Learning
¨Orjan Ekeberg Machine Learning
Defining the Problem
Learning
Improvements
1 Defining the Problem
Reward Maximization
Simplifying Assumptions
Bellman’s Equation
2 Learning
Monte-Carlo Method
Temporal-Difference
Learning to Act
Q-Learning
3 Improvements
Importance of Making Mistakes
Eligibility Trace
¨Orjan Ekeberg Machine Learning
Defining the Problem
Learning
Improvements
Reward Maximization
Simplifying Assumptions
Bellman’s Equation
1 Defining the Problem
Reward Maximization
Simplifying Assumptions
Bellman’s Equation
2 Learning
Monte-Carlo Method
Temporal-Difference
Learning to Act
Q-Learning
3 Improvements
Importance of Making Mistakes
Eligibility Trace
¨Orjan Ekeberg Machine Learning
Defining the Problem
Learning
Improvements
Reward Maximization
Simplifying Assumptions
Bellman’s Equation
Reinforcement Learning
Learning of a behavior without
explicit information about correct actions
A reward gives information about success
¨Orjan Ekeberg Machine Learning
Defining the Problem
Learning
Improvements
Reward Maximization
Simplifying Assumptions
Bellman’s Equation
Agent–Environment Abstraction:
Agent
Environment
ActionState
Reward
Policy — Choice of action, depending on current state
Learning Objective:
Develop a policy which maximizes the reward
... total reward over the agents lifetime!
¨Orjan Ekeberg Machine Learning
Defining the Problem
Learning
Improvements
Reward Maximization
Simplifying Assumptions
Bellman’s Equation
Credit Assignment Problems
The reward does not necessarily arrive when you do something
good
Temporal credit assignment
The reward does not say what was good
Structural credit assignment
¨Orjan Ekeberg Machine Learning
Defining the Problem
Learning
Improvements
Reward Maximization
Simplifying Assumptions
Bellman’s Equation
Consider a minimal behavior selection situation:
What is the best action to make?
−1
2
−1
2
−1
2
−1
1
0
0
3
−1
Reward — Immediate consequence of our decision
Planning — Taking future reward into account
Optimal Behavior — Maximize total reward
¨Orjan Ekeberg Machine Learning
Defining the Problem
Learning
Improvements
Reward Maximization
Simplifying Assumptions
Bellman’s Equation
Can the agent make the right decision without explicitly modeling
the future?
Yes! If the Sum of Future Rewards for each situation is known.
−1
2
−1
1
0
0
3
−1
4
−2
2
−2
−1
2
8
−4
Call this the State Value
State Values are subjective
Depend on the agents own behavior
Must be re-adjusted as the agents behavior improves
¨Orjan Ekeberg Machine Learning
Defining the Problem
Learning
Improvements
Reward Maximization
Simplifying Assumptions
Bellman’s Equation
Planning Horizon — How long is the future?
−1
0
1
2
Reward
Time
−1
0
1
2
Reward
Time
Infinite future allows for infinite postponing
Discounting — Make early reward more valuable
Discount factor (γ) — Time scale of planning
¨Orjan Ekeberg Machine Learning
Defining the Problem
Learning
Improvements
Reward Maximization
Simplifying Assumptions
Bellman’s Equation
Finite Horizon
max
h
t=0
rt
Infinite Horizon
max
∞
t=0
γt
rt
Requires discount of future reward (0 < γ < 1)
¨Orjan Ekeberg Machine Learning
Defining the Problem
Learning
Improvements
Reward Maximization
Simplifying Assumptions
Bellman’s Equation
The Reward function controls which task should be solved
Game (Chess, Backgammon)
Reward only at the end: +1 when winning, −1 when loosing
Avoiding mistakes (cycling, balancing, ...)
Reward −1 at the end (when failing)
Find a short/fast/cheap path to a goal
Reward −1 at each step
¨Orjan Ekeberg Machine Learning
Defining the Problem
Learning
Improvements
Reward Maximization
Simplifying Assumptions
Bellman’s Equation
Simplifying Assumptions
Discrete time
Finite number of actions ai
ai ∈ a1, a2, a3, . . . , an
Finite number of states si
si ∈ s1, s2, s3, . . . , sm
Environment is a stationary Markov Decision Process
Reward and next state depends only on s, a and chance
¨Orjan Ekeberg Machine Learning
Defining the Problem
Learning
Improvements
Reward Maximization
Simplifying Assumptions
Bellman’s Equation
Classical model problem: Grid World
Each state is represented by a position in a grid
The agent acts by moving to other positions
G
G
Trivial labyrinth
Reward: −1 at each step until
a goal state (G) is reached
¨Orjan Ekeberg Machine Learning
Defining the Problem
Learning
Improvements
Reward Maximization
Simplifying Assumptions
Bellman’s Equation
The values of a state depends on the current policy.
0
0
−1 −2 −3
−1 −2 −3 −2
−2 −3 −2 −1
−3 −2 −1
V with an
optimal policy
−14
−14
−14
−140 −20 −22
−18 −22 −20
0
−18−22
−22 −20
−20
V with a
random policy
¨Orjan Ekeberg Machine Learning
Defining the Problem
Learning
Improvements
Reward Maximization
Simplifying Assumptions
Bellman’s Equation
The Agents Internal Representation
Policy
The action chosen by the agent for each state
π(s) → a
Value Function
Expected total future reward from s when following policy π
V π
(s) →
¨Orjan Ekeberg Machine Learning
Defining the Problem
Learning
Improvements
Reward Maximization
Simplifying Assumptions
Bellman’s Equation
Representation of the Environment
Where does an action take us?
δ(s, a) → s
How much reward do we receive?
r(s, a) →
The values of different states are interrelated
Bellman’s Equation:
V π
(s) = r(s, π(s)) + γ · V π
(δ(s, π(s)))
¨Orjan Ekeberg Machine Learning
Defining the Problem
Learning
Improvements
Monte-Carlo Method
Temporal-Difference
Learning to Act
Q-Learning
1 Defining the Problem
Reward Maximization
Simplifying Assumptions
Bellman’s Equation
2 Learning
Monte-Carlo Method
Temporal-Difference
Learning to Act
Q-Learning
3 Improvements
Importance of Making Mistakes
Eligibility Trace
¨Orjan Ekeberg Machine Learning
Defining the Problem
Learning
Improvements
Monte-Carlo Method
Temporal-Difference
Learning to Act
Q-Learning
Normal scenario: r(s, a) and δ(s, a) are not known
V π must be estimated by experience
Monte-Carlo Method
Start at a random s
Follow π, store the rewards and st
When the goal is reached, update V π(s)-estimation for all
visited stated with the future reward we actually received
Painfully slow!
¨Orjan Ekeberg Machine Learning
Defining the Problem
Learning
Improvements
Monte-Carlo Method
Temporal-Difference
Learning to Act
Q-Learning
Temporal Difference Learning
−1
2
8
−4
−1
2
8
−4
6
Temporal Difference — the difference between:
Experienced value
(immediate reward + value of next state)
Expected value
Measure of unexpected success
Two sources:
Higher immediate reward than expected
Reached better situation than expected
¨Orjan Ekeberg Machine Learning
Defining the Problem
Learning
Improvements
Monte-Carlo Method
Temporal-Difference
Learning to Act
Q-Learning
Expected Value
V π
(st)
One-step Experienced Value
rt+1 + γ · V π
(st+1)
TD-signal — measure of surprise / disappointment
TD = rt+1 + γ · V π
(st+1) − V π
(st)
TD Learning
V π
(st) ← V π
(st) + ηTD
¨Orjan Ekeberg Machine Learning
Defining the Problem
Learning
Improvements
Monte-Carlo Method
Temporal-Difference
Learning to Act
Q-Learning
Learning to Act
How do we get a policy from the TD signal?
Many possibilities...
Actor–Critic Model
(Barto, Sutton, Anderson IEEE Trans. Syst. Man & Cybern. 1983)
Q-Learning
(Watkins, 1989; Watkins & Dayan Machine Learning 1992)
¨Orjan Ekeberg Machine Learning
Defining the Problem
Learning
Improvements
Monte-Carlo Method
Temporal-Difference
Learning to Act
Q-Learning
Actor–Critic Model
TDState
Reward
Action
Actor
Critic
Actor — Associates states with actions π(·)
Critic — Associates states with their value V (·)
TD-signal is used as a reinforcement signal to update both!
High TD (unexpected success)
Increase value of preceeding state
Increase tendency to make same action again
Low TD (unexpected failure)
Decrease value of preceeding state
Decrease tendency to make same action again
¨Orjan Ekeberg Machine Learning
Defining the Problem
Learning
Improvements
Monte-Carlo Method
Temporal-Difference
Learning to Act
Q-Learning
Q-Learning
Estimate Q(s, a) instead of V (s)
Q(s, a): Expected total reward when doing a from s.
π(s) = argmax
a
Q(s, a)
V (s) = max
a
Q (s, a)
The Q-function can also be learned using Temporal-Difference
Q(s, a) ← Q(s, a) + η r + γ max
a
Q(s , a ) − Q(s, a)
s is the next state.
¨Orjan Ekeberg Machine Learning
Defining the Problem
Learning
Improvements
Importance of Making Mistakes
Eligibility Trace
1 Defining the Problem
Reward Maximization
Simplifying Assumptions
Bellman’s Equation
2 Learning
Monte-Carlo Method
Temporal-Difference
Learning to Act
Q-Learning
3 Improvements
Importance of Making Mistakes
Eligibility Trace
¨Orjan Ekeberg Machine Learning
Defining the Problem
Learning
Improvements
Importance of Making Mistakes
Eligibility Trace
What do we do when...
The environment is not deterministic
The environment is not fully observable
There are way too many states
The states are not discrete
The agent is acting in continuous time
¨Orjan Ekeberg Machine Learning
Defining the Problem
Learning
Improvements
Importance of Making Mistakes
Eligibility Trace
The Exploration–Exploitation dilemma
If an agent strictly follows a greedy policy based on the current
estimate of Q, learning is not guaranteed to converge to Q
Simple solution:
Use a policy which has a certain probability of ”making mistakes”
-greedy
Sometimes (with probability ) make a random action instead
of the one that seems best (greedy)
Softmax
Assign a probability to choose each action depending on how
good they seem
¨Orjan Ekeberg Machine Learning
Defining the Problem
Learning
Improvements
Importance of Making Mistakes
Eligibility Trace
Accelerated learning
Id´ea: TD updates can be used to improve not only the last state’s
value, but also states we have visited earlier.
∀s, a : Q(s, a) ← Q(s, a) + η [rt+1 + γQ(st+1, at+1) − Q(st, at)] · e
e is a remaining trace (eligibility trace) encoding how long ago we
were in s doing a.
Often denoted TD(λ) where λ is the time constant of the trace e
¨Orjan Ekeberg Machine Learning

Más contenido relacionado

La actualidad más candente

Reinforcement learning 7313
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313
Slideshare
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
butest
 

La actualidad más candente (20)

Deep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its ApplicationsDeep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its Applications
 
An introduction to reinforcement learning
An introduction to reinforcement learningAn introduction to reinforcement learning
An introduction to reinforcement learning
 
Reinforcement learning 7313
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Actor critic algorithm
Actor critic algorithmActor critic algorithm
Actor critic algorithm
 
Reinforcement Learning 5. Monte Carlo Methods
Reinforcement Learning 5. Monte Carlo MethodsReinforcement Learning 5. Monte Carlo Methods
Reinforcement Learning 5. Monte Carlo Methods
 
Lec3 dqn
Lec3 dqnLec3 dqn
Lec3 dqn
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Policy gradient
Policy gradientPolicy gradient
Policy gradient
 
Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning
 
Temporal difference learning
Temporal difference learningTemporal difference learning
Temporal difference learning
 
Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-Learning
 
Marl의 개념 및 군사용 적용방안
Marl의 개념 및 군사용 적용방안Marl의 개념 및 군사용 적용방안
Marl의 개념 및 군사용 적용방안
 
RLCode와 A3C 쉽고 깊게 이해하기
RLCode와 A3C 쉽고 깊게 이해하기RLCode와 A3C 쉽고 깊게 이해하기
RLCode와 A3C 쉽고 깊게 이해하기
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Reinforcement Learning 6. Temporal Difference Learning
Reinforcement Learning 6. Temporal Difference LearningReinforcement Learning 6. Temporal Difference Learning
Reinforcement Learning 6. Temporal Difference Learning
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
Deep Q-Learning
Deep Q-LearningDeep Q-Learning
Deep Q-Learning
 
Deep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-LearningDeep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-Learning
 
Lecture 9 Markov decision process
Lecture 9 Markov decision processLecture 9 Markov decision process
Lecture 9 Markov decision process
 

Similar a Reinforcement Learning

reinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdfreinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdf
VaishnavGhadge1
 

Similar a Reinforcement Learning (20)

Introduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement LearningIntroduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement Learning
 
Designing an AI that gains experience for absolute beginners
Designing an AI that gains experience for absolute beginnersDesigning an AI that gains experience for absolute beginners
Designing an AI that gains experience for absolute beginners
 
RL.ppt
RL.pptRL.ppt
RL.ppt
 
Machine Learning - Reinforcement Learning
Machine Learning - Reinforcement LearningMachine Learning - Reinforcement Learning
Machine Learning - Reinforcement Learning
 
Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement Learning
 
Markov chain and Monte Carlo
Markov chain and Monte CarloMarkov chain and Monte Carlo
Markov chain and Monte Carlo
 
reinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdfreinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdf
 
An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game Learning
 
Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)
 
Finalver
FinalverFinalver
Finalver
 
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017
 
李宏毅课件-Regression.pdf
李宏毅课件-Regression.pdf李宏毅课件-Regression.pdf
李宏毅课件-Regression.pdf
 
Reinforcement-Learning.ppt
Reinforcement-Learning.pptReinforcement-Learning.ppt
Reinforcement-Learning.ppt
 
Deep einforcement learning
Deep einforcement learningDeep einforcement learning
Deep einforcement learning
 
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
 
NUS-ISS Learning Day 2018-How to train your program to play black jack
NUS-ISS Learning Day 2018-How to train your program to play black jackNUS-ISS Learning Day 2018-How to train your program to play black jack
NUS-ISS Learning Day 2018-How to train your program to play black jack
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
reinforcement-learning-141009013546-conversion-gate02.pptx
reinforcement-learning-141009013546-conversion-gate02.pptxreinforcement-learning-141009013546-conversion-gate02.pptx
reinforcement-learning-141009013546-conversion-gate02.pptx
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 

Último

一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
wsppdmt
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
nirzagarg
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
Health
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 

Último (20)

一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptx
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
 
Data Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdfData Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdf
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 

Reinforcement Learning

  • 1. Defining the Problem Learning Improvements Reinforcement Learning ¨Orjan Ekeberg Machine Learning Defining the Problem Learning Improvements 1 Defining the Problem Reward Maximization Simplifying Assumptions Bellman’s Equation 2 Learning Monte-Carlo Method Temporal-Difference Learning to Act Q-Learning 3 Improvements Importance of Making Mistakes Eligibility Trace ¨Orjan Ekeberg Machine Learning Defining the Problem Learning Improvements Reward Maximization Simplifying Assumptions Bellman’s Equation 1 Defining the Problem Reward Maximization Simplifying Assumptions Bellman’s Equation 2 Learning Monte-Carlo Method Temporal-Difference Learning to Act Q-Learning 3 Improvements Importance of Making Mistakes Eligibility Trace ¨Orjan Ekeberg Machine Learning Defining the Problem Learning Improvements Reward Maximization Simplifying Assumptions Bellman’s Equation Reinforcement Learning Learning of a behavior without explicit information about correct actions A reward gives information about success ¨Orjan Ekeberg Machine Learning
  • 2. Defining the Problem Learning Improvements Reward Maximization Simplifying Assumptions Bellman’s Equation Agent–Environment Abstraction: Agent Environment ActionState Reward Policy — Choice of action, depending on current state Learning Objective: Develop a policy which maximizes the reward ... total reward over the agents lifetime! ¨Orjan Ekeberg Machine Learning Defining the Problem Learning Improvements Reward Maximization Simplifying Assumptions Bellman’s Equation Credit Assignment Problems The reward does not necessarily arrive when you do something good Temporal credit assignment The reward does not say what was good Structural credit assignment ¨Orjan Ekeberg Machine Learning Defining the Problem Learning Improvements Reward Maximization Simplifying Assumptions Bellman’s Equation Consider a minimal behavior selection situation: What is the best action to make? −1 2 −1 2 −1 2 −1 1 0 0 3 −1 Reward — Immediate consequence of our decision Planning — Taking future reward into account Optimal Behavior — Maximize total reward ¨Orjan Ekeberg Machine Learning Defining the Problem Learning Improvements Reward Maximization Simplifying Assumptions Bellman’s Equation Can the agent make the right decision without explicitly modeling the future? Yes! If the Sum of Future Rewards for each situation is known. −1 2 −1 1 0 0 3 −1 4 −2 2 −2 −1 2 8 −4 Call this the State Value State Values are subjective Depend on the agents own behavior Must be re-adjusted as the agents behavior improves ¨Orjan Ekeberg Machine Learning
  • 3. Defining the Problem Learning Improvements Reward Maximization Simplifying Assumptions Bellman’s Equation Planning Horizon — How long is the future? −1 0 1 2 Reward Time −1 0 1 2 Reward Time Infinite future allows for infinite postponing Discounting — Make early reward more valuable Discount factor (γ) — Time scale of planning ¨Orjan Ekeberg Machine Learning Defining the Problem Learning Improvements Reward Maximization Simplifying Assumptions Bellman’s Equation Finite Horizon max h t=0 rt Infinite Horizon max ∞ t=0 γt rt Requires discount of future reward (0 < γ < 1) ¨Orjan Ekeberg Machine Learning Defining the Problem Learning Improvements Reward Maximization Simplifying Assumptions Bellman’s Equation The Reward function controls which task should be solved Game (Chess, Backgammon) Reward only at the end: +1 when winning, −1 when loosing Avoiding mistakes (cycling, balancing, ...) Reward −1 at the end (when failing) Find a short/fast/cheap path to a goal Reward −1 at each step ¨Orjan Ekeberg Machine Learning Defining the Problem Learning Improvements Reward Maximization Simplifying Assumptions Bellman’s Equation Simplifying Assumptions Discrete time Finite number of actions ai ai ∈ a1, a2, a3, . . . , an Finite number of states si si ∈ s1, s2, s3, . . . , sm Environment is a stationary Markov Decision Process Reward and next state depends only on s, a and chance ¨Orjan Ekeberg Machine Learning
  • 4. Defining the Problem Learning Improvements Reward Maximization Simplifying Assumptions Bellman’s Equation Classical model problem: Grid World Each state is represented by a position in a grid The agent acts by moving to other positions G G Trivial labyrinth Reward: −1 at each step until a goal state (G) is reached ¨Orjan Ekeberg Machine Learning Defining the Problem Learning Improvements Reward Maximization Simplifying Assumptions Bellman’s Equation The values of a state depends on the current policy. 0 0 −1 −2 −3 −1 −2 −3 −2 −2 −3 −2 −1 −3 −2 −1 V with an optimal policy −14 −14 −14 −140 −20 −22 −18 −22 −20 0 −18−22 −22 −20 −20 V with a random policy ¨Orjan Ekeberg Machine Learning Defining the Problem Learning Improvements Reward Maximization Simplifying Assumptions Bellman’s Equation The Agents Internal Representation Policy The action chosen by the agent for each state π(s) → a Value Function Expected total future reward from s when following policy π V π (s) → ¨Orjan Ekeberg Machine Learning Defining the Problem Learning Improvements Reward Maximization Simplifying Assumptions Bellman’s Equation Representation of the Environment Where does an action take us? δ(s, a) → s How much reward do we receive? r(s, a) → The values of different states are interrelated Bellman’s Equation: V π (s) = r(s, π(s)) + γ · V π (δ(s, π(s))) ¨Orjan Ekeberg Machine Learning
  • 5. Defining the Problem Learning Improvements Monte-Carlo Method Temporal-Difference Learning to Act Q-Learning 1 Defining the Problem Reward Maximization Simplifying Assumptions Bellman’s Equation 2 Learning Monte-Carlo Method Temporal-Difference Learning to Act Q-Learning 3 Improvements Importance of Making Mistakes Eligibility Trace ¨Orjan Ekeberg Machine Learning Defining the Problem Learning Improvements Monte-Carlo Method Temporal-Difference Learning to Act Q-Learning Normal scenario: r(s, a) and δ(s, a) are not known V π must be estimated by experience Monte-Carlo Method Start at a random s Follow π, store the rewards and st When the goal is reached, update V π(s)-estimation for all visited stated with the future reward we actually received Painfully slow! ¨Orjan Ekeberg Machine Learning Defining the Problem Learning Improvements Monte-Carlo Method Temporal-Difference Learning to Act Q-Learning Temporal Difference Learning −1 2 8 −4 −1 2 8 −4 6 Temporal Difference — the difference between: Experienced value (immediate reward + value of next state) Expected value Measure of unexpected success Two sources: Higher immediate reward than expected Reached better situation than expected ¨Orjan Ekeberg Machine Learning Defining the Problem Learning Improvements Monte-Carlo Method Temporal-Difference Learning to Act Q-Learning Expected Value V π (st) One-step Experienced Value rt+1 + γ · V π (st+1) TD-signal — measure of surprise / disappointment TD = rt+1 + γ · V π (st+1) − V π (st) TD Learning V π (st) ← V π (st) + ηTD ¨Orjan Ekeberg Machine Learning
  • 6. Defining the Problem Learning Improvements Monte-Carlo Method Temporal-Difference Learning to Act Q-Learning Learning to Act How do we get a policy from the TD signal? Many possibilities... Actor–Critic Model (Barto, Sutton, Anderson IEEE Trans. Syst. Man & Cybern. 1983) Q-Learning (Watkins, 1989; Watkins & Dayan Machine Learning 1992) ¨Orjan Ekeberg Machine Learning Defining the Problem Learning Improvements Monte-Carlo Method Temporal-Difference Learning to Act Q-Learning Actor–Critic Model TDState Reward Action Actor Critic Actor — Associates states with actions π(·) Critic — Associates states with their value V (·) TD-signal is used as a reinforcement signal to update both! High TD (unexpected success) Increase value of preceeding state Increase tendency to make same action again Low TD (unexpected failure) Decrease value of preceeding state Decrease tendency to make same action again ¨Orjan Ekeberg Machine Learning Defining the Problem Learning Improvements Monte-Carlo Method Temporal-Difference Learning to Act Q-Learning Q-Learning Estimate Q(s, a) instead of V (s) Q(s, a): Expected total reward when doing a from s. π(s) = argmax a Q(s, a) V (s) = max a Q (s, a) The Q-function can also be learned using Temporal-Difference Q(s, a) ← Q(s, a) + η r + γ max a Q(s , a ) − Q(s, a) s is the next state. ¨Orjan Ekeberg Machine Learning Defining the Problem Learning Improvements Importance of Making Mistakes Eligibility Trace 1 Defining the Problem Reward Maximization Simplifying Assumptions Bellman’s Equation 2 Learning Monte-Carlo Method Temporal-Difference Learning to Act Q-Learning 3 Improvements Importance of Making Mistakes Eligibility Trace ¨Orjan Ekeberg Machine Learning
  • 7. Defining the Problem Learning Improvements Importance of Making Mistakes Eligibility Trace What do we do when... The environment is not deterministic The environment is not fully observable There are way too many states The states are not discrete The agent is acting in continuous time ¨Orjan Ekeberg Machine Learning Defining the Problem Learning Improvements Importance of Making Mistakes Eligibility Trace The Exploration–Exploitation dilemma If an agent strictly follows a greedy policy based on the current estimate of Q, learning is not guaranteed to converge to Q Simple solution: Use a policy which has a certain probability of ”making mistakes” -greedy Sometimes (with probability ) make a random action instead of the one that seems best (greedy) Softmax Assign a probability to choose each action depending on how good they seem ¨Orjan Ekeberg Machine Learning Defining the Problem Learning Improvements Importance of Making Mistakes Eligibility Trace Accelerated learning Id´ea: TD updates can be used to improve not only the last state’s value, but also states we have visited earlier. ∀s, a : Q(s, a) ← Q(s, a) + η [rt+1 + γQ(st+1, at+1) − Q(st, at)] · e e is a remaining trace (eligibility trace) encoding how long ago we were in s doing a. Often denoted TD(λ) where λ is the time constant of the trace e ¨Orjan Ekeberg Machine Learning