SlideShare una empresa de Scribd logo
1 de 24
Descargar para leer sin conexión
Reinforcement Learning
Dr. P. Kuppusamy
Prof / CSE
Reinforcement learning
• Reinforcement Learning is the agent must sense the environment,
learns to behave (act) in a environment by performing actions
(reinforcement) and seeing the results.
• Task
- Learn how to behave successfully to achieve a goal while
interacting with an external environment.
- The goal of the agent is to learn an action policy that
maximizes the total reward it will receive from any starting
state.
• Examples
– Game playing: player knows whether it win or lose, but not
know how to move at each step
Applications
• A robot cleaning the room and recharging its battery
• Robot-soccer
• invest in shares
• Modeling the economy through rational agents
• Learning how to fly a helicopter
• Scheduling planes to their destinations
Reinforcement Learning Process
• RL contains two primary components:
1. Agent (A) – RL algorithm that learns from trial and error
2. Environment – World Space in which the agent moves (interact and take action)
• State (S) – Current situation returned by the environment
• Reward (R) – An immediate return from the environment to appraise the last action
• Policy (π) –Agent uses this approach to decide the next action based on the current state
• Value (V) – Expected long-term return with discount. Oppose to the short-term reward (R)
• Action-Value (Q) – Similar to value except it contains an additional parameter, the current
action (A)
Figure shows RL is learning from interaction
RL Approaches
• Two approaches
– Model based approach RL:
• learn the model, and use it to derive the optimal policy.
e.g Adaptive dynamic learning(ADP) approach
– Model free approach RL:
• derive the optimal policy without learning the model.
e.g LMS and Temporal difference approach
• Passive learning
– The agent imply watches the world during transition and tries to
learn the utilities in various states
• Active learning
– The agent not simply watches, but also acts on the environment
Example
:S A
 
• Immediate reward is worth more than future reward.
• Reward Maximization - Agent is trained to take best (optimal)
action to get maximum reward
Reward Maximization
:S A
 
• Exploration – Search and capture more information about the
environment
• Exploitation – Use the already known information to maximize the
rewards
Reinforcement learning model
• Each percept(e) is enough to determine the State(the state is
accessible)
• Agent’s task: Find a optimal policy by mapping states of
environment to actions of the agent, that maximize long-run
measure of the reward (reinforcement)
• It can be modeled as Markov Decision Process (MDP) model.
• Markov decision process (MDP) is a a mathematical
framework for modeling decision making i.e mapping a
solution in reinforcement learning.
MDP model
• MDP model <S,T,A,R>
Agent
Environment
State
Reward
Action
s0
r0
a0
s1
a1
r1
s2
a2
r2
s3
• S– set of states
• A– set of actions
• Transition Function: T(s,a,s’) =
P(s’|s,a) – the probability of transition
from s to s’ given action a
T(s,a)  s’
• Reward Function: r(s,a)  r the
expected reward for taking action a in
state s




'
'
)
'
,
,
(
)
'
,
,
(
)
,
(
)
'
,
,
(
)
,
|
'
(
)
,
(
s
s
s
a
s
r
s
a
s
T
a
s
R
s
a
s
r
a
s
s
P
a
s
R
MDP - Example I
• Consider the graph, and find the shortest path from a node S to a
goal node G.
• Set of states {S, T, U, V}
• Action – Traversal from one state to another state
• Reward - Traversing an edge provides “length edge” in dollars.
• Policy – Path considered to reach the destination {STV}
G
S
U
T
V
14
51
25
15
-5
-22
Q - Learning
• Q-Learning is a value-based reinforcement learning algorithm uses Q-
values (action values) to iteratively improve the behavior of the learning
agent.
• Goal is to maximize the Q value to find the optimal action-selection policy.
• The Q table helps to find the best action for each state and maximize the
expected reward.
• Q-Values / Action-Values: Q-values are defined for states and actions.
• Q(s, a) denotes an estimation of the action a at the state s.
• This estimation of Q(s, a) will be iteratively computed using the TD-
Update rule.
• Reward: At every transition, the agent observes a reward for every action
from the environment, and then transits to another state.
• Episode: If at any point of time the agent ends up in one of the terminating
states i.e. there are no further transition possible is called completion of an
episode.
Q-Learning
• Initially agent explore the environment and update the Q-Table. When the
Q-Table is ready, the agent will start to exploit the environment and taking
better actions.
• It is an off-policy control algorithm i.e. the updated policy is different
from the behavior policy. It estimates the reward for future actions and
appends a value to the new state without any greedy policy
Temporal Difference or TD-Update:
• Estimate the value of Q is applied at every time step of the agents
interaction with the environment
Advantage:
• Converges to an optimal policy in both deterministic and nondeterministic
MDPs.
Disadvantage:
• Suitable for small problems.
Understanding the Q – Learning
• Building Environment contains 5 rooms that are connected with doors.
• Each room is numbered from 0 to 4. The building outside is numbered as 5.
• Doors from room 1 and 4 leads to the building outside 5.
• Problem: Agent can place at any one of the rooms (0, 1, 2, 3, 4). Agent’s
goal is to reach the building outside (room 5).
Understanding the Q – Learning
• Represent the room in the graph.
• Room number is the state and door is the edge.
Understanding the Q – Learning
• Assign the Reward value to each
door.
• The doors lead immediately to
target is assigned an instant reward
of 100.
• Other doors not directly connected
to the target room have zero
reward.
• For example, doors are two-way ( 0
leads to 4, and 4 leads back to 0 ),
two edges are assigned to each
room.
• Each edge contains an instant
reward value
Understanding the Q – Learning
• Let consider agent starts from state s (Room) 2.
• Agent’s movement from one state to another state is action a.
• Agent is traversing from state 2 to state 5 (Target).
– Initial state = current state i.e. state 2
– Transition State 2  State 3
– Transition State 3  State (2, 1, 4)
– Transition State 4  State 5
Understanding the Q – Learning
Understanding the Q – Learning: Prepare matrix Q
• Matrix Q is the memory of the agent in which learned information
from experience is stored.
• Row denotes the current state of the agent
• Column denotes the possible actions leading to the next state
Compute Q matrix:
Q(state, action) = R(state, action) + Gamma * max[Q(next state, all actions)]
• Gamma is discounting factor for future rewards. Its range is 0 to 1.
i.e. 0 < Gamma <1.
• Future rewards are less valuable than current rewards so they must
be discounted.
• If Gamma is closer to 0, the agent will tend to consider only the
immediate rewards.
• If Gamma is closer to 1, the agent will tend to consider only future
rewards with higher edge weights.
Q – Learning Algorithm
• Set the gamma parameter
• Set environment rewards in matrix R
• Initialize matrix Q as Zero
– Select random initial (source) state
• Set initial state s = current state
– Select one action a among all possible actions using exploratory policy
• Take this possible action a, going to the next state s’.
• Observe reward r
– Get maximum Q value to go to next state based on all possible
actions
• Compute:
– Q(state, action) = R(state, action) + Gamma * max[Q(next state, all actions)]
• Repeat the above steps until reach the goal state i.e current state =
goal state
Example: Q – Learning
0 1 2 3 4 5
Action
0 1 2 3 4 5
State
Example: Q – Learning
0 1 2 3 4 5
Action
0 1 2 3 4 5
State
3
1 5
5
1
4
𝑅 =
𝟎
𝟏
𝟐
𝟑
𝟒
𝟓
−1 −1 −1
−1 −1 −1
−1 −1 −1
−1 0 −1
0 −1 100
0 −1 100
−1 0 0
0 −1 −1
−1 0 −1
−1 0 −1
0 −1 100
−1 0 100
Q =
𝟎
𝟏
𝟐
𝟑
𝟒
𝟓
0 0 0
0 0 0
0 0 0
0 0 0
0 0 𝟏𝟎𝟎
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
• Update the Matrix Q.
• For next episode, choose next state 3 randomly that
becomes current state.
• State 3 contains 3 choices i.e. state 1, 2 or 4.
• Let’s choose state 1.
• Compute max Q value to go to next state based
on all possible actions.
Q(state, action) = R(state, action) + Gamma * max[Q(next state, all actions)]
• Q(3,1) = R(3,1) + 0.8 * max[Q(1,3), Q(1,5)]
= 0 + 0.8 * max[0, 100] = 0 + 80 = 80
• Update the Matrix Q.
0 1 2 3 4 5
Action
0 1 2 3 4 5
State
𝑅 =
𝟎
𝟏
𝟐
𝟑
𝟒
𝟓
−1 −1 −1
−1 −1 −1
−1 −1 −1
−1 0 −1
0 −1 100
0 −1 100
−1 0 0
0 −1 −1
−1 0 −1
−1 0 −1
0 −1 100
−1 0 100
Q =
𝟎
𝟏
𝟐
𝟑
𝟒
𝟓
0 0 0
0 0 0
0 0 0
0 0 0
0 0 100
0 0 0
0 𝟖𝟎 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
3
3
5
4
1
2
• For next episode, next state 1 becomes current state
• Repeat the inner loop due to 1 is not target state
• From State 1, either can go to 3 or 5.
• Let’s choose state 5.
• Compute max Q value to go to next state based
on all possible actions.
• Q(state, action) = R(state, action) + Gamma * max[Q(next state, all actions)]
• Q(1,5) = R(1,5) + 0.8 * max[Q(5,1), Q(5,4), Q(5,5)]
= 100 + 0.8 * max[0, 0, 0] = 100 + 0 = 100
• Q remains the same due to Q(1,5) is already fed into the agent. Stop process
0 1 2 3 4 5
Action
0 1 2 3 4 5
State
𝑅 =
𝟎
𝟏
𝟐
𝟑
𝟒
𝟓
−1 −1 −1
−1 −1 −1
−1 −1 −1
−1 0 −1
0 −1 100
0 −1 100
−1 0 0
0 −1 −1
−1 0 −1
−1 0 −1
0 −1 100
−1 0 100
Q =
𝟎
𝟏
𝟐
𝟑
𝟒
𝟓
0 0 0
0 0 0
0 0 0
0 0 0
0 0 100
0 0 0
0 𝟖𝟎 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
3
1 5
5
1
4
References
• Tom Markiewicz& Josh Zheng,Getting started with Artificial Intelligence,
Published by O’Reilly Media,2017
• Stuart J. Russell and Peter Norvig,Artificial Intelligence A Modern
Approach
• Richard Szeliski, Computer Vision: Algorithms and Applications, Springer
2010

Más contenido relacionado

La actualidad más candente

Reinforcement Learning 5. Monte Carlo Methods
Reinforcement Learning 5. Monte Carlo MethodsReinforcement Learning 5. Monte Carlo Methods
Reinforcement Learning 5. Monte Carlo MethodsSeung Jae Lee
 
Logistic regression in Machine Learning
Logistic regression in Machine LearningLogistic regression in Machine Learning
Logistic regression in Machine LearningKuppusamy P
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningKhaled Saleh
 
Markov decision process
Markov decision processMarkov decision process
Markov decision processHamed Abdi
 
DQN (Deep Q-Network)
DQN (Deep Q-Network)DQN (Deep Q-Network)
DQN (Deep Q-Network)Dong Guo
 
Gradient descent method
Gradient descent methodGradient descent method
Gradient descent methodSanghyuk Chun
 
Knowledge Engineering in FOL.
Knowledge Engineering in FOL.Knowledge Engineering in FOL.
Knowledge Engineering in FOL.Megha Sharma
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningDing Li
 
Lecture 9 Markov decision process
Lecture 9 Markov decision processLecture 9 Markov decision process
Lecture 9 Markov decision processVARUN KUMAR
 
Reinforcement Learning 2. Multi-armed Bandits
Reinforcement Learning 2. Multi-armed BanditsReinforcement Learning 2. Multi-armed Bandits
Reinforcement Learning 2. Multi-armed BanditsSeung Jae Lee
 
Hill climbing algorithm in artificial intelligence
Hill climbing algorithm in artificial intelligenceHill climbing algorithm in artificial intelligence
Hill climbing algorithm in artificial intelligencesandeep54552
 
Reinforcement Learning 3. Finite Markov Decision Processes
Reinforcement Learning 3. Finite Markov Decision ProcessesReinforcement Learning 3. Finite Markov Decision Processes
Reinforcement Learning 3. Finite Markov Decision ProcessesSeung Jae Lee
 
Multi-armed Bandits
Multi-armed BanditsMulti-armed Bandits
Multi-armed BanditsDongmin Lee
 
Back propagation
Back propagationBack propagation
Back propagationNagarajan
 
Regularization in deep learning
Regularization in deep learningRegularization in deep learning
Regularization in deep learningKien Le
 
Unsupervised learning
Unsupervised learningUnsupervised learning
Unsupervised learningamalalhait
 
Informed search (heuristics)
Informed search (heuristics)Informed search (heuristics)
Informed search (heuristics)Bablu Shofi
 

La actualidad más candente (20)

Reinforcement Learning 5. Monte Carlo Methods
Reinforcement Learning 5. Monte Carlo MethodsReinforcement Learning 5. Monte Carlo Methods
Reinforcement Learning 5. Monte Carlo Methods
 
Logistic regression in Machine Learning
Logistic regression in Machine LearningLogistic regression in Machine Learning
Logistic regression in Machine Learning
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement Learning
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
Markov decision process
Markov decision processMarkov decision process
Markov decision process
 
DQN (Deep Q-Network)
DQN (Deep Q-Network)DQN (Deep Q-Network)
DQN (Deep Q-Network)
 
Gradient descent method
Gradient descent methodGradient descent method
Gradient descent method
 
Knowledge Engineering in FOL.
Knowledge Engineering in FOL.Knowledge Engineering in FOL.
Knowledge Engineering in FOL.
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Lecture 9 Markov decision process
Lecture 9 Markov decision processLecture 9 Markov decision process
Lecture 9 Markov decision process
 
Deep Q-Learning
Deep Q-LearningDeep Q-Learning
Deep Q-Learning
 
Reinforcement Learning 2. Multi-armed Bandits
Reinforcement Learning 2. Multi-armed BanditsReinforcement Learning 2. Multi-armed Bandits
Reinforcement Learning 2. Multi-armed Bandits
 
Hill climbing algorithm in artificial intelligence
Hill climbing algorithm in artificial intelligenceHill climbing algorithm in artificial intelligence
Hill climbing algorithm in artificial intelligence
 
Reinforcement Learning 3. Finite Markov Decision Processes
Reinforcement Learning 3. Finite Markov Decision ProcessesReinforcement Learning 3. Finite Markov Decision Processes
Reinforcement Learning 3. Finite Markov Decision Processes
 
Multi-armed Bandits
Multi-armed BanditsMulti-armed Bandits
Multi-armed Bandits
 
Back propagation
Back propagationBack propagation
Back propagation
 
Regularization in deep learning
Regularization in deep learningRegularization in deep learning
Regularization in deep learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Unsupervised learning
Unsupervised learningUnsupervised learning
Unsupervised learning
 
Informed search (heuristics)
Informed search (heuristics)Informed search (heuristics)
Informed search (heuristics)
 

Similar a Reinforcement learning, Q-Learning

Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement LearningNatan Katz
 
An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningPrabhu Kumar
 
24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptxManiMaran230751
 
Head First Reinforcement Learning
Head First Reinforcement LearningHead First Reinforcement Learning
Head First Reinforcement Learningazzeddine chenine
 
Reinforcement Learning Guide For Beginners
Reinforcement Learning Guide For BeginnersReinforcement Learning Guide For Beginners
Reinforcement Learning Guide For Beginnersgokulprasath06
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningElias Hasnat
 
Survey of Modern Reinforcement Learning
Survey of Modern Reinforcement Learning Survey of Modern Reinforcement Learning
Survey of Modern Reinforcement Learning Julia Maddalena
 
Demystifying deep reinforement learning
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learning재연 윤
 
reinforcement-learning its based on the slide of university
reinforcement-learning its based on the slide of universityreinforcement-learning its based on the slide of university
reinforcement-learning its based on the slide of universityMOHDNADEEM971008
 
reinforcement-learning.ppt
reinforcement-learning.pptreinforcement-learning.ppt
reinforcement-learning.ppthemalathache
 
Lecture notes
Lecture notesLecture notes
Lecture notesbutest
 
Reinforcement learning.pptx
Reinforcement learning.pptxReinforcement learning.pptx
Reinforcement learning.pptxaniketgupta16440
 

Similar a Reinforcement learning, Q-Learning (20)

Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement Learning
 
An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game Learning
 
24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx
 
Head First Reinforcement Learning
Head First Reinforcement LearningHead First Reinforcement Learning
Head First Reinforcement Learning
 
Reinforcement Learning Guide For Beginners
Reinforcement Learning Guide For BeginnersReinforcement Learning Guide For Beginners
Reinforcement Learning Guide For Beginners
 
Finalver
FinalverFinalver
Finalver
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Survey of Modern Reinforcement Learning
Survey of Modern Reinforcement Learning Survey of Modern Reinforcement Learning
Survey of Modern Reinforcement Learning
 
Demystifying deep reinforement learning
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learning
 
CH2_AI_Lecture1.ppt
CH2_AI_Lecture1.pptCH2_AI_Lecture1.ppt
CH2_AI_Lecture1.ppt
 
reinforcement-learning its based on the slide of university
reinforcement-learning its based on the slide of universityreinforcement-learning its based on the slide of university
reinforcement-learning its based on the slide of university
 
reinforcement-learning.ppt
reinforcement-learning.pptreinforcement-learning.ppt
reinforcement-learning.ppt
 
Lecture notes
Lecture notesLecture notes
Lecture notes
 
RL.ppt
RL.pptRL.ppt
RL.ppt
 
Making Complex Decisions(Artificial Intelligence)
Making Complex Decisions(Artificial Intelligence)Making Complex Decisions(Artificial Intelligence)
Making Complex Decisions(Artificial Intelligence)
 
Introduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement LearningIntroduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement Learning
 
Reinforcement learning.pptx
Reinforcement learning.pptxReinforcement learning.pptx
Reinforcement learning.pptx
 
Hill climbing algorithm
Hill climbing algorithmHill climbing algorithm
Hill climbing algorithm
 
Fundamentals of RL.pptx
Fundamentals of RL.pptxFundamentals of RL.pptx
Fundamentals of RL.pptx
 
Deep RL.pdf
Deep RL.pdfDeep RL.pdf
Deep RL.pdf
 

Más de Kuppusamy P

Recurrent neural networks rnn
Recurrent neural networks   rnnRecurrent neural networks   rnn
Recurrent neural networks rnnKuppusamy P
 
Image segmentation
Image segmentationImage segmentation
Image segmentationKuppusamy P
 
Image enhancement
Image enhancementImage enhancement
Image enhancementKuppusamy P
 
Feature detection and matching
Feature detection and matchingFeature detection and matching
Feature detection and matchingKuppusamy P
 
Image processing, Noise, Noise Removal filters
Image processing, Noise, Noise Removal filtersImage processing, Noise, Noise Removal filters
Image processing, Noise, Noise Removal filtersKuppusamy P
 
Flowchart design for algorithms
Flowchart design for algorithmsFlowchart design for algorithms
Flowchart design for algorithmsKuppusamy P
 
Algorithm basics
Algorithm basicsAlgorithm basics
Algorithm basicsKuppusamy P
 
Problem solving using Programming
Problem solving using ProgrammingProblem solving using Programming
Problem solving using ProgrammingKuppusamy P
 
Parts of Computer, Hardware and Software
Parts of Computer, Hardware and Software Parts of Computer, Hardware and Software
Parts of Computer, Hardware and Software Kuppusamy P
 
Java methods or Subroutines or Functions
Java methods or Subroutines or FunctionsJava methods or Subroutines or Functions
Java methods or Subroutines or FunctionsKuppusamy P
 
Java iterative statements
Java iterative statementsJava iterative statements
Java iterative statementsKuppusamy P
 
Java conditional statements
Java conditional statementsJava conditional statements
Java conditional statementsKuppusamy P
 
Java introduction
Java introductionJava introduction
Java introductionKuppusamy P
 
Anomaly detection (Unsupervised Learning) in Machine Learning
Anomaly detection (Unsupervised Learning) in Machine LearningAnomaly detection (Unsupervised Learning) in Machine Learning
Anomaly detection (Unsupervised Learning) in Machine LearningKuppusamy P
 
Machine Learning Performance metrics for classification
Machine Learning Performance metrics for classificationMachine Learning Performance metrics for classification
Machine Learning Performance metrics for classificationKuppusamy P
 
Machine learning Introduction
Machine learning IntroductionMachine learning Introduction
Machine learning IntroductionKuppusamy P
 

Más de Kuppusamy P (20)

Recurrent neural networks rnn
Recurrent neural networks   rnnRecurrent neural networks   rnn
Recurrent neural networks rnn
 
Deep learning
Deep learningDeep learning
Deep learning
 
Image segmentation
Image segmentationImage segmentation
Image segmentation
 
Image enhancement
Image enhancementImage enhancement
Image enhancement
 
Feature detection and matching
Feature detection and matchingFeature detection and matching
Feature detection and matching
 
Image processing, Noise, Noise Removal filters
Image processing, Noise, Noise Removal filtersImage processing, Noise, Noise Removal filters
Image processing, Noise, Noise Removal filters
 
Flowchart design for algorithms
Flowchart design for algorithmsFlowchart design for algorithms
Flowchart design for algorithms
 
Algorithm basics
Algorithm basicsAlgorithm basics
Algorithm basics
 
Problem solving using Programming
Problem solving using ProgrammingProblem solving using Programming
Problem solving using Programming
 
Parts of Computer, Hardware and Software
Parts of Computer, Hardware and Software Parts of Computer, Hardware and Software
Parts of Computer, Hardware and Software
 
Strings in java
Strings in javaStrings in java
Strings in java
 
Java methods or Subroutines or Functions
Java methods or Subroutines or FunctionsJava methods or Subroutines or Functions
Java methods or Subroutines or Functions
 
Java arrays
Java arraysJava arrays
Java arrays
 
Java iterative statements
Java iterative statementsJava iterative statements
Java iterative statements
 
Java conditional statements
Java conditional statementsJava conditional statements
Java conditional statements
 
Java data types
Java data typesJava data types
Java data types
 
Java introduction
Java introductionJava introduction
Java introduction
 
Anomaly detection (Unsupervised Learning) in Machine Learning
Anomaly detection (Unsupervised Learning) in Machine LearningAnomaly detection (Unsupervised Learning) in Machine Learning
Anomaly detection (Unsupervised Learning) in Machine Learning
 
Machine Learning Performance metrics for classification
Machine Learning Performance metrics for classificationMachine Learning Performance metrics for classification
Machine Learning Performance metrics for classification
 
Machine learning Introduction
Machine learning IntroductionMachine learning Introduction
Machine learning Introduction
 

Último

會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽
會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽
會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽中 央社
 
Andreas Schleicher presents at the launch of What does child empowerment mean...
Andreas Schleicher presents at the launch of What does child empowerment mean...Andreas Schleicher presents at the launch of What does child empowerment mean...
Andreas Schleicher presents at the launch of What does child empowerment mean...EduSkills OECD
 
BỘ LUYỆN NGHE TIẾNG ANH 8 GLOBAL SUCCESS CẢ NĂM (GỒM 12 UNITS, MỖI UNIT GỒM 3...
BỘ LUYỆN NGHE TIẾNG ANH 8 GLOBAL SUCCESS CẢ NĂM (GỒM 12 UNITS, MỖI UNIT GỒM 3...BỘ LUYỆN NGHE TIẾNG ANH 8 GLOBAL SUCCESS CẢ NĂM (GỒM 12 UNITS, MỖI UNIT GỒM 3...
BỘ LUYỆN NGHE TIẾNG ANH 8 GLOBAL SUCCESS CẢ NĂM (GỒM 12 UNITS, MỖI UNIT GỒM 3...Nguyen Thanh Tu Collection
 
ANTI PARKISON DRUGS.pptx
ANTI         PARKISON          DRUGS.pptxANTI         PARKISON          DRUGS.pptx
ANTI PARKISON DRUGS.pptxPoojaSen20
 
Spring gala 2024 photo slideshow - Celebrating School-Community Partnerships
Spring gala 2024 photo slideshow - Celebrating School-Community PartnershipsSpring gala 2024 photo slideshow - Celebrating School-Community Partnerships
Spring gala 2024 photo slideshow - Celebrating School-Community Partnershipsexpandedwebsite
 
Stl Algorithms in C++ jjjjjjjjjjjjjjjjjj
Stl Algorithms in C++ jjjjjjjjjjjjjjjjjjStl Algorithms in C++ jjjjjjjjjjjjjjjjjj
Stl Algorithms in C++ jjjjjjjjjjjjjjjjjjMohammed Sikander
 
Features of Video Calls in the Discuss Module in Odoo 17
Features of Video Calls in the Discuss Module in Odoo 17Features of Video Calls in the Discuss Module in Odoo 17
Features of Video Calls in the Discuss Module in Odoo 17Celine George
 
An overview of the various scriptures in Hinduism
An overview of the various scriptures in HinduismAn overview of the various scriptures in Hinduism
An overview of the various scriptures in HinduismDabee Kamal
 
Capitol Tech Univ Doctoral Presentation -May 2024
Capitol Tech Univ Doctoral Presentation -May 2024Capitol Tech Univ Doctoral Presentation -May 2024
Capitol Tech Univ Doctoral Presentation -May 2024CapitolTechU
 
Graduate Outcomes Presentation Slides - English (v3).pptx
Graduate Outcomes Presentation Slides - English (v3).pptxGraduate Outcomes Presentation Slides - English (v3).pptx
Graduate Outcomes Presentation Slides - English (v3).pptxneillewis46
 
Chapter 7 Pharmacosy Traditional System of Medicine & Ayurvedic Preparations ...
Chapter 7 Pharmacosy Traditional System of Medicine & Ayurvedic Preparations ...Chapter 7 Pharmacosy Traditional System of Medicine & Ayurvedic Preparations ...
Chapter 7 Pharmacosy Traditional System of Medicine & Ayurvedic Preparations ...Sumit Tiwari
 
diagnosting testing bsc 2nd sem.pptx....
diagnosting testing bsc 2nd sem.pptx....diagnosting testing bsc 2nd sem.pptx....
diagnosting testing bsc 2nd sem.pptx....Ritu480198
 
Book Review of Run For Your Life Powerpoint
Book Review of Run For Your Life PowerpointBook Review of Run For Your Life Powerpoint
Book Review of Run For Your Life Powerpoint23600690
 
Envelope of Discrepancy in Orthodontics: Enhancing Precision in Treatment
 Envelope of Discrepancy in Orthodontics: Enhancing Precision in Treatment Envelope of Discrepancy in Orthodontics: Enhancing Precision in Treatment
Envelope of Discrepancy in Orthodontics: Enhancing Precision in Treatmentsaipooja36
 
TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT VẬT LÝ 2024 - TỪ CÁC TRƯỜNG, TRƯ...
TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT VẬT LÝ 2024 - TỪ CÁC TRƯỜNG, TRƯ...TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT VẬT LÝ 2024 - TỪ CÁC TRƯỜNG, TRƯ...
TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT VẬT LÝ 2024 - TỪ CÁC TRƯỜNG, TRƯ...Nguyen Thanh Tu Collection
 
Implanted Devices - VP Shunts: EMGuidewire's Radiology Reading Room
Implanted Devices - VP Shunts: EMGuidewire's Radiology Reading RoomImplanted Devices - VP Shunts: EMGuidewire's Radiology Reading Room
Implanted Devices - VP Shunts: EMGuidewire's Radiology Reading RoomSean M. Fox
 
UChicago CMSC 23320 - The Best Commit Messages of 2024
UChicago CMSC 23320 - The Best Commit Messages of 2024UChicago CMSC 23320 - The Best Commit Messages of 2024
UChicago CMSC 23320 - The Best Commit Messages of 2024Borja Sotomayor
 
How To Create Editable Tree View in Odoo 17
How To Create Editable Tree View in Odoo 17How To Create Editable Tree View in Odoo 17
How To Create Editable Tree View in Odoo 17Celine George
 

Último (20)

會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽
會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽
會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽
 
Andreas Schleicher presents at the launch of What does child empowerment mean...
Andreas Schleicher presents at the launch of What does child empowerment mean...Andreas Schleicher presents at the launch of What does child empowerment mean...
Andreas Schleicher presents at the launch of What does child empowerment mean...
 
BỘ LUYỆN NGHE TIẾNG ANH 8 GLOBAL SUCCESS CẢ NĂM (GỒM 12 UNITS, MỖI UNIT GỒM 3...
BỘ LUYỆN NGHE TIẾNG ANH 8 GLOBAL SUCCESS CẢ NĂM (GỒM 12 UNITS, MỖI UNIT GỒM 3...BỘ LUYỆN NGHE TIẾNG ANH 8 GLOBAL SUCCESS CẢ NĂM (GỒM 12 UNITS, MỖI UNIT GỒM 3...
BỘ LUYỆN NGHE TIẾNG ANH 8 GLOBAL SUCCESS CẢ NĂM (GỒM 12 UNITS, MỖI UNIT GỒM 3...
 
ANTI PARKISON DRUGS.pptx
ANTI         PARKISON          DRUGS.pptxANTI         PARKISON          DRUGS.pptx
ANTI PARKISON DRUGS.pptx
 
Spring gala 2024 photo slideshow - Celebrating School-Community Partnerships
Spring gala 2024 photo slideshow - Celebrating School-Community PartnershipsSpring gala 2024 photo slideshow - Celebrating School-Community Partnerships
Spring gala 2024 photo slideshow - Celebrating School-Community Partnerships
 
Stl Algorithms in C++ jjjjjjjjjjjjjjjjjj
Stl Algorithms in C++ jjjjjjjjjjjjjjjjjjStl Algorithms in C++ jjjjjjjjjjjjjjjjjj
Stl Algorithms in C++ jjjjjjjjjjjjjjjjjj
 
Features of Video Calls in the Discuss Module in Odoo 17
Features of Video Calls in the Discuss Module in Odoo 17Features of Video Calls in the Discuss Module in Odoo 17
Features of Video Calls in the Discuss Module in Odoo 17
 
Mattingly "AI and Prompt Design: LLMs with Text Classification and Open Source"
Mattingly "AI and Prompt Design: LLMs with Text Classification and Open Source"Mattingly "AI and Prompt Design: LLMs with Text Classification and Open Source"
Mattingly "AI and Prompt Design: LLMs with Text Classification and Open Source"
 
An overview of the various scriptures in Hinduism
An overview of the various scriptures in HinduismAn overview of the various scriptures in Hinduism
An overview of the various scriptures in Hinduism
 
Capitol Tech Univ Doctoral Presentation -May 2024
Capitol Tech Univ Doctoral Presentation -May 2024Capitol Tech Univ Doctoral Presentation -May 2024
Capitol Tech Univ Doctoral Presentation -May 2024
 
Graduate Outcomes Presentation Slides - English (v3).pptx
Graduate Outcomes Presentation Slides - English (v3).pptxGraduate Outcomes Presentation Slides - English (v3).pptx
Graduate Outcomes Presentation Slides - English (v3).pptx
 
Chapter 7 Pharmacosy Traditional System of Medicine & Ayurvedic Preparations ...
Chapter 7 Pharmacosy Traditional System of Medicine & Ayurvedic Preparations ...Chapter 7 Pharmacosy Traditional System of Medicine & Ayurvedic Preparations ...
Chapter 7 Pharmacosy Traditional System of Medicine & Ayurvedic Preparations ...
 
diagnosting testing bsc 2nd sem.pptx....
diagnosting testing bsc 2nd sem.pptx....diagnosting testing bsc 2nd sem.pptx....
diagnosting testing bsc 2nd sem.pptx....
 
Book Review of Run For Your Life Powerpoint
Book Review of Run For Your Life PowerpointBook Review of Run For Your Life Powerpoint
Book Review of Run For Your Life Powerpoint
 
Envelope of Discrepancy in Orthodontics: Enhancing Precision in Treatment
 Envelope of Discrepancy in Orthodontics: Enhancing Precision in Treatment Envelope of Discrepancy in Orthodontics: Enhancing Precision in Treatment
Envelope of Discrepancy in Orthodontics: Enhancing Precision in Treatment
 
TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT VẬT LÝ 2024 - TỪ CÁC TRƯỜNG, TRƯ...
TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT VẬT LÝ 2024 - TỪ CÁC TRƯỜNG, TRƯ...TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT VẬT LÝ 2024 - TỪ CÁC TRƯỜNG, TRƯ...
TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT VẬT LÝ 2024 - TỪ CÁC TRƯỜNG, TRƯ...
 
Implanted Devices - VP Shunts: EMGuidewire's Radiology Reading Room
Implanted Devices - VP Shunts: EMGuidewire's Radiology Reading RoomImplanted Devices - VP Shunts: EMGuidewire's Radiology Reading Room
Implanted Devices - VP Shunts: EMGuidewire's Radiology Reading Room
 
UChicago CMSC 23320 - The Best Commit Messages of 2024
UChicago CMSC 23320 - The Best Commit Messages of 2024UChicago CMSC 23320 - The Best Commit Messages of 2024
UChicago CMSC 23320 - The Best Commit Messages of 2024
 
Word Stress rules esl .pptx
Word Stress rules esl               .pptxWord Stress rules esl               .pptx
Word Stress rules esl .pptx
 
How To Create Editable Tree View in Odoo 17
How To Create Editable Tree View in Odoo 17How To Create Editable Tree View in Odoo 17
How To Create Editable Tree View in Odoo 17
 

Reinforcement learning, Q-Learning

  • 1. Reinforcement Learning Dr. P. Kuppusamy Prof / CSE
  • 2. Reinforcement learning • Reinforcement Learning is the agent must sense the environment, learns to behave (act) in a environment by performing actions (reinforcement) and seeing the results. • Task - Learn how to behave successfully to achieve a goal while interacting with an external environment. - The goal of the agent is to learn an action policy that maximizes the total reward it will receive from any starting state. • Examples – Game playing: player knows whether it win or lose, but not know how to move at each step
  • 3. Applications • A robot cleaning the room and recharging its battery • Robot-soccer • invest in shares • Modeling the economy through rational agents • Learning how to fly a helicopter • Scheduling planes to their destinations
  • 4. Reinforcement Learning Process • RL contains two primary components: 1. Agent (A) – RL algorithm that learns from trial and error 2. Environment – World Space in which the agent moves (interact and take action) • State (S) – Current situation returned by the environment • Reward (R) – An immediate return from the environment to appraise the last action • Policy (π) –Agent uses this approach to decide the next action based on the current state • Value (V) – Expected long-term return with discount. Oppose to the short-term reward (R) • Action-Value (Q) – Similar to value except it contains an additional parameter, the current action (A) Figure shows RL is learning from interaction
  • 5. RL Approaches • Two approaches – Model based approach RL: • learn the model, and use it to derive the optimal policy. e.g Adaptive dynamic learning(ADP) approach – Model free approach RL: • derive the optimal policy without learning the model. e.g LMS and Temporal difference approach • Passive learning – The agent imply watches the world during transition and tries to learn the utilities in various states • Active learning – The agent not simply watches, but also acts on the environment
  • 6. Example :S A   • Immediate reward is worth more than future reward. • Reward Maximization - Agent is trained to take best (optimal) action to get maximum reward
  • 7. Reward Maximization :S A   • Exploration – Search and capture more information about the environment • Exploitation – Use the already known information to maximize the rewards
  • 8. Reinforcement learning model • Each percept(e) is enough to determine the State(the state is accessible) • Agent’s task: Find a optimal policy by mapping states of environment to actions of the agent, that maximize long-run measure of the reward (reinforcement) • It can be modeled as Markov Decision Process (MDP) model. • Markov decision process (MDP) is a a mathematical framework for modeling decision making i.e mapping a solution in reinforcement learning.
  • 9. MDP model • MDP model <S,T,A,R> Agent Environment State Reward Action s0 r0 a0 s1 a1 r1 s2 a2 r2 s3 • S– set of states • A– set of actions • Transition Function: T(s,a,s’) = P(s’|s,a) – the probability of transition from s to s’ given action a T(s,a)  s’ • Reward Function: r(s,a)  r the expected reward for taking action a in state s     ' ' ) ' , , ( ) ' , , ( ) , ( ) ' , , ( ) , | ' ( ) , ( s s s a s r s a s T a s R s a s r a s s P a s R
  • 10. MDP - Example I • Consider the graph, and find the shortest path from a node S to a goal node G. • Set of states {S, T, U, V} • Action – Traversal from one state to another state • Reward - Traversing an edge provides “length edge” in dollars. • Policy – Path considered to reach the destination {STV} G S U T V 14 51 25 15 -5 -22
  • 11. Q - Learning • Q-Learning is a value-based reinforcement learning algorithm uses Q- values (action values) to iteratively improve the behavior of the learning agent. • Goal is to maximize the Q value to find the optimal action-selection policy. • The Q table helps to find the best action for each state and maximize the expected reward. • Q-Values / Action-Values: Q-values are defined for states and actions. • Q(s, a) denotes an estimation of the action a at the state s. • This estimation of Q(s, a) will be iteratively computed using the TD- Update rule. • Reward: At every transition, the agent observes a reward for every action from the environment, and then transits to another state. • Episode: If at any point of time the agent ends up in one of the terminating states i.e. there are no further transition possible is called completion of an episode.
  • 12. Q-Learning • Initially agent explore the environment and update the Q-Table. When the Q-Table is ready, the agent will start to exploit the environment and taking better actions. • It is an off-policy control algorithm i.e. the updated policy is different from the behavior policy. It estimates the reward for future actions and appends a value to the new state without any greedy policy Temporal Difference or TD-Update: • Estimate the value of Q is applied at every time step of the agents interaction with the environment Advantage: • Converges to an optimal policy in both deterministic and nondeterministic MDPs. Disadvantage: • Suitable for small problems.
  • 13. Understanding the Q – Learning • Building Environment contains 5 rooms that are connected with doors. • Each room is numbered from 0 to 4. The building outside is numbered as 5. • Doors from room 1 and 4 leads to the building outside 5. • Problem: Agent can place at any one of the rooms (0, 1, 2, 3, 4). Agent’s goal is to reach the building outside (room 5).
  • 14. Understanding the Q – Learning • Represent the room in the graph. • Room number is the state and door is the edge.
  • 15. Understanding the Q – Learning • Assign the Reward value to each door. • The doors lead immediately to target is assigned an instant reward of 100. • Other doors not directly connected to the target room have zero reward. • For example, doors are two-way ( 0 leads to 4, and 4 leads back to 0 ), two edges are assigned to each room. • Each edge contains an instant reward value
  • 16. Understanding the Q – Learning • Let consider agent starts from state s (Room) 2. • Agent’s movement from one state to another state is action a. • Agent is traversing from state 2 to state 5 (Target). – Initial state = current state i.e. state 2 – Transition State 2  State 3 – Transition State 3  State (2, 1, 4) – Transition State 4  State 5
  • 17. Understanding the Q – Learning
  • 18. Understanding the Q – Learning: Prepare matrix Q • Matrix Q is the memory of the agent in which learned information from experience is stored. • Row denotes the current state of the agent • Column denotes the possible actions leading to the next state Compute Q matrix: Q(state, action) = R(state, action) + Gamma * max[Q(next state, all actions)] • Gamma is discounting factor for future rewards. Its range is 0 to 1. i.e. 0 < Gamma <1. • Future rewards are less valuable than current rewards so they must be discounted. • If Gamma is closer to 0, the agent will tend to consider only the immediate rewards. • If Gamma is closer to 1, the agent will tend to consider only future rewards with higher edge weights.
  • 19. Q – Learning Algorithm • Set the gamma parameter • Set environment rewards in matrix R • Initialize matrix Q as Zero – Select random initial (source) state • Set initial state s = current state – Select one action a among all possible actions using exploratory policy • Take this possible action a, going to the next state s’. • Observe reward r – Get maximum Q value to go to next state based on all possible actions • Compute: – Q(state, action) = R(state, action) + Gamma * max[Q(next state, all actions)] • Repeat the above steps until reach the goal state i.e current state = goal state
  • 20. Example: Q – Learning 0 1 2 3 4 5 Action 0 1 2 3 4 5 State
  • 21. Example: Q – Learning 0 1 2 3 4 5 Action 0 1 2 3 4 5 State 3 1 5 5 1 4 𝑅 = 𝟎 𝟏 𝟐 𝟑 𝟒 𝟓 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 0 −1 0 −1 100 0 −1 100 −1 0 0 0 −1 −1 −1 0 −1 −1 0 −1 0 −1 100 −1 0 100 Q = 𝟎 𝟏 𝟐 𝟑 𝟒 𝟓 0 0 0 0 0 0 0 0 0 0 0 0 0 0 𝟏𝟎𝟎 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 • Update the Matrix Q.
  • 22. • For next episode, choose next state 3 randomly that becomes current state. • State 3 contains 3 choices i.e. state 1, 2 or 4. • Let’s choose state 1. • Compute max Q value to go to next state based on all possible actions. Q(state, action) = R(state, action) + Gamma * max[Q(next state, all actions)] • Q(3,1) = R(3,1) + 0.8 * max[Q(1,3), Q(1,5)] = 0 + 0.8 * max[0, 100] = 0 + 80 = 80 • Update the Matrix Q. 0 1 2 3 4 5 Action 0 1 2 3 4 5 State 𝑅 = 𝟎 𝟏 𝟐 𝟑 𝟒 𝟓 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 0 −1 0 −1 100 0 −1 100 −1 0 0 0 −1 −1 −1 0 −1 −1 0 −1 0 −1 100 −1 0 100 Q = 𝟎 𝟏 𝟐 𝟑 𝟒 𝟓 0 0 0 0 0 0 0 0 0 0 0 0 0 0 100 0 0 0 0 𝟖𝟎 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 3 5 4 1 2
  • 23. • For next episode, next state 1 becomes current state • Repeat the inner loop due to 1 is not target state • From State 1, either can go to 3 or 5. • Let’s choose state 5. • Compute max Q value to go to next state based on all possible actions. • Q(state, action) = R(state, action) + Gamma * max[Q(next state, all actions)] • Q(1,5) = R(1,5) + 0.8 * max[Q(5,1), Q(5,4), Q(5,5)] = 100 + 0.8 * max[0, 0, 0] = 100 + 0 = 100 • Q remains the same due to Q(1,5) is already fed into the agent. Stop process 0 1 2 3 4 5 Action 0 1 2 3 4 5 State 𝑅 = 𝟎 𝟏 𝟐 𝟑 𝟒 𝟓 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 0 −1 0 −1 100 0 −1 100 −1 0 0 0 −1 −1 −1 0 −1 −1 0 −1 0 −1 100 −1 0 100 Q = 𝟎 𝟏 𝟐 𝟑 𝟒 𝟓 0 0 0 0 0 0 0 0 0 0 0 0 0 0 100 0 0 0 0 𝟖𝟎 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 1 5 5 1 4
  • 24. References • Tom Markiewicz& Josh Zheng,Getting started with Artificial Intelligence, Published by O’Reilly Media,2017 • Stuart J. Russell and Peter Norvig,Artificial Intelligence A Modern Approach • Richard Szeliski, Computer Vision: Algorithms and Applications, Springer 2010