SlideShare una empresa de Scribd logo
1 de 29
Introduction to Reinforcement Learning
- Utkarsh Garg
How do we learn to do Stuff?
• When any living organism, gets exposed to a specific
stimulus (or a situation), there is an effect of strengthening
the future behaviour of that organism when it has been
exposed to a specific stimulus prompting it to execute the
learned behaviour.
• The organism’s behaviour is controlled by detectable
changes in the environment, which is something external
that influences an activity. For example, our bodies can
detect touch, sound, vision, etc.
• The organism’s brain uses reinforcement or punishment to
modify the likelihood of a behaviour. As well, it involves
voluntary behaviour that can be described with the following
example on animal behaviour:
• dog can be trained to jump higher when rewarded by
dog treats, meaning its behaviour was reinforced by
treats to perform specific actions
With the advancements in Robotics Arm Manipulation, Google Deep Mind
beating a professional Alpha Go Player, and recently the OpenAI team beating a
professional DOTA player, the field of reinforcement learning has really exploded
in recent years
Before we understand how these systems were able to accomplish something
like above, lets first learn about the building blocks of Reinforcement learning.
Let’s learn to crawl before we run!
 A maze-like problem
 The agent lives in a grid
 Walls block the agent’s path
 Noisy movement: actions do not always go as planned
 80% of the time, the action North takes the agent North
(if there is no wall there)
 10% of the time, North takes the agent West; 10% East
 If there is a wall in the direction the agent would have
been taken, the agent stays put
 The agent receives rewards each time step
 Small “living” reward each step (can be negative)
 Big rewards come at the end (good or bad)
 Goal: maximize sum of rewards
Grid World
Deterministic Grid World Stochastic Grid World
• Need to reach from point A to B
• Each segment shows time in hrs. A to C takes 4 mins
• The shortest path in this problem is ACDEGHB
• This is a deterministic problem
• Let’s say we introduce some traffic with some
probabilities in each path
• There is 25% chance it will take 10 mins and 75%
chance it will take 3 mins to reach point C from point
A. Similar some probabilities for other segments
• Now, if we run the simulation multiple times, the
shortest time path would be different for each
iteration due to randomness in traffic introduced in
the system. This is called a Stochastic process
• Finding the shortest time route is not straight forward
anymore. In real world we may not know these
probabilities as well. Our goal is now to find the most
probable shortest path.
Another Example
Reinforcement
Learning
• Reinforcement learning (RL) is an area of machine learning
concerned with how software agents ought to take actions in
an environment so as to maximize some notion of cumulative
reward.
A simple example of the above system:
 Imagine a baby is given a TV remote control at your home
(environment)
 The baby (agent) will first observe the TV and its state (if its
on/of, what channel etc.)
 Then the curious baby will take certain actions like hitting
the remote control (action) and observe how would the TV
response (next state)
 As a non-responding TV is dull, the baby dislike it (receiving
a negative reward) and will take less actions that will lead
to such a result (updating the policy) and vice versa.
 The baby will repeat the process until he/she finds a policy
(what to do under different circumstances) that he/she is
happy with (maximizing the total (discounted) rewards).
BREAKOUT
Reward and Policy
• The reward structure of our system depends on how and what we want our system to learn
R(s) = -2.0R(s) = -0.4
R(s) = -0.03R(s) = -0.01
• We not only want the system to greedily get whatever the
highest reward it is getting right now but we also want it to
consider the future reward.
Why?
It leads to better strategies!
• Therefore, we want to:
• Maximize the sum of rewards
• Prefer rewards now more than later since we deal with a stochastic
process and we never know if the action we take leads to the target state
with the reward
Calculating Rewards
In the picture on the left,
• the two paths are policies
• Each circle is a state and each diamond a reward
• The agent needs to decide the optimal path (or policy) so
that it maximizes its total reward
• If it was a deterministic process, both paths would lead to
equal sum of rewards
• But since we are dealing with a Stochastic process, we
cannot wait for the 4th circle as the policy may not take us to
max reward
One way to model this is to exponentially decay future
rewards:
𝛾(gamma) is the decaying factor. Therefore, the reward
equation becomes:
Total discounted reward = r_1 + 𝛾 r_2 + 𝛾² r_3 + 𝛾³ r_4 + 𝛾⁴
r_5+ …
The above equation gives us a quantitative basis to say that the
agent would prefer path 1 as the value of Total discounted
award is more than the second case.
Done with basics.
Let’s go Deeper
Q - Learning
What is Q?
• Q-value: Q(s,a) is the value of total discounted rewards, when the agent
takes an action a and then follows the most optimal path (that is why we
have max over all actions in below equation).
• And Q*(s,a) is this value for the best action at state s.
By having this value for all combinations of states and actions,
Q table
Reward Value
1 Step -0.04
Power +0.5
Mines -10
End +1 or -1
𝛾 = 0.9
Learned Q Values
Exploration Vs Exploitation
• There is an important concept of the exploration and
exploitation trade off in reinforcement learning.
• Exploration is all about finding more information about
an environment, whereas exploitation is exploiting
already known information to maximize the rewards.
• Real Life Example: Say you go to the same restaurant
(which you like) every day. You are basically exploiting.
But on the other hand, if you search for new restaurant
every time before going to any one of them, then it’s
exploration. Exploration is very important for the
search of future rewards which might be higher than
the near rewards i.e. you may find a new restaurant
even better than when you were exploiting.
Generalization across States
• Basic Q-Learning keeps a table of all q-values
• In realistic situations, we cannot possibly learn about every
single state!
• Too many states to visit them all in training
• Too many states to hold the q-tables in memory
• Instead, we want to generalize:
• Learn about some small number of training states from
experience
• Generalize that experience to new, similar situations
• This is a fundamental idea in machine learning, and we’ll see it
over and over again
State space
• Discretized vertical distance from lower pipe
• Discretized horizontal distance from next pair of pipes
• Life: Dead or Living
Actions
• Click
• Do nothing
Rewards
• +1 if Flappy Bird still alive
• -1000 if Flappy Bird is dead
• 6-7 hours of Q-learning
Generalization Example 1
Let’s say we discover
through experience
that this state is bad:
In naïve q-learning,
we know nothing
about this state:
Or even this one!
Generalization Example 2
• Solution: describe a state using a vector of features
(properties)
• Features are functions from states to real numbers (often
0/1) that capture important properties of the state
• Example features:
• Distance to closest ghost
• Distance to closest dot
• Number of ghosts
• 1 / (dist to dot)2
• Is Pacman in a tunnel? (0/1)
• …… etc.
• Is it the exact state on this slide?
• Can also describe a q-state (s, a) with features (e.g. action
moves closer to food)
• Now instead of a Q table, we have these features using
which we can train any supervised learning algo to learn
the Q values and hence the right actions
Feature Based Representation
Generalization Example 3 (play video)
4 Actions available:
• The avg angle of the
blades
• Difference in angle
between front and back
• Difference in angle
between left and right
• Angle for the tail rotor
Task:
Learn to hover
States:
• Data from various sensors
Note! The most efficient policy it
found was to fly inverted!
Going even
Deeper…
Deep Q Networks (DQN)
Alpha Go
• In 2016, initial version of alphago lee beat 17 times world champion lee
sedol.
• Just a year later, alphago zero beat unlike its predecessor was trained
without any data from real human games
• It learned only by playing against itself. The 2016 version was defeated
100-0 by alphago zero.
• Go has shown us that AI has started to move beyond what
humans can tell it to do.
• This was shown when the alphago made the move37. For
humans or the world champion, it was a seemingly bad
move, but it turn out to be a game changing move which led
to alphago’s victory
Arch Link : https://applied-
data.science/static/main/res/alpha_go_zero_cheat_sheet.png
Alpha Go Training Graph
Self Driving Cars
Supervised learning based self driving car (with simulator)
https://www.youtube.com/watch?v=EaY5QiZwSP4&t=1111s
The reinforcement learning way to do this!
https://wayve.ai/blog/learning-to-drive-in-a-day-with-
reinforcement-learning
Landing Spacex Rockets
https://www.youtube.com/watch?v=4_igzo4qNmQ
Thank You

Más contenido relacionado

Similar a Intro to Reinforcement Learning

Demystifying deep reinforement learning
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learning재연 윤
 
Reinforcement learning in Machine learning
 Reinforcement learning in Machine learning Reinforcement learning in Machine learning
Reinforcement learning in Machine learningMegha Sharma
 
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)Hogeon Seo
 
Search-Beyond-Classical-no-exercise-answers.pdf
Search-Beyond-Classical-no-exercise-answers.pdfSearch-Beyond-Classical-no-exercise-answers.pdf
Search-Beyond-Classical-no-exercise-answers.pdfMrRRThirrunavukkaras
 
Rl chapter 1 introduction
Rl chapter 1 introductionRl chapter 1 introduction
Rl chapter 1 introductionConnorShorten2
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement LearningUsman Qayyum
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningElias Hasnat
 
Survey of Modern Reinforcement Learning
Survey of Modern Reinforcement Learning Survey of Modern Reinforcement Learning
Survey of Modern Reinforcement Learning Julia Maddalena
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningDing Li
 
reiniforcement learning.ppt
reiniforcement learning.pptreiniforcement learning.ppt
reiniforcement learning.pptcharusharma165
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningSalem-Kabbani
 
Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-LearningKuppusamy P
 
RL_online _presentation_1.ppt
RL_online _presentation_1.pptRL_online _presentation_1.ppt
RL_online _presentation_1.pptssuser43a599
 
Making smart decisions in real-time with Reinforcement Learning
Making smart decisions in real-time with Reinforcement LearningMaking smart decisions in real-time with Reinforcement Learning
Making smart decisions in real-time with Reinforcement LearningRuth Yakubu
 

Similar a Intro to Reinforcement Learning (20)

Demystifying deep reinforement learning
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learning
 
Reinforcement learning in Machine learning
 Reinforcement learning in Machine learning Reinforcement learning in Machine learning
Reinforcement learning in Machine learning
 
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
 
Search-Beyond-Classical-no-exercise-answers.pdf
Search-Beyond-Classical-no-exercise-answers.pdfSearch-Beyond-Classical-no-exercise-answers.pdf
Search-Beyond-Classical-no-exercise-answers.pdf
 
Rl chapter 1 introduction
Rl chapter 1 introductionRl chapter 1 introduction
Rl chapter 1 introduction
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Survey of Modern Reinforcement Learning
Survey of Modern Reinforcement Learning Survey of Modern Reinforcement Learning
Survey of Modern Reinforcement Learning
 
Introduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement LearningIntroduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement Learning
 
Organizational behaviour
Organizational behaviourOrganizational behaviour
Organizational behaviour
 
Deep einforcement learning
Deep einforcement learningDeep einforcement learning
Deep einforcement learning
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
reiniforcement learning.ppt
reiniforcement learning.pptreiniforcement learning.ppt
reiniforcement learning.ppt
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
(ppt
(ppt(ppt
(ppt
 
Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-Learning
 
02LocalSearch.pdf
02LocalSearch.pdf02LocalSearch.pdf
02LocalSearch.pdf
 
YijueRL.ppt
YijueRL.pptYijueRL.ppt
YijueRL.ppt
 
RL_online _presentation_1.ppt
RL_online _presentation_1.pptRL_online _presentation_1.ppt
RL_online _presentation_1.ppt
 
Making smart decisions in real-time with Reinforcement Learning
Making smart decisions in real-time with Reinforcement LearningMaking smart decisions in real-time with Reinforcement Learning
Making smart decisions in real-time with Reinforcement Learning
 

Último

Mine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxMine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxRomil Mishra
 
An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...Chandu841456
 
welding defects observed during the welding
welding defects observed during the weldingwelding defects observed during the welding
welding defects observed during the weldingMuhammadUzairLiaqat
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girlsssuser7cb4ff
 
Indian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptIndian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptMadan Karki
 
home automation using Arduino by Aditya Prasad
home automation using Arduino by Aditya Prasadhome automation using Arduino by Aditya Prasad
home automation using Arduino by Aditya Prasadaditya806802
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
 
National Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdfNational Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdfRajuKanojiya4
 
Industrial Safety Unit-IV workplace health and safety.ppt
Industrial Safety Unit-IV workplace health and safety.pptIndustrial Safety Unit-IV workplace health and safety.ppt
Industrial Safety Unit-IV workplace health and safety.pptNarmatha D
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)Dr SOUNDIRARAJ N
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleAlluxio, Inc.
 
Industrial Safety Unit-I SAFETY TERMINOLOGIES
Industrial Safety Unit-I SAFETY TERMINOLOGIESIndustrial Safety Unit-I SAFETY TERMINOLOGIES
Industrial Safety Unit-I SAFETY TERMINOLOGIESNarmatha D
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...VICTOR MAESTRE RAMIREZ
 
System Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingSystem Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingBootNeck1
 
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONTHE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONjhunlian
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating SystemRashmi Bhat
 
Input Output Management in Operating System
Input Output Management in Operating SystemInput Output Management in Operating System
Input Output Management in Operating SystemRashmi Bhat
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxk795866
 
Class 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm SystemClass 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm Systemirfanmechengr
 

Último (20)

Mine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxMine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptx
 
An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...
 
welding defects observed during the welding
welding defects observed during the weldingwelding defects observed during the welding
welding defects observed during the welding
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girls
 
Indian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptIndian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.ppt
 
home automation using Arduino by Aditya Prasad
home automation using Arduino by Aditya Prasadhome automation using Arduino by Aditya Prasad
home automation using Arduino by Aditya Prasad
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
 
National Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdfNational Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdf
 
Industrial Safety Unit-IV workplace health and safety.ppt
Industrial Safety Unit-IV workplace health and safety.pptIndustrial Safety Unit-IV workplace health and safety.ppt
Industrial Safety Unit-IV workplace health and safety.ppt
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at Scale
 
Industrial Safety Unit-I SAFETY TERMINOLOGIES
Industrial Safety Unit-I SAFETY TERMINOLOGIESIndustrial Safety Unit-I SAFETY TERMINOLOGIES
Industrial Safety Unit-I SAFETY TERMINOLOGIES
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...
 
System Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingSystem Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event Scheduling
 
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONTHE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating System
 
Input Output Management in Operating System
Input Output Management in Operating SystemInput Output Management in Operating System
Input Output Management in Operating System
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptx
 
Class 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm SystemClass 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm System
 

Intro to Reinforcement Learning

  • 1. Introduction to Reinforcement Learning - Utkarsh Garg
  • 2. How do we learn to do Stuff? • When any living organism, gets exposed to a specific stimulus (or a situation), there is an effect of strengthening the future behaviour of that organism when it has been exposed to a specific stimulus prompting it to execute the learned behaviour. • The organism’s behaviour is controlled by detectable changes in the environment, which is something external that influences an activity. For example, our bodies can detect touch, sound, vision, etc. • The organism’s brain uses reinforcement or punishment to modify the likelihood of a behaviour. As well, it involves voluntary behaviour that can be described with the following example on animal behaviour: • dog can be trained to jump higher when rewarded by dog treats, meaning its behaviour was reinforced by treats to perform specific actions
  • 3. With the advancements in Robotics Arm Manipulation, Google Deep Mind beating a professional Alpha Go Player, and recently the OpenAI team beating a professional DOTA player, the field of reinforcement learning has really exploded in recent years Before we understand how these systems were able to accomplish something like above, lets first learn about the building blocks of Reinforcement learning. Let’s learn to crawl before we run!
  • 4.  A maze-like problem  The agent lives in a grid  Walls block the agent’s path  Noisy movement: actions do not always go as planned  80% of the time, the action North takes the agent North (if there is no wall there)  10% of the time, North takes the agent West; 10% East  If there is a wall in the direction the agent would have been taken, the agent stays put  The agent receives rewards each time step  Small “living” reward each step (can be negative)  Big rewards come at the end (good or bad)  Goal: maximize sum of rewards Grid World
  • 5. Deterministic Grid World Stochastic Grid World
  • 6. • Need to reach from point A to B • Each segment shows time in hrs. A to C takes 4 mins • The shortest path in this problem is ACDEGHB • This is a deterministic problem • Let’s say we introduce some traffic with some probabilities in each path • There is 25% chance it will take 10 mins and 75% chance it will take 3 mins to reach point C from point A. Similar some probabilities for other segments • Now, if we run the simulation multiple times, the shortest time path would be different for each iteration due to randomness in traffic introduced in the system. This is called a Stochastic process • Finding the shortest time route is not straight forward anymore. In real world we may not know these probabilities as well. Our goal is now to find the most probable shortest path. Another Example
  • 7. Reinforcement Learning • Reinforcement learning (RL) is an area of machine learning concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward.
  • 8. A simple example of the above system:  Imagine a baby is given a TV remote control at your home (environment)  The baby (agent) will first observe the TV and its state (if its on/of, what channel etc.)  Then the curious baby will take certain actions like hitting the remote control (action) and observe how would the TV response (next state)  As a non-responding TV is dull, the baby dislike it (receiving a negative reward) and will take less actions that will lead to such a result (updating the policy) and vice versa.  The baby will repeat the process until he/she finds a policy (what to do under different circumstances) that he/she is happy with (maximizing the total (discounted) rewards).
  • 10. Reward and Policy • The reward structure of our system depends on how and what we want our system to learn R(s) = -2.0R(s) = -0.4 R(s) = -0.03R(s) = -0.01
  • 11. • We not only want the system to greedily get whatever the highest reward it is getting right now but we also want it to consider the future reward. Why? It leads to better strategies!
  • 12. • Therefore, we want to: • Maximize the sum of rewards • Prefer rewards now more than later since we deal with a stochastic process and we never know if the action we take leads to the target state with the reward
  • 13. Calculating Rewards In the picture on the left, • the two paths are policies • Each circle is a state and each diamond a reward • The agent needs to decide the optimal path (or policy) so that it maximizes its total reward • If it was a deterministic process, both paths would lead to equal sum of rewards • But since we are dealing with a Stochastic process, we cannot wait for the 4th circle as the policy may not take us to max reward One way to model this is to exponentially decay future rewards: 𝛾(gamma) is the decaying factor. Therefore, the reward equation becomes: Total discounted reward = r_1 + 𝛾 r_2 + 𝛾² r_3 + 𝛾³ r_4 + 𝛾⁴ r_5+ … The above equation gives us a quantitative basis to say that the agent would prefer path 1 as the value of Total discounted award is more than the second case.
  • 15. Q - Learning What is Q? • Q-value: Q(s,a) is the value of total discounted rewards, when the agent takes an action a and then follows the most optimal path (that is why we have max over all actions in below equation). • And Q*(s,a) is this value for the best action at state s. By having this value for all combinations of states and actions, Q table Reward Value 1 Step -0.04 Power +0.5 Mines -10 End +1 or -1 𝛾 = 0.9
  • 17. Exploration Vs Exploitation • There is an important concept of the exploration and exploitation trade off in reinforcement learning. • Exploration is all about finding more information about an environment, whereas exploitation is exploiting already known information to maximize the rewards. • Real Life Example: Say you go to the same restaurant (which you like) every day. You are basically exploiting. But on the other hand, if you search for new restaurant every time before going to any one of them, then it’s exploration. Exploration is very important for the search of future rewards which might be higher than the near rewards i.e. you may find a new restaurant even better than when you were exploiting.
  • 18. Generalization across States • Basic Q-Learning keeps a table of all q-values • In realistic situations, we cannot possibly learn about every single state! • Too many states to visit them all in training • Too many states to hold the q-tables in memory • Instead, we want to generalize: • Learn about some small number of training states from experience • Generalize that experience to new, similar situations • This is a fundamental idea in machine learning, and we’ll see it over and over again
  • 19. State space • Discretized vertical distance from lower pipe • Discretized horizontal distance from next pair of pipes • Life: Dead or Living Actions • Click • Do nothing Rewards • +1 if Flappy Bird still alive • -1000 if Flappy Bird is dead • 6-7 hours of Q-learning Generalization Example 1
  • 20. Let’s say we discover through experience that this state is bad: In naïve q-learning, we know nothing about this state: Or even this one! Generalization Example 2
  • 21. • Solution: describe a state using a vector of features (properties) • Features are functions from states to real numbers (often 0/1) that capture important properties of the state • Example features: • Distance to closest ghost • Distance to closest dot • Number of ghosts • 1 / (dist to dot)2 • Is Pacman in a tunnel? (0/1) • …… etc. • Is it the exact state on this slide? • Can also describe a q-state (s, a) with features (e.g. action moves closer to food) • Now instead of a Q table, we have these features using which we can train any supervised learning algo to learn the Q values and hence the right actions Feature Based Representation
  • 22. Generalization Example 3 (play video) 4 Actions available: • The avg angle of the blades • Difference in angle between front and back • Difference in angle between left and right • Angle for the tail rotor Task: Learn to hover States: • Data from various sensors Note! The most efficient policy it found was to fly inverted!
  • 25. Alpha Go • In 2016, initial version of alphago lee beat 17 times world champion lee sedol. • Just a year later, alphago zero beat unlike its predecessor was trained without any data from real human games • It learned only by playing against itself. The 2016 version was defeated 100-0 by alphago zero. • Go has shown us that AI has started to move beyond what humans can tell it to do. • This was shown when the alphago made the move37. For humans or the world champion, it was a seemingly bad move, but it turn out to be a game changing move which led to alphago’s victory Arch Link : https://applied- data.science/static/main/res/alpha_go_zero_cheat_sheet.png
  • 27. Self Driving Cars Supervised learning based self driving car (with simulator) https://www.youtube.com/watch?v=EaY5QiZwSP4&t=1111s The reinforcement learning way to do this! https://wayve.ai/blog/learning-to-drive-in-a-day-with- reinforcement-learning