SlideShare una empresa de Scribd logo
1 de 30
Reinforcement Learning
Dr. Subrat Panda
Head of AI and Data Sciences,
Capillary Technologies
Introduction about me
● BTech ( 2002) , PhD (2009) – CSE, IIT Kharagpur
● Synopsys (EDA), IBM (CPU), NVIDIA (GPU), Taro (Full Stack Engineer), Capillary (Principal
Architect - AI)
● Applying AI to Retail
● Co-Founded IDLI (for social good) with Prof. Amit Sethi (IIT Bombay), Jacob Minz (Synopsys)
and Biswa Gourav Singh (Capillary)
● https://www.facebook.com/groups/idliai/
● Linked In - https://www.linkedin.com/in/subratpanda/
● Facebook - https://www.facebook.com/subratpanda
● Twitter - @subratpanda Email - subrat.panda@capillarytech.com
Overview
• Supervised Learning: Immediate feedback (labels provided for every input).
• Unsupervised Learning: No feedback (No labels provided).
• Reinforcement Learning: Delayed scalar feedback (a number called reward).
• RL deals with agents that must sense & act upon their environment. This combines
classical AI and machine learning techniques. It is probably the most comprehensive problem
setting.
• Examples:
• Robot-soccer
• Share Investing
• Learning to walk/fly/ride a vehicle
• Scheduling
• AlphaGo / Super Mario
Machine Learning – http://techleer.com
Some Definitions and assumptions
MDP - Markov Decision Problem
The framework of the MDP has the following elements:
1. state of the system,
2. actions,
3. transition probabilities,
4. transition rewards,
5. a policy, and
6. a performance metric.
We assume that the system is modeled by a so-called abstract stochastic process called the Markov
chain.
RL is generally used to solve the so-called Markov decision problem (MDP).
The theory of RL relies on dynamic programming (DP) and artificial intelligence (AI).
The Big Picture
Your action influences the state of the world which determines its reward
Some Complications
• The outcome of your actions may be uncertain
• You may not be able to perfectly sense the state of the world
• The reward may be stochastic.
• Reward is delayed (i.e. finding food in a maze)
• You may have no clue (model) about how the world responds to your actions.
• You may have no clue (model) of how rewards are being paid off.
• The world may change while you try to learn it.
• How much time do you need to explore uncharted territory before you exploit what you have learned?
• For large-scale systems with millions of states, it is impractical to store these values. This is called the curse of
dimensionality. DP breaks down on problems which suffer from any one of these curses because it needs all these
values.
The Task
• To learn an optimal policy that maps states of the world to actions of the
agent.
i.e., if this patch of room is dirty, I clean it. If my battery is empty, I recharge it.
• What is it that the agent tries to optimize?
Answer: the total future discounted reward:
Note: immediate reward is worth more than future reward.
What would happen to mouse in a maze with gamma = 0 ? - Greedy
Value Function
• Let’s say we have access to the optimal value function that computes the
total future discounted reward
• What would be the optimal policy ?
• Answer: we choose the action that maximizes:
• We assume that we know what the reward will be if we perform action “a” in
state “s”:
• We also assume we know what the next state of the world will be if we perform
action “a” in state “s”:
Example I
• Consider some complicated graph, and we would like to find the
shortest
path from a node Si to a goal node G.
• Traversing an edge will cost you “length edge” dollars.
• The value function encodes the total remaining
distance to the goal node from any node s, i.e.
V(s) = “1 / distance” to goal from s.
• If you know V(s), the problem is trivial. You simply
choose the node that has highest V(s).
Si
G
Example II
Find your way to the goal.
Q-Function
• One approach to RL is then to try to estimate V*(s).
• However, this approach requires you to know r(s,a) and delta(s,a).
• This is unrealistic in many real problems. What is the reward if a robot is exploring mars
and decides to take a right turn?
• Fortunately we can circumvent this problem by exploring and experiencing how the world
reacts to our actions. We need to learn r & delta.
• We want a function that directly learns good state-action pairs, i.e. what action should I
take in this state. We call this Q(s,a).
• Given Q(s,a) it is now trivial to execute the optimal policy, without knowing
r(s,a) and delta(s,a). We have:
Bellman Equation:
Example II
Check that
Q-Learning
• This still depends on r(s,a) and delta(s,a).
• However, imagine the robot is exploring its environment, trying new actions as
it goes.
• At every step it receives some reward “r”, and it observes the environment
change into a new state s’ for action a. How can we use these observations,
(s,a,s’,r) to learn a model?
s’=st+1
Q-Learning
• This equation continually estimates Q at state s consistent with an estimate
of Q at state s’, one step in the future: temporal difference (TD) learning.
• Note that s’ is closer to goal, and hence more “reliable”, but still an estimate itself.
• Updating estimates based on other estimates is called bootstrapping.
• We do an update after each state-action pair. I.e., we are learning online!
• We are learning useful things about explored state-action pairs. These are typically most
useful because they are likely to be encountered again.
• Under suitable conditions, these updates can actually be proved to converge to the real
answer.
s’=st+1
Example Q-Learning
Q-learning propagates Q-estimates 1-step backwards
Exploration / Exploitation
• It is very important that the agent does not simply follow the current policy
when learning Q. (off-policy learning).The reason is that you may get stuck
in a suboptimal solution. i.e. there may be other solutions out there that you
have never seen.
• Hence it is good to try new things so now and then, e.g. If T is large lots of
exploring, if T is small follow current policy. One can decrease T over time.
Improvements
• One can trade-off memory and computation by caching (s,s’,r) for observed
transitions. After a while, as Q(s’,a’) has changed, you can “replay” the update:
•One can actively search for state-action pairs for which Q(s,a) is expected to
change a lot (prioritized sweeping).
• One can do updates along the sampled path much further back than just
one step ( learning).
Extensions
• To deal with stochastic environments, we need to maximize
expected future discounted reward:
• Often the state space is too large to deal with all states. In this case we
need to learn a function:
• Neural network with back-propagation have been quite successful.
• For instance, TD-Gammon is a back-gammon program that plays at expert
level.
state-space very large, trained by playing against itself, uses NN to approximate
value function, uses TD(lambda) for learning.
Real Life Examples Using RL
Deep RL - Deep Q Learning
Tools and EcoSystem
● OpenAI Gym- Python based rich AI simulation environment
● TensorFlow - TensorLayer providing popular RL modules
● Pytorch - Open Sourced by Facebook
● Keras - On top of TensorFlow
● DeepMind Lab - Google 3D platform
Small Demo - on Youtube
Application Areas of RL
- Resources management in computer clusters
- Traffic Light Control
- Robotics
- Personalized Recommendations
- Chemistry
- Bidding and Advertising
- Games
What you need to know to apply RL?
- Understanding your problem
- A simulated environment
- MDP
- Algorithms
Learning how to run? - Good example by
deepsense.ai
Conclusion
• Reinforcement learning addresses a very broad and relevant question: How can we learn
to survive in our environment?
• We have looked at Q-learning, which simply learns from experience. No model of the world
is needed.
• We made simplifying assumptions: e.g. state of the world only depends on last state and
action. This is the Markov assumption. The model is called a Markov Decision Process
(MDP).
• We assumed deterministic dynamics, reward function, but the world really is stochastic.
• There are many extensions to speed up learning.
• There have been many successful real world applications.
References
• Thanks to Google and AI research Community
• https://www.slideshare.net/AnirbanSantara/an-introduction-to-reinforcement-learning-the-doors-to-agi - Good Intro PPT
• https://skymind.ai/wiki/deep-reinforcement-learning - Good blog
• https://medium.com/@BonsaiAI/why-reinforcement-learning-might-be-the-best-ai-technique-for-complex-industrial-systems-
fde8b0ebd5fb
• https://deepsense.ai/learning-to-run-an-example-of-reinforcement-learning/
• https://en.wikipedia.org/wiki/Reinforcement_learning
• https://www.ics.uci.edu/~welling/teaching/ICS175winter12/RL.ppt - Good Intro PPT which I have adopted in whole
• https://people.eecs.berkeley.edu/~jordan/MLShortCourse/reinforcement-learning.ppt
• https://www.cse.iitm.ac.in/~ravi/courses/Reinforcement%20Learning.html - Detailed Course
• https://web.mst.edu/~gosavia/tutorial.pdf - short Tutorial
• Book - Reinforcement Learning: An Introduction (English, Hardcover, Andrew G Barto Richard S Sutton Barto
Sutton)
• https://www.techleer.com/articles/203-machine-learning-algorithm-backbone-of-emerging-technologies/
• http://ruder.io/transfer-learning/
• Exploration and Exploitation - https://artint.info/html/ArtInt_266.html ,
http://www.cs.cmu.edu/~rsalakhu/10703/Lecture_Exploration.pdf
• https://hub.packtpub.com/tools-for-reinforcement-learning/
• http://adventuresinmachinelearning.com/reinforcement-learning-tensorflow/
• https://towardsdatascience.com/applications-of-reinforcement-learning-in-real-world-1a94955bcd12
Acknowledgements
- Anirban Santara - For his Intro Session on IDLI and support
- Prof. Ravindran - Inspiration
- Palash Arora from Capillary, Ashwin Krishna.
- Techgig Team for the opportunity
- Sumandeep and Biswa from Capillary for review and discussions.
- https://www.ics.uci.edu/~welling/teaching/ICS175winter12/RL.ppt - Good Intro PPT
which I have adopted in whole
Thank You!!

Más contenido relacionado

La actualidad más candente

An introduction to deep reinforcement learning
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learningBig Data Colombia
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningDing Li
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningKhaled Saleh
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningDongHyun Kwak
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningDongHyun Kwak
 
Actor critic algorithm
Actor critic algorithmActor critic algorithm
Actor critic algorithmJie-Han Chen
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningCloudxLab
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learningbutest
 
Deep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its ApplicationsDeep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its ApplicationsBill Liu
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningNAVER Engineering
 
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...Simplilearn
 
Unsupervised learning
Unsupervised learningUnsupervised learning
Unsupervised learningamalalhait
 
Rl chapter 1 introduction
Rl chapter 1 introductionRl chapter 1 introduction
Rl chapter 1 introductionConnorShorten2
 
Markov decision process
Markov decision processMarkov decision process
Markov decision processHamed Abdi
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkYan Xu
 
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex FridmanMIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex FridmanPeerasak C.
 

La actualidad más candente (20)

An introduction to deep reinforcement learning
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learning
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement Learning
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Actor critic algorithm
Actor critic algorithmActor critic algorithm
Actor critic algorithm
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Deep Q-Learning
Deep Q-LearningDeep Q-Learning
Deep Q-Learning
 
Deep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its ApplicationsDeep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its Applications
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
 
Machine learning
Machine learningMachine learning
Machine learning
 
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
 
Unsupervised learning
Unsupervised learningUnsupervised learning
Unsupervised learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Rl chapter 1 introduction
Rl chapter 1 introductionRl chapter 1 introduction
Rl chapter 1 introduction
 
Markov decision process
Markov decision processMarkov decision process
Markov decision process
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural Network
 
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex FridmanMIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
 

Similar a An introduction to reinforcement learning

24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptxManiMaran230751
 
Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement LearningNatan Katz
 
Q-Learning Algorithm: A Concise Introduction [Shakeeb A.]
Q-Learning Algorithm: A Concise Introduction [Shakeeb A.]Q-Learning Algorithm: A Concise Introduction [Shakeeb A.]
Q-Learning Algorithm: A Concise Introduction [Shakeeb A.]Shakeeb Ahmad Mohammad Mukhtar
 
Intro to Reinforcement Learning
Intro to Reinforcement LearningIntro to Reinforcement Learning
Intro to Reinforcement LearningUtkarsh Garg
 
Demystifying deep reinforement learning
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learning재연 윤
 
Reinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine SweeperReinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine SweeperDataScienceLab
 
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)Hogeon Seo
 
ODSC East 2020 : Continuous_learning_systems
ODSC East 2020 : Continuous_learning_systemsODSC East 2020 : Continuous_learning_systems
ODSC East 2020 : Continuous_learning_systemsAnuj Gupta
 
Continuous Learning Systems: Building ML systems that learn from their mistakes
Continuous Learning Systems: Building ML systems that learn from their mistakesContinuous Learning Systems: Building ML systems that learn from their mistakes
Continuous Learning Systems: Building ML systems that learn from their mistakesAnuj Gupta
 
Building Continuous Learning Systems
Building Continuous Learning SystemsBuilding Continuous Learning Systems
Building Continuous Learning SystemsAnuj Gupta
 
ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...
ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...
ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...Srinath Perera
 
cs330_2021_lifelong_learning.pdf
cs330_2021_lifelong_learning.pdfcs330_2021_lifelong_learning.pdf
cs330_2021_lifelong_learning.pdfKuan-Tsae Huang
 
reiniforcement learning.ppt
reiniforcement learning.pptreiniforcement learning.ppt
reiniforcement learning.pptcharusharma165
 
semi supervised Learning and Reinforcement learning (1).pptx
 semi supervised Learning and Reinforcement learning (1).pptx semi supervised Learning and Reinforcement learning (1).pptx
semi supervised Learning and Reinforcement learning (1).pptxDr.Shweta
 
Introduction to Deep learning and H2O for beginner's
Introduction to Deep learning and H2O for beginner'sIntroduction to Deep learning and H2O for beginner's
Introduction to Deep learning and H2O for beginner'sVidyasagar Bhargava
 
Horizon: Deep Reinforcement Learning at Scale
Horizon: Deep Reinforcement Learning at ScaleHorizon: Deep Reinforcement Learning at Scale
Horizon: Deep Reinforcement Learning at ScaleDatabricks
 
李俊良/Feature Engineering in Machine Learning
李俊良/Feature Engineering in Machine Learning李俊良/Feature Engineering in Machine Learning
李俊良/Feature Engineering in Machine Learning台灣資料科學年會
 
Deep Learning in Recommender Systems - RecSys Summer School 2017
Deep Learning in Recommender Systems - RecSys Summer School 2017Deep Learning in Recommender Systems - RecSys Summer School 2017
Deep Learning in Recommender Systems - RecSys Summer School 2017Balázs Hidasi
 

Similar a An introduction to reinforcement learning (20)

24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx
 
Introduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement LearningIntroduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement Learning
 
Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement Learning
 
Q-Learning Algorithm: A Concise Introduction [Shakeeb A.]
Q-Learning Algorithm: A Concise Introduction [Shakeeb A.]Q-Learning Algorithm: A Concise Introduction [Shakeeb A.]
Q-Learning Algorithm: A Concise Introduction [Shakeeb A.]
 
Intro to Reinforcement Learning
Intro to Reinforcement LearningIntro to Reinforcement Learning
Intro to Reinforcement Learning
 
Demystifying deep reinforement learning
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learning
 
Reinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine SweeperReinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine Sweeper
 
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
 
TD Learning Webinar
TD Learning WebinarTD Learning Webinar
TD Learning Webinar
 
ODSC East 2020 : Continuous_learning_systems
ODSC East 2020 : Continuous_learning_systemsODSC East 2020 : Continuous_learning_systems
ODSC East 2020 : Continuous_learning_systems
 
Continuous Learning Systems: Building ML systems that learn from their mistakes
Continuous Learning Systems: Building ML systems that learn from their mistakesContinuous Learning Systems: Building ML systems that learn from their mistakes
Continuous Learning Systems: Building ML systems that learn from their mistakes
 
Building Continuous Learning Systems
Building Continuous Learning SystemsBuilding Continuous Learning Systems
Building Continuous Learning Systems
 
ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...
ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...
ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...
 
cs330_2021_lifelong_learning.pdf
cs330_2021_lifelong_learning.pdfcs330_2021_lifelong_learning.pdf
cs330_2021_lifelong_learning.pdf
 
reiniforcement learning.ppt
reiniforcement learning.pptreiniforcement learning.ppt
reiniforcement learning.ppt
 
semi supervised Learning and Reinforcement learning (1).pptx
 semi supervised Learning and Reinforcement learning (1).pptx semi supervised Learning and Reinforcement learning (1).pptx
semi supervised Learning and Reinforcement learning (1).pptx
 
Introduction to Deep learning and H2O for beginner's
Introduction to Deep learning and H2O for beginner'sIntroduction to Deep learning and H2O for beginner's
Introduction to Deep learning and H2O for beginner's
 
Horizon: Deep Reinforcement Learning at Scale
Horizon: Deep Reinforcement Learning at ScaleHorizon: Deep Reinforcement Learning at Scale
Horizon: Deep Reinforcement Learning at Scale
 
李俊良/Feature Engineering in Machine Learning
李俊良/Feature Engineering in Machine Learning李俊良/Feature Engineering in Machine Learning
李俊良/Feature Engineering in Machine Learning
 
Deep Learning in Recommender Systems - RecSys Summer School 2017
Deep Learning in Recommender Systems - RecSys Summer School 2017Deep Learning in Recommender Systems - RecSys Summer School 2017
Deep Learning in Recommender Systems - RecSys Summer School 2017
 

Más de Subrat Panda, PhD

Más de Subrat Panda, PhD (9)

Role of technology in agriculture courses by srmist & the hindu
Role of technology in agriculture courses by srmist & the hinduRole of technology in agriculture courses by srmist & the hindu
Role of technology in agriculture courses by srmist & the hindu
 
Journey so far
Journey so farJourney so far
Journey so far
 
AI in security
AI in securityAI in security
AI in security
 
AI in Retail
AI in RetailAI in Retail
AI in Retail
 
Machine Learning in the Financial Industry
Machine Learning in the Financial IndustryMachine Learning in the Financial Industry
Machine Learning in the Financial Industry
 
ML Workshop at SACON 2018
ML Workshop at SACON 2018ML Workshop at SACON 2018
ML Workshop at SACON 2018
 
AI and Deep Learning
AI and Deep Learning AI and Deep Learning
AI and Deep Learning
 
AI and The future of work
AI and The future of work AI and The future of work
AI and The future of work
 
AI in retail
AI in retailAI in retail
AI in retail
 

Último

Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 

Último (20)

Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 

An introduction to reinforcement learning

  • 1. Reinforcement Learning Dr. Subrat Panda Head of AI and Data Sciences, Capillary Technologies
  • 2. Introduction about me ● BTech ( 2002) , PhD (2009) – CSE, IIT Kharagpur ● Synopsys (EDA), IBM (CPU), NVIDIA (GPU), Taro (Full Stack Engineer), Capillary (Principal Architect - AI) ● Applying AI to Retail ● Co-Founded IDLI (for social good) with Prof. Amit Sethi (IIT Bombay), Jacob Minz (Synopsys) and Biswa Gourav Singh (Capillary) ● https://www.facebook.com/groups/idliai/ ● Linked In - https://www.linkedin.com/in/subratpanda/ ● Facebook - https://www.facebook.com/subratpanda ● Twitter - @subratpanda Email - subrat.panda@capillarytech.com
  • 3. Overview • Supervised Learning: Immediate feedback (labels provided for every input). • Unsupervised Learning: No feedback (No labels provided). • Reinforcement Learning: Delayed scalar feedback (a number called reward). • RL deals with agents that must sense & act upon their environment. This combines classical AI and machine learning techniques. It is probably the most comprehensive problem setting. • Examples: • Robot-soccer • Share Investing • Learning to walk/fly/ride a vehicle • Scheduling • AlphaGo / Super Mario
  • 4. Machine Learning – http://techleer.com
  • 5.
  • 6. Some Definitions and assumptions MDP - Markov Decision Problem The framework of the MDP has the following elements: 1. state of the system, 2. actions, 3. transition probabilities, 4. transition rewards, 5. a policy, and 6. a performance metric. We assume that the system is modeled by a so-called abstract stochastic process called the Markov chain. RL is generally used to solve the so-called Markov decision problem (MDP). The theory of RL relies on dynamic programming (DP) and artificial intelligence (AI).
  • 7. The Big Picture Your action influences the state of the world which determines its reward
  • 8. Some Complications • The outcome of your actions may be uncertain • You may not be able to perfectly sense the state of the world • The reward may be stochastic. • Reward is delayed (i.e. finding food in a maze) • You may have no clue (model) about how the world responds to your actions. • You may have no clue (model) of how rewards are being paid off. • The world may change while you try to learn it. • How much time do you need to explore uncharted territory before you exploit what you have learned? • For large-scale systems with millions of states, it is impractical to store these values. This is called the curse of dimensionality. DP breaks down on problems which suffer from any one of these curses because it needs all these values.
  • 9. The Task • To learn an optimal policy that maps states of the world to actions of the agent. i.e., if this patch of room is dirty, I clean it. If my battery is empty, I recharge it. • What is it that the agent tries to optimize? Answer: the total future discounted reward: Note: immediate reward is worth more than future reward. What would happen to mouse in a maze with gamma = 0 ? - Greedy
  • 10. Value Function • Let’s say we have access to the optimal value function that computes the total future discounted reward • What would be the optimal policy ? • Answer: we choose the action that maximizes: • We assume that we know what the reward will be if we perform action “a” in state “s”: • We also assume we know what the next state of the world will be if we perform action “a” in state “s”:
  • 11. Example I • Consider some complicated graph, and we would like to find the shortest path from a node Si to a goal node G. • Traversing an edge will cost you “length edge” dollars. • The value function encodes the total remaining distance to the goal node from any node s, i.e. V(s) = “1 / distance” to goal from s. • If you know V(s), the problem is trivial. You simply choose the node that has highest V(s). Si G
  • 12. Example II Find your way to the goal.
  • 13. Q-Function • One approach to RL is then to try to estimate V*(s). • However, this approach requires you to know r(s,a) and delta(s,a). • This is unrealistic in many real problems. What is the reward if a robot is exploring mars and decides to take a right turn? • Fortunately we can circumvent this problem by exploring and experiencing how the world reacts to our actions. We need to learn r & delta. • We want a function that directly learns good state-action pairs, i.e. what action should I take in this state. We call this Q(s,a). • Given Q(s,a) it is now trivial to execute the optimal policy, without knowing r(s,a) and delta(s,a). We have: Bellman Equation:
  • 15. Q-Learning • This still depends on r(s,a) and delta(s,a). • However, imagine the robot is exploring its environment, trying new actions as it goes. • At every step it receives some reward “r”, and it observes the environment change into a new state s’ for action a. How can we use these observations, (s,a,s’,r) to learn a model? s’=st+1
  • 16. Q-Learning • This equation continually estimates Q at state s consistent with an estimate of Q at state s’, one step in the future: temporal difference (TD) learning. • Note that s’ is closer to goal, and hence more “reliable”, but still an estimate itself. • Updating estimates based on other estimates is called bootstrapping. • We do an update after each state-action pair. I.e., we are learning online! • We are learning useful things about explored state-action pairs. These are typically most useful because they are likely to be encountered again. • Under suitable conditions, these updates can actually be proved to converge to the real answer. s’=st+1
  • 17. Example Q-Learning Q-learning propagates Q-estimates 1-step backwards
  • 18. Exploration / Exploitation • It is very important that the agent does not simply follow the current policy when learning Q. (off-policy learning).The reason is that you may get stuck in a suboptimal solution. i.e. there may be other solutions out there that you have never seen. • Hence it is good to try new things so now and then, e.g. If T is large lots of exploring, if T is small follow current policy. One can decrease T over time.
  • 19. Improvements • One can trade-off memory and computation by caching (s,s’,r) for observed transitions. After a while, as Q(s’,a’) has changed, you can “replay” the update: •One can actively search for state-action pairs for which Q(s,a) is expected to change a lot (prioritized sweeping). • One can do updates along the sampled path much further back than just one step ( learning).
  • 20. Extensions • To deal with stochastic environments, we need to maximize expected future discounted reward: • Often the state space is too large to deal with all states. In this case we need to learn a function: • Neural network with back-propagation have been quite successful. • For instance, TD-Gammon is a back-gammon program that plays at expert level. state-space very large, trained by playing against itself, uses NN to approximate value function, uses TD(lambda) for learning.
  • 21. Real Life Examples Using RL
  • 22. Deep RL - Deep Q Learning
  • 23. Tools and EcoSystem ● OpenAI Gym- Python based rich AI simulation environment ● TensorFlow - TensorLayer providing popular RL modules ● Pytorch - Open Sourced by Facebook ● Keras - On top of TensorFlow ● DeepMind Lab - Google 3D platform
  • 24. Small Demo - on Youtube
  • 25. Application Areas of RL - Resources management in computer clusters - Traffic Light Control - Robotics - Personalized Recommendations - Chemistry - Bidding and Advertising - Games
  • 26. What you need to know to apply RL? - Understanding your problem - A simulated environment - MDP - Algorithms
  • 27. Learning how to run? - Good example by deepsense.ai
  • 28. Conclusion • Reinforcement learning addresses a very broad and relevant question: How can we learn to survive in our environment? • We have looked at Q-learning, which simply learns from experience. No model of the world is needed. • We made simplifying assumptions: e.g. state of the world only depends on last state and action. This is the Markov assumption. The model is called a Markov Decision Process (MDP). • We assumed deterministic dynamics, reward function, but the world really is stochastic. • There are many extensions to speed up learning. • There have been many successful real world applications.
  • 29. References • Thanks to Google and AI research Community • https://www.slideshare.net/AnirbanSantara/an-introduction-to-reinforcement-learning-the-doors-to-agi - Good Intro PPT • https://skymind.ai/wiki/deep-reinforcement-learning - Good blog • https://medium.com/@BonsaiAI/why-reinforcement-learning-might-be-the-best-ai-technique-for-complex-industrial-systems- fde8b0ebd5fb • https://deepsense.ai/learning-to-run-an-example-of-reinforcement-learning/ • https://en.wikipedia.org/wiki/Reinforcement_learning • https://www.ics.uci.edu/~welling/teaching/ICS175winter12/RL.ppt - Good Intro PPT which I have adopted in whole • https://people.eecs.berkeley.edu/~jordan/MLShortCourse/reinforcement-learning.ppt • https://www.cse.iitm.ac.in/~ravi/courses/Reinforcement%20Learning.html - Detailed Course • https://web.mst.edu/~gosavia/tutorial.pdf - short Tutorial • Book - Reinforcement Learning: An Introduction (English, Hardcover, Andrew G Barto Richard S Sutton Barto Sutton) • https://www.techleer.com/articles/203-machine-learning-algorithm-backbone-of-emerging-technologies/ • http://ruder.io/transfer-learning/ • Exploration and Exploitation - https://artint.info/html/ArtInt_266.html , http://www.cs.cmu.edu/~rsalakhu/10703/Lecture_Exploration.pdf • https://hub.packtpub.com/tools-for-reinforcement-learning/ • http://adventuresinmachinelearning.com/reinforcement-learning-tensorflow/ • https://towardsdatascience.com/applications-of-reinforcement-learning-in-real-world-1a94955bcd12
  • 30. Acknowledgements - Anirban Santara - For his Intro Session on IDLI and support - Prof. Ravindran - Inspiration - Palash Arora from Capillary, Ashwin Krishna. - Techgig Team for the opportunity - Sumandeep and Biswa from Capillary for review and discussions. - https://www.ics.uci.edu/~welling/teaching/ICS175winter12/RL.ppt - Good Intro PPT which I have adopted in whole Thank You!!