SlideShare una empresa de Scribd logo
1 de 34
Introduction to Deep Reinforcement Learning
Moustafa Alzantot
PhD Student, Networked and Embedded Systems Lab, UCLA
Oct 22, 2017
Machine Learning
Computer programs can increase their performance on a given task
without being explicitly programmed for it, just by analyzing data !
Types Machine Learning
• Supervised Learning
• Given a set of labeled examples , predict the output label for new unseen
inputs.
• Unsupervised Learning
• Given unlabeled dataset, understand the structure of the data (e.g.
clustering, dimensionality reduction).
• Reinforcement Learning
• Branch of machine learning concerned with acting optimally in face of
uncertainty (i.e. learning to do ! )
Reinforcement Learning
• Agent observes the environment state, performs some action.
• In response, the environment state changes and agent receives reward.
• Goal of agent is to pick actions that maximizes the total reward received from
environment.
Environment
Agent
Actions: a
State: s
Reward: r
Source: Pieter Abeel, UC Berkley188
Examples
Ex: Grid World
 A maze-like problem
 The agent lives in a grid
 Walls block the agent’s path
 Noisy movement: actions do not always go as planned
 80% of the time, the action North takes the agent North
(if there is no wall there)
 10% of the time, North takes the agent West; 10% East
 If there is a wall in the direction the agent would have been taken, the agent stays put
 The agent receives rewards each time step
 Small “living” reward each step (can be negative)
 Big rewards come at the end (good or bad)
 Goal: maximize sum of rewards
Source: Pieter Abeel, UC Berkley188
Ex: Grid World
Deterministic Grid World Stochastic Grid World
Markov Decision Process
• MDP is used to describe RL environments.
• MDP is defined by:
• A set of states s S
A set of actions a A
A transition function
Probability that a from s leads to s’, i.e., P(s’| s, a)
Also called the model or the dynamics
A reward function
Sometimes just R(s) or R(s’)
Discount factor
Environment
Agent
Actions: a
State: s
Reward: r
Source: Pieter Abeel, UC Berkley188
Discounting
It’s reasonable to maximize the sum of rewards
It’s also reasonable to prefer rewards now to rewards later
One solution: values of rewards decay exponentially
0 < < 1
Worth Now Worth Next Step Worth In Two Steps
Why discount ?
— sooner rewards will probably have higher utility than later rewards
— Control preferences of different solutions.
— Avoid numerical issues (total rewards going to infinity)
Optimal policy
No penalty at each step • Reward for each step: -0.1
• Reward for each step: -2 • Reward for each step: +0.1
Remember MDPs
• MDP is defined by:
• A set of states s S
A set of actions a A
A transition function
Probability that a from s leads to s’, i.e., P(s’| s, a)
Also called the model or the dynamics
A reward function
Sometimes just R(s) or R(s’)
Discount factor
Environment
Agent
Actions: a
State: s
Reward: r
Solving MDPs
• If the MDP (environment model) is known, there are ways that are guaranteed
to find the optimal policy.
Value-function
The value (utility) of a state s:
V*(s) = expected utility starting in s and acting optimally
The value (utility) of a q-state (s,a):
Q*(s,a) = expected utility starting out having taken action a from state s and
(thereafter) acting optimally
The optimal policy:
*(s) = optimal action from state s
GridWorld: Q-Values
Noise = 0.2
Discount = 0.9
Living reward = 0
Source: Pieter Abeel, UC Berkley188
Value Iteration
 Theorem: will converge to unique optimal values
 Basic idea: approximations get refined towards optimal values
 Policy may converge long before values do
• Alpaydin: Introduction to Machine Learning, 3rd edition
Policy Iteration
• Value-iterations iterates to refine the value function estimates until it
converges.
• Optimal policy often converges before the value function.
• The final goal is to get an optimal policy.
• Policy-iteration: iterates to re-define the policy at each step.
• Alpaydin: Introduction to Machine Learning, 3rd edition
Reinforcement Learning ?!
Model-Based Learning
Model-Based Idea:
Learn an approximate model based on experiences
Solve for values as if the learned model were correct
Step 1: Learn empirical MDP model
Count outcomes s’ for each s, a
Normalize to give an estimate of
Discover each when we experience (s, a, s’)
Step 2: Solve the learned MDP
For example, use value iteration, as before
Model-Free Learning
• Directly learn the V and Q value functions without estimating T
and R.
• Remember:
Key question: how can we do this update to V without knowing T and R?
In other words, how to we take a weighted average without knowing the weights?
Q-Learning
 Use Temporal difference to learn Q(s, a) from observed samples.
 After convergence, extract the optimal policy !
How to Explore?
Several schemes for forcing exploration
Simplest: random actions (-greedy)
Every time step, flip a coin
With (small) probability , act randomly
With (large) probability 1-, act on current policy
Problems with random actions?
You do eventually explore the space, but keep
thrashing around once learning is done
One solution: lower over time
Another solution: exploration functions
Demo: MountainCar using Q-Learning
https://www.youtube.com/watch?v=ByOdncJE5bE
Approximate Q Learning
Approximate Q Learning
 Basic Q-Learning keeps a table of all q-values
 In realistic situations, we cannot possibly learn about every single state!
 Too many states to visit them all in training
 Too many states to hold the q-tables in memory
Approximate Q-Learning
 Using a feature representation, we can write a q function (or value function) for any
state using a few weights:
 Use optimization to find the weights that minimize MSE between predicted and
observed Q-values.
Questions:
How to approximate the Q(s, a) function ?
How to compute these features ?
Deep Q Networks
Remember:
Universal approximation theorem:
Neural Network with 1 hidden layer can learn any
bounded continuous function!
Deep Q Networks
Remember:
Deep neural networks are good as feature
extractors !
Deep Q Networks
Deep Q-Network: Atari
Deep Q-Network training
Deep Q-Network training
Experience Replay Trick
DQN Results in Atari
Resources
• Pieter Abeel, UC Berkley CS 188
• Alpaydin: Introduction to Machine Learning, 3rd edition
• David Silver, UCL Reinforcement Learning Course
• Yandex: Practical RL
• MIT: Deep Learning for self-driving cars !
• Stanford 234: Reinforcement Learning
Thanks
Send any question to
malzantot@ucla.edu

Más contenido relacionado

La actualidad más candente

Exploration Strategies in Reinforcement Learning
Exploration Strategies in Reinforcement LearningExploration Strategies in Reinforcement Learning
Exploration Strategies in Reinforcement LearningDongmin Lee
 
Reinforcement learning 7313
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313Slideshare
 
[1808.00177] Learning Dexterous In-Hand Manipulation
[1808.00177] Learning Dexterous In-Hand Manipulation[1808.00177] Learning Dexterous In-Hand Manipulation
[1808.00177] Learning Dexterous In-Hand ManipulationSeung Jae Lee
 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialOmar Enayet
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learningbutest
 
Dexterous In-hand Manipulation by OpenAI
Dexterous In-hand Manipulation by OpenAIDexterous In-hand Manipulation by OpenAI
Dexterous In-hand Manipulation by OpenAIAnand Joshi
 
Deep Q-learning from Demonstrations DQfD
Deep Q-learning from Demonstrations DQfDDeep Q-learning from Demonstrations DQfD
Deep Q-learning from Demonstrations DQfDAmmar Rashed
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningYigit UNALLAR
 
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
1118_Seminar_Continuous_Deep Q-Learning with Model based accelerationHye-min Ahn
 
Competition winning learning rates
Competition winning learning ratesCompetition winning learning rates
Competition winning learning ratesMLconf
 
Planning and Learning with Tabular Methods
Planning and Learning with Tabular MethodsPlanning and Learning with Tabular Methods
Planning and Learning with Tabular MethodsDongmin Lee
 
Actor critic algorithm
Actor critic algorithmActor critic algorithm
Actor critic algorithmJie-Han Chen
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningDing Li
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017MLconf
 
Real-time ranking with concept drift using expert advice
Real-time ranking with concept drift using expert adviceReal-time ranking with concept drift using expert advice
Real-time ranking with concept drift using expert adviceHila Becker
 
Real-time Ranking of Electrical Feeders using Expert Advice
Real-time Ranking of Electrical Feeders using Expert AdviceReal-time Ranking of Electrical Feeders using Expert Advice
Real-time Ranking of Electrical Feeders using Expert AdviceHila Becker
 
NUS-ISS Learning Day 2019-Introduction to reinforcement learning
NUS-ISS Learning Day 2019-Introduction to reinforcement learningNUS-ISS Learning Day 2019-Introduction to reinforcement learning
NUS-ISS Learning Day 2019-Introduction to reinforcement learningNUS-ISS
 
Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Dongmin Lee
 

La actualidad más candente (20)

Exploration Strategies in Reinforcement Learning
Exploration Strategies in Reinforcement LearningExploration Strategies in Reinforcement Learning
Exploration Strategies in Reinforcement Learning
 
Reinforcement learning 7313
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313
 
[1808.00177] Learning Dexterous In-Hand Manipulation
[1808.00177] Learning Dexterous In-Hand Manipulation[1808.00177] Learning Dexterous In-Hand Manipulation
[1808.00177] Learning Dexterous In-Hand Manipulation
 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners Tutorial
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Dexterous In-hand Manipulation by OpenAI
Dexterous In-hand Manipulation by OpenAIDexterous In-hand Manipulation by OpenAI
Dexterous In-hand Manipulation by OpenAI
 
Deep Q-learning from Demonstrations DQfD
Deep Q-learning from Demonstrations DQfDDeep Q-learning from Demonstrations DQfD
Deep Q-learning from Demonstrations DQfD
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
 
Competition winning learning rates
Competition winning learning ratesCompetition winning learning rates
Competition winning learning rates
 
Planning and Learning with Tabular Methods
Planning and Learning with Tabular MethodsPlanning and Learning with Tabular Methods
Planning and Learning with Tabular Methods
 
Actor critic algorithm
Actor critic algorithmActor critic algorithm
Actor critic algorithm
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Reinforcement Learning using OpenAI Gym
Reinforcement Learning using OpenAI GymReinforcement Learning using OpenAI Gym
Reinforcement Learning using OpenAI Gym
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
 
Real-time ranking with concept drift using expert advice
Real-time ranking with concept drift using expert adviceReal-time ranking with concept drift using expert advice
Real-time ranking with concept drift using expert advice
 
Real-time Ranking of Electrical Feeders using Expert Advice
Real-time Ranking of Electrical Feeders using Expert AdviceReal-time Ranking of Electrical Feeders using Expert Advice
Real-time Ranking of Electrical Feeders using Expert Advice
 
NUS-ISS Learning Day 2019-Introduction to reinforcement learning
NUS-ISS Learning Day 2019-Introduction to reinforcement learningNUS-ISS Learning Day 2019-Introduction to reinforcement learning
NUS-ISS Learning Day 2019-Introduction to reinforcement learning
 
Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)
 

Similar a Introduction to Deep Reinforcement Learning

Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement LearningNatan Katz
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningDongHyun Kwak
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningNAVER Engineering
 
14_ReinforcementLearning.pptx
14_ReinforcementLearning.pptx14_ReinforcementLearning.pptx
14_ReinforcementLearning.pptxRithikRaj25
 
Demystifying deep reinforement learning
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learning재연 윤
 
anintroductiontoreinforcementlearning-180912151720.pdf
anintroductiontoreinforcementlearning-180912151720.pdfanintroductiontoreinforcementlearning-180912151720.pdf
anintroductiontoreinforcementlearning-180912151720.pdfssuseradaf5f
 
24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptxManiMaran230751
 
Reinforcement Learning Guide For Beginners
Reinforcement Learning Guide For BeginnersReinforcement Learning Guide For Beginners
Reinforcement Learning Guide For Beginnersgokulprasath06
 
reinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdfreinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdfVaishnavGhadge1
 
An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningPrabhu Kumar
 
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018Universitat Politècnica de Catalunya
 
How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysYasutoTamura1
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningDongHyun Kwak
 
Reinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural NetsReinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural NetsPierre de Lacaze
 
Head First Reinforcement Learning
Head First Reinforcement LearningHead First Reinforcement Learning
Head First Reinforcement Learningazzeddine chenine
 
reinforcement-learning-141009013546-conversion-gate02.pptx
reinforcement-learning-141009013546-conversion-gate02.pptxreinforcement-learning-141009013546-conversion-gate02.pptx
reinforcement-learning-141009013546-conversion-gate02.pptxMohibKhan79
 
Aaa ped-24- Reinforcement Learning
Aaa ped-24- Reinforcement LearningAaa ped-24- Reinforcement Learning
Aaa ped-24- Reinforcement LearningAminaRepo
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningElias Hasnat
 
An introduction to deep reinforcement learning
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learningBig Data Colombia
 

Similar a Introduction to Deep Reinforcement Learning (20)

Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement Learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
 
14_ReinforcementLearning.pptx
14_ReinforcementLearning.pptx14_ReinforcementLearning.pptx
14_ReinforcementLearning.pptx
 
Demystifying deep reinforement learning
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learning
 
anintroductiontoreinforcementlearning-180912151720.pdf
anintroductiontoreinforcementlearning-180912151720.pdfanintroductiontoreinforcementlearning-180912151720.pdf
anintroductiontoreinforcementlearning-180912151720.pdf
 
24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx
 
Reinforcement Learning Guide For Beginners
Reinforcement Learning Guide For BeginnersReinforcement Learning Guide For Beginners
Reinforcement Learning Guide For Beginners
 
reinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdfreinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdf
 
An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game Learning
 
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
 
How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative ways
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Reinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural NetsReinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural Nets
 
Head First Reinforcement Learning
Head First Reinforcement LearningHead First Reinforcement Learning
Head First Reinforcement Learning
 
reinforcement-learning-141009013546-conversion-gate02.pptx
reinforcement-learning-141009013546-conversion-gate02.pptxreinforcement-learning-141009013546-conversion-gate02.pptx
reinforcement-learning-141009013546-conversion-gate02.pptx
 
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
 
Aaa ped-24- Reinforcement Learning
Aaa ped-24- Reinforcement LearningAaa ped-24- Reinforcement Learning
Aaa ped-24- Reinforcement Learning
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
An introduction to deep reinforcement learning
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learning
 

Más de IDEAS - Int'l Data Engineering and Science Association

Más de IDEAS - Int'l Data Engineering and Science Association (20)

How to deliver effective data science projects
How to deliver effective data science projectsHow to deliver effective data science projects
How to deliver effective data science projects
 
Digital cracks in banking--Sid Nandi
Digital cracks in banking--Sid NandiDigital cracks in banking--Sid Nandi
Digital cracks in banking--Sid Nandi
 
“Full Stack” Data Science with R for Startups: Production-ready with Open-Sou...
“Full Stack” Data Science with R for Startups: Production-ready with Open-Sou...“Full Stack” Data Science with R for Startups: Production-ready with Open-Sou...
“Full Stack” Data Science with R for Startups: Production-ready with Open-Sou...
 
Battling Skynet: The Role of Humanity in Artificial Intelligence
Battling Skynet: The Role of Humanity in Artificial IntelligenceBattling Skynet: The Role of Humanity in Artificial Intelligence
Battling Skynet: The Role of Humanity in Artificial Intelligence
 
Implementing Artificial Intelligence with Big Data
Implementing Artificial Intelligence with Big DataImplementing Artificial Intelligence with Big Data
Implementing Artificial Intelligence with Big Data
 
Data Architecture (i.e., normalization / relational algebra) and Database Sec...
Data Architecture (i.e., normalization / relational algebra) and Database Sec...Data Architecture (i.e., normalization / relational algebra) and Database Sec...
Data Architecture (i.e., normalization / relational algebra) and Database Sec...
 
Blockchain Application in Real Estate Transactions
Blockchain Application in Real Estate TransactionsBlockchain Application in Real Estate Transactions
Blockchain Application in Real Estate Transactions
 
Learning to learn Model Behavior: How to use "human-in-the-loop" to explain d...
Learning to learn Model Behavior: How to use "human-in-the-loop" to explain d...Learning to learn Model Behavior: How to use "human-in-the-loop" to explain d...
Learning to learn Model Behavior: How to use "human-in-the-loop" to explain d...
 
Practical Machine Learning at Work
Practical Machine Learning at WorkPractical Machine Learning at Work
Practical Machine Learning at Work
 
Artificial Intelligence: Hype, Reality, Vision.
Artificial Intelligence: Hype, Reality, Vision.Artificial Intelligence: Hype, Reality, Vision.
Artificial Intelligence: Hype, Reality, Vision.
 
Operationalizing your Data Lake: Get Ready for Advanced Analytics
Operationalizing your Data Lake: Get Ready for Advanced AnalyticsOperationalizing your Data Lake: Get Ready for Advanced Analytics
Operationalizing your Data Lake: Get Ready for Advanced Analytics
 
Best Practices in Data Partnerships Between Mayor's Office and Academia
Best Practices in Data Partnerships Between Mayor's Office and AcademiaBest Practices in Data Partnerships Between Mayor's Office and Academia
Best Practices in Data Partnerships Between Mayor's Office and Academia
 
Everything You Wish You Knew About Search
Everything You Wish You Knew About SearchEverything You Wish You Knew About Search
Everything You Wish You Knew About Search
 
AliMe Bot Platform Technical Practice - Alibaba`s Personal Intelligent Assist...
AliMe Bot Platform Technical Practice - Alibaba`s Personal Intelligent Assist...AliMe Bot Platform Technical Practice - Alibaba`s Personal Intelligent Assist...
AliMe Bot Platform Technical Practice - Alibaba`s Personal Intelligent Assist...
 
Data-Driven AI for Entertainment and Healthcare
Data-Driven AI for Entertainment and HealthcareData-Driven AI for Entertainment and Healthcare
Data-Driven AI for Entertainment and Healthcare
 
Generating Creative Works with AI
Generating Creative Works with AIGenerating Creative Works with AI
Generating Creative Works with AI
 
Using AI to Tackle the Future of Health Care Data
Using AI to Tackle the Future of Health Care DataUsing AI to Tackle the Future of Health Care Data
Using AI to Tackle the Future of Health Care Data
 
State of AI/ML in Real Estate
State of AI/ML in Real EstateState of AI/ML in Real Estate
State of AI/ML in Real Estate
 
Hot Dog, Not Hot Dog! Generate new training data without taking more photos.
Hot Dog, Not Hot Dog! Generate new training data without taking more photos.Hot Dog, Not Hot Dog! Generate new training data without taking more photos.
Hot Dog, Not Hot Dog! Generate new training data without taking more photos.
 
Machine Learning in Healthcare and Life Science
Machine Learning in Healthcare and Life ScienceMachine Learning in Healthcare and Life Science
Machine Learning in Healthcare and Life Science
 

Último

Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Karmanjay Verma
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...Karmanjay Verma
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Mark Simos
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Nikki Chapple
 
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...amber724300
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...BookNet Canada
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
WomenInAutomation2024: AI and Automation for eveyone
WomenInAutomation2024: AI and Automation for eveyoneWomenInAutomation2024: AI and Automation for eveyone
WomenInAutomation2024: AI and Automation for eveyoneUiPathCommunity
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 

Último (20)

Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
 
How Tech Giants Cut Corners to Harvest Data for A.I.
How Tech Giants Cut Corners to Harvest Data for A.I.How Tech Giants Cut Corners to Harvest Data for A.I.
How Tech Giants Cut Corners to Harvest Data for A.I.
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
 
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
WomenInAutomation2024: AI and Automation for eveyone
WomenInAutomation2024: AI and Automation for eveyoneWomenInAutomation2024: AI and Automation for eveyone
WomenInAutomation2024: AI and Automation for eveyone
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 

Introduction to Deep Reinforcement Learning

  • 1. Introduction to Deep Reinforcement Learning Moustafa Alzantot PhD Student, Networked and Embedded Systems Lab, UCLA Oct 22, 2017
  • 2. Machine Learning Computer programs can increase their performance on a given task without being explicitly programmed for it, just by analyzing data !
  • 3. Types Machine Learning • Supervised Learning • Given a set of labeled examples , predict the output label for new unseen inputs. • Unsupervised Learning • Given unlabeled dataset, understand the structure of the data (e.g. clustering, dimensionality reduction). • Reinforcement Learning • Branch of machine learning concerned with acting optimally in face of uncertainty (i.e. learning to do ! )
  • 4. Reinforcement Learning • Agent observes the environment state, performs some action. • In response, the environment state changes and agent receives reward. • Goal of agent is to pick actions that maximizes the total reward received from environment. Environment Agent Actions: a State: s Reward: r Source: Pieter Abeel, UC Berkley188
  • 6. Ex: Grid World  A maze-like problem  The agent lives in a grid  Walls block the agent’s path  Noisy movement: actions do not always go as planned  80% of the time, the action North takes the agent North (if there is no wall there)  10% of the time, North takes the agent West; 10% East  If there is a wall in the direction the agent would have been taken, the agent stays put  The agent receives rewards each time step  Small “living” reward each step (can be negative)  Big rewards come at the end (good or bad)  Goal: maximize sum of rewards Source: Pieter Abeel, UC Berkley188
  • 7. Ex: Grid World Deterministic Grid World Stochastic Grid World
  • 8. Markov Decision Process • MDP is used to describe RL environments. • MDP is defined by: • A set of states s S A set of actions a A A transition function Probability that a from s leads to s’, i.e., P(s’| s, a) Also called the model or the dynamics A reward function Sometimes just R(s) or R(s’) Discount factor Environment Agent Actions: a State: s Reward: r Source: Pieter Abeel, UC Berkley188
  • 9. Discounting It’s reasonable to maximize the sum of rewards It’s also reasonable to prefer rewards now to rewards later One solution: values of rewards decay exponentially 0 < < 1 Worth Now Worth Next Step Worth In Two Steps Why discount ? — sooner rewards will probably have higher utility than later rewards — Control preferences of different solutions. — Avoid numerical issues (total rewards going to infinity)
  • 10. Optimal policy No penalty at each step • Reward for each step: -0.1 • Reward for each step: -2 • Reward for each step: +0.1
  • 11. Remember MDPs • MDP is defined by: • A set of states s S A set of actions a A A transition function Probability that a from s leads to s’, i.e., P(s’| s, a) Also called the model or the dynamics A reward function Sometimes just R(s) or R(s’) Discount factor Environment Agent Actions: a State: s Reward: r
  • 12. Solving MDPs • If the MDP (environment model) is known, there are ways that are guaranteed to find the optimal policy.
  • 13. Value-function The value (utility) of a state s: V*(s) = expected utility starting in s and acting optimally The value (utility) of a q-state (s,a): Q*(s,a) = expected utility starting out having taken action a from state s and (thereafter) acting optimally The optimal policy: *(s) = optimal action from state s
  • 14. GridWorld: Q-Values Noise = 0.2 Discount = 0.9 Living reward = 0 Source: Pieter Abeel, UC Berkley188
  • 15. Value Iteration  Theorem: will converge to unique optimal values  Basic idea: approximations get refined towards optimal values  Policy may converge long before values do • Alpaydin: Introduction to Machine Learning, 3rd edition
  • 16. Policy Iteration • Value-iterations iterates to refine the value function estimates until it converges. • Optimal policy often converges before the value function. • The final goal is to get an optimal policy. • Policy-iteration: iterates to re-define the policy at each step. • Alpaydin: Introduction to Machine Learning, 3rd edition
  • 18. Model-Based Learning Model-Based Idea: Learn an approximate model based on experiences Solve for values as if the learned model were correct Step 1: Learn empirical MDP model Count outcomes s’ for each s, a Normalize to give an estimate of Discover each when we experience (s, a, s’) Step 2: Solve the learned MDP For example, use value iteration, as before
  • 19. Model-Free Learning • Directly learn the V and Q value functions without estimating T and R. • Remember: Key question: how can we do this update to V without knowing T and R? In other words, how to we take a weighted average without knowing the weights?
  • 20. Q-Learning  Use Temporal difference to learn Q(s, a) from observed samples.  After convergence, extract the optimal policy !
  • 21. How to Explore? Several schemes for forcing exploration Simplest: random actions (-greedy) Every time step, flip a coin With (small) probability , act randomly With (large) probability 1-, act on current policy Problems with random actions? You do eventually explore the space, but keep thrashing around once learning is done One solution: lower over time Another solution: exploration functions
  • 22. Demo: MountainCar using Q-Learning https://www.youtube.com/watch?v=ByOdncJE5bE
  • 24. Approximate Q Learning  Basic Q-Learning keeps a table of all q-values  In realistic situations, we cannot possibly learn about every single state!  Too many states to visit them all in training  Too many states to hold the q-tables in memory
  • 25. Approximate Q-Learning  Using a feature representation, we can write a q function (or value function) for any state using a few weights:  Use optimization to find the weights that minimize MSE between predicted and observed Q-values. Questions: How to approximate the Q(s, a) function ? How to compute these features ?
  • 26. Deep Q Networks Remember: Universal approximation theorem: Neural Network with 1 hidden layer can learn any bounded continuous function!
  • 27. Deep Q Networks Remember: Deep neural networks are good as feature extractors !
  • 32. DQN Results in Atari
  • 33. Resources • Pieter Abeel, UC Berkley CS 188 • Alpaydin: Introduction to Machine Learning, 3rd edition • David Silver, UCL Reinforcement Learning Course • Yandex: Practical RL • MIT: Deep Learning for self-driving cars ! • Stanford 234: Reinforcement Learning
  • 34. Thanks Send any question to malzantot@ucla.edu