SlideShare una empresa de Scribd logo
1 de 76
Descargar para leer sin conexión
Introduction to
Deep Reinforcement
Learning
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
Deep Reinforcement Learning (Deep RL)
• What is it? Framework for learning to solve sequential decision
making problems.
• How? Trial and error in a world that provides occasional rewards
• Deep? Deep RL = RL + Neural Networks
[306, 307]
Deep Learning Deep RL
2019https://deeplearning.mit.edu
Types of Learning
• Supervised Learning
• Semi-Supervised Learning
• Unsupervised Learning
• Reinforcement Learning
It’s all “supervised” by a loss function!
*Someone has to say what’s good and what’s bad (see Socrates, Epictetus, Kant, Nietzsche, etc.)
Neural
Network
Input Output
Good or
Bad?
Supervision*
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
Supervised learning is “teach by example”:
Here’s some examples, now learn patterns in these example.
Reinforcement learning is “teach by experience”:
Here’s a world, now learn patterns by exploring it.
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
Machine Learning:
Supervised vs Reinforcement
Supervised learning is “teach by example”:
Here’s some examples, now learn patterns in these example.
Reinforcement learning is “teach by experience”:
Here’s a world, now learn patterns by exploring it.
[330]
Failure Success
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
Reinforcement Learning in Humans
• Human appear to learn to walk through “very few examples” of
trial and error. How is an open question…
• Possible answers:
• Hardware: 230 million years of bipedal movement data.
• Imitation Learning: Observation of other humans walking.
• Algorithms: Better than backpropagation and stochastic gradient descent
[286, 291]
2019https://deeplearning.mit.edu
Open Question:
What can be learned from
data?
Environment
Sensors
Sensor Data
Feature Extraction
Machine Learning
Reasoning
Knowledge
Effector
Planning
Action
Representation
2019https://deeplearning.mit.edu
Environment
Sensors
Sensor Data
Feature Extraction
Machine Learning
Reasoning
Knowledge
Effector
Planning
Action
Representation
References: [132]
Lidar
Camera
(Visible, Infrared)
Radar
Stereo Camera Microphone
GPS
IMU
Networking
(Wired, Wireless)
2019https://deeplearning.mit.edu
Environment
Sensors
Sensor Data
Feature Extraction
Machine Learning
Reasoning
Knowledge
Effector
Planning
Action
Representation
2019https://deeplearning.mit.edu
Environment
Sensors
Sensor Data
Feature Extraction
Machine Learning
Reasoning
Knowledge
Effector
Planning
Action
Representation
2019https://deeplearning.mit.edu
Environment
Sensors
Sensor Data
Feature Extraction
Machine Learning
Reasoning
Knowledge
Effector
Planning
Action
Representation
Image Recognition:
If it looks like a duck
Activity Recognition:
Swims like a duck
Audio Recognition:
Quacks like a duck
2019https://deeplearning.mit.edu
Environment
Sensors
Sensor Data
Feature Extraction
Machine Learning
Reasoning
Knowledge
Effector
Planning
Action
Representation
Final breakthrough, 358 years after its conjecture:
“It was so indescribably beautiful; it was so simple and
so elegant. I couldn’t understand how I’d missed it and
I just stared at it in disbelief for twenty minutes. Then
during the day I walked around the department, and
I’d keep coming back to my desk looking to see if it
was still there. It was still there. I couldn’t contain
myself, I was so excited. It was the most important
moment of my working life. Nothing I ever do again
will mean as much."
2019https://deeplearning.mit.edu
Environment
Sensors
Sensor Data
Feature Extraction
Machine Learning
Reasoning
Knowledge
Effector
Planning
Action
Representation
References: [133]
2019https://deeplearning.mit.edu
Environment
Sensors
Sensor Data
Feature Extraction
Machine Learning
Reasoning
Knowledge
Effector
Planning
Action
Representation
The promise of
Deep Learning
The promise of
Deep Reinforcement Learning
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
Reinforcement Learning Framework
At each step, the agent:
• Executes action
• Observe new state
• Receive reward
[80]
Open Questions:
• What cannot be modeled in
this way?
• What are the challenges of
learning in this framework?
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
Environment and Actions
• Fully Observable (Chess) vs Partially Observable (Poker)
• Single Agent (Atari) vs Multi Agent (DeepTraffic)
• Deterministic (Cart Pole) vs Stochastic (DeepTraffic)
• Static (Chess) vs Dynamic (DeepTraffic)
• Discrete (Chess) vs Continuous (Cart Pole)
Note: Real-world environment might not technically be stochastic
or partially-observable but might as well be treated as such due
to their complexity.
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
The Challenge for RL in Real-World Applications
Reminder:
Supervised learning:
teach by example
Reinforcement learning:
teach by experience
Open Challenges. Two Options:
1. Real world observation + one-shot trial & error
2. Realistic simulation + transfer learning
1. Improve
Transfer
Learning
2. Improve
Simulation
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
Major Componentsof anRLAgent
An RL agent may be directly or indirectly trying to learn a:
• Policy: agent’s behavior function
• Value function: how good is each state and/or action
• Model: agent’s representation of the environment
𝑠0, 𝑎0, 𝑟1, 𝑠1, 𝑎1, 𝑟2, … , 𝑠 𝑛−1, 𝑎 𝑛−1, 𝑟 𝑛, 𝑠 𝑛
state
action
reward
Terminal state
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
Meaning of Life for RL Agent: Maximize Reward
[84]
• Future reward: 𝑅𝑡 = 𝑟𝑡 + 𝑟𝑡+1 + 𝑟𝑡+2 + ⋯ + 𝑟𝑛
• Discounted future reward:
𝑅 𝑡 = 𝑟𝑡 + 𝛾𝑟𝑡+1 + 𝛾2 𝑟 𝑡+2 + ⋯ + 𝛾 𝑛−𝑡 𝑟 𝑛
• A good strategy for an agent would be to always choose
an action that maximizes the (discounted) future reward
• Why “discounted”?
• Math trick to help analyze convergence
• Uncertainty due to environment stochasticity, partial
observability, or that life can end at any moment:
“If today were the last day of my life, would I want
to do what I’m about to do today?” – Steve Jobs
2019https://deeplearning.mit.edu
actions: UP, DOWN, LEFT, RIGHT
(Stochastic) model of the world:
Action: UP
80% move UP
10% move LEFT
10% move RIGHT
Robot in a Room
+1
-1
START
• Reward +1 at [4,3], -1 at [4,2]
• Reward -0.04 for each step
• What’s the strategy to achieve max reward?
• We can learn the model and plan
• We can learn the value of (action, state) pairs and act greed/non-greedy
• We can learn the policy directly while sampling from it
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
Optimal Policy for a Deterministic World
Reward: -0.04 for each step
actions: UP, DOWN, LEFT, RIGHT
When actions are deterministic:
UP
100% move UP
0% move LEFT
0% move RIGHT
Policy: Shortest path.
+1
-1
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
Optimal Policy for a Stochastic World
Reward: -0.04 for each step
+1
-1
actions: UP, DOWN, LEFT, RIGHT
When actions are stochastic:
UP
80% move UP
10% move LEFT
10% move RIGHT
Policy: Shortest path. Avoid -UP around -1 square.
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
Optimal Policy for a Stochastic World
Reward: -2 for each step
+1
-1
actions: UP, DOWN, LEFT, RIGHT
When actions are stochastic:
UP
80% move UP
10% move LEFT
10% move RIGHT
Policy: Shortest path.
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
Optimal Policy for a Stochastic World
Reward: -0.1 for each step
+1
-1
+1
-1
Reward: -0.04 for each step
Less urgentMore urgent
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
Optimal Policy for a Stochastic World
Reward: +0.01 for each step
actions: UP, DOWN, LEFT, RIGHT
When actions are stochastic:
UP
80% move UP
10% move LEFT
10% move RIGHT
Policy: Longest path.
+1
-1
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
Lessons from Robot in Room
• Environment model has big impact on optimal policy
• Reward structure has big impact on optimal policy
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
Reward structure may have
Unintended Consequences
Player gets reward based on:
1. Finishing time
2. Finishing position
3. Picking up “turbos”
[285]
Human AI (Deep RL Agent)
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
AI Safety
Risk (and thus Human Life) Part of the Loss Function
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
Examples of Reinforcement Learning
Cart-Pole Balancing
• Goal— Balance the pole on top of a moving cart
• State— Pole angle, angular speed. Cart position, horizontal velocity.
• Actions— horizontal force to the cart
• Reward— 1 at each time step if the pole is upright
[166]
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
Examples of Reinforcement Learning
Doom*
• Goal:
Eliminate all opponents
• State:
Raw game pixels of the game
• Actions:
Up, Down, Left, Right, Shoot, etc.
• Reward:
• Positive when eliminating an opponent,
negative when the agent is eliminated
[166]
* Added for important thought-provoking considerations of AI safety in the context of
autonomous weapons systems (see AGI lectures on the topic).
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
Examples of Reinforcement Learning
Grasping Objects with Robotic Arm
• Goal - Pick an object of different shapes
• State - Raw pixels from camera
• Actions – Move arm. Grasp.
• Reward - Positive when pickup is successful
[332]
2019https://deeplearning.mit.eduFor the full updated list of references visit:
https://selfdrivingcars.mit.edu/references
Examples of Reinforcement Learning
Human Life
• Goal - Survival? Happiness?
• State - Sight. Hearing. Taste. Smell. Touch.
• Actions - Think. Move.
• Reward – Homeostasis?
2019https://deeplearning.mit.eduFor the full updated list of references visit:
https://selfdrivingcars.mit.edu/references
Key Takeaways for Real-World Impact
• Deep Learning:
• Fun part: Good algorithms that learn from data.
• Hard part: Good questions, huge amounts of representative data.
• Deep Reinforcement Learning:
• Fun part: Good algorithms that learn from data.
• Hard part: Defining a useful state space, action space, and reward.
• Hardest part: Getting meaningful data for the above formalization.
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
3 Types of Reinforcement Learning
Model-based
• Learn the model of
the world, then plan
using the model
• Update model often
• Re-plan often
[331, 333]
Value-based
• Learn the state or
state-action value
• Act by choosing best
action in state
• Exploration is a
necessary add-on
Policy-based
• Learn the stochastic
policy function that
maps state to action
• Act by sampling policy
• Exploration is baked in
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
Taxonomy of RL Methods
[334]
Link: https://spinningup.openai.com
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
Taxonomy of RL Methods
[334]
2019https://deeplearning.mit.edu
Q-Learning
• State-action value function: Q(s,a)
• Expected return when starting in s,
performing a, and following 
• Q-Learning: Use any policy to estimate Q that maximizes future reward:
• Q directly approximates Q* (Bellman optimality equation)
• Independent of the policy being followed
• Only requirement: keep updating each (s,a) pair
s
a
s’
r
New State Old State Reward
Learning Rate Discount Factor
2019https://deeplearning.mit.edu
Exploration vs Exploitation
• Deterministic/greedy policy won’t explore all actions
• Don’t know anything about the environment at the beginning
• Need to try all actions to find the optimal one
• ε-greedy policy
• With probability 1-ε perform the optimal/greedy action, otherwise random action
• Slowly move it towards greedy policy: ε -> 0
2019https://deeplearning.mit.edu
Q-Learning: Value Iteration
References: [84]
A1 A2 A3 A4
S1 +1 +2 -1 0
S2 +2 0 +1 -2
S3 -1 +1 0 -2
S4 -2 0 +1 +1
2019https://deeplearning.mit.edu
Q-Learning: Representation Matters
• In practice, Value Iteration is impractical
• Very limited states/actions
• Cannot generalize to unobserved states
• Think about the Breakout game
• State: screen pixels
• Image size: 𝟖𝟒 × 𝟖𝟒 (resized)
• Consecutive 4 images
• Grayscale with 256 gray levels
𝟐𝟓𝟔 𝟖𝟒×𝟖𝟒×𝟒 rows in theQ-table!
= 1069,970 >> 1082 atoms in the universe
References: [83, 84]
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
Deep RL = RL + Neural Networks
[20]
2019https://deeplearning.mit.eduFor the full updated list of references visit:
https://selfdrivingcars.mit.edu/references
DQN: Deep Q-Learning
[83]
Use a neural network to
approximate the Q-function:
2019https://deeplearning.mit.eduFor the full updated list of references visit:
https://selfdrivingcars.mit.edu/references
Deep Q-Network (DQN): Atari
[83]
Mnih et al. "Playing atari with deep reinforcement learning." 2013.
2019https://deeplearning.mit.eduFor the full updated list of references visit:
https://selfdrivingcars.mit.edu/references
DQN and Double DQN
• Loss function (squared error):
[83]
predictiontarget
• DQN: same network for both Q
• Double DQN: separate network for each Q
• Helps reduce bias introduced by the inaccuracies of
Q network at the beginning of training
2019https://deeplearning.mit.eduFor the full updated list of references visit:
https://selfdrivingcars.mit.edu/references
DQN Tricks
• Experience Replay
• Stores experiences (actions, state transitions, and rewards) and creates
mini-batches from them for the training process
• Fixed Target Network
• Error calculation includes the target function depends on network
parameters and thus changes quickly. Updating it only every 1,000
steps increases stability of training process.
[83, 167]
2019https://deeplearning.mit.eduFor the full updated list of references visit:
https://selfdrivingcars.mit.edu/references
Atari Breakout
[85]
After
120 Minutes
of Training
After
10 Minutes
of Training
After
240 Minutes
of Training
2019https://deeplearning.mit.eduFor the full updated list of references visit:
https://selfdrivingcars.mit.edu/references
DQN Resultsin Atari
[83]
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
Dueling DQN (DDQN)
• Decompose Q(s,a)
• V(s): the value of being at that state
• A(s,a): the advantage of taking action a in state s versus all other possible
actions at that state
• Use two streams:
• one that estimates the state value V(s)
• one that estimates the advantage for each action A(s,a)
• Useful for states where action choice does not affect Q(s,a)
[336]
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
Prioritized Experience Replay
• Sample experiences based on impact not frequency of occurrence
[336]
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
Taxonomy of RL Methods
[334]
2019https://deeplearning.mit.eduFor the full updated list of references visit:
https://selfdrivingcars.mit.edu/references
Policy Gradient (PG)
• DQN (off-policy): Approximate Q and infer optimal policy
• PG (on-policy): Directly optimize policy space
[63]
Good illustrative explanation:
http://karpathy.github.io/2016/05/31/rl/
“Deep Reinforcement Learning:
Pong from Pixels”
Policy Network
2019https://deeplearning.mit.eduFor the full updated list of references visit:
https://selfdrivingcars.mit.edu/references
Policy Gradient – Training
[63, 204]
• REINFORCE: Policy gradient that increases probability of good actions and
decreases probability of bad action:
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
Policy Gradient
• Pros vs DQN:
• Messy World: If Q function is too complex to be learned, DQN may fail
miserably, while PG will still learn a good policy.
• Speed: Faster convergence
• Stochastic Policies: Capable of learning stochastic policies - DQN can’t
• Continuous actions: Much easier to model continuous action space
• Cons vs DQN:
• Data: Sample inefficient (needs more data)
• Stability: Less stable during training process.
• Poor credit assignment to (state, action) pairs for delayed rewards
[63, 335]
Problem with REINFORCE:
Calculating the reward at the
end, means all the actions will
be averaged as good because
the total reward was high.
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
Taxonomy of RL Methods
[334]
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
Advantage Actor-Critic (A2C)
• Combine DQN (value-based) and REINFORCE (policy-based)
• Two neural networks (Actor and Critic):
• Actor is policy-based: Samples the action from a policy
• Critic is value-based: Measures how good the chosen action is
[335]
• Update at each time step - temporal difference (TD) learning
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
Asynchronous Advantage Actor-Critic (A3C)
[337]
• Both use parallelism in training
• A2C syncs up for global parameter update and then start each
iteration with the same policy
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
Taxonomy of RL Methods
[334]
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
Deep Deterministic Policy Gradient (DDPG)
• Actor-Critic framework for learning a deterministic policy
• Can be thought of as: DQN for continuous action spaces
• As with all DQN, following tricks are required:
• Experience Replay
• Target network
• Exploration: add noise to actions,
reducing scale of the noise
as training progresses
[341]
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
Taxonomy of RL Methods
[334]
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
Policy Optimization
• Progress beyond Vanilla Policy Gradient:
• Natural Policy Gradient
• TRPO
• PPO
• Basic idea in on-policy optimization:
Avoid taking bad actions that collapse the training performance.
[339]
Line Search:
First pick direction, then step size
Trust Region:
First pick step size, then direction
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
Taxonomy of RL Methods
[334]
2019https://deeplearning.mit.eduFor the full updated list of references visit:
https://selfdrivingcars.mit.edu/references
Game of Go
[170]
2019https://deeplearning.mit.eduFor the full updated list of references visit:
https://selfdrivingcars.mit.edu/references
AlphaGo (2016) Beat Top Human at Go
[83]
2019https://deeplearning.mit.eduFor the full updated list of references visit:
https://selfdrivingcars.mit.edu/references
AlphaGo Zero (2017): Beats AlphaGo
[149]
2019https://deeplearning.mit.eduFor the full updated list of references visit:
https://selfdrivingcars.mit.edu/references
AlphaGo Zero Approach
[170]
• Same as the best before: Monte Carlo Tree Search (MCTS)
• Balance exploitation/exploration (going deep on promising positions or
exploring new underplayed positions)
• Use a neural network as “intuition” for which positions to
expand as part of MCTS (same as AlphaGo)
2019https://deeplearning.mit.eduFor the full updated list of references visit:
https://selfdrivingcars.mit.edu/references
AlphaGo Zero Approach
[170]
• Same as the best before: Monte Carlo Tree Search (MCTS)
• Balance exploitation/exploration (going deep on promising positions or
exploring new underplayed positions)
• Use a neural network as “intuition” for which positions to
expand as part of MCTS (same as AlphaGo)
• “Tricks”
• Use MCTS intelligent look-ahead (instead of human games) to improve
value estimates of play options
• Multi-task learning: “two-headed” network that outputs (1) move
probability and (2) probability of winning.
• Updated architecture: use residual networks
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
AlphaZero vs Chess, Shogi, Go
[340]
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
AlphaZero vs Chess, Shogi, Go
[340]
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
AlphaZero vs Chess, Shogi, Go
[340]
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
To date, for most successful robots operating in the real world:
Deep RL is not involved
[169]
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
To date, for most successful robots operating in the real world:
Deep RL is not involved
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
But… that’s slowly changing:
Learning Control Dynamics
[343]
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
But… that’s slowly changing:
Learning to Drive: Beyond Pure Imitation (Waymo)
[343]
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
The Challenge for RL in Real-World Applications
Reminder:
Supervised learning:
teach by example
Reinforcement learning:
teach by experience
Open Challenges. Two Options:
1. Real world observation + one-shot trial & error
2. Realistic simulation + transfer learning
1. Improve
Transfer
Learning
2. Improve
Simulation
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
Thinking Outside the Box:
Multiverse Theory and the Simulation Hypothesis
• Create an (infinite) set of simulation environments to learn in
so that our reality becomes just another sample from the set.
[342]
2019https://deeplearning.mit.eduFor the full list of references visit:
https://hcai.mit.edu/references
Next Steps in Deep RL
• Lectures: https://deeplearning.mit.edu
• Tutorials: https://github.com/lexfridman/mit-deep-learning
• Advice for Research (from Spinning Up as a Deep RL Researcher by Joshua Achiam)
• Background
• Fundamentals in probability, statistics, multivariate calculus.
• Deep learning basics
• Deep RL basics
• TensorFlow (or PyTorch)
• Learn by doing
• Implement core deep RL algorithms (discussed today)
• Look for tricks and details in papers that were key to get it to work
• Iterate fast in simple environments (see Einstein quote on simplicity)
• Research
• Improve on an existing approach
• Focus on an unsolved task / benchmark
• Create a new task / problem that hasn’t been addressed with RL
[339]

Más contenido relacionado

La actualidad más candente

Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
butest
 

La actualidad más candente (20)

Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement Learning
 
Intro to AI STRIPS Planning & Applications in Video-games Lecture6-Part1
Intro to AI STRIPS Planning & Applications in Video-games Lecture6-Part1Intro to AI STRIPS Planning & Applications in Video-games Lecture6-Part1
Intro to AI STRIPS Planning & Applications in Video-games Lecture6-Part1
 
Actor critic algorithm
Actor critic algorithmActor critic algorithm
Actor critic algorithm
 
Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-Learning
 
Reinforcement learning
Reinforcement  learningReinforcement  learning
Reinforcement learning
 
An introduction to reinforcement learning
An introduction to  reinforcement learningAn introduction to  reinforcement learning
An introduction to reinforcement learning
 
Proximal Policy Optimization
Proximal Policy OptimizationProximal Policy Optimization
Proximal Policy Optimization
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
Reinforcement Learning 5. Monte Carlo Methods
Reinforcement Learning 5. Monte Carlo MethodsReinforcement Learning 5. Monte Carlo Methods
Reinforcement Learning 5. Monte Carlo Methods
 
Genetic Algorithm in Artificial Intelligence
Genetic Algorithm in Artificial IntelligenceGenetic Algorithm in Artificial Intelligence
Genetic Algorithm in Artificial Intelligence
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Model-Based Reinforcement Learning @NIPS2017
Model-Based Reinforcement Learning @NIPS2017Model-Based Reinforcement Learning @NIPS2017
Model-Based Reinforcement Learning @NIPS2017
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners Tutorial
 
Support Vector Machines ( SVM )
Support Vector Machines ( SVM ) Support Vector Machines ( SVM )
Support Vector Machines ( SVM )
 
Hierarchical Reinforcement Learning
Hierarchical Reinforcement LearningHierarchical Reinforcement Learning
Hierarchical Reinforcement Learning
 
Overview on Optimization algorithms in Deep Learning
Overview on Optimization algorithms in Deep LearningOverview on Optimization algorithms in Deep Learning
Overview on Optimization algorithms in Deep Learning
 
Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning
 
Planning
Planning Planning
Planning
 
Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark...
Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark...Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark...
Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark...
 

Similar a MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman

D u e o n A p r i l 1 3 , 2 0 1 7 Accounting 212 F.docx
 D u e  o n  A p r i l  1 3 ,  2 0 1 7  Accounting 212 F.docx D u e  o n  A p r i l  1 3 ,  2 0 1 7  Accounting 212 F.docx
D u e o n A p r i l 1 3 , 2 0 1 7 Accounting 212 F.docx
aryan532920
 
Measuring Direct Social Media ROI
Measuring Direct Social Media ROIMeasuring Direct Social Media ROI
Measuring Direct Social Media ROI
Kemp Edmonds
 
Baworld adapting to whats happening
Baworld adapting to whats happeningBaworld adapting to whats happening
Baworld adapting to whats happening
Dave Davis PMP, PgMP, PBA
 
Enterprise social what is the real value to the business - collab con - mar...
Enterprise social   what is the real value to the business - collab con - mar...Enterprise social   what is the real value to the business - collab con - mar...
Enterprise social what is the real value to the business - collab con - mar...
Ruven Gotz
 
3420, 6)48 PMAPUS RCampus - Week 1 - Financial Statements R.docx
3420, 6)48 PMAPUS RCampus - Week 1 - Financial Statements R.docx3420, 6)48 PMAPUS RCampus - Week 1 - Financial Statements R.docx
3420, 6)48 PMAPUS RCampus - Week 1 - Financial Statements R.docx
BHANU281672
 
3420, 6)48 PMAPUS RCampus - Week 1 - Financial Statements R.docx
3420, 6)48 PMAPUS RCampus - Week 1 - Financial Statements R.docx3420, 6)48 PMAPUS RCampus - Week 1 - Financial Statements R.docx
3420, 6)48 PMAPUS RCampus - Week 1 - Financial Statements R.docx
lorainedeserre
 

Similar a MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman (20)

MIT Deep Learning Basics: Introduction and Overview by Lex Fridman
MIT Deep Learning Basics: Introduction and Overview by Lex FridmanMIT Deep Learning Basics: Introduction and Overview by Lex Fridman
MIT Deep Learning Basics: Introduction and Overview by Lex Fridman
 
Finding Leverage with System Dynamics
Finding Leverage with System DynamicsFinding Leverage with System Dynamics
Finding Leverage with System Dynamics
 
Blind Spots in Persuasive Design
Blind Spots in Persuasive DesignBlind Spots in Persuasive Design
Blind Spots in Persuasive Design
 
D u e o n A p r i l 1 3 , 2 0 1 7 Accounting 212 F.docx
 D u e  o n  A p r i l  1 3 ,  2 0 1 7  Accounting 212 F.docx D u e  o n  A p r i l  1 3 ,  2 0 1 7  Accounting 212 F.docx
D u e o n A p r i l 1 3 , 2 0 1 7 Accounting 212 F.docx
 
Community Management ROI - CMX Summit East
Community Management ROI - CMX Summit EastCommunity Management ROI - CMX Summit East
Community Management ROI - CMX Summit East
 
An Introduction to Usability
An Introduction to UsabilityAn Introduction to Usability
An Introduction to Usability
 
Emerce GAUC 2020 - Lars Harmsen - Beerwulf - Brewing a culture of data-driven...
Emerce GAUC 2020 - Lars Harmsen - Beerwulf - Brewing a culture of data-driven...Emerce GAUC 2020 - Lars Harmsen - Beerwulf - Brewing a culture of data-driven...
Emerce GAUC 2020 - Lars Harmsen - Beerwulf - Brewing a culture of data-driven...
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
Measuring Direct Social Media ROI
Measuring Direct Social Media ROIMeasuring Direct Social Media ROI
Measuring Direct Social Media ROI
 
Baworld adapting to whats happening
Baworld adapting to whats happeningBaworld adapting to whats happening
Baworld adapting to whats happening
 
User Interface and User Experience - A Process and Strategy for Small Teams
User Interface and User Experience - A Process and Strategy for Small TeamsUser Interface and User Experience - A Process and Strategy for Small Teams
User Interface and User Experience - A Process and Strategy for Small Teams
 
Enterprise social what is the real value to the business - collab con - mar...
Enterprise social   what is the real value to the business - collab con - mar...Enterprise social   what is the real value to the business - collab con - mar...
Enterprise social what is the real value to the business - collab con - mar...
 
Be Networked, Use Measurement, and Learn from Your Data
Be Networked, Use Measurement, and Learn from Your DataBe Networked, Use Measurement, and Learn from Your Data
Be Networked, Use Measurement, and Learn from Your Data
 
Online Community Practices
Online Community PracticesOnline Community Practices
Online Community Practices
 
Human connectedness and meaningful conversations - how coaching boosts the su...
Human connectedness and meaningful conversations - how coaching boosts the su...Human connectedness and meaningful conversations - how coaching boosts the su...
Human connectedness and meaningful conversations - how coaching boosts the su...
 
YTH Live 2019 (youth+tech+health): Online to Offline Impact Measurement
YTH Live 2019 (youth+tech+health): Online to Offline Impact MeasurementYTH Live 2019 (youth+tech+health): Online to Offline Impact Measurement
YTH Live 2019 (youth+tech+health): Online to Offline Impact Measurement
 
How Nonprofits Open Professional Doors
How Nonprofits Open Professional DoorsHow Nonprofits Open Professional Doors
How Nonprofits Open Professional Doors
 
Collaboration and enterprise social tools-SharePointAlooza - 2015
Collaboration and enterprise social tools-SharePointAlooza - 2015Collaboration and enterprise social tools-SharePointAlooza - 2015
Collaboration and enterprise social tools-SharePointAlooza - 2015
 
3420, 6)48 PMAPUS RCampus - Week 1 - Financial Statements R.docx
3420, 6)48 PMAPUS RCampus - Week 1 - Financial Statements R.docx3420, 6)48 PMAPUS RCampus - Week 1 - Financial Statements R.docx
3420, 6)48 PMAPUS RCampus - Week 1 - Financial Statements R.docx
 
3420, 6)48 PMAPUS RCampus - Week 1 - Financial Statements R.docx
3420, 6)48 PMAPUS RCampus - Week 1 - Financial Statements R.docx3420, 6)48 PMAPUS RCampus - Week 1 - Financial Statements R.docx
3420, 6)48 PMAPUS RCampus - Week 1 - Financial Statements R.docx
 

Más de Peerasak C.

เสวนาเจาะเกณฑ์ระดมทุนและการลงทุนกับ LiVex
เสวนาเจาะเกณฑ์ระดมทุนและการลงทุนกับ LiVexเสวนาเจาะเกณฑ์ระดมทุนและการลงทุนกับ LiVex
เสวนาเจาะเกณฑ์ระดมทุนและการลงทุนกับ LiVex
Peerasak C.
 
Blockchain for Government Services
Blockchain for Government ServicesBlockchain for Government Services
Blockchain for Government Services
Peerasak C.
 
e-Conomy SEA 2020
e-Conomy SEA 2020e-Conomy SEA 2020
e-Conomy SEA 2020
Peerasak C.
 
A Roadmap for CrossBorder Data Flows: Future-Proofing Readiness and Cooperati...
A Roadmap for CrossBorder Data Flows: Future-Proofing Readiness and Cooperati...A Roadmap for CrossBorder Data Flows: Future-Proofing Readiness and Cooperati...
A Roadmap for CrossBorder Data Flows: Future-Proofing Readiness and Cooperati...
Peerasak C.
 
๙ พระสูตร ปฐมโพธิกาล
๙ พระสูตร ปฐมโพธิกาล๙ พระสูตร ปฐมโพธิกาล
๙ พระสูตร ปฐมโพธิกาล
Peerasak C.
 
TAT The Journey
TAT The JourneyTAT The Journey
TAT The Journey
Peerasak C.
 
Artificial Intelligence and Life in 2030. Standford U. Sep.2016
Artificial Intelligence and Life in 2030. Standford U. Sep.2016Artificial Intelligence and Life in 2030. Standford U. Sep.2016
Artificial Intelligence and Life in 2030. Standford U. Sep.2016
Peerasak C.
 
Royal Virtues by Somdet Phra Buddhaghosajahn (P. A. Payutto) translated by Ja...
Royal Virtues by Somdet Phra Buddhaghosajahn (P. A. Payutto) translated by Ja...Royal Virtues by Somdet Phra Buddhaghosajahn (P. A. Payutto) translated by Ja...
Royal Virtues by Somdet Phra Buddhaghosajahn (P. A. Payutto) translated by Ja...
Peerasak C.
 
e-Conomy SEA Report 2019
e-Conomy SEA Report 2019e-Conomy SEA Report 2019
e-Conomy SEA Report 2019
Peerasak C.
 
นิตยสารคิด (Creative Thailand) ฉบับเดือนตุลาคม 2562
นิตยสารคิด (Creative Thailand) ฉบับเดือนตุลาคม 2562นิตยสารคิด (Creative Thailand) ฉบับเดือนตุลาคม 2562
นิตยสารคิด (Creative Thailand) ฉบับเดือนตุลาคม 2562
Peerasak C.
 
เจาะเทรนด์โลก 2020: Positive Power โดย TCDC
เจาะเทรนด์โลก 2020: Positive Power โดย TCDCเจาะเทรนด์โลก 2020: Positive Power โดย TCDC
เจาะเทรนด์โลก 2020: Positive Power โดย TCDC
Peerasak C.
 
คู่มือการใช้ระบบ Biz Portal ส าหรับผู้ยื่นคำขอ (Biz Portal User Manual)
คู่มือการใช้ระบบ Biz Portal ส าหรับผู้ยื่นคำขอ (Biz Portal User Manual)คู่มือการใช้ระบบ Biz Portal ส าหรับผู้ยื่นคำขอ (Biz Portal User Manual)
คู่มือการใช้ระบบ Biz Portal ส าหรับผู้ยื่นคำขอ (Biz Portal User Manual)
Peerasak C.
 
K SME INSPIRED September 2019 Issue 65
K SME INSPIRED September 2019 Issue 65K SME INSPIRED September 2019 Issue 65
K SME INSPIRED September 2019 Issue 65
Peerasak C.
 
หนังสือ BlockChain for Government Services
หนังสือ BlockChain for Government Servicesหนังสือ BlockChain for Government Services
หนังสือ BlockChain for Government Services
Peerasak C.
 
General Population Census of the Kingdom of Cambodia 2019
General Population Census of the Kingdom of Cambodia 2019General Population Census of the Kingdom of Cambodia 2019
General Population Census of the Kingdom of Cambodia 2019
Peerasak C.
 

Más de Peerasak C. (20)

เสวนาเจาะเกณฑ์ระดมทุนและการลงทุนกับ LiVex
เสวนาเจาะเกณฑ์ระดมทุนและการลงทุนกับ LiVexเสวนาเจาะเกณฑ์ระดมทุนและการลงทุนกับ LiVex
เสวนาเจาะเกณฑ์ระดมทุนและการลงทุนกับ LiVex
 
Blockchain for Government Services
Blockchain for Government ServicesBlockchain for Government Services
Blockchain for Government Services
 
e-Conomy SEA 2020
e-Conomy SEA 2020e-Conomy SEA 2020
e-Conomy SEA 2020
 
A Roadmap for CrossBorder Data Flows: Future-Proofing Readiness and Cooperati...
A Roadmap for CrossBorder Data Flows: Future-Proofing Readiness and Cooperati...A Roadmap for CrossBorder Data Flows: Future-Proofing Readiness and Cooperati...
A Roadmap for CrossBorder Data Flows: Future-Proofing Readiness and Cooperati...
 
๙ พระสูตร ปฐมโพธิกาล
๙ พระสูตร ปฐมโพธิกาล๙ พระสูตร ปฐมโพธิกาล
๙ พระสูตร ปฐมโพธิกาล
 
TAT The Journey
TAT The JourneyTAT The Journey
TAT The Journey
 
FREELANCING IN AMERICA 2019
FREELANCING IN AMERICA 2019FREELANCING IN AMERICA 2019
FREELANCING IN AMERICA 2019
 
How to start a business: Checklist and Canvas
How to start a business: Checklist and CanvasHow to start a business: Checklist and Canvas
How to start a business: Checklist and Canvas
 
The Multiple Effects of Business Planning on New Venture Performance
The Multiple Effects of Business Planning on New Venture PerformanceThe Multiple Effects of Business Planning on New Venture Performance
The Multiple Effects of Business Planning on New Venture Performance
 
Artificial Intelligence and Life in 2030. Standford U. Sep.2016
Artificial Intelligence and Life in 2030. Standford U. Sep.2016Artificial Intelligence and Life in 2030. Standford U. Sep.2016
Artificial Intelligence and Life in 2030. Standford U. Sep.2016
 
Testing Business Ideas by David Bland & Alex Osterwalder
Testing Business Ideas by David Bland & Alex Osterwalder Testing Business Ideas by David Bland & Alex Osterwalder
Testing Business Ideas by David Bland & Alex Osterwalder
 
Royal Virtues by Somdet Phra Buddhaghosajahn (P. A. Payutto) translated by Ja...
Royal Virtues by Somdet Phra Buddhaghosajahn (P. A. Payutto) translated by Ja...Royal Virtues by Somdet Phra Buddhaghosajahn (P. A. Payutto) translated by Ja...
Royal Virtues by Somdet Phra Buddhaghosajahn (P. A. Payutto) translated by Ja...
 
e-Conomy SEA Report 2019
e-Conomy SEA Report 2019e-Conomy SEA Report 2019
e-Conomy SEA Report 2019
 
นิตยสารคิด (Creative Thailand) ฉบับเดือนตุลาคม 2562
นิตยสารคิด (Creative Thailand) ฉบับเดือนตุลาคม 2562นิตยสารคิด (Creative Thailand) ฉบับเดือนตุลาคม 2562
นิตยสารคิด (Creative Thailand) ฉบับเดือนตุลาคม 2562
 
เจาะเทรนด์โลก 2020: Positive Power โดย TCDC
เจาะเทรนด์โลก 2020: Positive Power โดย TCDCเจาะเทรนด์โลก 2020: Positive Power โดย TCDC
เจาะเทรนด์โลก 2020: Positive Power โดย TCDC
 
RD Strategies & D2rive
RD Strategies & D2rive RD Strategies & D2rive
RD Strategies & D2rive
 
คู่มือการใช้ระบบ Biz Portal ส าหรับผู้ยื่นคำขอ (Biz Portal User Manual)
คู่มือการใช้ระบบ Biz Portal ส าหรับผู้ยื่นคำขอ (Biz Portal User Manual)คู่มือการใช้ระบบ Biz Portal ส าหรับผู้ยื่นคำขอ (Biz Portal User Manual)
คู่มือการใช้ระบบ Biz Portal ส าหรับผู้ยื่นคำขอ (Biz Portal User Manual)
 
K SME INSPIRED September 2019 Issue 65
K SME INSPIRED September 2019 Issue 65K SME INSPIRED September 2019 Issue 65
K SME INSPIRED September 2019 Issue 65
 
หนังสือ BlockChain for Government Services
หนังสือ BlockChain for Government Servicesหนังสือ BlockChain for Government Services
หนังสือ BlockChain for Government Services
 
General Population Census of the Kingdom of Cambodia 2019
General Population Census of the Kingdom of Cambodia 2019General Population Census of the Kingdom of Cambodia 2019
General Population Census of the Kingdom of Cambodia 2019
 

Último

The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
AnaAcapella
 

Último (20)

Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Wellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxWellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptx
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structure
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the Classroom
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 

MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman

  • 2. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references Deep Reinforcement Learning (Deep RL) • What is it? Framework for learning to solve sequential decision making problems. • How? Trial and error in a world that provides occasional rewards • Deep? Deep RL = RL + Neural Networks [306, 307] Deep Learning Deep RL
  • 3. 2019https://deeplearning.mit.edu Types of Learning • Supervised Learning • Semi-Supervised Learning • Unsupervised Learning • Reinforcement Learning It’s all “supervised” by a loss function! *Someone has to say what’s good and what’s bad (see Socrates, Epictetus, Kant, Nietzsche, etc.) Neural Network Input Output Good or Bad? Supervision*
  • 4. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references Supervised learning is “teach by example”: Here’s some examples, now learn patterns in these example. Reinforcement learning is “teach by experience”: Here’s a world, now learn patterns by exploring it.
  • 5. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references Machine Learning: Supervised vs Reinforcement Supervised learning is “teach by example”: Here’s some examples, now learn patterns in these example. Reinforcement learning is “teach by experience”: Here’s a world, now learn patterns by exploring it. [330] Failure Success
  • 6. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references Reinforcement Learning in Humans • Human appear to learn to walk through “very few examples” of trial and error. How is an open question… • Possible answers: • Hardware: 230 million years of bipedal movement data. • Imitation Learning: Observation of other humans walking. • Algorithms: Better than backpropagation and stochastic gradient descent [286, 291]
  • 7. 2019https://deeplearning.mit.edu Open Question: What can be learned from data? Environment Sensors Sensor Data Feature Extraction Machine Learning Reasoning Knowledge Effector Planning Action Representation
  • 8. 2019https://deeplearning.mit.edu Environment Sensors Sensor Data Feature Extraction Machine Learning Reasoning Knowledge Effector Planning Action Representation References: [132] Lidar Camera (Visible, Infrared) Radar Stereo Camera Microphone GPS IMU Networking (Wired, Wireless)
  • 9. 2019https://deeplearning.mit.edu Environment Sensors Sensor Data Feature Extraction Machine Learning Reasoning Knowledge Effector Planning Action Representation
  • 10. 2019https://deeplearning.mit.edu Environment Sensors Sensor Data Feature Extraction Machine Learning Reasoning Knowledge Effector Planning Action Representation
  • 11. 2019https://deeplearning.mit.edu Environment Sensors Sensor Data Feature Extraction Machine Learning Reasoning Knowledge Effector Planning Action Representation Image Recognition: If it looks like a duck Activity Recognition: Swims like a duck Audio Recognition: Quacks like a duck
  • 12. 2019https://deeplearning.mit.edu Environment Sensors Sensor Data Feature Extraction Machine Learning Reasoning Knowledge Effector Planning Action Representation Final breakthrough, 358 years after its conjecture: “It was so indescribably beautiful; it was so simple and so elegant. I couldn’t understand how I’d missed it and I just stared at it in disbelief for twenty minutes. Then during the day I walked around the department, and I’d keep coming back to my desk looking to see if it was still there. It was still there. I couldn’t contain myself, I was so excited. It was the most important moment of my working life. Nothing I ever do again will mean as much."
  • 13. 2019https://deeplearning.mit.edu Environment Sensors Sensor Data Feature Extraction Machine Learning Reasoning Knowledge Effector Planning Action Representation References: [133]
  • 14. 2019https://deeplearning.mit.edu Environment Sensors Sensor Data Feature Extraction Machine Learning Reasoning Knowledge Effector Planning Action Representation The promise of Deep Learning The promise of Deep Reinforcement Learning
  • 15. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references Reinforcement Learning Framework At each step, the agent: • Executes action • Observe new state • Receive reward [80] Open Questions: • What cannot be modeled in this way? • What are the challenges of learning in this framework?
  • 16. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references Environment and Actions • Fully Observable (Chess) vs Partially Observable (Poker) • Single Agent (Atari) vs Multi Agent (DeepTraffic) • Deterministic (Cart Pole) vs Stochastic (DeepTraffic) • Static (Chess) vs Dynamic (DeepTraffic) • Discrete (Chess) vs Continuous (Cart Pole) Note: Real-world environment might not technically be stochastic or partially-observable but might as well be treated as such due to their complexity.
  • 17. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references The Challenge for RL in Real-World Applications Reminder: Supervised learning: teach by example Reinforcement learning: teach by experience Open Challenges. Two Options: 1. Real world observation + one-shot trial & error 2. Realistic simulation + transfer learning 1. Improve Transfer Learning 2. Improve Simulation
  • 18. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references Major Componentsof anRLAgent An RL agent may be directly or indirectly trying to learn a: • Policy: agent’s behavior function • Value function: how good is each state and/or action • Model: agent’s representation of the environment 𝑠0, 𝑎0, 𝑟1, 𝑠1, 𝑎1, 𝑟2, … , 𝑠 𝑛−1, 𝑎 𝑛−1, 𝑟 𝑛, 𝑠 𝑛 state action reward Terminal state
  • 19. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references Meaning of Life for RL Agent: Maximize Reward [84] • Future reward: 𝑅𝑡 = 𝑟𝑡 + 𝑟𝑡+1 + 𝑟𝑡+2 + ⋯ + 𝑟𝑛 • Discounted future reward: 𝑅 𝑡 = 𝑟𝑡 + 𝛾𝑟𝑡+1 + 𝛾2 𝑟 𝑡+2 + ⋯ + 𝛾 𝑛−𝑡 𝑟 𝑛 • A good strategy for an agent would be to always choose an action that maximizes the (discounted) future reward • Why “discounted”? • Math trick to help analyze convergence • Uncertainty due to environment stochasticity, partial observability, or that life can end at any moment: “If today were the last day of my life, would I want to do what I’m about to do today?” – Steve Jobs
  • 20. 2019https://deeplearning.mit.edu actions: UP, DOWN, LEFT, RIGHT (Stochastic) model of the world: Action: UP 80% move UP 10% move LEFT 10% move RIGHT Robot in a Room +1 -1 START • Reward +1 at [4,3], -1 at [4,2] • Reward -0.04 for each step • What’s the strategy to achieve max reward? • We can learn the model and plan • We can learn the value of (action, state) pairs and act greed/non-greedy • We can learn the policy directly while sampling from it
  • 21. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references Optimal Policy for a Deterministic World Reward: -0.04 for each step actions: UP, DOWN, LEFT, RIGHT When actions are deterministic: UP 100% move UP 0% move LEFT 0% move RIGHT Policy: Shortest path. +1 -1
  • 22. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references Optimal Policy for a Stochastic World Reward: -0.04 for each step +1 -1 actions: UP, DOWN, LEFT, RIGHT When actions are stochastic: UP 80% move UP 10% move LEFT 10% move RIGHT Policy: Shortest path. Avoid -UP around -1 square.
  • 23. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references Optimal Policy for a Stochastic World Reward: -2 for each step +1 -1 actions: UP, DOWN, LEFT, RIGHT When actions are stochastic: UP 80% move UP 10% move LEFT 10% move RIGHT Policy: Shortest path.
  • 24. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references Optimal Policy for a Stochastic World Reward: -0.1 for each step +1 -1 +1 -1 Reward: -0.04 for each step Less urgentMore urgent
  • 25. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references Optimal Policy for a Stochastic World Reward: +0.01 for each step actions: UP, DOWN, LEFT, RIGHT When actions are stochastic: UP 80% move UP 10% move LEFT 10% move RIGHT Policy: Longest path. +1 -1
  • 26. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references Lessons from Robot in Room • Environment model has big impact on optimal policy • Reward structure has big impact on optimal policy
  • 27. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references Reward structure may have Unintended Consequences Player gets reward based on: 1. Finishing time 2. Finishing position 3. Picking up “turbos” [285] Human AI (Deep RL Agent)
  • 28. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references AI Safety Risk (and thus Human Life) Part of the Loss Function
  • 29. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references Examples of Reinforcement Learning Cart-Pole Balancing • Goal— Balance the pole on top of a moving cart • State— Pole angle, angular speed. Cart position, horizontal velocity. • Actions— horizontal force to the cart • Reward— 1 at each time step if the pole is upright [166]
  • 30. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references Examples of Reinforcement Learning Doom* • Goal: Eliminate all opponents • State: Raw game pixels of the game • Actions: Up, Down, Left, Right, Shoot, etc. • Reward: • Positive when eliminating an opponent, negative when the agent is eliminated [166] * Added for important thought-provoking considerations of AI safety in the context of autonomous weapons systems (see AGI lectures on the topic).
  • 31. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references Examples of Reinforcement Learning Grasping Objects with Robotic Arm • Goal - Pick an object of different shapes • State - Raw pixels from camera • Actions – Move arm. Grasp. • Reward - Positive when pickup is successful [332]
  • 32. 2019https://deeplearning.mit.eduFor the full updated list of references visit: https://selfdrivingcars.mit.edu/references Examples of Reinforcement Learning Human Life • Goal - Survival? Happiness? • State - Sight. Hearing. Taste. Smell. Touch. • Actions - Think. Move. • Reward – Homeostasis?
  • 33. 2019https://deeplearning.mit.eduFor the full updated list of references visit: https://selfdrivingcars.mit.edu/references Key Takeaways for Real-World Impact • Deep Learning: • Fun part: Good algorithms that learn from data. • Hard part: Good questions, huge amounts of representative data. • Deep Reinforcement Learning: • Fun part: Good algorithms that learn from data. • Hard part: Defining a useful state space, action space, and reward. • Hardest part: Getting meaningful data for the above formalization.
  • 34. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references 3 Types of Reinforcement Learning Model-based • Learn the model of the world, then plan using the model • Update model often • Re-plan often [331, 333] Value-based • Learn the state or state-action value • Act by choosing best action in state • Exploration is a necessary add-on Policy-based • Learn the stochastic policy function that maps state to action • Act by sampling policy • Exploration is baked in
  • 35. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references Taxonomy of RL Methods [334] Link: https://spinningup.openai.com
  • 36. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references Taxonomy of RL Methods [334]
  • 37. 2019https://deeplearning.mit.edu Q-Learning • State-action value function: Q(s,a) • Expected return when starting in s, performing a, and following  • Q-Learning: Use any policy to estimate Q that maximizes future reward: • Q directly approximates Q* (Bellman optimality equation) • Independent of the policy being followed • Only requirement: keep updating each (s,a) pair s a s’ r New State Old State Reward Learning Rate Discount Factor
  • 38. 2019https://deeplearning.mit.edu Exploration vs Exploitation • Deterministic/greedy policy won’t explore all actions • Don’t know anything about the environment at the beginning • Need to try all actions to find the optimal one • ε-greedy policy • With probability 1-ε perform the optimal/greedy action, otherwise random action • Slowly move it towards greedy policy: ε -> 0
  • 39. 2019https://deeplearning.mit.edu Q-Learning: Value Iteration References: [84] A1 A2 A3 A4 S1 +1 +2 -1 0 S2 +2 0 +1 -2 S3 -1 +1 0 -2 S4 -2 0 +1 +1
  • 40. 2019https://deeplearning.mit.edu Q-Learning: Representation Matters • In practice, Value Iteration is impractical • Very limited states/actions • Cannot generalize to unobserved states • Think about the Breakout game • State: screen pixels • Image size: 𝟖𝟒 × 𝟖𝟒 (resized) • Consecutive 4 images • Grayscale with 256 gray levels 𝟐𝟓𝟔 𝟖𝟒×𝟖𝟒×𝟒 rows in theQ-table! = 1069,970 >> 1082 atoms in the universe References: [83, 84]
  • 41. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references Deep RL = RL + Neural Networks [20]
  • 42. 2019https://deeplearning.mit.eduFor the full updated list of references visit: https://selfdrivingcars.mit.edu/references DQN: Deep Q-Learning [83] Use a neural network to approximate the Q-function:
  • 43. 2019https://deeplearning.mit.eduFor the full updated list of references visit: https://selfdrivingcars.mit.edu/references Deep Q-Network (DQN): Atari [83] Mnih et al. "Playing atari with deep reinforcement learning." 2013.
  • 44. 2019https://deeplearning.mit.eduFor the full updated list of references visit: https://selfdrivingcars.mit.edu/references DQN and Double DQN • Loss function (squared error): [83] predictiontarget • DQN: same network for both Q • Double DQN: separate network for each Q • Helps reduce bias introduced by the inaccuracies of Q network at the beginning of training
  • 45. 2019https://deeplearning.mit.eduFor the full updated list of references visit: https://selfdrivingcars.mit.edu/references DQN Tricks • Experience Replay • Stores experiences (actions, state transitions, and rewards) and creates mini-batches from them for the training process • Fixed Target Network • Error calculation includes the target function depends on network parameters and thus changes quickly. Updating it only every 1,000 steps increases stability of training process. [83, 167]
  • 46. 2019https://deeplearning.mit.eduFor the full updated list of references visit: https://selfdrivingcars.mit.edu/references Atari Breakout [85] After 120 Minutes of Training After 10 Minutes of Training After 240 Minutes of Training
  • 47. 2019https://deeplearning.mit.eduFor the full updated list of references visit: https://selfdrivingcars.mit.edu/references DQN Resultsin Atari [83]
  • 48. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references Dueling DQN (DDQN) • Decompose Q(s,a) • V(s): the value of being at that state • A(s,a): the advantage of taking action a in state s versus all other possible actions at that state • Use two streams: • one that estimates the state value V(s) • one that estimates the advantage for each action A(s,a) • Useful for states where action choice does not affect Q(s,a) [336]
  • 49. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references Prioritized Experience Replay • Sample experiences based on impact not frequency of occurrence [336]
  • 50. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references Taxonomy of RL Methods [334]
  • 51. 2019https://deeplearning.mit.eduFor the full updated list of references visit: https://selfdrivingcars.mit.edu/references Policy Gradient (PG) • DQN (off-policy): Approximate Q and infer optimal policy • PG (on-policy): Directly optimize policy space [63] Good illustrative explanation: http://karpathy.github.io/2016/05/31/rl/ “Deep Reinforcement Learning: Pong from Pixels” Policy Network
  • 52. 2019https://deeplearning.mit.eduFor the full updated list of references visit: https://selfdrivingcars.mit.edu/references Policy Gradient – Training [63, 204] • REINFORCE: Policy gradient that increases probability of good actions and decreases probability of bad action:
  • 53. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references Policy Gradient • Pros vs DQN: • Messy World: If Q function is too complex to be learned, DQN may fail miserably, while PG will still learn a good policy. • Speed: Faster convergence • Stochastic Policies: Capable of learning stochastic policies - DQN can’t • Continuous actions: Much easier to model continuous action space • Cons vs DQN: • Data: Sample inefficient (needs more data) • Stability: Less stable during training process. • Poor credit assignment to (state, action) pairs for delayed rewards [63, 335] Problem with REINFORCE: Calculating the reward at the end, means all the actions will be averaged as good because the total reward was high.
  • 54. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references Taxonomy of RL Methods [334]
  • 55. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references Advantage Actor-Critic (A2C) • Combine DQN (value-based) and REINFORCE (policy-based) • Two neural networks (Actor and Critic): • Actor is policy-based: Samples the action from a policy • Critic is value-based: Measures how good the chosen action is [335] • Update at each time step - temporal difference (TD) learning
  • 56. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references Asynchronous Advantage Actor-Critic (A3C) [337] • Both use parallelism in training • A2C syncs up for global parameter update and then start each iteration with the same policy
  • 57. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references Taxonomy of RL Methods [334]
  • 58. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references Deep Deterministic Policy Gradient (DDPG) • Actor-Critic framework for learning a deterministic policy • Can be thought of as: DQN for continuous action spaces • As with all DQN, following tricks are required: • Experience Replay • Target network • Exploration: add noise to actions, reducing scale of the noise as training progresses [341]
  • 59. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references Taxonomy of RL Methods [334]
  • 60. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references Policy Optimization • Progress beyond Vanilla Policy Gradient: • Natural Policy Gradient • TRPO • PPO • Basic idea in on-policy optimization: Avoid taking bad actions that collapse the training performance. [339] Line Search: First pick direction, then step size Trust Region: First pick step size, then direction
  • 61. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references Taxonomy of RL Methods [334]
  • 62. 2019https://deeplearning.mit.eduFor the full updated list of references visit: https://selfdrivingcars.mit.edu/references Game of Go [170]
  • 63. 2019https://deeplearning.mit.eduFor the full updated list of references visit: https://selfdrivingcars.mit.edu/references AlphaGo (2016) Beat Top Human at Go [83]
  • 64. 2019https://deeplearning.mit.eduFor the full updated list of references visit: https://selfdrivingcars.mit.edu/references AlphaGo Zero (2017): Beats AlphaGo [149]
  • 65. 2019https://deeplearning.mit.eduFor the full updated list of references visit: https://selfdrivingcars.mit.edu/references AlphaGo Zero Approach [170] • Same as the best before: Monte Carlo Tree Search (MCTS) • Balance exploitation/exploration (going deep on promising positions or exploring new underplayed positions) • Use a neural network as “intuition” for which positions to expand as part of MCTS (same as AlphaGo)
  • 66. 2019https://deeplearning.mit.eduFor the full updated list of references visit: https://selfdrivingcars.mit.edu/references AlphaGo Zero Approach [170] • Same as the best before: Monte Carlo Tree Search (MCTS) • Balance exploitation/exploration (going deep on promising positions or exploring new underplayed positions) • Use a neural network as “intuition” for which positions to expand as part of MCTS (same as AlphaGo) • “Tricks” • Use MCTS intelligent look-ahead (instead of human games) to improve value estimates of play options • Multi-task learning: “two-headed” network that outputs (1) move probability and (2) probability of winning. • Updated architecture: use residual networks
  • 67. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references AlphaZero vs Chess, Shogi, Go [340]
  • 68. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references AlphaZero vs Chess, Shogi, Go [340]
  • 69. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references AlphaZero vs Chess, Shogi, Go [340]
  • 70. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references To date, for most successful robots operating in the real world: Deep RL is not involved [169]
  • 71. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references To date, for most successful robots operating in the real world: Deep RL is not involved
  • 72. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references But… that’s slowly changing: Learning Control Dynamics [343]
  • 73. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references But… that’s slowly changing: Learning to Drive: Beyond Pure Imitation (Waymo) [343]
  • 74. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references The Challenge for RL in Real-World Applications Reminder: Supervised learning: teach by example Reinforcement learning: teach by experience Open Challenges. Two Options: 1. Real world observation + one-shot trial & error 2. Realistic simulation + transfer learning 1. Improve Transfer Learning 2. Improve Simulation
  • 75. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references Thinking Outside the Box: Multiverse Theory and the Simulation Hypothesis • Create an (infinite) set of simulation environments to learn in so that our reality becomes just another sample from the set. [342]
  • 76. 2019https://deeplearning.mit.eduFor the full list of references visit: https://hcai.mit.edu/references Next Steps in Deep RL • Lectures: https://deeplearning.mit.edu • Tutorials: https://github.com/lexfridman/mit-deep-learning • Advice for Research (from Spinning Up as a Deep RL Researcher by Joshua Achiam) • Background • Fundamentals in probability, statistics, multivariate calculus. • Deep learning basics • Deep RL basics • TensorFlow (or PyTorch) • Learn by doing • Implement core deep RL algorithms (discussed today) • Look for tricks and details in papers that were key to get it to work • Iterate fast in simple environments (see Einstein quote on simplicity) • Research • Improve on an existing approach • Focus on an unsolved task / benchmark • Create a new task / problem that hasn’t been addressed with RL [339]