SlideShare una empresa de Scribd logo
1 de 44
WORLD MODELS
Presentation By Duane Nielsen And Pushkar Merwah
World models
Background: Reinforcement Learning
World Model Architecture
 View
 Model
 Controller
EXPERIMENT 1- OPENAI GYM – Car Racing V0
 Ablation Study
Experiment 2 – Vizdoom
 Training Policy From “World Model Dream"
Basics of Reinforcement Learning
Left Brain
What is reinforcement learning
Learning expert representation required to achieve an objective, given a success metric, from raw
inputs in the absence of domain knowledge.
Introduces a unique advantage over supervised learning, it does not require the environment to be
static and learns to make similar decisions to intelligent bodies like value of new knowledge
(exploration vs. Exploitation)
Left Brain
© 2018 Copyright: Left Brain
Key concepts in reinforcement learning
Planning an optimal way to achieve an objective
Learning value of being in a state
Learning the policy of how to act in a given state
Learning to engage in an environment by experiencing a representation of the model of that
environment (indirect)
Left Brain
© 2018 Copyright: Left Brain
General policy iteration
Policy Value
Policy evaluation
Policy improvement
Evaluate the policy to get state value
Act greedily on the value to improve the policy
Converge to an optimal policy
converge to
optimal
policy
Left Brain
© 2018 Copyright: Left Brain
Markov decision process
Future is independent of the past given the present
(State, action, reward, discount, exploration)
Agent / controller
Environment / system
state action
reward
new state
1 2
@ a given state
Agent() takes an action
Env() returns a reward and next state
3
4
© 2018 Copyright: Left Brain
Interacting Between Planning Acting And Learning
Value/Polic
y
ExperienceModel
Planning acting
Model Learning
Direct RL
Left Brain
© 2018 Copyright: Left Brain
Interacting Between Planning Acting And Learning
Value/Polic
y
ExperienceModel
Planning acting
Model Learning
Direct RL
You Are Here
Left Brain
CONTRIBUTIONS
Models the environment using an unsupervised, low
dimensional representation
Recurrent mixture model, models the agents actions on the
environment stochastically, which helps the controller
anticipate the next move.
RL is used on the model policy, which is transferable to the
“real” environment
WORLD MODELS
D Ha and J Shmidhubecr
World Model Architecture
• V – VIEW
• M – MODEL
• C - CONTROLLER
V – The View
A Typical Random Image
Actual Images
Set Of All Images
Real images Random images
Z Latent Space
• If real images are a subset of the entire space of images then in theory, we should be able to encode them
with a smaller amount of information than the large space
• This smaller “space” of variables is called the latent space “z”
Autoencoder – Encoder-decoder Network
Some Examples Of Latent Spaces
Z examples
 Http://vecg.Cs.Ucl.Ac.Uk/projects/projects_fonts/projects_fonts.Html
 Https://worldmodels.Github.Io/
So What Is The Use Of Z ?
• Z is a smaller space, so reduces the “state space” the model needs to deal with
• Z-values contain should more meaningful information
• Z-values, trained with enough data, should generalize to novel, unseen yet similar environments
M – The Model
Basics Of Gaussian Mixture Models
Left Brain
How To Solve Problems That Have Multimodal Solutions
y
x
Y = f(x) {discreet solution}
y
x
x = f(y) {multimodal solution}
Left Brain
Propose using a probability of observing the state© 2018 Copyright: Left Brain
MDN-RNN MODEL (M): Predict The Future Z Value
Since the environment is stochastic we model the distribution not discrete values
MDN
RNN
MDN
RNN
MDN
RNN
at-
1
at
at+
1
Zt-
1
Zt
Zt+
1
Zt
Zt+
1
Zt+
2
ht-
1
ht
ht+
1
ht+
2
p(zt+1 | at ,zt ,ht)
Left Brain
temperatu
re
© 2018 Copyright: Left Brain
MDN-RNN MODEL (M): Predict The Future Z Value
Since the environment is stochastic we model the distribution not discrete values
MDN
RNN
MDN
RNN
MDN
RNN
at-
1
at
at+
1
Zt-
1
Zt
Zt+
1
Zt
Zt+
1
Zt+
2
ht-
1
ht
ht+
1
ht+
2
p(zt+1 | at ,zt ,ht)
So now we can predict the future image,
eg. prepare for the car to make a turn
Relationship between Zt and ht will be
trained by the controller
Left Brain
temperatu
re
© 2018 Copyright: Left Brain
Notebook Example Of GMM
Left Brain
C – The Controler
Basics Of Evolution Strategies
Left Brain
Evolution strategies (ES) can best described as a gradient descent method which uses gradients estimated from
stochastic perturbations around the current parameter value
Parameter exploring policy gradient (PEPG)
Reinforce-ES
Natural evolution strategies (NES):weak solutions contain information about what not to do, and this is valuable
information to calculate a better estimate for the next generation
Simple ES: sample a set of solutions from a normal distribution, with a mean μ and a fixed standard deviation σ
Simple genetic ES :genetic algorithms help diversity by keeping track of a diverse set of candidate solutions to
reproduce the next generation.
Open AI solution in ES: keep a constant σ. Does not require calculating a covariance matrix so it is requires less
flops.
Covariance Matrix Adaptation-ES (used in this paper)
Evolution strategies offer an alternative to backpropagation making
them easier to scale across machines
Left Brain
© 2018 Copyright: Left Brain
Covariance Matrix Adaptation - ES
Algorithm utilizes the results of generation to compute the next iteration. It adaptively increase or decreases the search space for the next generation of seed. It
will modulate the mean and the sigma of the parameters. This results in calculating an entirely new covariance matrix of the parameter space at every iteration. At
each generation, the algorithm provides a multivariate normal distribution to sample
References:
 Evolution strategies as a scalable alternative to reinforcement learning(arxiv:1703.03864)
 Blog.Openai.Com/evolution-strategies/
 CMA ES tutorial arxiv:1604.0077
Left Brain
© 2018 Copyright: Left Brain
Notebook Example Of CMA-ES
Left Brain
Explain Covariance Matrix
a1 a2 a3
b1 b2 b3
c1 c2 c3
Tells us how much noise (variance) exists within the
parameter(cov(B,B))
A
B
C
A B C
Tells us how much change in C is captured in A (variance)
exists within the parameter(cov(C,A))
Left Brain
© 2018 Copyright: Left Brain
MDN-RNN MODEL (M): Predict The Future Z Value
Since the environment is stochastic we model the distribution not discrete values
MDN
RNN
MDN
RNN
MDN
RNN
at-
1
at
at+
1
Zt-
1
Zt
Zt+
1
Zt
Zt+
1
Zt+
2
ht-
1
ht
ht+
1
ht+
2
p(zt+1 | at ,zt ,ht)
Left Brain
temperatu
re
© 2018 Copyright: Left Brain
MDN-RNN MODEL (M): Predict The Future Z Value
Since the environment is stochastic we model the distribution not discrete values
MDN
RNN
MDN
RNN
MDN
RNN
at-
1
at
at+
1
Zt-
1
Zt
Zt+
1
Zt
Zt+
1
Zt+
2
ht-
1
ht
ht+
1
ht+
2
p(zt+1 | at ,zt ,ht)
So now we can predict the future image,
eg. prepare for the car to make a turn
Relationship between Zt and ht will be
trained by the controller
Left Brain
temperatu
re
© 2018 Copyright: Left Brain
Results
Experiment 1 CAR Race - ABLATION
 SO DOES THIS WORK?
 TO TEST, OPENAI GYM CAR-RACE ENVIRONMENT WAS USED
 MODEL TOP-SCORED THE LEADERBOARD
 MODEL WAS RUN WITHOUT THE M COMPENENT AND WITH THE M COMPONENT
Car Race Results
Parameter Counts For Car Race
Experiment 2 – The Advantage Of Dreams
• NOTE THAT V AND M ARE TRAINED COMPLETELY BY UNSUPERVISED LEARNING
USING A RANDOM POLICY
• AFTER TRAINING M EFFECTIVELY BECOMES A “SIMULATION” OF THE REAL
ENVIRONMENT
• IF WE TRAIN A POLICY USING M, WILL IT WORK ON THE REAL ENVIRONMENT?
SO WHY USE M AND NOT THE REAL
ENVIRONMENT?
• M RUNS FASTER BECAUSE
• THE DIMENSIONALITY IS REDUCED TO Z
• ITS VECTORIZED AND THEREFORE OPTIMIZED FOR HARDWARE
ACCELERATION
• M IS NOT DETERMINISTIC, AND HAS A “TEMPERATURE”
PARAMETER
GAMING THE
SIMULATION
RANDOMNESS TO REDUCING
• GAMING A SYSTEM GENERALLY RELIES UPON EXPLOITING ”EDGE CASES”
• ADDING “RANDOMNESS” TO A SIMULATION REDUCES THE RELIABILITY OF EDGE
CASES AND MAKES THE UNDERLYING SIMULATION
• IT ALSO CAUSES THE POLICY TO BECOME MORE REDUNDANT AND ROBUST
• HTTPS://BLOG.OPENAI.COM/GENERALIZING-FROM-SIMULATION/
• M COMES WITH A “RANDOMNESS” SLIDER BUILT IN!
• HTTPS://WORLDMODELS.GITHUB.IO
Further the discussion
Have A Longer Conversation On The State Of AI/ML Join Us On A Free Demo
http://training.leftbrain.consulting/
Gain Hands On Experience Programing From Publications
Left Brain

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Reinforcement Learning In AI Powerpoint Presentation Slide Templates Complete...
Reinforcement Learning In AI Powerpoint Presentation Slide Templates Complete...Reinforcement Learning In AI Powerpoint Presentation Slide Templates Complete...
Reinforcement Learning In AI Powerpoint Presentation Slide Templates Complete...
 
CNN and its applications by ketaki
CNN and its applications by ketakiCNN and its applications by ketaki
CNN and its applications by ketaki
 
Introduction to Machine Learning and Deep Learning
Introduction to Machine Learning and Deep LearningIntroduction to Machine Learning and Deep Learning
Introduction to Machine Learning and Deep Learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Introduction to Multi-armed Bandits
Introduction to Multi-armed BanditsIntroduction to Multi-armed Bandits
Introduction to Multi-armed Bandits
 
Deep Q-Learning
Deep Q-LearningDeep Q-Learning
Deep Q-Learning
 
Methods of Optimization in Machine Learning
Methods of Optimization in Machine LearningMethods of Optimization in Machine Learning
Methods of Optimization in Machine Learning
 
Deep learning
Deep learningDeep learning
Deep learning
 
Reinforcement Learning - Apprentissage par renforcement
Reinforcement Learning - Apprentissage par renforcementReinforcement Learning - Apprentissage par renforcement
Reinforcement Learning - Apprentissage par renforcement
 
A brief overview of Reinforcement Learning applied to games
A brief overview of Reinforcement Learning applied to gamesA brief overview of Reinforcement Learning applied to games
A brief overview of Reinforcement Learning applied to games
 
Domain adaptation
Domain adaptationDomain adaptation
Domain adaptation
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
 
Explainable AI
Explainable AIExplainable AI
Explainable AI
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
deep learning
deep learningdeep learning
deep learning
 
Deep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its ApplicationsDeep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its Applications
 
Intro to Deep learning - Autoencoders
Intro to Deep learning - Autoencoders Intro to Deep learning - Autoencoders
Intro to Deep learning - Autoencoders
 
Artificial Neural Network Lecture 6- Associative Memories & Discrete Hopfield...
Artificial Neural Network Lecture 6- Associative Memories & Discrete Hopfield...Artificial Neural Network Lecture 6- Associative Memories & Discrete Hopfield...
Artificial Neural Network Lecture 6- Associative Memories & Discrete Hopfield...
 
Introduction to soft computing
Introduction to soft computingIntroduction to soft computing
Introduction to soft computing
 
Proximal Policy Optimization
Proximal Policy OptimizationProximal Policy Optimization
Proximal Policy Optimization
 

Similar a World models v0.14

Harnessing AI to Create a Trillion Dollar Asset Class - John Mercer, Chief Pr...
Harnessing AI to Create a Trillion Dollar Asset Class - John Mercer, Chief Pr...Harnessing AI to Create a Trillion Dollar Asset Class - John Mercer, Chief Pr...
Harnessing AI to Create a Trillion Dollar Asset Class - John Mercer, Chief Pr...
Sri Ambati
 

Similar a World models v0.14 (20)

[NEW LAUNCH!] [REPEAT 1] AWS DeepRacer Workshops –a new, fun way to learn rei...
[NEW LAUNCH!] [REPEAT 1] AWS DeepRacer Workshops –a new, fun way to learn rei...[NEW LAUNCH!] [REPEAT 1] AWS DeepRacer Workshops –a new, fun way to learn rei...
[NEW LAUNCH!] [REPEAT 1] AWS DeepRacer Workshops –a new, fun way to learn rei...
 
Comparative Analysis of Tuning Hyperparameters in Policy-Based DRL Algorithm ...
Comparative Analysis of Tuning Hyperparameters in Policy-Based DRL Algorithm ...Comparative Analysis of Tuning Hyperparameters in Policy-Based DRL Algorithm ...
Comparative Analysis of Tuning Hyperparameters in Policy-Based DRL Algorithm ...
 
Robocar Rally 2018 (AIM206-R20) - AWS re:Invent 2018
Robocar Rally 2018 (AIM206-R20) - AWS re:Invent 2018Robocar Rally 2018 (AIM206-R20) - AWS re:Invent 2018
Robocar Rally 2018 (AIM206-R20) - AWS re:Invent 2018
 
Machine learning in credit risk modeling : a James white paper
Machine learning in credit risk modeling : a James white paperMachine learning in credit risk modeling : a James white paper
Machine learning in credit risk modeling : a James white paper
 
Datascience101presentation4
Datascience101presentation4Datascience101presentation4
Datascience101presentation4
 
An Introduction to Reinforcement Learning with Amazon SageMaker
An Introduction to Reinforcement Learning with Amazon SageMakerAn Introduction to Reinforcement Learning with Amazon SageMaker
An Introduction to Reinforcement Learning with Amazon SageMaker
 
Reinforcement Learning in the Wild and Lessons Learned
Reinforcement Learning in the Wild and Lessons LearnedReinforcement Learning in the Wild and Lessons Learned
Reinforcement Learning in the Wild and Lessons Learned
 
Mohamad C
Mohamad CMohamad C
Mohamad C
 
Modern Recommendation for Advanced Practitioners part2
Modern Recommendation for Advanced Practitioners part2Modern Recommendation for Advanced Practitioners part2
Modern Recommendation for Advanced Practitioners part2
 
IRJET- A Review on Deep Reinforcement Learning Induced Autonomous Driving Fra...
IRJET- A Review on Deep Reinforcement Learning Induced Autonomous Driving Fra...IRJET- A Review on Deep Reinforcement Learning Induced Autonomous Driving Fra...
IRJET- A Review on Deep Reinforcement Learning Induced Autonomous Driving Fra...
 
Python tutorial for ML
Python tutorial for MLPython tutorial for ML
Python tutorial for ML
 
Harnessing AI to Create a Trillion Dollar Asset Class - John Mercer, Chief Pr...
Harnessing AI to Create a Trillion Dollar Asset Class - John Mercer, Chief Pr...Harnessing AI to Create a Trillion Dollar Asset Class - John Mercer, Chief Pr...
Harnessing AI to Create a Trillion Dollar Asset Class - John Mercer, Chief Pr...
 
Regresión
RegresiónRegresión
Regresión
 
An Introduction to Reinforcement Learning (December 2018)
An Introduction to Reinforcement Learning (December 2018)An Introduction to Reinforcement Learning (December 2018)
An Introduction to Reinforcement Learning (December 2018)
 
Top 100+ Google Data Science Interview Questions.pdf
Top 100+ Google Data Science Interview Questions.pdfTop 100+ Google Data Science Interview Questions.pdf
Top 100+ Google Data Science Interview Questions.pdf
 
Amazon SageMaker 內建機器學習演算法 (Level 400)
Amazon SageMaker 內建機器學習演算法 (Level 400)Amazon SageMaker 內建機器學習演算法 (Level 400)
Amazon SageMaker 內建機器學習演算法 (Level 400)
 
Graph Gurus Episode 29: Using Graph Algorithms for Advanced Analytics Part 3
Graph Gurus Episode 29: Using Graph Algorithms for Advanced Analytics Part 3Graph Gurus Episode 29: Using Graph Algorithms for Advanced Analytics Part 3
Graph Gurus Episode 29: Using Graph Algorithms for Advanced Analytics Part 3
 
Using Graph Algorithms for Advanced Analytics - Part 2 Centrality
Using Graph Algorithms for Advanced Analytics - Part 2 CentralityUsing Graph Algorithms for Advanced Analytics - Part 2 Centrality
Using Graph Algorithms for Advanced Analytics - Part 2 Centrality
 
Deep learning for product title summarization
Deep learning for product title summarizationDeep learning for product title summarization
Deep learning for product title summarization
 
Deep RL for Autonomous Driving exploring applications Cognitive vehicles 2019
Deep RL for Autonomous Driving exploring applications Cognitive vehicles 2019Deep RL for Autonomous Driving exploring applications Cognitive vehicles 2019
Deep RL for Autonomous Driving exploring applications Cognitive vehicles 2019
 

Último

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Último (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 

World models v0.14

  • 1. WORLD MODELS Presentation By Duane Nielsen And Pushkar Merwah
  • 2. World models Background: Reinforcement Learning World Model Architecture  View  Model  Controller EXPERIMENT 1- OPENAI GYM – Car Racing V0  Ablation Study Experiment 2 – Vizdoom  Training Policy From “World Model Dream"
  • 3. Basics of Reinforcement Learning Left Brain
  • 4. What is reinforcement learning Learning expert representation required to achieve an objective, given a success metric, from raw inputs in the absence of domain knowledge. Introduces a unique advantage over supervised learning, it does not require the environment to be static and learns to make similar decisions to intelligent bodies like value of new knowledge (exploration vs. Exploitation) Left Brain © 2018 Copyright: Left Brain
  • 5. Key concepts in reinforcement learning Planning an optimal way to achieve an objective Learning value of being in a state Learning the policy of how to act in a given state Learning to engage in an environment by experiencing a representation of the model of that environment (indirect) Left Brain © 2018 Copyright: Left Brain
  • 6. General policy iteration Policy Value Policy evaluation Policy improvement Evaluate the policy to get state value Act greedily on the value to improve the policy Converge to an optimal policy converge to optimal policy Left Brain © 2018 Copyright: Left Brain
  • 7. Markov decision process Future is independent of the past given the present (State, action, reward, discount, exploration) Agent / controller Environment / system state action reward new state 1 2 @ a given state Agent() takes an action Env() returns a reward and next state 3 4 © 2018 Copyright: Left Brain
  • 8. Interacting Between Planning Acting And Learning Value/Polic y ExperienceModel Planning acting Model Learning Direct RL Left Brain © 2018 Copyright: Left Brain
  • 9. Interacting Between Planning Acting And Learning Value/Polic y ExperienceModel Planning acting Model Learning Direct RL You Are Here Left Brain
  • 10. CONTRIBUTIONS Models the environment using an unsupervised, low dimensional representation Recurrent mixture model, models the agents actions on the environment stochastically, which helps the controller anticipate the next move. RL is used on the model policy, which is transferable to the “real” environment
  • 11. WORLD MODELS D Ha and J Shmidhubecr
  • 12. World Model Architecture • V – VIEW • M – MODEL • C - CONTROLLER
  • 13. V – The View
  • 16. Set Of All Images Real images Random images
  • 17. Z Latent Space • If real images are a subset of the entire space of images then in theory, we should be able to encode them with a smaller amount of information than the large space • This smaller “space” of variables is called the latent space “z”
  • 19. Some Examples Of Latent Spaces Z examples  Http://vecg.Cs.Ucl.Ac.Uk/projects/projects_fonts/projects_fonts.Html  Https://worldmodels.Github.Io/
  • 20. So What Is The Use Of Z ? • Z is a smaller space, so reduces the “state space” the model needs to deal with • Z-values contain should more meaningful information • Z-values, trained with enough data, should generalize to novel, unseen yet similar environments
  • 21. M – The Model
  • 22. Basics Of Gaussian Mixture Models Left Brain
  • 23. How To Solve Problems That Have Multimodal Solutions y x Y = f(x) {discreet solution} y x x = f(y) {multimodal solution} Left Brain Propose using a probability of observing the state© 2018 Copyright: Left Brain
  • 24. MDN-RNN MODEL (M): Predict The Future Z Value Since the environment is stochastic we model the distribution not discrete values MDN RNN MDN RNN MDN RNN at- 1 at at+ 1 Zt- 1 Zt Zt+ 1 Zt Zt+ 1 Zt+ 2 ht- 1 ht ht+ 1 ht+ 2 p(zt+1 | at ,zt ,ht) Left Brain temperatu re © 2018 Copyright: Left Brain
  • 25. MDN-RNN MODEL (M): Predict The Future Z Value Since the environment is stochastic we model the distribution not discrete values MDN RNN MDN RNN MDN RNN at- 1 at at+ 1 Zt- 1 Zt Zt+ 1 Zt Zt+ 1 Zt+ 2 ht- 1 ht ht+ 1 ht+ 2 p(zt+1 | at ,zt ,ht) So now we can predict the future image, eg. prepare for the car to make a turn Relationship between Zt and ht will be trained by the controller Left Brain temperatu re © 2018 Copyright: Left Brain
  • 26. Notebook Example Of GMM Left Brain
  • 27. C – The Controler
  • 28. Basics Of Evolution Strategies Left Brain
  • 29. Evolution strategies (ES) can best described as a gradient descent method which uses gradients estimated from stochastic perturbations around the current parameter value Parameter exploring policy gradient (PEPG) Reinforce-ES Natural evolution strategies (NES):weak solutions contain information about what not to do, and this is valuable information to calculate a better estimate for the next generation Simple ES: sample a set of solutions from a normal distribution, with a mean μ and a fixed standard deviation σ Simple genetic ES :genetic algorithms help diversity by keeping track of a diverse set of candidate solutions to reproduce the next generation. Open AI solution in ES: keep a constant σ. Does not require calculating a covariance matrix so it is requires less flops. Covariance Matrix Adaptation-ES (used in this paper) Evolution strategies offer an alternative to backpropagation making them easier to scale across machines Left Brain © 2018 Copyright: Left Brain
  • 30. Covariance Matrix Adaptation - ES Algorithm utilizes the results of generation to compute the next iteration. It adaptively increase or decreases the search space for the next generation of seed. It will modulate the mean and the sigma of the parameters. This results in calculating an entirely new covariance matrix of the parameter space at every iteration. At each generation, the algorithm provides a multivariate normal distribution to sample References:  Evolution strategies as a scalable alternative to reinforcement learning(arxiv:1703.03864)  Blog.Openai.Com/evolution-strategies/  CMA ES tutorial arxiv:1604.0077 Left Brain © 2018 Copyright: Left Brain
  • 31. Notebook Example Of CMA-ES Left Brain
  • 32. Explain Covariance Matrix a1 a2 a3 b1 b2 b3 c1 c2 c3 Tells us how much noise (variance) exists within the parameter(cov(B,B)) A B C A B C Tells us how much change in C is captured in A (variance) exists within the parameter(cov(C,A)) Left Brain © 2018 Copyright: Left Brain
  • 33. MDN-RNN MODEL (M): Predict The Future Z Value Since the environment is stochastic we model the distribution not discrete values MDN RNN MDN RNN MDN RNN at- 1 at at+ 1 Zt- 1 Zt Zt+ 1 Zt Zt+ 1 Zt+ 2 ht- 1 ht ht+ 1 ht+ 2 p(zt+1 | at ,zt ,ht) Left Brain temperatu re © 2018 Copyright: Left Brain
  • 34. MDN-RNN MODEL (M): Predict The Future Z Value Since the environment is stochastic we model the distribution not discrete values MDN RNN MDN RNN MDN RNN at- 1 at at+ 1 Zt- 1 Zt Zt+ 1 Zt Zt+ 1 Zt+ 2 ht- 1 ht ht+ 1 ht+ 2 p(zt+1 | at ,zt ,ht) So now we can predict the future image, eg. prepare for the car to make a turn Relationship between Zt and ht will be trained by the controller Left Brain temperatu re © 2018 Copyright: Left Brain
  • 36. Experiment 1 CAR Race - ABLATION  SO DOES THIS WORK?  TO TEST, OPENAI GYM CAR-RACE ENVIRONMENT WAS USED  MODEL TOP-SCORED THE LEADERBOARD  MODEL WAS RUN WITHOUT THE M COMPENENT AND WITH THE M COMPONENT
  • 39. Experiment 2 – The Advantage Of Dreams • NOTE THAT V AND M ARE TRAINED COMPLETELY BY UNSUPERVISED LEARNING USING A RANDOM POLICY • AFTER TRAINING M EFFECTIVELY BECOMES A “SIMULATION” OF THE REAL ENVIRONMENT • IF WE TRAIN A POLICY USING M, WILL IT WORK ON THE REAL ENVIRONMENT?
  • 40.
  • 41. SO WHY USE M AND NOT THE REAL ENVIRONMENT? • M RUNS FASTER BECAUSE • THE DIMENSIONALITY IS REDUCED TO Z • ITS VECTORIZED AND THEREFORE OPTIMIZED FOR HARDWARE ACCELERATION • M IS NOT DETERMINISTIC, AND HAS A “TEMPERATURE” PARAMETER
  • 43. RANDOMNESS TO REDUCING • GAMING A SYSTEM GENERALLY RELIES UPON EXPLOITING ”EDGE CASES” • ADDING “RANDOMNESS” TO A SIMULATION REDUCES THE RELIABILITY OF EDGE CASES AND MAKES THE UNDERLYING SIMULATION • IT ALSO CAUSES THE POLICY TO BECOME MORE REDUNDANT AND ROBUST • HTTPS://BLOG.OPENAI.COM/GENERALIZING-FROM-SIMULATION/ • M COMES WITH A “RANDOMNESS” SLIDER BUILT IN! • HTTPS://WORLDMODELS.GITHUB.IO
  • 44. Further the discussion Have A Longer Conversation On The State Of AI/ML Join Us On A Free Demo http://training.leftbrain.consulting/ Gain Hands On Experience Programing From Publications Left Brain