World models v0.14

WORLD MODELS
Presentation By Duane Nielsen And Pushkar Merwah

World models
Background: Reinforcement Learning
World Model Architecture
 View
 Model
 Controller
EXPERIMENT 1- OPENAI GYM – Car Racing V0
 Ablation Study
Experiment 2 – Vizdoom
 Training Policy From “World Model Dream"

Basics of Reinforcement Learning
Left Brain

What is reinforcement learning
Learning expert representation required to achieve an objective, given a success metric, from raw
inputs in the absence of domain knowledge.
Introduces a unique advantage over supervised learning, it does not require the environment to be
static and learns to make similar decisions to intelligent bodies like value of new knowledge
(exploration vs. Exploitation)
Left Brain
© 2018 Copyright: Left Brain

Key concepts in reinforcement learning
Planning an optimal way to achieve an objective
Learning value of being in a state
Learning the policy of how to act in a given state
Learning to engage in an environment by experiencing a representation of the model of that
environment (indirect)
Left Brain

General policy iteration
Policy Value
Policy evaluation
Policy improvement
Evaluate the policy to get state value
Act greedily on the value to improve the policy
Converge to an optimal policy
converge to
optimal
policy
Left Brain

Markov decision process
Future is independent of the past given the present
(State, action, reward, discount, exploration)
Agent / controller
Environment / system
state action
reward
new state
1 2
@ a given state
Agent() takes an action
Env() returns a reward and next state
3
4

Interacting Between Planning Acting And Learning
Value/Polic
y
ExperienceModel
Planning acting
Model Learning
Direct RL
Left Brain

Interacting Between Planning Acting And Learning
Value/Polic
y
ExperienceModel
Planning acting
Model Learning
Direct RL
You Are Here
Left Brain

CONTRIBUTIONS
Models the environment using an unsupervised, low
dimensional representation
Recurrent mixture model, models the agents actions on the
environment stochastically, which helps the controller
anticipate the next move.
RL is used on the model policy, which is transferable to the
“real” environment

WORLD MODELS
D Ha and J Shmidhubecr

World Model Architecture
• V – VIEW
• M – MODEL
• C - CONTROLLER

Set Of All Images
Real images Random images

Z Latent Space
• If real images are a subset of the entire space of images then in theory, we should be able to encode them
with a smaller amount of information than the large space
• This smaller “space” of variables is called the latent space “z”

Autoencoder – Encoder-decoder Network

Some Examples Of Latent Spaces
Z examples
 Http://vecg.Cs.Ucl.Ac.Uk/projects/projects_fonts/projects_fonts.Html
 Https://worldmodels.Github.Io/

So What Is The Use Of Z ?
• Z is a smaller space, so reduces the “state space” the model needs to deal with
• Z-values contain should more meaningful information
• Z-values, trained with enough data, should generalize to novel, unseen yet similar environments

Basics Of Gaussian Mixture Models
Left Brain

How To Solve Problems That Have Multimodal Solutions
y
x
Y = f(x) {discreet solution}
y
x
x = f(y) {multimodal solution}
Left Brain
Propose using a probability of observing the state© 2018 Copyright: Left Brain

MDN-RNN MODEL (M): Predict The Future Z Value
Since the environment is stochastic we model the distribution not discrete values
MDN
RNN
MDN
RNN
MDN
RNN
at-
1
at
at+
1
Zt-
1
Zt
Zt+
1
Zt
Zt+
1
Zt+
2
ht-
1
ht
ht+
1
ht+
2
p(zt+1 | at ,zt ,ht)
Left Brain
temperatu
re

MDN-RNN MODEL (M): Predict The Future Z Value
Since the environment is stochastic we model the distribution not discrete values
MDN
RNN
MDN
RNN
MDN
RNN
at-
1
at
at+
1
Zt-
1
Zt
Zt+
1
Zt
Zt+
1
Zt+
2
ht-
1
ht
ht+
1
ht+
2
p(zt+1 | at ,zt ,ht)
So now we can predict the future image,
eg. prepare for the car to make a turn
Relationship between Zt and ht will be
trained by the controller
Left Brain
temperatu
re

Notebook Example Of GMM
Left Brain

Basics Of Evolution Strategies
Left Brain

Evolution strategies (ES) can best described as a gradient descent method which uses gradients estimated from
stochastic perturbations around the current parameter value
Parameter exploring policy gradient (PEPG)
Reinforce-ES
Natural evolution strategies (NES):weak solutions contain information about what not to do, and this is valuable
information to calculate a better estimate for the next generation
Simple ES: sample a set of solutions from a normal distribution, with a mean μ and a fixed standard deviation σ
Simple genetic ES :genetic algorithms help diversity by keeping track of a diverse set of candidate solutions to
reproduce the next generation.
Open AI solution in ES: keep a constant σ. Does not require calculating a covariance matrix so it is requires less
flops.
Covariance Matrix Adaptation-ES (used in this paper)
Evolution strategies offer an alternative to backpropagation making
them easier to scale across machines
Left Brain

Covariance Matrix Adaptation - ES
Algorithm utilizes the results of generation to compute the next iteration. It adaptively increase or decreases the search space for the next generation of seed. It
will modulate the mean and the sigma of the parameters. This results in calculating an entirely new covariance matrix of the parameter space at every iteration. At
each generation, the algorithm provides a multivariate normal distribution to sample
References:
 Evolution strategies as a scalable alternative to reinforcement learning(arxiv:1703.03864)
 Blog.Openai.Com/evolution-strategies/
 CMA ES tutorial arxiv:1604.0077
Left Brain

Notebook Example Of CMA-ES
Left Brain

Explain Covariance Matrix
a1 a2 a3
b1 b2 b3
c1 c2 c3
Tells us how much noise (variance) exists within the
parameter(cov(B,B))
A
B
C
A B C
Tells us how much change in C is captured in A (variance)
exists within the parameter(cov(C,A))
Left Brain

Experiment 1 CAR Race - ABLATION
 SO DOES THIS WORK?
 TO TEST, OPENAI GYM CAR-RACE ENVIRONMENT WAS USED
 MODEL TOP-SCORED THE LEADERBOARD
 MODEL WAS RUN WITHOUT THE M COMPENENT AND WITH THE M COMPONENT

Experiment 2 – The Advantage Of Dreams
• NOTE THAT V AND M ARE TRAINED COMPLETELY BY UNSUPERVISED LEARNING
USING A RANDOM POLICY
• AFTER TRAINING M EFFECTIVELY BECOMES A “SIMULATION” OF THE REAL
ENVIRONMENT
• IF WE TRAIN A POLICY USING M, WILL IT WORK ON THE REAL ENVIRONMENT?

SO WHY USE M AND NOT THE REAL
ENVIRONMENT?
• M RUNS FASTER BECAUSE
• THE DIMENSIONALITY IS REDUCED TO Z
• ITS VECTORIZED AND THEREFORE OPTIMIZED FOR HARDWARE
ACCELERATION
• M IS NOT DETERMINISTIC, AND HAS A “TEMPERATURE”
PARAMETER

RANDOMNESS TO REDUCING
• GAMING A SYSTEM GENERALLY RELIES UPON EXPLOITING ”EDGE CASES”
• ADDING “RANDOMNESS” TO A SIMULATION REDUCES THE RELIABILITY OF EDGE
CASES AND MAKES THE UNDERLYING SIMULATION
• IT ALSO CAUSES THE POLICY TO BECOME MORE REDUNDANT AND ROBUST
• HTTPS://BLOG.OPENAI.COM/GENERALIZING-FROM-SIMULATION/
• M COMES WITH A “RANDOMNESS” SLIDER BUILT IN!
• HTTPS://WORLDMODELS.GITHUB.IO

Further the discussion
Have A Longer Conversation On The State Of AI/ML Join Us On A Free Demo
http://training.leftbrain.consulting/
Gain Hands On Experience Programing From Publications
Left Brain

World models v0.14

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a World models v0.14

Similar a World models v0.14 (20)

Último

Último (20)

World models v0.14