Master MATT thesis defense by Juan José Nieto
Advised by Víctor Campos and Xavier Giro-i-Nieto.
27th May 2021.
Pre-training Reinforcement Learning (RL) agents in a task-agnostic manner has shown promising results. However, previous works still struggle to learn and discover meaningful skills in high-dimensional state-spaces. We approach the problem by leveraging unsupervised skill discovery and self-supervised learning of state representations. In our work, we learn a compact latent representation by making use of variational or contrastive techniques. We demonstrate that both allow learning a set of basic navigation skills by maximizing an information theoretic objective. We assess our method in Minecraft 3D maps with different complexities. Our results show that representations and conditioned policies learned from pixels are enough for toy examples, but do not scale to realistic and complex maps. We also explore alternative rewards and input observations to overcome these limitations.
https://imatge.upc.edu/web/publications/discovery-and-learning-navigation-goals-pixels-minecraft
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
Discovery and Learning of Navigation Goals from Pixels in Minecraft
1. Discovery and learning of
navigation goals from pixels
in Minecraft
Juan José Nieto Salas
Master Thesis
May 27th, 2021
Acknowledgements:
Xavier Giró (Advisor)
Víctor Campos (Advisor)
Òscar Mañas
Roger Creus
1
6. Explore, Discover and
Learn (EDL)
Víctor Campos et. al. ICML 2020
Explore, Discover and
Learn (EDL)
Víctor Campos et. al. ICML 2020
Good coverage of the state
space
Independent of how the
state distribution is induced
6
7. 7
Explore, Discover and Learn
Define the state
distribution and how
we sample from it
Learn the mapping
from s to z and
define the intrinsic
rewards
Learn behaviours by
training the
conditioned policies
on z
8. 8
Reward as reconstruction error using
MSE does not scale to pixels
Explore, Discover and Learn
(x, y) (3, H, W)
23. 1. Toy map with
random trajectories
2. Toy map with expert
plays
3. Realistic map with
random trajectories
where input is
composed by pixels
and coordinates
Skill learning
Skill learning
26
25. 28
Experiment 1
Experiment 1
● Handcrafted map
● Random trajectories
● Contrastive approach
REWARD MAP
TRAJECTORIES
IN EVALUATION
AVERAGE REWARD
OVER TIME
MAP
35. 38
Experiment 3
Experiment 3
● Real map
● Random trajectories
● Variational approach
● Inputs: pixels and coordinates
REWARD MAP
MAP
36. 39
Experiment 3
Experiment 3
● Real map
● Random trajectories
● Variational approach
● Inputs: pixels and coordinates
REWARD MAP
TRAJECTORIES
IN EVALUATION
AVERAGE REWARD
OVER TIME
MAP
39. ● We empirically demonstrate
that expert trajectories are
sufficient for discovering
generic skills
● We maximize empowerment
either with variational and
contrastive approaches
● We successfully learned
meaningful skills by using the
reverse form of the mutual
information
43
Conclusions
Conclusions
-> goal mine diamond bloc.
-> neurips challenge
-> long sequence of actions, impossible to perform by chance
-> rather learn set of skills to ease training and solve more complex tasks
-> mention skills examples
-> learn this skills without supervision, inspired from the self-supervised success
-> mention these two examples
-> in this paradigm we extract some features that can be transferred to other downstream tasks.
-> since it does not require annotating labels there will be no scalability problems
-> transfer ideas to RL?
-> what kind of tasks?
-> not enough to extract features, we wanna transfer behaviours or skills
-> it’s a little bit difficult to asses the learned skills since we do not have labels! But these simple examples and plots helps on this task
-> we’ll also show some differences between discovering skills from random and expert trajectories
-> for that, we use two different maps
-> showing top view of the map!!
-> these index maps are a way of assessing the learned skills
-> each dot belongs to an observation from a random trajectory
-> it has been encoded and we pick the index of the closest embedding from the codebook
-> each index is mapped to a different color forming these plots
-> explain results, variational more discrete regions and contrastive more overlapped
These experiments show the progress done during our work
Make sure that everyone understands what are the observations of the agent!!
Although as we’ve seen they struggle when deployed in realistic environments, since they are quite mix and overlapped
Mention that variational is common but contrastive approach is kind of new for maximizing empowerment
That could be used along with a hierarchical policy on top to perform more complex tasks