Discovery and Learning of Navigation Goals from Pixels in Minecraft

Discovery and learning of
navigation goals from pixels
in Minecraft
Juan José Nieto Salas
Master Thesis
May 27th, 2021
Acknowledgements:
Xavier Giró (Advisor)
Víctor Campos (Advisor)
Òscar Mañas
Roger Creus
1

ENVIRONMENT
REINFORCEMENT
LEARNING
REINFORCEMENT
LEARNING
2
action state reward
agent

3
https://www.youtube.com/watch?v=GHo8B4JMC38
MOTIVATION
MOTIVATION

4
MOTIVATION: SELF-SUPERVISED LEARNING
MOTIVATION: SELF-SUPERVISED LEARNING
COMPUTER VISION
COMPUTER VISION NATURAL LANGUAGE PROCESSING
NATURAL LANGUAGE PROCESSING
Mathilde Caron, et al. "Emerging Properties in Self-
Supervised Vision Transformers." (2021).
Tom B. Brown, et al. "Language Models are Few-Shot
Learners." (2020).

INTRINSIC MOTIVATION:
INTRINSIC MOTIVATION:
5
EMPOWERMENT
UNSUPERVISED RL
UNSUPERVISED RL
Benjamin Eysenbach, et al. "Diversity is All You
Need: Learning Skills without a Reward Function."
(2018)
Archit Sharma, et al. "Dynamics-Aware Unsupervised
Discovery of Skills." (2020).

Explore, Discover and
Learn (EDL)
Víctor Campos et. al. ICML 2020
Explore, Discover and
Learn (EDL)
Víctor Campos et. al. ICML 2020
Good coverage of the state
space
Independent of how the
state distribution is induced
6

7
Explore, Discover and Learn
Define the state
distribution and how
we sample from it
Learn the mapping
from s to z and
define the intrinsic
rewards
Learn behaviours by
training the
conditioned policies
on z

8
Reward as reconstruction error using
MSE does not scale to pixels
(x, y) (3, H, W)

10
IMPLEMENTATION
IMPLEMENTATION
SKILL DISCOVERY
→ DISCOVER NAVIGATION GOALS

11
IMPLEMENTATION
IMPLEMENTATION
SKILL DISCOVERY
→ DISCOVER NAVIGATION GOALS
SKILL LEARNING
→ LEARN BEHAVIOURS THAT
GUIDE THE AGENT TOWARDS
THESE GOALS

● Induce state distribution
from expert trajectories
● Information-theoretic
objectives do not encode
human priors properly
Navigate:
Treechop:
Obtain bed:
Obtain diamond:
ObtainIron Pickaxe:
Obtain meat:
MineRL - Guss et. al. (2019)
MineRL - Guss et. al. (2019)
Explore
12

13
Maximize mutual information between inputs and some latent variables
FORWARD
REVERSE
Discover

14
FORWARD
REVERSE
Discover

15
FORWARD
REVERSE
Discover

16
FORWARD
REVERSE
VARIATIONAL
VARIATIONAL CONTRASTIVE
CONTRASTIVE
Auto-encoding Variational Bayes
Kingma et. al. (2014)
Representation Learning with Contrastive Predictive Coding
Oord et. al. (2018)
Discover

18
VARIATIONAL
VARIATIONAL CONTRASTIVE
CONTRASTIVE
Learn

Pipeline: Variational
Pipeline: Variational
20

Pipeline: Contrastive
Pipeline: Contrastive
21

23
Skill discovery
Skill discovery
CONTRASTIVE
VARIATIONAL
MAP
Index maps from random trajectories

24
Skill discovery
Skill discovery
CONTRASTIVE
VARIATIONAL
Index maps from expert trajectories
MAP

25
CONTRASTIVE
VARIATIONAL
Skill discovery
Skill discovery
PCA over embeddings learned from expert trajectories

1. Toy map with
random trajectories
2. Toy map with expert
plays
3. Realistic map with
random trajectories
where input is
composed by pixels
and coordinates
Skill learning
Skill learning
26

27
Experiment 1
Experiment 1
● Handcrafted map
● Random trajectories
● Contrastive approach
MAP REWARD MAP

28
Experiment 1
Experiment 1
● Handcrafted map
REWARD MAP
TRAJECTORIES
IN EVALUATION
AVERAGE REWARD
OVER TIME
MAP

29
Experiment 1
Experiment 1
● Handcrafted map

30
Experiment 1
Experiment 1
● Handcrafted map

31
Experiment 2
Experiment 2
● Handcrafted map
● Expert trajectories
● Variational approach
CENTROIDES RECONSTRUCTION

32
Experiment 2
Experiment 2
● Handcrafted map
● z3 reconstruction ->
REWARD MAP
MAP

33
Experiment 2
Experiment 2
● Handcrafted map
● z3 reconstruction ->
REWARD MAP
TRAJECTORIES
IN EVALUATION
AVERAGE REWARD
OVER TIME
MAP

34
Experiment 2
Experiment 2
● Handcrafted map

35
Experiment 2
Experiment 2
● Handcrafted map

36
Experiment 2
Experiment 2
● Handcrafted map

38
Experiment 3
Experiment 3
● Real map
● Inputs: pixels and coordinates
REWARD MAP
MAP

39
Experiment 3
Experiment 3
● Real map
REWARD MAP
TRAJECTORIES
IN EVALUATION
AVERAGE REWARD
OVER TIME
MAP

41
Experiment 3
Experiment 3
● Real map

42
Embodied AI Workshop
Embodied AI Workshop

● We empirically demonstrate
that expert trajectories are
sufficient for discovering
generic skills
● We maximize empowerment
either with variational and
contrastive approaches
● We successfully learned
meaningful skills by using the
reverse form of the mutual
information
43
Conclusions
Conclusions

Discovery and Learning of Navigation Goals from Pixels in Minecraft

Recomendados

Recomendados

Más contenido relacionado

Más de Universitat Politècnica de Catalunya

Más de Universitat Politècnica de Catalunya (20)

Último

Último (20)

Discovery and Learning of Navigation Goals from Pixels in Minecraft

Notas del editor