InfoGAIL

Info-Wasserstein-GAIL
Yunzhu Li, Jiaming Song, Stefano Ermon, “Inferring The Latent Structure
of Human Decision-Making from Raw Visual Inputs”, ArXiv, 2017
Sungjoon Choi
(sungjoon.choi@cpslab.snu.ac.kr)

Latent Structure of Human Demos
2
Pass / code: 0 Pass / code: 1
Turn/ code: 0 Turn/ code: 1

• Introduction
• Backgrounds
• Generative Adversarial Imitation Learning (GAIL)
• Policy gradient
• InfoGAN
• Wasserstein GAN
• InfoGAIL
• Experiments
Contents
3

• Goal of imitation learning is to match expert
behavior.
• However, demonstrations often show significant
variability due to latent factors.
• This paper presents an Info-GAIL algorithm that
can infer the latent structure of human decision
making.
• This method can not only imitate, but also learn
interpretable representations.
Imitation Learning
4

• The goal of this paper is to develop an imitation
learning framework that is able to autonomously
discover and disentangle the latent factors of
variation underlying human decision making.
• Basically, this paper combines generative
adversarial imitation learning (GAIL), Info GAN,
and Wasserstein GAN with some reward
heuristics
Introduction
5

• We will NOT go into details.
GAIL
6
• But, we will see some basics of policy gradient methods.

Policy Gradient
7
Now we Get rid of expectation
over a policy function!!

Step-based PG
10
In other words, now we are considering a dynamic model!

Step-based PG
11
We do NOT have to care about
complex models in an MDP, anymore!

Step-based PG (REINFORCE)
12
Now, we have REINFORCE algorithm!
This method has been used in many deep learning methods
where the objective function is NOT differentiable.

Step-based PG (PG)
13
For all trajectories, and for all instances in a trajectory,
the PG is simply weighted MLE where the weight is defined by
the sum of future rewards, or Q value.

• Now, we know where (18) came from, right?
GAIL
14

• Interpretable Imitation Learning
• Utilized information theoretic regularization.
• Simply added InfoGAN to GAIL.
• Utilizing Raw Visual Inputs via Transfer Learning
• Used a Deep Residual Network.
Visual InfoGAIL
15

• Rather than using a single unstructured noise vector,
InfoGAN decomposes the input noise vector into two
parts: (1) z, incompressible noise and (2) c, the latent code
that targets the salient structured semantic features of the
data distribution.
• InfoGAN proposes an information-theoretic regularization:
there should be high mutual information between latent
codes c and generator distribution G(z, c). Thus I(c; G(z, c))
should be high.
InfoGAN
16

• Reward Augmentation
• A general framework to incorporate prior knowledge in imitation
learning by providing additional incentives to the agent without
interfering with the imitation learning process.
• Added a surrogate state-based reward that reflects our biases over
the desired behaviors.
• Can be seen as
• a hybrid between imitation and reinforcement learning
• side information provided to the generator
• Wasserstein GAN (WGAN)
• The discrimination network in WGAN solves a regression problem
instead of a classification problem.
• Suffers less from the vanishing gradient and mode collapse problem.
Improved Optimization
17

• Wasserstein Generative Adversarial Learning
WGAN?
18

• Variance Reduction
• Reduce variance in policy gradient method.
• Replay buffer method with prioritized replay.
• Good for the cases where the rewards are rare.
• Baseline variance reduction methods.
Improved Optimization
49

Finally, InfoGAIL
50
Sample data similar to InfoGAN
Update D similar to WGAN.
Initialize policy from behavior cloning
Update Q similar to GAN or GAIL.
Update Policy with TRPO.

Network Architectures
51
Latent codes are
added to G
Latent codes are also
added to D
Actions are added to D
The posterior network Q adopts the same
architecture as D except that the output is
a softmax over the discrete latent variables,
or factored Gaussian over continuous
latent variables.

Input Image
Action
Disc. Latent Code Cont. Latent Code
G (policy)
Input Image
Action Disc. Latent Code
D (cost)
Score
Input Image
Action Disc. Latent Code
Q (regularizer)
Disc. Latent Code Cont. Latent Code
Train policy function G with TRPO, and iterate.

Experiments
53
Pass / code: 0 Pass / code: 1
Turn/ code: 0 Turn/ code: 1

InfoGAIL

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a InfoGAIL

Similar a InfoGAIL (20)

Más de Sungjoon Choi

Más de Sungjoon Choi (20)

Último

Último (20)

InfoGAIL