Deep learning is becoming ubiquitous in Machine Learning (ML) research, and it's also finding its place in industry-related applications. Specifically, deep generative models have proven incredibly useful at generating and remixing realistic content from scratch, making themselves a very appealing technology in the field of AI-enhanced content authoring. As part of this year's Machine Learning Tutorial at the Game Developers Conference 2019 (GDC), Jorge Del Val from SEED will cover in an accessible manner the fundamentals of deep generative modeling, including some common algorithms and architectures. He will also discuss applications to game development and explore some recent advances in the field.
The attendee will gain basic understanding of the fundamentals of generative models and how to implement them. Also, attendees will grasp potential applications in the field of game development to inspire their work and companies. This talk does not require a mathematical or machine learning background, although previous knowledge on either of those is beneficial.
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
GDC2019 - SEED - Towards Deep Generative Models in Game Development
1. Towards Deep Generative Models in Game
Development
Jorge del Val
Research Engineer / SEED (Electronic Arts)
2.
3. Agenda
1. Motivation and fundamentals
2. Variational autoencoders (VAE)
3. Generative adversarial networks (GAN)
4. Conditional generative models
5. Some applications to game development
25. Latent dimensions of data
Images credit Ward A.D. et al 2007
*Spoiler: Generative models learn both: the intrinsic geometry and the probability distribution!
29. Random variables and generative modelling
For us, each datapoint 𝑥𝑖 is just a realization of an underlying random variable.
𝐱 ∼ 𝑝(𝑥)
● Unsupervised learning is the field which attempts to infer properties of x from samples.
● Generative modelling is a subset of unsupervised learning which attempts to approximate
x as some parametrized combination of “simple” random variables which you can sample.
𝐱 ≈ 𝑓𝜃 𝐳 𝟏, 𝐳 𝟐, … , 𝐳 𝐊 ≜ 𝐱
33. Architectures: neural networks
Neural networks can approximate any function 𝑓(𝑥) to any precision! *
𝑥 𝑓𝜃(𝑥)
𝜃
*G. Cybenko 1989, K. Hornik 1991, Z. Lu et al 2017, B. Hanin 2017
Image credit Sydney Firmin
34. But to find the right 𝜃 (training)?
You optimize some loss (error) function!
min
𝜃
𝐿 𝜃; data
37. But to find the right 𝜃 (training)?
You optimize some loss (error) function!
𝜃𝑡+1 = 𝜃𝑡 − 𝛼∇ 𝜃 𝐿min
𝜃
𝐿 𝜃; data
Stochastic Gradient Descent
38. But to find the right 𝜃 (training)?
You optimize some loss (error) function!
𝜃𝑡+1 = 𝜃𝑡 − 𝛼∇ 𝜃 𝐿min
𝜃
𝐿 𝜃; data
Stochastic Gradient Descent
39. But to find the right 𝜃 (training)?
You optimize some loss (error) function!
𝜃𝑡+1 = 𝜃𝑡 − 𝛼∇ 𝜃 𝐿min
𝜃
𝐿 𝜃; data
Easy to gradient descent any
function with current frameworks!!
Stochastic Gradient Descent
42. Training
We want to approximate 𝐱 as 𝐱 = 𝐺 𝜃(𝐳). How do we find the optimal θ?
Maximize the likelihood of the data!
max
𝜃
log ℒ(𝜃|𝑥 𝑡𝑟𝑎𝑖𝑛) = max
𝜃
𝑖=1
𝑁
log 𝑝 𝜃(𝑥𝑖)
max
𝜃
ℒ(𝜃|𝑥 𝑡𝑟𝑎𝑖𝑛) = max
𝜃
𝑖=1
𝑁
𝑝 𝜃(𝑥𝑖)
But… we need 𝑝 𝜃 𝑥 explicitly!
Probability that the model would generate 𝑥𝑖
Image credit: Colin Raffel
43. What about deep latent variable models?
𝑝(𝑧) 𝑝 𝜃(𝑥|𝑧)
𝐺 𝜃(⋅)𝐳 𝐱
𝑝 𝜃 𝑥 = 𝑝 𝜃 𝑥 𝑧 𝑝 𝑧 𝑑𝑧
What’s the total probability
of generating 𝑥?
𝑝 𝜃(𝑥)??
44. Different models – different methods
1. We have 𝑝 𝜃( 𝑥) explicitly: maximize the likelihood.
2. 𝑝 𝜃 𝑥 is intractable: we can approximate it instead
● Markov Chain Monte Carlo (MCMC) methods
● Variational methods (e.g. Variational Autoencoders)
3. We don’t need 𝑝 𝜃 𝑥 ; it’s implicit.
● Adversarial methods (e.g. GANs)
50. Pros:
● Efficient inference for free!
o Great tool for modelling the hidden structure of data.
● Stable to train.
● Good theoretical ground.
Cons:
● Not very good samples.
Variational autoencoders
51. Generative adversarial networks (GANs)
𝑝(𝑧)
𝐺 𝜃(⋅)𝐳 𝐱
Why not just sample a bunch of
data and see if they look real?
𝐱
Sample
𝑥1, 𝑥2, … , 𝑥 𝐵
𝑥1, 𝑥2, … , 𝑥 𝐵
Objective:
Looks similar!
52. But… how do we measure similarity between groups of samples?
53. How to measure similarity of samples
One solution: train a classifier 𝐷 𝜙(𝑥) to discriminate!
● If the classifier can not tell if a sample is real or fake, both distributions are close.
● Trained with the standard cross-entropy loss:
max
𝜙
𝐿 𝑑(𝜙) = max
𝜙
𝔼 𝑥 𝑟∼𝑝 𝑟𝑒𝑎𝑙
log 𝐷 𝜙 𝑥 𝑟 + 𝔼 𝑥 𝑓∼𝑝 𝑓𝑎𝑘𝑒
log 1 − 𝐷 𝜙 𝑥𝑓
It can be shown that the optimal classifier performance 𝐿 𝑑(𝜙∗
) is related to the closeness
between both distributions (JS divergence).
54. The GAN game
We want to minimize “closeness” between the generated and real samples, as measured by
the discriminator loss:
min
𝜃
"closeness"
= min
𝜃
max
𝜙
𝔼 𝑥 𝑟∼𝑝 𝑟𝑒𝑎𝑙
log 𝐷 𝜙 𝑥 𝑟 + 𝔼 𝑥 𝑓∼𝑝 𝑓𝑎𝑘𝑒
log 1 − 𝐷 𝜙 𝑥𝑓
It’s formally a two-player minimax game!!
55. Generative adversarial networks
𝑝(𝑧)
𝐺 𝜃(⋅)𝐳 𝐱
Why not just sample a bunch of
data and see if they look real?
𝐱
Sample
𝑥1, 𝑥2, … , 𝑥 𝐵
𝑥1, 𝑥2, … , 𝑥 𝐵
Objective:
Looks similar!
𝐷 𝜙(⋅)
Objective:
Fool 𝐷 𝜙!
56.
57. GANs
● Pros:
Awesome samples
● Cons:
Unstable training
No explicit probability density
No direct inference
Large Scale GAN Training for High Fidelity Natural Image Synthesis. Brock et al. 2018
58. Bonus: autoregressive methods
𝑝 𝜃 𝑥 =
𝑡=1
𝑇
𝑝 𝜃 𝑥 𝑡 𝑥1, … , 𝑥 𝑡−1
Generate little by little!
Wavenet: A Generative Model for Raw Audio. Van den Oord et al. 2016.
61. OK! I can generate stuff.
But how do I influence what I generate?
62. OK! I can generate stuff.
How do I remix existing stuff??
63. Conditional generative models
What if I have information 𝐜 to condition the generation/inference, e.g., class
labels?
● Just introduce them in the networks!
𝐺 𝜃(⋅)𝐳 𝐱𝐱 𝐸 𝜙(⋅)
𝑞 𝜙(𝑧|𝑥)
𝑝(𝑧) 𝑝 𝜃(𝑥|𝑧)
Encoder
(inference)
Decoder
(generation)
Variational
Autoencoder
64. 𝑞 𝜙(𝑧|𝑥, 𝑐)
𝑝 𝜃(𝑥|𝑧, 𝑐)
Conditional
Variational
Autoencoder
Conditional generative models
What if I have information 𝐜 to condition the generation/inference, e.g., class
labels?
● Just introduce them in the networks!
𝐺 𝜃(⋅)𝐳 𝐱𝐱 𝐸 𝜙(⋅)
𝑝(𝑧)
Encoder
(inference)
Decoder
(generation)
𝐜 𝐜
65. Conditional generative models
What if I have information 𝐜 to condition the generation/inference, e.g., class
labels?
● Just introduce them in the networks!
GAN
𝑝(𝑧)
𝐺 𝜃(⋅)𝐳 𝐱
𝐱
Sample
𝑥1, 𝑥2, … , 𝑥 𝐵
𝑥1, 𝑥2, … , 𝑥 𝐵
𝐷 𝜙(⋅)
66. Conditional
GAN
Conditional generative models
What if I have information 𝐜 to condition the generation/inference, e.g., class
labels?
● Just introduce them in the networks!
𝑝(𝑧)
𝐺 𝜃(⋅)𝐳 𝐱
𝐱
Sample
𝑥1, 𝑥2, … , 𝑥 𝐵
𝑥1, 𝑥2, … , 𝑥 𝐵
𝐷 𝜙(⋅)
𝐜
𝐜
67. Conditional GMs are very important!
Pix2Pix: Image-to-Image Translation with Conditional Adversarial Networks. Isola et al.
68. Conditional GMs are very important!
Pose Guided Person Image Generation. Ma et al. 2017.
70. Generation of terrain
Interactive Example-Based Terrain Authoring with Conditional Generative Adversarial Networks. Guérin et al. 2017.
71. 3D Content Generation
DeepSDF: Learning Continuous Signed Distance Functions for Shape Representation.
Park et al. 2019.
Learning a Probabilistic Latent Space of Object Shapes via 3D Generative-Adversarial Modeling.
Wu et al. 2016
72. Face generation
GANFIT: Generative Adversarial Network Fitting for High Fidelity 3D Face Reconstruction. Gecer et al. 2019 (FaceSoft.io)
74. Generation of behaviour policies
Variational Discriminator Bottleneck: Improving Imitation Learning, Inverse RL, and GANs by Constraining Information Flow. Peng et al. 2018.
75. Generation of behaviour policies
Imitation Learning with Concurrent Actions in 3D Games. Harmer et al. 2018 (SEED)
76. Learn and accelerate physics
Latent-space Physics: Towards Learning the Temporal Evolution of Fluid Flow. Wiewel et al. 2018
77.
78. Thanks for inspiration and insightful
conversations
Anastasia Opara
Camilo Gordillo
Colin Barré-Brisebois
Hector Anadón León
Henrik Johansson
Jack Harmer
Joakim Bergdahl
Johan Andersson
Kristoffer Sjöö
Ken Brown
Linus Gisslen
Martin Singh-Blom
Mattias Teye
Mark Cerny
Mónica Villanueva Aylagas
Magnus Nordin
Olivier Pomarez
Paul Greveson
Roy Harvey
Tomasz Stachowiak
79. S E E D / / S E A R C H F O R E X T R A O R D I N A R Y E X P E R I E N C E S D I V I S I O N
S T O C K H O L M – L O S A N G E L E S – M O N T R É A L – R E M O T E
S E E D . E A . C O M
W E ‘ R E H I R I N G !
Jorge del Val Santos
jdelvalsantos@ea.com
Notas del editor
Research division within EA which attempts to explore the future of interactive entertainment. Among other areas, we explore deep learning for animation, content creation and reinforcement learning.
Current content creation is time consuming, unintuitive, and hard. Leverage current data to make it easier.
It doesn’t scale!
They also allow for immersion in games through the interaction with the real world.
Or allow us to interact with games, enabling complete new experiences, which content being created dynamically. Dynamic and automatic creation of content enable new, rich experiences.
Trained with dataset of 70K 1024x1024 images!
Trained with dataset of 70K 1024x1024 images!
Let’s think about data. This can be anything, from images of cute dogs
language
Textures…
Meshes..
Rig controls of faces..
https://blog.animationmentor.com/introducing-viktor-and-moya-two-new-animation-mentor-character-rigs/. Justin Owens
Animations of characters..
In the end it’s all numbers, and more particular, vectors … (Say M is the number of pixels!)
Think of it as points in a higher dimensional space
But.. Are those points completely random? Do they have they some hidden structure?
Can any pixel have any value? Patterns
https://ml4a.github.io/ml4a/looking_inside_neural_nets/
No! We actually see a rich statistical hidden structure in data. Patterns materialize in things that we can actually learn.
Data is far from random, and there are inherent correlations among pixels; two neighbouring pixels tend to have the same color etc. Then, do we really need 700 numbers to represent one digit? Or what about a face?
Data has a rich statistical structure.
Features of faces. How would you describe a face to the police? – So there’s clearly a hidden reality about this data, an underlying truth (representation?).
The whole space of 1k images; few of them are actually faces.
If we could actually see the points in M dimensions…
Talk about the fundamental characteristic of faces.
… we would see that they actually lay in a lower dimensional surface, that we call manifold. Therefore, this is called the manifold assumption, and in practice we see that it holds very well for natural data. The underlying truth is the geometry!
We could hope to represent a point with the coordinates IN the manifold (what really matters), instead of all the redundant dimensions. Let me elaborate…
Data has less INTRINSIC DIMENSIONALITY.
We can represent the data with much lesser numbers! We don’t need 1M numbers for a 1k image! “Like the surface of the earth”. Spoiler: learning that underlying geometry and the probability distribution will be the caveat of generative models.
Learning that underlying space that REALLY represents the data, and not the thousands of pixels, will be the job of generative models.
Earth analogy. U-V analogy.
So we say that the data has inherent latent dimensions which describe the data much better than all the redundant information of the data space.
At left, algorithm has discovered without any supervision two intrinsic dimensions of data: pose and expression. Recall that we told nothing to the algorithm of what expression or pose means; no additional information.
Trained with dataset of 70K 1024x1024 images!
Go easy here. Put concrete example of Random Variable (Dice, etc). Example of C++ of random normal variable (rand()?).
21 min – 22
Credit Venkatesh Tata
24:30
Sydney Firmin
Introduce training methods also! Introduce frameworks (pytorch & tensorflow) in different slide.
Reafirm to take it easy here.
Sydney Firmin
26:50 – 28:55
33:20 -
Colin Raffel: https://colinraffel.com/blog/gans-and-divergence-minimization.html
Go easy here and explain what is the idea.
34m – 38:20m = 4,2 m
38:25 – 40:25 -> 2m
40:28
41:47
42:30 – 43:30 = 1m
Showcase awesome examples
- 47:30 = total VAE 5m
47:30 – 49 – 50 = 2,3 m
Particularly, to the Jensen-Shannon divergence
50:10 – 52:10 = 2m
Put detective image?
Training method?
52:15 – 54:15 = 2m
54:30
GANs are among some of the most active area of research in deep learning!
54:44
Awesome examples: LARGE SCALE GAN TRAINING FOR HIGH FIDELITY NATURAL IMAGE SYNTHESIS, Brock et. Al
I would love to have that hamburger
55 – 56 = 1m
56
Pix2Pix: Image-to-Image Translation with Conditional Adversarial Networks. Isola et al
58 (since OK is 2m) – 59:30 -> 1:30m
Interactive Example-Based Terrain Authoring with Conditional Generative Adversarial Networks. Guérin et al.
Igual ponerlo antes.
DeepSDF: Learning Continuous Signed Distance Functions for Shape Representation. Park et al. 2019.
Learning a Probabilistic Latent Space of Object Shapes via 3D Generative-Adversarial Modeling. Wu et al. 2016
GANFIT / Genova
GANFIT: Generative Adversarial Network Fitting for High Fidelity 3D Face Reconstruction. Gecer et al. 2019.
Deep Convolutional Priors for Indoor Scene Synthesis. Wang et al. 2018.
Variational Discriminator Bottleneck: Improving Imitation Learning, Inverse RL, and GANs by Constraining Information Flow. Peng et al. 2018.
Variational Discriminator Bottleneck: Improving Imitation Learning, Inverse RL, and GANs by Constraining Information Flow. Peng et al. 2018.
TempoGAN, latent space physics / wrinkles
Latent-space Physics: Towards Learning the Temporal Evolution of Fluid Flow. Wiewel et al. 2018
Dynamic content generation – worlds that create themselves.
Intuitive and easy content creation.
Seamless interaction with the real world.
Leverage huge amounts of natural data.