The document discusses different distances for measuring the similarity between probability distributions, including Total Variation (TV), Kullback-Leibler (KL) divergence, Jensen-Shannon (JS) divergence, and the Earth Mover (EM) or Wasserstein-1 distance. It shows that the EM distance behaves more continuously compared to other distances when distributions are changing. The document also defines couplings between distributions and provides examples to compute the Wasserstein distance between simple distributions. It presents theorems stating the EM distance is continuous and differentiable for neural network models, unlike other distances.
3. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1. Introduction
1. Introduction
• Main Goal : Learning GAN by using Wasserstein distance W(Pr, Pg)
• In Section 2, we provide how the Earth Mover (EM) distance behaves in
comparison to Total Variation (TV), Kullback-Leibler (KL) divergence and
Jensen-Shannon (JS) divergence.
• In Section 3, we define Wasserstein-GAN and efficient approximation of
the EM distance
• we empirically show that WGANs cure the main training problems of
GANs.
JIN HO LEE Wasserstein GAN 2018-11-30 3 / 26
4. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2. Different Distances
2. Different Distances
• A σ-algebra Σ of subst of X is a collection Σ of subsets of X satisfying
the following conditions
(a) ∅ ∈ Σ
(b) if B ∈ Σ then Bc ∈ Σ
(c) if B1, B2, · · · is a countable collection of sets in Σ then ∪∞
n=1Bn ∈ Σ
• Borel algebra : the smallest σ-algebra containing the open sets
• A probability space consists of sample space Ω, events F and
probability measure P where the set of events F is a σ-algebra
• A function µ is a probability measure on a probability space (X, Σ, P) if
(a) µ(X) = 1, µ(∅) = 0, µ(A) ∈ [0, 1] for every A ∈ Σ
(b) countable additivity : for all countable collections {Ei} of pairwise
disjoint sets:
µ (∪iEi) =
∑
i
µ(Ei).
JIN HO LEE Wasserstein GAN 2018-11-30 4 / 26
5. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2. Different Distances
• The Total Variation (TV) distance
δ(Pr, Pg) = sup
A∈Σ
|Pr(A) − Pg(A)|.
• The Kullback-Leibler (KL) divergence
KL(Pr||Pg) =
∫
log
(
Pr(x)
Pg(x)
)
Pr(x)dµ(x).
• The Jensen-Shannon (JS) divergence
JS(Pr, Pg) = KL(Pr||Pm) + KL(Pg||Pm),
where Pm = (Pr + Pg)/2 is the mixture.
• The Earth-Mover (EM) distance or Wasserstein-1
W(Pr, Pg) = inf
γ∈Π(Pr,Pg)
E(x,y)∼γ[||x − y||],
where Π(Pr, Pg) denotes the set of all joint distributions γ(x, y) whose
marginals are respectively Pr and Pg, that is γ is a coupling of Pr and Pg.
JIN HO LEE Wasserstein GAN 2018-11-30 5 / 26
6. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2. Different Distances Couplings
Couplings
• χ : compact metric space
• Σ : the set of all Borel subset of χ
• Prob(χ) : probability measures on χ
Definition
Let µ and ν be probability measures on the same measurable space (S, Σ).
A coupling of µ and ν is a probability measure on the coupling product
space (S × S, Σ × Σ) such that the marginals of coincide with µ and ν, i.e.,
γ(A × S) = µ(A) and γ(S × A) = ν(A) ∀A ∈ Σ.
JIN HO LEE Wasserstein GAN 2018-11-30 6 / 26
8. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2. Different Distances Example of Wasserstein Distance
Example
For previous joint distributions f and g, we assume that(it’s not true)
Π[Ber(p1), Ber(p2)] = {f, g}.
Then we have
W(Ber(p1), Ber(p2)) = min{q1p2 + p1q2, p2 − p1}.
Proof.
Since Π[Ber(p1), Ber(p2)] = {f, g}, we consider only two cases.
case 1. f ∈ Π[Ber(p1), Ber(p2)].
E(x,y)∼f[||x − y||]
= f(0, 0)||0 − 0|| + f(0, 1)||0 − 1|| + f(1, 0)||1 − 0|| + f(1, 1)||1 − 1||
= q1p2 + p1q2
JIN HO LEE Wasserstein GAN 2018-11-30 8 / 26
9. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2. Different Distances Example of Wasserstein Distance
case 2. g ∈ Π[Ber(p1), Ber(p2)].
E(x,y)∼g[||x − y||]
= g(0, 0)||0 − 0|| + g(0, 1)||0 − 1|| + g(1, 0)||1 − 0|| + g(1, 1)||1 − 1||
= p2 − p1
By case 1 and 2, we have
W(Ber(p1), Ber(p2)) = inf
γ∈Π[Ber(p1),Ber(p2)]
E(x,y)∼γ[||x − y||]
= inf
γ∈{f,g}
E(x,y)∼γ[||x − y||]
= min{q1p2 + p1q2, p2 − p1}.
JIN HO LEE Wasserstein GAN 2018-11-30 9 / 26
10. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2. Different Distances An example of coulpings
Lemma
For p1, p2 ∈ [0, 1], the set of all couplings Π[Ber(p1), Ber(p2)] of Ber(p1)
and Ber(p2) is {pa|a ∈ [0, 1]} where
pa(0, 0) = a
pa(0, 1) = q1 − a
pa(1, 0) = q2 − a
pa(1, 1) = p2 − q1 + a
Proof.
Let γ ∈ Π[Ber(p1), Ber(p2)]. Then we have the following table
γ Y = 0 Y = 1 Σyγ(x, y)
X = 0 q1
X = 1 q2
Σxγ(x, y) q2 p2
JIN HO LEE Wasserstein GAN 2018-11-30 10 / 26
11. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2. Different Distances An example of coulpings
For a ∈ [0, 1], if γ(0, 0) = a, then the following table is completely
determined.
γ Y = 0 Y = 1 Σyγ(x, y)
X = 0 a q1 − a q1
X = 1 q2 − a p2 − (q1 − a) q2
Σxγ(x, y) q2 p2
It means that, for a ∈ [0, 1], we can have a coupling γ of Ber(p1) and
Ber(p2) such that γ(0, 0) = a. This complete the proof.
JIN HO LEE Wasserstein GAN 2018-11-30 11 / 26
12. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2. Different Distances A computational result of Wasserstein Distance
Theorem
For p1 ≤ p2, we have
W(Ber(p1), Ber(p2)) = p2 − p1.
Proof.
From the previous Lemma, we have Π[Ber(p1), Ber(p2)] = {pa|a ∈ [0, 1]}
where pa(0, 0) = a. Then we obtain
E(x,y)∼pa
[||x − y||]
= pa(0, 0)||0 − 0|| + pa(0, 1)||0 − 1|| + pa(1, 0)||1 − 0|| + pa(1, 1)||1 − 1||
= 2 − p1 − p2 − 2a
Since p1 and p2 are constants and a is less or equal to marginal
probabilities, we have a ≤ min{q1, q2}. From the assumption p1 ≤ p2, we
have q1 ≥ q2 and min{q1, q2} = q2.
JIN HO LEE Wasserstein GAN 2018-11-30 12 / 26
14. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2. Different Distances Example 1
Example (1)
• We assume that
▷ Z ∼ U[0, 1] : uniform distribution on the unit interval.
▷ P0 : be the distribution of (0, Z) ∈ R2, uniform on a straight vertical
line passing through the origin.
▷ gθ(z) = (θ, z) with θ a single real parameter.
Then we obtain the following.
• W(P0, Pθ) = |θ|
• JS(P0, Pθ) =
{
log 2 if θ ̸= 0,
0 if θ = 0,
• KL(Pθ||P0) = KL(P0||Pθ) =
{
+∞ if θ ̸= 0,
0 if θ = 0,
JIN HO LEE Wasserstein GAN 2018-11-30 14 / 26
16. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2. Different Distances Theorem 1
Theorem (1)
Let Pr be a fixed distribution over X. Let Z be a random variable (e.g
Gaussian) over another space Z. Let g : Z × Rd → χ be a function, that
will be denoted gθ(z) with z the first coordinate and θ the second. Let Pθ
denote the distribution of gθ(z). Then,
1. If g is continuous in θ, so is W(Pr, Pθ).
2. If g is locally Lipschitz and satisfies regularity assumption 1, then
W(Pr, Pθ) is continuous everywhere, and differentiable almost everywhere.
3. Statements 1-2 are false for the Jensen-Shannon divergence JS(Pr, Pθ)
and all the KLs.
JIN HO LEE Wasserstein GAN 2018-11-30 16 / 26
17. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2. Different Distances Theorem 1
The following corollary tells us that learning by minimizing the EM
distance makes sense (at least in theory) with neural networks.
Corollary
Let gθ be any feedforward neural network parameterized by θ, and p(z) a
prior over z such that Ez∼p(z)[||z||] < ∞ (e.g. Gaussian, uniform, etc.).
Then assumption 1 is satisfied and therefore W(Pr, Pθ) is continuous
everywhere and differentiable almost everywhere.
JIN HO LEE Wasserstein GAN 2018-11-30 17 / 26
18. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2. Different Distances Theorem 2
Theorem (2)
Let P be a distribution on a compact space X and (Pn)n∈N be a sequence
of distributions on X. Then, considering all limits as n → ∞,
1. The following statements are equivalent
• δ(Pn, P) → 0 with δ the total variation distance.
• JS(Pn, P) → 0 with JS the Jensen-Shannon divergence.
2. The following statements are equivalent
• W(Pn, P) → 0.
• Pn
D
−→ P where
D
−→ represents convergence in distribution for random
variables.
3. KL(Pn||P) → 0 or KL(P||n) → 0 imply the statements in (2)
JIN HO LEE Wasserstein GAN 2018-11-30 18 / 26
19. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3. Wasserstein GAN
3. Wasserstein GAN
• Computing W(Pr, Pg) is intractible from the definition of Wasserstein
distance. However, then Kantorovich-Rubinstein duality tell us that:
W(Pr, Pg) = sup
||f||L≤1
Ex∼Pr [f(x)] − Ex∼Pθ
[f(x)]
where ||f||L ≤ 1 means that f satisfies 1-Lipschitz condition.
• Note that, if we replace ||f||L ≤ 1 for ||f||L ≤ K for some K, we have
K · W(Pr, Pg) = sup
||f||L≤K
Ex∼Pr [f(x)] − Ex∼Pθ
[f(x)].
• If we have a parametrized family functions {fw}w∈W that are all
K-Lipschitz for some K, then we have:
max
w∈W
Ex∼Pr [fw(x)] − Ex∼Pθ
[fw(x)] ≤ sup
||f||L≤K
Ex∼Pr [f(x)] − Ex∼Pθ
[f(x)]
= K · W(Pr, Pθ)
JIN HO LEE Wasserstein GAN 2018-11-30 19 / 26
20. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3. Wasserstein GAN Theorem 3
Theorem (3)
Let Pr be any distribution. Let Pθ be the distribution of gθ(Z) with Z a
random variable with density p and gθ a function satisfying assumption 1.
Then, there is a solution f : χ → R to the problem
max
||f||L≤1
Ex∼Pr [f(x)] − Ex∼Pθ
[f(x)]
and we have
∇θW(Pr, Pθ) = −Ez∼p(z)[∇θf(gθ(z))]
when both terms are well-defined.
• Objective functions:
LWGAN
D = Ex∼Pr [fw(x)] − Ez∼P(z)[fw(gθ(z))]
LWGAN
G = Ez∼P(z)[f(gθ(z))]
where wD ← clip(w, −0.01, 0.01) in LD.
JIN HO LEE Wasserstein GAN 2018-11-30 20 / 26
22. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3. Wasserstein GAN Figure 2
In this paper, the Authors call discriminator critic. In Figure 2, we train a
GAN discriminator and a WGAN critic still optimality. The discriminator
learn very quickly to distinguish between fake and real. But, the critic
can’t saturate and converges to a linear function.
JIN HO LEE Wasserstein GAN 2018-11-30 22 / 26
26. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Some knowledge to read Appendix
• Let χ ∈ Rd be a compact set, that is closed and bounded by Heine-Borel
Theorem and Prob(χ) a probability measure over χ.
• We define
Cb(χ) = {f : χ → R|f is continuous and bounded}
• For f ∈ Cb(χ), we can define a norm ||f||∞ = max
x∈χ
|f(x)|, since f is
bounded.
• Then we have a normed vector space (Cb(χ), || · ||∞).
• The dual space
Cb(χ)∗
= {ϕ : Cb(χ) → R|ϕ is linear and continuous}
has norm ||ϕ|| = sup
f∈Cb(χ),||f||∞≤1
|ϕ(f)|.
JIN HO LEE Wasserstein GAN 2018-11-30 26 / 26
27. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Some knowledge to read Appendix
• Let µ be a signed measure over χ, and let the Total Variational distance
||µ||TV = sup
A⊂χ
|µ(A)|
where A is a Borel subset in χ. For two probability distributions Pr and
Pθ, we the function
δ(Pr, Pθ) = ||Pr − Pθ||TV
is a distance in Prob(χ) (called the Total Variation distance)
• We can consider
Φ : (Prob(χ), δ) → (Cb(χ)∗
, || · ||)
where Φ(P)(f) = Ex∼P[f(x)] is a linear function over Cb(χ).
• By the Riesz Representation Theorem, Φ is an isometric immersion, that
is δ(P, Q) = ||Φ(P) − Φ(Q)|| and ϕ is a 1-1 correspondence.
JIN HO LEE Wasserstein GAN 2018-11-30 26 / 26