Talk presented on SIAM IS 2022 conference.
Very often, in the course of uncertainty quantification tasks or
data analysis, one has to deal with high-dimensional random variables (RVs)
(with values in $\Rd$). Just like any other RV,
a high-dimensional RV can be described by its probability density (\pdf) and/or
by the corresponding probability characteristic functions (\pcf),
or a more general representation as
a function of other, known, random variables.
Here the interest is mainly to compute characterisations like the entropy, the Kullback-Leibler, or more general
$f$-divergences. These are all computed from the \pdf, which is often not available directly,
and it is a computational challenge to even represent it in a numerically
feasible fashion in case the dimension $d$ is even moderately large. It
is an even stronger numerical challenge to then actually compute said characterisations
in the high-dimensional case.
In this regard, in order to achieve a computationally feasible task, we propose
to approximate density by a low-rank tensor.
Computing f-Divergences and Distances of\\ High-Dimensional Probability Density Functions
1. Computing f-Divergences and Distances of
High-Dimensional Probability Density Functions
Alexander Litvinenko (RWTH Aachen, Germany),
joint work with
Youssef Marzouk, Hermann G. Matthies, Marco Scavino, Alessio Spantini
SIAM IS 2022
2. Plan
1. Motivating examples, working diagram
2. Basics: pdf, cdf, FFT
3. Theoretical background
4. Computation of moments and divergences
5. Tensor formats
6. Algorithms
7. Numerics
1 / 30
3. Motivation and settings:
Assume we use Karhunen-Loeve and (or) Polynomial Chaos
Expansion methods to solve
1. a PDE with uncertain coefficients,
2. a parameter inference problem,
3. a data assimilation task,
4. an optimal design of experiment task.
As a result, we obtain a functional representation (KLE, PCE) of
our uncertain QoI.
How to compute KLD and other divergences? (Classical result
is given only for pdfs, which are not known).
How to compute entropy, KLD and other divergences if pdfs are
not available or not exist?
2 / 30
4. Motivation: stochastic PDEs
−∇ · (κ(x,ω)∇u(x,ω)) = f(x,ω), x ∈ G ⊂ Rd
, ω ∈ Ω
Write first Karhunen-Loeve Expansion and then for
uncorrelated random variables the Polynomial Chaos
Expansion
u(x,ω) =
K
X
i=1
p
λiφi(x)ξi(ω) =
K
X
i=1
p
λiφi(x)
X
α∈J
ξ
(α)
i Hα(θ(ω))
(1)
where ξ
(α)
i is a tensor. Note that α = (α1,α2,...,αM,...) is a
multi-index.
X
α∈J
ξ
(α)
i Hα(θ(ω)) :=
p1
X
α1=1
...
pM
X
αM=1
ξ
(α1,...,αM)
i
M
Y
j=1
hαj
(θj) (2)
The same decomposition for κ(x,ω).
3 / 30
6. How to compute KLE, divergences if pdf is not available?
Two ways to compute f-divergence, KLD, entropy.
5 / 30
7. Connection of pcf and pdf
The probability characteristic function (pcf ) ϕξ could be
defined:
ϕξ(t) := E(exp(ihξ|ti)) :=
Z
Rd
pξ(y)exp(ihy|ti) dy =: F [d]
(pξ)(t),
where t = (t1,t2,...,td) ∈ Rd is the dual variable to y ∈ Rd,
hy|ti =
Pd
j=1 yjtj is the canonical inner product on Rd, and F [d]
is the probabilist’s d-dimensional Fourier transform.
pξ(y) =
1
(2π)d
Z
Rd
exp(−iht|yi)ϕξ(t)dt = F [−d]
(ϕξ)(y). (3)
6 / 30
8. Discrete low-rank representation of pcf
We try to find an approximation
ϕξ(t) ≈ e
ϕξ(t) =
R
X
`=1
d
O
ν=1
ϕ`,ν(tν), (4)
where the ϕ`,ν(tν) are one-dimensional functions.
Then we can get
pξ(y) ≈ e
pξ(y) = F [−d]
(ϕξ)y =
R
X
`=1
d
O
ν=1
F −1
1 (ϕ`,ν)(yν),
where F −1
1 is the one-dimensional inverse Fourier transform.
7 / 30
9. Representation of a 3D tensor in the CP tensor format
A full tensor w ∈ Rn1×n2×n3 (on the left) is represented as a sum
of tensor products. The lines on the right denote vectors
wi,k ∈ Rnk , i = 1,...,r, k = 1,2,3.
8 / 30
10. CP tensor format linear algebra
α · w =
r
X
j=1
α
d
O
ν=1
wj,ν =
r
X
j=1
d
O
ν=1
ανwj,ν
where αν := d
√
|α| for all ν 1. and α1 := sign(α) d
√
|α|.
The sum of two tensors costs only O(1):
w = u + v =
ru
X
j=1
d
O
ν=1
uj,ν
+
rv
X
k=1
d
O
µ=1
vk,µ
=
ru+rv
X
j=1
d
O
ν=1
wj,ν
where wj,ν := uj,ν for j ≤ ru and wj,ν := vj,ν for ru j ≤ ru + rv.
The sum w generally has rank ru + rv.
9 / 30
14. vk,ν)
The new rank can increase till rurv, and the computational cost
is O(ru rvn d).
The scalar product can be computed as follows:
hu|viT = h
ru
X
j=1
d
O
ν=1
uj,ν|
rv
X
k=1
d
O
ν=1
vk,νiT =
ru
X
j=1
rv
X
k=1
d
Y
ν=1
huj,ν|vk,νi
Cost is O(ru rv n d).
Rank truncation via the ALS-method or Gauss-Newton-method.
The scalar product above helps to compute the Frobenius norm
kuk2 :=
p
hu|viT .
10 / 30
15. Tensor Train data format
w := (wi1...id
) =
r0
X
j0=1
...
rd
X
jd=1
w
(1)
j0j1
[i1]···w
(ν)
jν−1jν
[iν]···w
(d)
jd−1jd
[id]
Alternatively, each TT-core W(ν)
may be seen as a vector of
rν−1 × rν matrices W
(ν)
iν
of length Mν, i.e.
W(ν)
= (W
(ν)
iν
) : 1 ≤ iν ≤ Mν. Then
w := (wi1...id
) =
d
Y
ν=1
W
(ν)
iν
.
Schema of the TT decomposition. The waggons denote the TT
cores and each wheel denotes the index iν. Each waggon is
connected with neighbours by indices jν−1 and jν. 11 / 30
17. P) =
V
N
hF|PiT
Differential entropy, requiring the point-wise logarithm of P:
h(p) := E(−log(p))p ≈ E(−log(P))P = −
V
N
hlog(P)|Pi,
Then the f-divergence of p from q and its discrete
approximation is defined as
Df(pkq) := E f
p
q
!!
q
≈ E
f(P
22. List of some typical divergences and distances
Divergence D•(pkq)
KLD
Z
(log(p(x)/q(x)))p(x) dx
Hellinger dist.
1
2
Z q
p(x) −
q
q(x)
2
dx
Bregman div.
Z
[(φ(p(x)) − φ(q(x))) − (p(x) − q(x))φ0
(q(x))] dx
Bhattacharyya −log
Z q
(p(x)q(x)) dx
!
List of some typical divergences and distances.
13 / 30
23. Discrete approximations for divergences
Divergence Approx. D•(pkq)
KLD
V
N
(hlog(P)|Pi − hlog(Q)|Pi)
(DH)2:
V
2N
hP
50. Series of numerical examples
1. KLD is computed with the analytical formula and the
amen cross algorithm from TT-toolbox
2. Hellinger distances is computed with well-known analytical
formulas and the amen cross algorithm.
3. (pdf is not known analytically), the d-variate elliptically
contoured α-stable distributions are chosen and accessed
via their pcfs ,
4. KLD and Hellinger distances for different value of d, n and
the parameter α.
17 / 30
51. Example 1: KLD for two Gaussian distributions
N1 := N (µ1,C1) and N2 := N (µ2,C2), where
C1 := σ2
1 I, C2 := σ2
2 I,
µ1 = (1.1...,1.1) and µ2 = (1.4,...,1.4) ∈ Rd,
d = {16,32,64}, and σ1 = 1.5, σ2 = 22.1.
The well known analytical formula is
2DKL(N1kN2) = tr(C−1
2 C1) + (µ2 − µ1)T
C−1
2 (µ2 − µ1) − d + log
|C2|
|C1|
18 / 30
52. Comparison of KLDs computed via two methods
DKL computed via TT tensors (AMEn algorithm) and the
analytical formula for various values of d.
TT tolerance = 10−6, the stopping difference between
consecutive iterations.
d 16 32 64
n 2048 2048 2048
DKL (exact) 35.08 70.16 140.32
e
DKL 35.08 70.16 140.32
erra 4.0e-7 2.43e-5 1.4e-5
errr 1.1e-8 3.46e-8 8.1e-8
comp. time, sec. 1.0 5.0 18.7
19 / 30
54. Hellinger distance DH
is computed via TT tensors (AMEn) and the analytical formula.
TT tolerance = 10−6.
d 16 32 64
n 2048 2048 2048
DH (exact) 0.99999 0.99999 0.99999
e
DH 0.99992 0.99999 0.99999
erra 3.5e-5 7.1e-5 1.4e-4
errr 2.5e-5 5.0e-5 1.0e-4
comp. time, sec. 1.7 7.5 30.5
The AMEn algorithm is able to compute the Hellinger distance
DH between two multiv. Gaussian distribs for d = {16,32,64},
and n = 2048. The exact and approximate values are identical,
and the error is small.
21 / 30
55. Example 3: α-stable distribution
The pcf of a d-variate elliptically contoured α-stable
distribution is given by
ϕξ(t) = exp
iht|µi − ht|Cti
α
2
.
AMEn tol.= 10−9.
22 / 30
57. Full storage vs low-rank
d = 32 and n = 256 mean that the amount of data in full
storage mode would be
N = nd = 26532 ≈ 1.16E77 ≈ 1E78 bytes.
In TT-low-rank approximation it is ca. 626MB, and fits on a
laptop.
Assuming 1GHz notebook, the KLD computation in full mode
would require ca. 1.2E68sec,
or more than 3E60 years,
and even with a perfect speed-up on a parallel super-computer
with say 1,000,000 processors,
this would require still more than 3E54 years;
compare this with the estimated age of the universe of ca.
1.4E10 years.
24 / 30
58. Example 4: DKL(α1,α2) between two α-stable distributions
for (α1,α2) and fixed d = 8 and n = 64.
(α1,α2) (2.0,0.5) (2.0,1.0) (2.0,1.5) (2.0,1.9) (1.5,1.4) (1.0,0.4)
DKL(α1,α2) 2.27 0.66 0.3 0.03 0.031 0.6
comp. time, sec. 8.4 7.8 7.5 8.5 11 8.7
max. TT rank 78 74 76 76 80 79
memory, MB 28.5 28.5 27.1 28.5 35 29.5
µ1 = µ2 = 0, C1 = C2 = I.
AMEn tol.= 10−12.
25 / 30
59. Example 3: Hellinger distance DH(α1,α2)
for the d-variate elliptically contoured α-stable distribution for
α = 1.5 and α = 0.9 for different d and n.
µ1 = µ2 = 0, C1 = C2 = I.
d 16 16 16 16 16 16 32 32 32 32
n 8 16 32 64 128 256 16 32 64 128
DH(1.5,0.9) 0.218 0.223 0.223 0.223 0.219 0.223 0.180 0.176 0.175 0.176
comp. time, sec. 2.8 3.7 7.5 19 53 156 11 21 62 117
max. TT rank 79 76 76 76 79 76 75 71 75 74
memory, MB 7.7 17 34 71 145 283 34 66 144 285
AMEn tolerance is 10−9.
26 / 30
60. Example 6: DH vs. TT (AMEn) tolerances
TT(AMEn) tolerance 10−7 10−8 10−9 10−10 10−14
DH(1.5,0.9) 0.1645 0.1817 0.176 0.1761 0.1802
comp. time, sec. 43 86 103 118 241
max. TT rank 64 75 75 78 77
memory, MB 126 255 270 307 322
Computation of DH(α1,α2) between two α-stable distributions
(α = 1.5 and α = 0.9) for different AMEn tolerances.
n = 128, d = 32, µ1 = µ2 = 0, C1 = C2 = I.
27 / 30
61. Conclusion
We demonstrated
1. numerical computation of characterising statistics of
high-dimensional pdfs , as well as their divergences and
distances.
2. pdfs and pcfs , and some functions (log(),
√
P,...) of them
can be approximated and represented in a low-rank tensor
data format.
3. Utilisation of low-rank tensor techniques helps to reduce the
complexity and storage from exponential O(nd) to linear, e.g.
O
dnr2
for the TT format.
28 / 30
62. Literature
1. A. Litvinenko, Y. Marzouk, H.G. Matthies, M. Scavino, A.
Spantini, Computing f-Divergences and Distances of
High-Dimensional Probability Density Functions – Low-Rank
Tensor Approximations, arXiv: 2111.07164, 2021,
https://arxiv.org/abs/2111.07164
2. M. Espig, W. Hackbusch, A. Litvinenko, H.G. Matthies, E
Zander, Iterative algorithms for the post-processing of
high-dimensional data, JCP 410, 109396, 2020,
https://doi.org/10.1016/j.jcp.2020.109396
3. S. Dolgov, A. Litvinenko, D. Liu, KRIGING IN TENSOR TRAIN
DATA FORMAT, Conf. Proc., 3rd Int. Conf. on Uncertainty
Quantification in CSE, https://files.eccomasproceedia.
org/papers/e-books/uncecomp_2019.pdf, pp 309-329,
2019
29 / 30
63. Literature
4. A. Litvinenko, D. Keyes, V. Khoromskaia, B.N. Khoromskij, H.G.
Matthies, Tucker tensor analysis of Matérn functions in
spatial statistics, Computational Methods in Applied
Mathematics 19 (1), 101-122, 2019,
https://www.degruyter.com/document/doi/10.1515/
cmam-2018-0022/pdf
5. A. Litvinenko, R. Kriemann, M.G. Genton, Y. Sun, D.E. Keyes,
HLIBCov: Parallel hierarchical matrix approximation of large
covariance matrices and likelihoods with applications in
parameter identification, MethodsX 7, 100600, 2020,
https://doi.org/10.1016/j.mex.2019.07.001
6. A. Litvinenko, Y. Sun, M.G. Genton, D.E. Keyes, Likelihood
approximation with hierarchical matrices for large spatial
datasets, Computational Statistics Data Analysis 137, pp
115-132, 2019,
https://doi.org/10.1016/j.csda.2019.02.002
30 / 30