Hideitsu Hino

talk
1
If (x) = − ln f(x)
H(f) = − f(x) ln f(x)dx
f X H(X)
1
Shannon Renyi
((1 − α)−1
log f(x)α
dx) Tsallis ((q − 1)−1
(1 − fq
(x)dx))
( )
3 / 74

talk
H(f, g) =Ef [Ig(X)] = − f(x) ln g(x)dx,
H(f) =Ef [If (X)] = − f(x) ln f(x)dx
Kullback-Leibler
DKL(f, g) = Ef [Ig(X)] − Ef [If (X)] = f(x) ln
f(x)
g(x)
dx
MI(X, Y ) = H(X) + H(Y ) − H(X, Y )
H(X, Y ) X Y
4 / 74

m Y ∈ Rm n
X ∈ Rn Y
W ∈ Rn×m :
Y = WX. (1)
Y W
WX f(WX) WX (m
) f(wjX), j = 1, . . . , m
W [Hyv¨arinen&Oja, 2000]
6 / 74

k
L(c1, . . . , cK) =
n
i=1
min
l=1,...,K
∥xi − cl∥2
.
8 / 74

k
L(c1, . . . , cK) =
n
i=1
min
l=1,...,K
∥xi − cl∥2
.
A Nonparametric Information Theoretic Clustering Algorithm
−8 −6 −4 −2 0 2 4 6 8
−8
−6
−4
−2
0
2
4
6
8
−30 −20 −10 0 10 20 30
−30
−20
−10
0
10
20
30
−15 −10 −5 0 5 10 15
−15
−10
−5
0
5
10
15
(a) (b) (c)
−8 −6 −4 −2 0 2 4 6 8
−8
−6
−4
−2
0
2
4
6
8
−30 −20 −10 0 10 20 30
−30
−20
−10
0
10
20
30
−15 −10 −5 0 5 10 15
−15
−10
−5
0
5
10
15
(d) (e) (f)
Figure 2. Comparison of the proposed clustering method NIC and the k-means clustering algorithm on thr
cases. (a)-(c) NIC, (d)-(f) k-means.
Fig. from [Faivishevsky&Goldberger, 2010]
8 / 74

H(X|Y )
A Nonparametric Information Theoretic Clustering Algorithm
−8 −6 −4 −2 0 2 4 6 8
−8
−6
−4
−2
0
2
4
6
8
−30 −20 −10 0 10 20 30
−30
−20
−10
0
10
20
30
−15 −10 −5 0 5 10 15
−15
−10
−5
0
5
10
15
(a) (b) (c)
−8 −6 −4 −2 0 2 4 6 8
−8
−6
−4
−2
0
2
4
6
8
−30 −20 −10 0 10 20 30
−30
−20
−10
0
10
20
30
−15 −10 −5 0 5 10 15
−15
−10
−5
0
5
10
15
(d) (e) (f)
9 / 74

H(X|Y ) Fisher
[Hino&Murata, 2010]
10 / 74

−3 −2 −1 0 1 2 3
−3−2−10123
1st axis
2ndaxis
LDA
minH
−3 −2 −1 0 1 2 3
−3−2−10123
1st axis
2ndaxis
−3 −2 −1 0 1 2 3
−3−2−10123
1st axis
2ndaxis
LDA
minH
−3 −2 −1 0 1 2 3
−3−2−10123
1st axis
2ndaxis
11 / 74

European single
market completed
The Great Hanshion-
Awaji Earthquake
decay of bubble economy
the Gulf war
TOPIX
ChangePointScore
10001500200025003000
0.000.020.040.060.080.10
1988!02!01
1988!09!01
1989!05!01
1989!12!01
1990!08!01
1991!04!01
1992!04!01
1992!10!01
1993!06!01
1993!12!01
1994!07!01
1995!02!01
1995!09!01
1996!04!01
:
score(t) = log
fafter(t)
fbefore(t)
.
[Murata+, 2013, Koshijima+, 2015]
13 / 74

f(xt+1|xt:1)
50%, 95%
f(xt+1|xt:1)
14 / 74

D = {xi}n
i=1 ⊂ R 1
D i.i.d.
20 / 74

f(x) =
5
8
φ(x; µ = 0, σ = 1) +
3
8
φ(x; µ = 3, σ = 1)
21 / 74

ˆf(x; h) =
1
nh
n
i=1
κ((x − xi)/h) (2)
κ κ(x)dx = 1
h > 0
κh(x) = h−1κ(x/h)
ˆf(x; h) =
1
n
n
i=1
κh(x − xi)
23 / 74

x
MSE(mean squared error): ˆθ
MSE(ˆθ) = E[(ˆθ − θ)2
] = Var[ˆθ] + (E[ˆθ] − θ)2
E[ ˆf(x; h)] = E[κh(x − X)] = κh(x − y)f(y)dy
(f ∗ g)(x) = f(x − y)g(y)dy
ˆf(x; h)
E[ ˆf(x; h)] − f(x) = (κh ∗ f)(x) − f(x).
Var[ ˆf(x; h)] =
1
n
(κ2
h ∗ f)(x) − (κh ∗ f)2
(x)
25 / 74

x
MSE[ ˆf(x; h)] =
1
n
(κ2
h ∗ f)(x) − (κh ∗ f)2
(x)
+ {(κh ∗ f)(x) − f(x)}2
26 / 74

L2 ( ) : ISE(integrated squared
error)
ISE[ ˆf(·; h)] = ˆf(x; h) − f(x)
2
dx
27 / 74

ˆf(x; h) D = {xi}n
i=1
ISE ˆf
D
MISE(mean integrated squared error)
MISE[ ˆf(·; h)] =ED[ISE[ ˆf(·; h, D)]]
= ED([ ˆf(x; h, D) − f(x)])2
dx
= MSE[ ˆf(x; h, D)]dx
28 / 74

MISE[ ˆf(·; h)] =n−1
(κ2
h ∗ f)(x) − (κh ∗ f)2
(x) dx
+ {(κh ∗ f)(x) − f(x)}2
dx
=(nh)−1
κ2
(x)dx
+ (1 − n−1
) (κh ∗ f)2
(x)dx
− 2 (κh ∗ f)(x)f(x)dx + f(x)2
dx.
29 / 74

1 f C2- L2
2 {hn} hn
n h n :
lim
n→∞
h = 0, lim
n→∞
nh = ∞.
3 κ 4
κ(x)dx = 1, xκ(x)dx = 0, µ2(κ) = x2
κ(x)dx < ∞
31 / 74

E[ ˆf(x; h)] = κ(z)f(x − hz)dz f(x − hz)
f(x − hz) = f(x) − hzf′
(x) +
1
2
h2
z2
f′′
(x) + o(h2
)
E[ ˆf(x; h)] = f(x) +
1
2
h2
f′′
(x) z2
κ(z)dz + o(h2
)
E[ ˆf(x; h)] − f(x) =
1
2
h2
µ2(κ)f′′
(x) + o(h2
) (3)
ˆf
f
32 / 74

g
R(g) = g2(x)dx
Var[ ˆf(x; h)] = (nh)−1
R(κ)f(x) + o((nh)−1
) (4)
(2) (3) 0 MSE
MSE[ ˆf(x; h)] =(nh)−1
R(κ)f(x) +
1
4
h4
µ2
2(κ)(f′′
(x))2
+ o((nh)−1
+ h4
)
33 / 74

MSE
MISE[ ˆf(·; h)] = AMISE[ ˆf(·; h)] + o((nh)−1
+ h4
)
AMISE[ ˆf(·; h)] = (nh)−1
R(κ) +
1
4
h4
µ2
2(κ)R(f′′
).
AMISE MISE h
:
hAMISE =
R(κ)
µ2
2(κ)R(f′′)n
1/5
.
34 / 74

k
f(z) z ∈ Rp
D = {xi}n
i=1
z k εk
z ε p
b(z; ε) = {x ∈ Rp|∥z − x∥ < ε}
|b(z; ε)| = cpεp
cp = πp/2/Γ(p/2 + 1) Γ( · )
35 / 74

k
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
xi ∈ D ◦
z ∈ Rp ×
36 / 74

k
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
ε
z ε
ε
(k )
37 / 74

k
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
ε
z ε
qz(ε) =
b(z;ε)
f(x)dx.
k/n
k ε = εk
ε
ε
kε
38 / 74

k
Taylor :
qz(εk) =
b(z;εk)
{f(z) + ∇f(x)(z − x) + O(ε2
k)}dx
= |b(z; εk)|(f(z) + O(ε2
k)) ≃ εp
kcpf(z).
cp Rp
39 / 74

k
k
n
, εp
kcpf(z)
ˆfk(z) =
k
cpn
ε−p
k (5)
40 / 74

k
k
ˆfk(z) =
k
cpn
ε−p
k , (6)
εk z D k
41 / 74

H(f) D = {xi}n
i=1
xi ∈ Rp, i = 1, . . . , n f(x) X
44 / 74

z ε
qz(ε) =
x∈b(z;ε)
f(x)dx (7)
45 / 74

z ε
qz(ε) =
x∈b(z;ε)
f(x)dx (7)
qz(ε) =
x∈b(z;ε)
f(x) + (z − x)⊤
∇f(z) + O(ε2
) dx
= |b(z; ε)| f(z) + O(ε2
) = cpεp
f(z) + O(εp+2
)
k/n O(εp+2)
45 / 74

z ε qz(ε) ε
qz(ε) = cpf(z)εp
+
p
4(p/2 + 1)
cpεp+2
tr∇2
f(z)+O(εp+4
) (8)
46 / 74

z ε qz(ε) ε
qz(ε) = cpf(z)εp
+
p
4(p/2 + 1)
cpεp+2
tr∇2
f(z)+O(εp+4
) (8)
qz(ε) kε/n cpεp
kε
ncpεp
= f(z) + Cε2
+ O(ε4
) (9)
C = ptr∇2f(z)
4(p/2+1)
46 / 74

Yε = kε
ncpεp Xε = ε2 ε 4
Yε Xε
Yε ≃ f(z) + CXε (10)
2
47 / 74

Yε ≃ f(z) + CXε
Xε Yε
ε
48 / 74

k [ ]
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
ε
z ε
ε
(k )
49 / 74

E = {ε1, . . . , εm}, m < n
E ε {(Xε, Yε)}ε∈E
R =
1
m
ε∈E
(Yε − f(z) − CXε)2
(11)
f(z) C
f(z)
ˆfs(z)
50 / 74

z ˆfs(z)
leave-one-out
ˆHs(D) = −
1
n
n
i=1
ln ˆfs,i(xi), (12)
ˆfs,i(xi) xi
ˆHs(D) Simple Regression Entropy
Estimator (SRE) [Hino+, 2015]
51 / 74

SRE: how it works
−3 −2 −1 0 1 2 3
0.00.10.20.30.4
Normal
x
density
0 1 2 3 40.240.280.320.36
Normal
epsilon^2
f(z)
Fitted density function Fitted intercept ˆfs(z = 0.5)
52 / 74

SRE: how it works
−3 −2 −1 0 1 2 3
0.000.100.200.30
Bimodal
x
density
1.0 1.5 2.0 2.5 3.0 3.5 4.00.2250.2350.245
Bimodal
epsilon^2
f(z)
Fitted density function Fitted intercept ˆfs(z = 0.5)
53 / 74

ε xi ∈ D
Yε ≃ f(xi) + CXε
Yε = kε
ncpεp C = ptr∇2f(xi)
4(p/2+1) xi
Y i
ε Ci :
Y i
ε ≃ f(xi) + Ci
Xε
54 / 74

Y i
ε = f(xi) + CiXε
xi ∈ D
−
1
n
n
i=1
ln Y i
ε = −
1
n
n
i=1
ln f(xi) + Ci
Xε
= −
1
n
n
i=1
ln f(xi) 1 +
CiXε
f(xi)
= −
1
n
n
i=1
ln f(xi) −
1
n
n
i=1
ln 1 +
CiXε
f(xi)
≃ −
1
n
n
i=1
ln f(xi) −
1
n
n
i=1
Ci
f(xi)
Xε
55 / 74

−
1
n
n
i=1
ln Y i
ε ≃ −
1
n
n
i=1
ln f(xi) −
1
n
n
i=1
Ci
f(xi)
Xε
¯Yε = − 1
n
n
i=1 ln Y i
ε
H(D) = − 1
n
n
i=1 f(xi)
¯C = − 1
n
n
i=1
Ci
f(xi)
ε > 0
¯Yε = H(D) + ¯CXε (13)
56 / 74

ε ∈ E (13)
Rd =
1
m
ε∈E
( ¯Yε − H(D) − ¯CXε)2
Direct Regression Entropy
Estimator (DRE) [Hino+, 2015]
57 / 74

qz(ε) = cpf(z)εp
+
p
4(p/2 + 1)
cpεp+2
tr∇2
f(z) + O(εp+4
)
qz(ε) kε/n cpεp
kε
ncpεp
= f(z) + Cε2
+ O(ε4
)
Yε = f(z) + CXε
58 / 74

SRE
min
1
m
ε∈E
(Yε − f(z) − CXε)2
,
and
ˆHs(D) = −
1
n
n
i=1
ln ˆfi(xi)
DRE
min
1
m
ε∈E
( ¯Yε − H(D) − ¯CXε)2
59 / 74

qz(ε) = cpf(z)εp
+
p
4(p/2 + 1)
cpεp+2
tr∇2
f(z) + O(εp+4
)
qz(ε) kε/n n
:
kε ≃ cpnf(z)εp
+ cpn
p
4(p/2 + 1)
tr∇2
f(z)εp+2
61 / 74

kε ≃ cpnf(z)εp
+ cpn
p
4(p/2 + 1)
tr∇2
f(z)εp+2
X = (εp, εp+2) Y = kε
Y = β⊤X
kε Poisson
62 / 74

max L(β) =
m
i=1
e−X⊤
i β(X⊤
i β)Yi
Yi!
εp β1
ˆβ1
z ˆβ1/(cpn)
SRE LOO
Entropy Estimator with Poisson-noise structure and
Identity-link regression(EPI) [Hino+,under review]
63 / 74

H(f)
ˆH(D)
AE = |H(f) − ˆH(D)|
100
65 / 74

Univariate Case
15 distributions
−3 −2 −1 0 1 2 3
0.00.10.20.30.4
Normal
x
density
−3 −2 −1 0 1 2 3
0.00.10.20.30.40.5
Skewed
x
density
−3 −2 −1 0 1 2 3
0.00.20.40.60.81.01.21.4
Strongly Skewed
x
density
−3 −2 −1 0 1 2 3
0.00.51.01.5
Kurtotic
x
density
−3 −2 −1 0 1 2 3
0.000.050.100.150.200.250.30
Bimodal
x
density
−3 −2 −1 0 1 2 3
0.00.10.20.30.4
Skewed Bimodal
x
density
66 / 74

Univariate Case
15 distributions
−3 −2 −1 0 1 2 3
0.000.050.100.150.200.250.30
Trimodal
x
density
−3 −2 −1 0 1 2 3
0.00.10.20.30.40.50.6
10 Claw
x
density
−3 −2 −1 0 1 2 3
0.00.10.20.30.4
Standard Power Exponential
x
density
−3 −2 −1 0 1 2 3
0.050.100.150.200.25
Standard Logistic
x
density
−3 −2 −1 0 1 2 3
0.10.20.30.40.5
Standard Classical Laplace
x
density
−3 −2 −1 0 1 2 3
0.10.20.3
t(df=5)
x
density
67 / 74

Univariate Case
15 distributions
−3 −2 −1 0 1 2 3
0.050.100.150.200.25
Mixed t
x
density
−3 −2 −1 0 1 2 3
0.00.20.40.60.81.0
Standard Exponential
x
density
−3 −2 −1 0 1 2 3
0.050.100.150.200.250.30
Cauchy
x
density
68 / 74

●
●●
●
●
●
●
●
●
●
−3 −2 −1 0 1 2 3
0.00.10.20.30.4
Normal
x
density
−3 −2 −1 0 1 2 3
0.00.10.20.30.40.5
Skewed
x
density
−3 −2 −1 0 1 2 3
0.00.20.40.60.81.01.21.4
Strongly Skewed
x
density
−3 −2 −1 0 1 2 3
0.00.51.01.5
Kurtotic
x
density
−3 −2 −1 0 1 2 3
0.000.050.100.150.200.250.30
Bimodal
x
density
69 / 74

●
●
●
●
●
●
●●
●
●
−3 −2 −1 0 1 2 3
0.00.10.20.30.4
Skewed Bimodal
x
density
−3 −2 −1 0 1 2 3
0.000.050.100.150.200.250.30
Trimodal
x
density
−3 −2 −1 0 1 2 3
0.00.10.20.30.40.50.6
10 Claw
x
density
−3 −2 −1 0 1 2 3
0.00.10.20.30.4
Standard Power Exponential
x
density
−3 −2 −1 0 1 2 3
0.050.100.150.200.25
Standard Logistic
x
density
69 / 74

●●
●
●●
●
●
●
●
●
●
−3 −2 −1 0 1 2 3
0.10.20.30.40.5
Standard Classical Laplace
x
density
−3 −2 −1 0 1 2 3
0.10.20.3
t(df=5)
x
density
−3 −2 −1 0 1 2 3
0.050.100.150.200.25
Mixed t
x
density
−3 −2 −1 0 1 2 3
0.00.20.40.60.81.0
Standard Exponential
x
density
−3 −2 −1 0 1 2 3
0.050.100.150.200.250.30
Cauchy
x
density
69 / 74

Univariate Case
Results: Curvature and Improvement
tr∇2f k
γ > 0
:
f(x; γ) =
1
πγ(1 + (x/γ)2)
.
∇2
f(x; γ) =
2
πγ3
3(x/γ)2 − 1
(1 + (x/γ)2)3
γ 0.01 0.9
n = 300 100 k
EPI
| ˆHk(D) − H(f)| − | ˆHs(D) − H(f)|
70 / 74

Univariate Case
Results: Curvature and Improvement
maxx∈R log |∇2f(x; γ)|
−0.2
0.0
0.2
0.0 2.5 5.0 7.5
LogMaxCurvature
Improvement
71 / 74

That’s all fork
Pros. KDE k-NN
Cons.
72 / 74

I
[Faivishevsky&Goldberger, 2010] Faivishevsky, L. and Goldberger, J. (2010).
A Nonparametric Information Theoretic Clustering Algorithm.
ICML2010.
[Hino+, 2015] Hino, H., Koshijima, K., and Murata, N. (2015).
Non-parametric entropy estimators based on simple linear regression.
Computational Statistics & Data Analysis, 89(0):72 – 84.
[Hino&Murata, 2010] Hino, H. and Murata, N. (2010).
A conditional entropy minimization criterion for dimensionality reduction and
multiple kernel learning.
Neural Computation, 22(11):2887–2923.
[Hyv¨arinen&Oja, 2000] Hyv¨arinen, A. and Oja, E. (2000).
Independent component analysis: algorithms and applications.
Neural Networks, 13(4-5):411–430.
[Koshijima+, 2015] Koshijima, K., Hino, H., and Murata, N. (2015).
Change-point detection in a sequence of bags-of-data.
Knowledge and Data Engineering, IEEE Transactions on, 27(10):2632–2644.
73 / 74

II
[Murata+, 2013] Murata, N., Koshijima, K., and Hino, H. (2013).
Distance-based change-point detection with entropy estimation.
In Proceedings of the Sixth Workshop on Information Theoretic Methods in
Science and Engineering, pages 22–25.
74 / 74

Hideitsu Hino

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (9)

Similar a Hideitsu Hino

Similar a Hideitsu Hino (20)

Más de Suurist

Más de Suurist (6)

Último

Último (20)

Hideitsu Hino