+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
Introduction to Machine Learning
1. Introduction to Machine Learning
Bernhard Schölkopf
Empirical Inference Department
Max Planck Institute for Intelligent Systems
Tübingen, Germany
http://www.tuebingen.mpg.de/bs
1
2. Empirical Inference
• Drawing conclusions from empirical data (observations, measurements)
• Example 1: scientific inference
y = Σi ai k(x,xi) + b
x
y y=a*x
x
x
x
x
x x
x
x
x
Leibniz, Weyl, Chaitin
2
3. Empirical Inference
• Drawing conclusions from empirical data (observations, measurements)
• Example 1: scientific inference
“If your experiment needs statistics [inference],
you ought to have done a better experiment.” (Rutherford)
3
4. Empirical Inference, II
• Example 2: perception
“The brain is nothing but a statistical decision organ”
(H. Barlow)
4
5. Hard Inference Problems
Sonnenburg, Rätsch, Schäfer,
Schölkopf, 2006, Journal of Machine
Learning Research
Task: classify human DNA
sequence locations into {acceptor
splice site, decoy} using 15
Million sequences of length 141,
and a Multiple-Kernel Support
Vector Machines.
PRC = Precision-Recall-Curve,
fraction of correct positive
predictions among all positively
predicted cases
• High dimensionality – consider many factors simultaneously to find the regularity
• Complex regularities – nonlinear, nonstationary, etc.
• Little prior knowledge – e.g., no mechanistic models for the data
• Need large data sets – processing requires computers and automatic inference methods
5
6. Hard Inference Problems, II
• We can solve scientific inference problems that humans can’t solve
• Even if it’s just because of data set size / dimensionality, this is a
quantum leap
6
7. Generalization (thanks to O. Bousquet)
• observe 1, 2, 4, 7,..
• What’s next?
+1 +2 +3
• 1,2,4,7,11,16,…: an+1=an+n (“lazy caterer’s sequence”)
• 1,2,4,7,12,20,…: an+2=an+1+an+1
• 1,2,4,7,13,24,…: “Tribonacci”-sequence
• 1,2,4,7,14,28: set of divisors of 28
• 1,2,4,7,1,1,5,…: decimal expansions of p=3,14159…
and e=2,718… interleaved
• The On-Line Encyclopedia of Integer Sequences: >600 hits…
7
8. Generalization, II
• Question: which continuation is correct (“generalizes”)?
• Answer: there’s no way to tell (“induction problem”)
• Question of statistical learning theory: how to come up
with a law that is (probably) correct (“demarcation problem”)
(more accurately: a law that is probably as correct on the test data as it is on the training data)
8
9. 2-class classification
Learn based on m observations
generated from some
Goal: minimize expected error (“risk”)
V. Vapnik
Problem: P is unknown.
Induction principle: minimize training error (“empirical risk”)
over some class of functions. Q: is this “consistent”?
9
10. The law of large numbers
For all and
Does this imply “consistency” of empirical risk minimization
(optimality in the limit)?
No – need a uniform law of large numbers:
For all
10
12. -> LaTeX
12
Bernhard Schölkopf Empirical Inference Department Tübingen , 03 October 2011
13. Support Vector Machines
class 2
class 1
F
+-
- +
+
- k(x,x’)
+ -
= +
<F(x),F(x’)>
+
• sparse expansion of solution in terms of SVs (Boser, Guyon, Vapnik 1992):
- -
representer theorem (Kimeldorf & Wahba 1971, Schölkopf et al. 2000)
• unique solution found by convex QP
Bernhard Schölkopf, 03
October 2011 13
14. Support Vector Machines
class 2
class 1
F
+-
- +
+
- k(x,x’)
+ -
= +
<F(x),F(x’)>
+
• sparse expansion of solution in terms of SVs (Boser, Guyon, Vapnik 1992):
- -
representer theorem (Kimeldorf & Wahba 1971, Schölkopf et al. 2000)
• unique solution found by convex QP
Bernhard Schölkopf, 03
October 2011 14
15. Applications in Computational Geometry / Graphics
Steinke, Walder, Blanz et al.,
Eurographics’05, ’06, ‘08, ICML ’05,‘08,
NIPS ’07
15
17. Kernel Quiz
17
Bernhard Schölkopf Empirical Inference Department Tübingen , 03 October 2011
18. Kernel Methods
Bernhard Sch¨lkopf
o
Max Planck Institute for Intelligent Systems
B. Sch¨lkopf, MLSS France 2011
o
19. Statistical Learning Theory
1. started by Vapnik and Chervonenkis in the Sixties
2. model: we observe data generated by an unknown stochastic
regularity
3. learning = extraction of the regularity from the data
4. the analysis of the learning problem leads to notions of capacity
of the function classes that a learning machine can implement.
5. support vector machines use a particular type of function class:
classifiers with large “margins” in a feature space induced by a
kernel.
[47, 48]
B. Sch¨lkopf, MLSS France 2011
o
20. Example: Regression Estimation
y
x
• Data: input-output pairs (xi, yi) ∈ R × R
• Regularity: (x1, y1), . . . (xm, ym) drawn from P(x, y)
• Learning: choose a function f : R → R such that the error,
averaged over P, is minimized.
• Problem: P is unknown, so the average cannot be computed
— need an “induction principle”
21. Pattern Recognition
Learn f : X → {±1} from examples
(x1, y1), . . . , (xm, ym) ∈ X×{±1}, generated i.i.d. from P(x, y),
such that the expected misclassification error on a test set, also
drawn from P(x, y),
1
R[f ] = |f (x) − y)| dP(x, y),
2
is minimal (Risk Minimization (RM)).
Problem: P is unknown. −→ need an induction principle.
Empirical risk minimization (ERM): replace the average over
P(x, y) by an average over the training sample, i.e. minimize the
training error
1 m 1
Remp[f ] = |f (xi) − yi|
m i=1 2
B. Sch¨lkopf, MLSS France 2011
o
22. Convergence of Means to Expectations
Law of large numbers:
Remp[f ] → R[f ]
as m → ∞.
Does this imply that empirical risk minimization will give us the
optimal result in the limit of infinite sample size (“consistency”
of empirical risk minimization)?
No.
Need a uniform version of the law of large numbers. Uniform over
all functions that the learning machine can implement.
B. Sch¨lkopf, MLSS France 2011
o
23. Consistency and Uniform Convergence
R
Risk
Remp
Remp [f]
R[f]
f f opt fm Function class
B. Sch¨lkopf, MLSS France 2011
o
24. The Importance of the Set of Functions
What about allowing all functions from X to {±1}?
Training set (x1, y1), . . . , (xm, ym) ∈ X × {±1}
¯ ¯¯
Test patterns x1, . . . , xm ∈ X,
¯¯
such that {¯ 1, . . . , xm} ∩ {x1, . . . , xm} = {}.
x
1. f ∗(xi) = f (xi) for all i
For any f there exists f ∗ s.t.:
2. f ∗(¯ j ) = f (¯ j ) for all j.
x x
Based on the training set alone, there is no means of choosing
which one is better. On the test set, however, they give opposite
results. There is ’no free lunch’ [24, 56].
−→ a restriction must be placed on the functions that we allow
B. Sch¨lkopf, MLSS France 2011
o
25. Restricting the Class of Functions
Two views:
1. Statistical Learning (VC) Theory: take into account the ca-
pacity of the class of functions that the learning machine can
implement
2. The Bayesian Way: place Prior distributions P(f ) over the
class of functions
B. Sch¨lkopf, MLSS France 2011
o
26. Detailed Analysis
• loss ξi := 1 |f (xi) − yi| in {0, 1}
2
• the ξi are independent Bernoulli trials
1 m
• empirical mean m i=1 ξi (by def: equals Remp[f ])
• expected value E [ξ] (equals R[f ])
B. Sch¨lkopf, MLSS France 2011
o
27. Chernoff ’s Bound
1 m
P ξi − E [ξ] ≥ ǫ ≤ 2 exp(−2mǫ2)
m
i=1
• here, P refers to the probability of getting a sample ξ1, . . . , ξm
with the property m m ξi − E [ξ] ≥ ǫ (is a product mea-
1
i=1
sure)
Useful corollary: Given a 2m-sample of Bernoulli trials, we have
1 m 1
2m mǫ2
P ξi − ξi ≥ ǫ ≤ 4 exp − .
m m 2
i=1 i=m+1
B. Sch¨lkopf, MLSS France 2011
o
28. Chernoff ’s Bound, II
Translate this back into machine learning terminology: the prob-
ability of obtaining an m-sample where the training error and test
error differ by more than ǫ > 0 is bounded by
P Remp[f ] − R[f ] ≥ ǫ ≤ 2 exp(−2mǫ2).
• refers to one fixed f
• not allowed to look at the data before choosing f , hence not
suitable as a bound on the test error of a learning algorithm
using empirical risk minimization
B. Sch¨lkopf, MLSS France 2011
o
29. Uniform Convergence (Vapnik & Chervonenkis)
Necessary and sufficient conditions for nontrivial consistency of
empirical risk minimization (ERM):
One-sided convergence, uniformly over all functions that can be
implemented by the learning machine.
lim P { sup (R[f ] − Remp[f ]) > ǫ} = 0
m→∞ f ∈F
for all ǫ > 0.
• note that this takes into account the whole set of functions that
can be implemented by the learning machine
• this is hard to check for a learning machine
Are there properties of learning machines (≡ sets of functions)
which ensure uniform convergence of risk?
B. Sch¨lkopf, MLSS France 2011
o
30. How to Prove a VC Bound
Take a closer look at P{supf ∈F (R[f ] − Remp[f ]) > ǫ}.
Plan:
• if the function class F contains only one function, then Cher-
noff’s bound suffices:
P{ sup (R[f ] − Remp[f ]) > ǫ} ≤ 2 exp(−2mǫ2).
f ∈F
• if there are finitely many functions, we use the ’union bound’
• even if there are infinitely many, then on any finite sample
there are effectively only finitely many (use symmetrization
and capacity concepts)
B. Sch¨lkopf, MLSS France 2011
o
31. The Case of Two Functions
Suppose F = {f1, f2}. Rewrite
1 2
P{ sup (R[f ] − Remp[f ]) > ǫ} = P(Cǫ ∪ Cǫ ),
f ∈F
where
i
Cǫ := {(x1, y1), . . . , (xm, ym) | (R[fi] − Remp[fi]) > ǫ}
denotes the event that the risks of fi differ by more than ǫ.
The RHS equals
1 2 1 2 1 2
P(Cǫ ∪ Cǫ ) = P(Cǫ ) + P(Cǫ ) − P(Cǫ ∩ Cǫ )
1 2
≤ P(Cǫ ) + P(Cǫ ).
Hence by Chernoff’s bound
1 2
P{ sup (R[f ] − Remp[f ]) > ǫ} ≤ P(Cǫ ) + P(Cǫ )
f ∈F
≤ 2 · 2 exp(−2mǫ2).
32. The Union Bound
Similarly, if F = {f1, . . . , fn}, we have
1 n
P{ sup (R[f ] − Remp[f ]) > ǫ} = P(Cǫ ∪ · · · ∪ Cǫ ),
f ∈F
and n
1 n
P(Cǫ ∪ · · · ∪ Cǫ ) ≤ i
P(Cǫ ).
i=1
Use Chernoff for each summand, to get an extra factor n in the
bound.
i
Note: this becomes an equality if and only if all the events Cǫ
involved are disjoint.
B. Sch¨lkopf, MLSS France 2011
o
33. Infinite Function Classes
• Note: empirical risk only refers to m points. On these points,
the functions of F can take at most 2m values
• for Remp, the function class thus “looks” finite
• how about R?
• need to use a trick
B. Sch¨lkopf, MLSS France 2011
o
34. Symmetrization
Lemma 1 (Vapnik & Chervonenkis (e.g., [46, 12]))
For mǫ2 ≥ 2 we have
′
P{ sup (R[f ]−Remp[f ]) > ǫ} ≤ 2P{ sup (Remp[f ]−Remp[f ]) > ǫ/2}
f ∈F f ∈F
Here, the first P refers to the distribution of iid samples of
size m, while the second one refers to iid samples of size 2m.
In the latter case, Remp measures the loss on the first half of
′
the sample, and Remp on the second half.
B. Sch¨lkopf, MLSS France 2011
o
35. Shattering Coefficient
• Hence, we only need to consider the maximum size of F on 2m
points. Call it N(F, 2m).
• N(F, 2m) = max. number of different outputs (y1, . . . , y2m)
that the function class can generate on 2m points — in other
words, the max. number of different ways the function class can
separate 2m points into two classes.
• N(F, 2m) ≤ 22m
• if N(F, 2m) = 22m, then the function class is said to shatter
2m points.
B. Sch¨lkopf, MLSS France 2011
o
36. Putting Everything Together
We now use (1) symmetrization, (2) the shattering coefficient, and
(3) the union bound, to get
P{sup(R[f ] − Remp[f ]) > ǫ}
f ∈F
′
≤ 2P{sup(Remp[f ] − Remp[f ]) > ǫ/2}
f ∈F
′ ′
= 2P{(Remp[f1] − Remp[f1]) > ǫ/2 ∨. . .∨ (Remp[fN(F,2m)] − Remp[fN(F,2m)]) > ǫ/2}
N(F,2m)
′
≤ 2P{(Remp[fn ] − Remp[fn]) > ǫ/2}.
n=1
B. Sch¨lkopf, MLSS France 2011
o
37. ctd.
Use Chernoff’s bound for each term:∗
1 m 1
2m mǫ2
P ξi − ξi ≥ ǫ ≤ 2 exp − .
m m 2
i=1 i=m+1
This yields
mǫ2
P{ sup (R[f ] − Remp[f ]) > ǫ} ≤ 4N(F, 2m) exp − .
f ∈F 8
• provided that N(F, 2m) does not grow exponentially in m, this
is nontrivial
• such bounds are called VC type inequalities
• two types of randomness: (1) the P refers to the drawing of
the training examples, and (2) R[f ] is an expectation over the
drawing of test examples.
∗
Note that the fi depend on the 2m−sample. A rigorous treatment would need to use a second random-
ization over permutations of the 2m-sample, see [36].
38. Confidence Intervals
Rewrite the bound: specify the probability with which we want R
to be close to Remp, and solve for ǫ:
With a probability of at least 1 − δ,
8 4
R[f ] ≤ Remp[f ] + ln(N(F, 2m)) + ln .
m δ
This bound holds independent of f ; in particular, it holds for the
function f m minimizing the empirical risk.
B. Sch¨lkopf, MLSS France 2011
o
39. Discussion
• tighter bounds are available (better constants etc.)
• cannot minimize the bound over f
• other capacity concepts can be used
B. Sch¨lkopf, MLSS France 2011
o
40. VC Entropy
On an example (x, y), f causes a loss
1
ξ(x, y, f (x)) = |f (x) − y| ∈ {0, 1}.
For a larger sample (x , y ), . .2. , (x , y ), the different functions
1 1 m m
f ∈ F lead to a set of loss vectors
ξf = (ξ(x1, y1, f (x1)), . . . , ξ(xm, ym, f (xm))),
whose cardinality we denote by
N (F, (x1, y1) . . . , (xm, ym)) .
The VC entropy is defined as
HF (m) = E [ln N (F, (x1, y1) . . . , (xm, ym))] ,
where the expectation is taken over the random generation of the
m-sample (x1, y1) . . . , (xm, ym) from P.
HF (m)/m → 0 ⇐⇒ uniform convergence of risks (hence consis-
tency)
41. Further PR Capacity Concepts
• exchange ’E’ and ’ln’: annealed entropy.
ann
HF (m)/m → 0 ⇐⇒ exponentially fast uniform convergence
• take ’max’ instead of ’E’: growth function.
Note that GF (m) = ln N(F, m).
GF (m)/m → 0 ⇐⇒ exponential convergence for all underlying
distributions P.
GF (m) = m · ln(2) for all m ⇐⇒ for any m, all loss vectors
can be generated, i.e., the m points can be chosen such that by
using functions of the learning machine, they can be separated
in all 2m possible ways (shattered ).
B. Sch¨lkopf, MLSS France 2011
o
42. Structure of the Growth Function
Either GF (m) = m · ln(2) for all m ∈ N
Or there exists some maximal m for which the above is possible.
Call this number the VC-dimension, and denote it by h. For
m > h,
m
GF (m) ≤ h ln + 1 .
h
Nothing “in between” linear growth and logarithmic growth is
possible.
B. Sch¨lkopf, MLSS France 2011
o
43. VC-Dimension: Example
Half-spaces in R2:
f (x, y) = sgn(a + bx + cy), with parameters a, b, c ∈ R
• Clearly, we can shatter three non-collinear points.
• But we can never shatter four points.
• Hence the VC dimension is h = 3 (in this case, equal to the
number of parameters)
xxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
x
xxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxx
x
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
x x
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
x xxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxx
x
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
x x
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
x
xxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxx
x x
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
x
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx
xxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx
xxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx
xxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx
xxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx
x x
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
x
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
x
xxxxxxxxxxxx
xxxxxxxxxxxx
xxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx
xxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx
xxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx
x
xxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx
x
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
x
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
x xxxxxxxxxxxx
xxxxxxxxxxxx
xxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx
xxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx
xxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx
xxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx
xxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx
x x
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
x x xxxxxxxxxxxx
xxxxxxxxxxxx
xxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx
xxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx
xxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx
xxxxxxxxxxxx
B. Sch¨lkopf, MLSS France 2011
o
44. A Typical Bound for Pattern Recognition
For any f ∈ F and m > h, with a probability of at least 1 − δ,
h log 2m + 1 − log(δ/4)
h
R[f ] ≤ Remp[f ] +
m
holds.
• does this mean, that we can learn anything?
• The study of the consistency of ERM has thus led to concepts
and results which lets us formulate another induction principle
(structural risk minimization)
B. Sch¨lkopf, MLSS France 2011
o
45. SRM
error R(f* )
bound on test error
capacity term
training error
h
structure
Sn−1 Sn Sn+1
B. Sch¨lkopf, MLSS France 2011
o
46. Finding a Good Function Class
• recall: separating hyperplanes in R2 have a VC dimension of 3.
• more generally: separating hyperplanes in RN have a VC di-
mension of N + 1.
• hence: separating hyperplanes in high-dimensional feature
spaces have extremely large VC dimension, and may not gener-
alize well
• however, margin hyperplanes can still have a small VC dimen-
sion
B. Sch¨lkopf, MLSS France 2011
o
47. Kernels and Feature Spaces
Preprocess the data with
Φ:X → H
x → Φ(x),
where H is a dot product space, and learn the mapping from Φ(x)
to y [6].
• usually, dim(X) ≪ dim(H)
• “Curse of Dimensionality”?
• crucial issue: capacity, not dimensionality
B. Sch¨lkopf, MLSS France 2011
o
48. Example: All Degree 2 Monomials
Φ : R2 → R3 √
2 , 2 x x , x2 )
(x1, x2) → (z1, z2, z3) := (x1 1 2 2
x
z3
2
H
H
H
H
x1 H
H
H
H
H
H
HH
z1
H H H
H
z2
B. Sch¨lkopf, MLSS France 2011
o
49. General Product Feature Space
How about patterns x ∈ RN and product features of order d?
Here, dim(H) grows like N d.
E.g. N = 16 × 16, and d = 5 −→ dimension 1010
B. Sch¨lkopf, MLSS France 2011
o
50. The Kernel Trick, N = d = 2
√ √
2)(x′2, 2 x′ x′ , x′2)⊤
Φ(x), Φ(x′) = (x2,
1 2 x1 x2 , x2 1 1 2 2
2
= x, x′
= : k(x, x′)
−→ the dot product in H can be computed in R2
B. Sch¨lkopf, MLSS France 2011
o
51. The Kernel Trick, II
More generally: x, x′ ∈ RN , d ∈ N:
d
N
x, x ′ d = xj · x′
j
j=1
N
= xj1 · · · · · xjd · x′ 1 · · · · · x′ d = Φ(x), Φ(x′) ,
j j
j1,...,jd=1
where Φ maps into the space spanned by all ordered products of
d input directions
B. Sch¨lkopf, MLSS France 2011
o
52. Mercer’s Theorem
If k is a continuous kernel of a positive definite integral oper-
ator on L2(X) (where X is some compact space),
k(x, x′)f (x)f (x′) dx dx′ ≥ 0,
X
it can be expanded as
∞
k(x, x′) = λiψi(x)ψi(x′)
i=1
using eigenfunctions ψi and eigenvalues λi ≥ 0 [30].
B. Sch¨lkopf, MLSS France 2011
o
53. The Mercer Feature Map
In that case √
√λ1ψ1(x)
Φ(x) := λ2ψ2(x)
.
.
satisfies Φ(x), Φ(x′) = k(x, x′).
Proof: √ √ ′)
√λ1ψ1(x) √λ1ψ1(x′
Φ(x), Φ(x′) = λ2ψ2(x) , λ2ψ2(x )
.
. .
.
∞
= λiψi(x)ψi(x′) = k(x, x′)
i=1
B. Sch¨lkopf, MLSS France 2011
o
54. Positive Definite Kernels
It can be shown that the admissible class of kernels coincides with
the one of positive definite (pd) kernels: kernels which are sym-
metric (i.e., k(x, x′) = k(x′, x)), and for
• any set of training points x1, . . . , xm ∈ X and
• any a1, . . . , am ∈ R
satisfy
aiaj Kij ≥ 0, where Kij := k(xi, xj ).
i,j
K is called the Gram matrix or kernel matrix.
If for pairwise distinct points, i,j aiaj Kij = 0 =⇒ a = 0, call
it strictly positive definite.
B. Sch¨lkopf, MLSS France 2011
o
55. The Kernel Trick — Summary
• any algorithm that only depends on dot products can benefit
from the kernel trick
• this way, we can apply linear methods to vectorial as well as
non-vectorial data
• think of the kernel as a nonlinear similarity measure
• examples of common kernels:
Polynomial k(x, x′) = ( x, x′ + c)d
Gaussian k(x, x′) = exp(− x − x′ 2/(2 σ 2))
• Kernels are also known as covariance functions [54, 52, 55, 29]
B. Sch¨lkopf, MLSS France 2011
o
56. Properties of PD Kernels, 1
Assumption: Φ maps X into a dot product space H; x, x′ ∈ X
Kernels from Feature Maps.
k(x, x′) := Φ(x), Φ(x′) is a pd kernel on X × X.
Kernels from Feature Maps, II
K(A, B) := x∈A,x′∈B k(x, x′),
where A, B are finite subsets of X, is also a pd kernel
˜
(Hint: use the feature map Φ(A) := x∈A Φ(x))
B. Sch¨lkopf, MLSS France 2011
o
57. Properties of PD Kernels, 2 [36, 39]
Assumption: k, k1, k2, . . . are pd; x, x′ ∈ X
k(x, x) ≥ 0 for all x (Positivity on the Diagonal)
k(x, x′)2 ≤ k(x, x)k(x′, x′) (Cauchy-Schwarz Inequality)
(Hint: compute the determinant of the Gram matrix)
k(x, x) = 0 for all x =⇒ k(x, x′) = 0 for all x, x′ (Vanishing Diagonals)
The following kernels are pd:
• αk, provided α ≥ 0
• k1 + k2
• k(x, x′) := limn→∞ kn(x, x′), provided it exists
• k1 · k2
• tensor products, direct sums, convolutions [22]
B. Sch¨lkopf, MLSS France 2011
o
58. The Feature Space for PD Kernels [4, 1, 35]
• define a feature map
Φ : X → RX
x → k(., x).
E.g., for the Gaussian kernel: Φ
. .
x x' Φ(x) Φ(x')
Next steps:
• turn Φ(X) into a linear space
• endow it with a dot product satisfying
Φ(x), Φ(x′) = k(x, x′), i.e., k(., x), k(., x′ ) = k(x, x′)
• complete the space to get a reproducing kernel Hilbert space
B. Sch¨lkopf, MLSS France 2011
o
59. Turn it Into a Linear Space
Form linear combinations
m
f (.) = αik(., xi),
i=1
m′
g(.) = βj k(., x′ )
j
j=1
(m, m′ ∈ N, αi, βj ∈ R, xi, x′ ∈ X).
j
B. Sch¨lkopf, MLSS France 2011
o
60. Endow it With a Dot Product
m m′
f, g := αiβj k(xi, x′ )
j
i=1 j=1
m m′
= αig(xi) = βj f (x′ )
j
i=1 j=1
• This is well-defined, symmetric, and bilinear (more later).
• So far, it also works for non-pd kernels
B. Sch¨lkopf, MLSS France 2011
o
61. The Reproducing Kernel Property
Two special cases:
• Assume
f (.) = k(., x).
In this case, we have
k(., x), g = g(x).
• If moreover
g(.) = k(., x′),
we have
k(., x), k(., x′ ) = k(x, x′).
k is called a reproducing kernel
(up to here, have not used positive definiteness)
B. Sch¨lkopf, MLSS France 2011
o
62. Endow it With a Dot Product, II
• It can be shown that ., . is a p.d. kernel on the set of functions
{f (.) = m αik(., xi)|αi ∈ R, xi ∈ X} :
i=1
γi γj f i , f j = γi f i , γj f j =: f, f
ij i j
= αik(., xi), αik(., xi) = αiαj k(xi, xj ) ≥ 0
i i ij
• furthermore, it is strictly positive definite:
f (x)2 = f, k(., x) 2 ≤ f, f k(., x), k(., x)
hence f, f = 0 implies f = 0.
• Complete the space in the corresponding norm to get a Hilbert
space Hk .
63. The Empirical Kernel Map
Recall the feature map
Φ : X → RX
x → k(., x).
• each point is represented by its similarity to all other points
• how about representing it by its similarity to a sample of points?
Consider
Φm : X → Rm
x → k(., x)|(x1 ,...,xm) = (k(x1, x), . . . , k(xm, x))⊤
B. Sch¨lkopf, MLSS France 2011
o
64. ctd.
• Φm(x1), . . . , Φm(xm) contain all necessary information about
Φ(x1), . . . , Φ(xm)
• the Gram matrix Gij := Φm(xi), Φm(xj ) satisfies G = K 2
where Kij = k(xi, xj )
• modify Φm to
Φw : X → Rm
m
− 1 (k(x , x), . . . , k(x , x))⊤
x → K 2 1 m
• this “whitened” map (“kernel PCA map”) satifies
Φw (xi), Φw (xj ) = k(xi, xj )
m m
for all i, j = 1, . . . , m.
B. Sch¨lkopf, MLSS France 2011
o
65. An Example of a Kernel Algorithm
Idea: classify points x := Φ(x) in feature space according to which
of the two class means is closer.
1 1
c+ := Φ(xi), c− := Φ(xi)
m+ m−
yi=1 yi=−1
+
o
o . +
w + c2
o c
c1 x-c
o
x
Compute the sign of the dot product between w := c+ − c− and
x − c.
B. Sch¨lkopf, MLSS France 2011
o
66. An Example of a Kernel Algorithm, ctd. [36]
f (x) = sgn 1 Φ(x), Φ(xi) −
1
Φ(x), Φ(xi) +b
m+ m−
{i:yi=+1} {i:yi=−1}
= sgn 1 k(x, xi) −
1
k(x, xi) + b
m+ m−
{i:yi=+1} {i:yi=−1}
where
1 1 1
b= k(xi, xj ) − k(xi, xj ) .
2 m2− m2+
{(i,j):yi=yj =−1} {(i,j):yi=yj =+1}
• provides a geometric interpretation of Parzen windows
B. Sch¨lkopf, MLSS France 2011
o
67. An Example of a Kernel Algorithm, ctd.
• Demo
• Exercise: derive the Parzen windows classifier by computing the
distance criterion directly
• SVMs (ppt)
B. Sch¨lkopf, MLSS France 2011
o
68. An example of a kernel algorithm, revisited
o
+ µ(Y )
.
+ w o
µ(X ) +
o
+
X compact subset of a separable metric space, m, n ∈ N.
Positive class X := {x1, . . . , xm} ⊂ X
Negative class Y := {y1, . . . , yn} ⊂ X
1 m 1 n
RKHS means µ(X) = m i=1 k(xi, ·), µ(Y ) = n i=1 k(yi, ·).
Get a problem if µ(X) = µ(Y )!
B. Sch¨lkopf, MLSS France 2011
o
69. When do the means coincide?
k(x, x′) = x, x′ : the means coincide
k(x, x′) = ( x, x′ + 1)d: all empirical moments up to order d coincide
k strictly pd: X =Y.
The mean “remembers” each point that contributed to it.
B. Sch¨lkopf, MLSS France 2011
o
70. Proposition 2 Assume X, Y are defined as above, k is
strictly pd, and for all i, j, xi = xj , and yi = yj .
If for some αi, βj ∈ R − {0}, we have
m n
αik(xi, .) = βj k(yj , .), (1)
i=1 j=1
then X = Y .
B. Sch¨lkopf, MLSS France 2011
o
71. Proof (by contradiction)
W.l.o.g., assume that x1 ∈ Y . Subtract n βj k(yj , .) from (1),
j=1
and make it a sum over pairwise distinct points, to get
0= γik(zi, .),
i
where z1 = x1, γ1 = α1 = 0, and
z2, · · · ∈ X ∪ Y − {x1}, γ2, · · · ∈ R.
Take the RKHS dot product with j γj k(zj , .) to get
0= γiγj k(zi, zj ),
ij
with γ = 0, hence k cannot be strictly pd.
B. Sch¨lkopf, MLSS France 2011
o
72. The mean map
m
1
µ : X = (x1, . . . , xm) → k(xi, ·)
m
i=1
satisfies
m m
1 1
µ(X), f = k(xi, ·), f = f (xi)
m m
i=1 i=1
and
m n
1 1
µ(X)−µ(Y ) = sup | µ(X) − µ(Y ), f | = sup f (xi) − f (yi) .
f ≤1 f ≤1 m i=1
n i=1
Note: Large distance = can find a function distinguishing the
samples
B. Sch¨lkopf, MLSS France 2011
o
73. Witness function
µ(X)−µ(Y )
f = µ(X)−µ(Y ) , thus f (x) ∝ µ(X) − µ(Y ), k(x, .) ):
Witness f for Gauss and Laplace data
1
f
0.8 Gauss
Laplace
0.6
Prob. density and f
0.4
0.2
0
−0.2
−0.4
−6 −4 −2 0 2 4 6
X
This function is in the RKHS of a Gaussian kernel, but not in the
RKHS of the linear kernel.
B. Sch¨lkopf, MLSS France 2011
o
74. The mean map for measures
p, q Borel probability measures,
Ex,x′∼p[k(x, x′)], Ex,x′∼q [k(x, x′)] ∞ ( k(x, .) ≤ M ∞ is sufficient)
Define
µ : p → Ex∼p[k(x, ·)].
Note
µ(p), f = Ex∼p[f (x)]
and
µ(p) − µ(q) = sup Ex∼p[f (x)] − Ex∼q [f (x)] .
f ≤1
Recall that in the finite sample case, for strictly p.d. kernels, µ
was injective — how about now?
[43, 17]
B. Sch¨lkopf, MLSS France 2011
o
75. Theorem 3 [15, 13]
p = q ⇐⇒ sup Ex∼p(f (x)) − Ex∼q (f (x)) = 0,
f ∈C(X)
where C(X) is the space of continuous bounded functions on
X.
Combine this with
µ(p) − µ(q) = sup Ex∼p[f (x)] − Ex∼q [f (x)] .
f ≤1
Replace C(X) by the unit ball in an RKHS that is dense in C(X)
— universal kernel [45], e.g., Gaussian.
Theorem 4 [19] If k is universal, then
p = q ⇐⇒ µ(p) − µ(q) = 0.
B. Sch¨lkopf, MLSS France 2011
o
76. • µ is invertible on its image
M = {µ(p) | p is a probability distribution}
(the “marginal polytope”, [53])
• generalization of the moment generating function of a RV x
with distribution p:
Mp(.) = Ex∼p e x, · .
This provides us with a convenient metric on probability distribu-
tions, which can be used to check whether two distributions are
different — provided that µ is invertible.
B. Sch¨lkopf, MLSS France 2011
o
77. Fourier Criterion
Assume we have densities, the kernel is shift invariant (k(x, y) =
k(x − y)), and all Fourier transforms below exist.
Note that µ is invertible iff
k(x − y)p(y) dy = k(x − y)q(y) dy =⇒ p = q,
i.e.,
ˆp ˆ
k(ˆ − q ) = 0 =⇒ p = q
(Sriperumbudur et al., 2008)
ˆ
E.g., µ is invertible if k has full support. Restricting the class of
ˆ
distributions, weaker conditions suffice (e.g., if k has non-empty in-
terior, µ is invertible for all distributions with compact support).
B. Sch¨lkopf, MLSS France 2011
o
78. Fourier Optics
Application: p source of incoherent light, I indicator of a finite
ˆ
aperture. In Fraunhofer diffraction, the intensity image is ∝ p∗ I 2.
ˆ
Set k = I 2, then this equals µ(p).
ˆ
This k does not have full support, thus the imaging process is not
invertible for the class of all light sources (Abbe), but it is if we
restrict the class (e.g., to compact support).
B. Sch¨lkopf, MLSS France 2011
o
79. Application 1: Two-sample problem [19]
X, Y i.i.d. m-samples from p, q, respectively.
2
µ(p) − µ(q) =Ex,x′∼p [k(x, x′)] − 2Ex∼p,y∼q [k(x, y)] + Ey,y′∼q [k(y, y ′)]
=Ex,x′∼p,y,y′∼q [h((x, y), (x′, y ′))]
with
h((x, y), (x′, y ′)) := k(x, x′) − k(x, y ′) − k(y, x′) + k(y, y ′).
Define
D(p, q)2 := Ex,x′∼p,y,y′∼q h((x, y), (x′, y ′))
ˆ
D(X, Y )2 := 1 h((xi, yi), (xj , yj )).
m(m−1)
i=j
ˆ
D(X, Y )2 is an unbiased estimator of D(p, q)2.
It’s easy to compute, and works on structured data.
B. Sch¨lkopf, MLSS France 2011
o
80. Theorem 5 Assume k is bounded.
1
ˆ
D(X, Y )2 converges to D(p, q)2 in probability with rate O(m− 2 ).
This could be used as a basis for a test, but uniform convergence bounds are often loose..
√ ˆ
Theorem 6 We assume E h2 ∞. When p = q, then m(D(X, Y )2 − D(p, q)2)
converges in distribution to a zero mean Gaussian with variance
2
σu = 4 Ez (Ez′ h(z, z ′ ))2 − Ez,z′ (h(z, z ′ ))
2
.
ˆ ˆ
When p = q, then m(D(X, Y )2 − D(p, q)2) = mD(X, Y )2 converges in distribution to
∞
λl ql2 − 2 , (2)
l=1
where ql ∼ N(0, 2) i.i.d., λi are the solutions to the eigenvalue equation
˜
k(x, x′)ψi (x)dp(x) = λi ψi(x′ ),
X
˜
and k(xi, xj ) := k(xi, xj ) − Exk(xi, x) − Exk(x, xj ) + Ex,x′ k(x, x′) is the centred RKHS
kernel.
B. Sch¨lkopf, MLSS France 2011
o
81. Application 2: Dependence Measures
Assume that (x, y) are drawn from pxy , with marginals px, py .
Want to know whether pxy factorizes.
[2, 16]: kernel generalized variance
[20, 21]: kernel constrained covariance, HSIC
Main idea [25, 34]:
x and y independent ⇐⇒ ∀ bounded continuous functions f, g,
we have Cov(f (x), g(y)) = 0.
B. Sch¨lkopf, MLSS France 2011
o
82. k kernel on X × Y.
µ(pxy ) := E(x,y)∼pxy [k((x, y), ·)]
µ(px × py ) := Ex∼px,y∼py [k((x, y), ·)] .
Use ∆ := µ(pxy ) − µ(px × py ) as a measure of dependence.
For k((x, y), (x′ , y ′)) = kx(x, x′)ky (y, y ′):
∆2 equals the Hilbert-Schmidt norm of the covariance opera-
tor between the two RKHSs (HSIC), with empirical estimate
m−2 tr HKxHKy , where H = I − 1/m [20, 44].
B. Sch¨lkopf, MLSS France 2011
o
83. Witness function of the equivalent optimisation problem:
Dependence witness and sample
1.5
0.05
1 0.04
0.03
0.5
0.02
0.01
Y
0
0
−0.5 −0.01
−0.02
−1 −0.03
−0.04
−1.5
−1.5 −1 −0.5 0 0.5 1 1.5
X
Application: learning causal structures (Sun et al., ICML 2007; Fuku-
mizu et al., NIPS 2007))
B. Sch¨lkopf, MLSS France 2011
o
84. Application 3: Covariate Shift Correction and Local
Learning
training set X = {(x1, y1), . . . , (xm, ym)} drawn from p,
test set X ′ = (x′ , y1), . . . , (x′ , yn) from p′ = p.
1
′
n
′
Assume py|x = p′ .
y|x
[40]: reweight training set
B. Sch¨lkopf, MLSS France 2011
o
85. Minimize
2
m
βik(xi, ·) − µ(X ′) +λ β 2 subject to βi ≥ 0,
2 βi = 1.
i=1 i
Equivalent QP:
1 ⊤
minimize β (K + λ1) β − β ⊤l
β 2
subject to βi ≥ 0 and βi = 1,
i
where Kij := k(xi, xj ), li = k(xi, ·), µ(X ′) .
Experiments show that in underspecified situations (e.g., large ker-
nel widths), this helps [23].
X ′ = x′ leads to a local sample weighting scheme.
B. Sch¨lkopf, MLSS France 2011
o
86. The Representer Theorem
Theorem 7 Given: a p.d. kernel k on X × X, a training set
(x1, y1), . . . , (xm, ym) ∈ X × R, a strictly monotonic increasing
real-valued function Ω on [0, ∞[, and an arbitrary cost function
c : (X × R2)m → R ∪ {∞}
Any f ∈ Hk minimizing the regularized risk functional
c ((x1, y1, f (x1)), . . . , (xm, ym, f (xm))) + Ω ( f ) (3)
admits a representation of the form
m
f (.) = αik(xi, .).
i=1
B. Sch¨lkopf, MLSS France 2011
o
87. Remarks
• significance: many learning algorithms have solutions that can
be expressed as expansions in terms of the training examples
• original form, with mean squared loss
m
1
c((x1, y1, f (x1)), . . . , (xm, ym, f (xm))) = (yi − f (xi))2,
m
i=1
and Ω( f ) = λ f 2 (λ 0): [27]
• generalization to non-quadratic cost functions: [10]
• present form: [36]
B. Sch¨lkopf, MLSS France 2011
o
88. Proof
Decompose f ∈ H into a part in the span of the k(xi, .) and an
orthogonal one:
f= αik(xi, .) + f⊥,
where for all j i
f⊥, k(xj , .) = 0.
Application of f to an arbitrary training point xj yields
f (xj ) = f, k(xj , .)
= αik(xi, .) + f⊥, k(xj , .)
i
= αi k(xi, .), k(xj , .) ,
i
independent of f⊥.
B. Sch¨lkopf, MLSS France 2011
o
89. Proof: second part of (3)
Since f⊥ is orthogonal to i αik(xi, .), and Ω is strictly mono-
tonic, we get
Ω( f ) = Ω αik(xi, .) + f⊥
i
= Ω αik(xi, .) 2 + f⊥ 2
i
≥ Ω αik(xi, .) , (4)
i
with equality occuring if and only if f⊥ = 0.
Hence, any minimizer must have f⊥ = 0. Consequently, any
solution takes the form
f= αik(xi, .).
i
B. Sch¨lkopf, MLSS France 2011
o
90. Application: Support Vector Classification
Here, yi ∈ {±1}. Use
1
c ((xi, yi, f (xi))i) = max (0, 1 − yif (xi)) ,
λ
i
and the regularizer Ω ( f ) = f 2.
λ → 0 leads to the hard margin SVM
B. Sch¨lkopf, MLSS France 2011
o
91. Further Applications
Bayesian MAP Estimates. Identify (3) with the negative log
posterior (cf. Kimeldorf Wahba, 1970, Poggio Girosi, 1990),
i.e.
• exp(−c((xi, yi, f (xi))i)) — likelihood of the data
• exp(−Ω( f )) — prior over the set of functions; e.g., Ω( f ) =
λ f 2 — Gaussian process prior [55] with covariance function
k
• minimizer of (3) = MAP estimate
Kernel PCA (see below) can be shown to correspond to the case
of
2
1 1
c((xi, yi, f (xi))i=1,...,m) = 0 if m i f (xi) − m j f (xj ) = 1
∞ otherwise
with g an arbitrary strictly monotonically increasing function.
92. Conclusion
• the kernel corresponds to
– a similarity measure for the data, or
– a (linear) representation of the data, or
– a hypothesis space for learning,
• kernels allow the formulation of a multitude of geometrical algo-
rithms (Parzen windows, 2-sample tests, SVMs, kernel PCA,...)
B. Sch¨lkopf, MLSS France 2011
o
93. Kernel PCA [37]
linear PCA k(x,y) = (x.y)
R2
x
x
x
xx
x
x x x
x
x
x
kernel PCA k(x,y) = (x.y)d
R2
x x
x
x x x
x x x
xx
x
x
x x x x x x x
x x
x x
k H
Φ
B. Sch¨lkopf, MLSS France 2011
o
94. Kernel PCA, II
m
1
x1, . . . , xm ∈ X, Φ : X → H, C= Φ(xj )Φ(xj )⊤
m
j=1
Eigenvalue problem
m
1
λV = CV = Φ(xj ), V Φ(xj ).
m
j=1
For λ = 0, V ∈ span{Φ(x1), . . . , Φ(xm)}, thus
m
V= αiΦ(xi),
i=1
and the eigenvalue problem can be written as
λ Φ(xn), V = Φ(xn), CV for all n = 1, . . . , m
B. Sch¨lkopf, MLSS France 2011
o
95. Kernel PCA in Dual Variables
In term of the m × m Gram matrix
Kij := Φ(xi), Φ(xj ) = k(xi, xj ),
this leads to
mλKα = K 2α
where α = (α1, . . . , αm)⊤.
Solve
mλα = Kα
−→ (λn, αn)
Vn, Vn = 1 ⇐⇒ λn αn, αn = 1
thus divide α n by √λ
n
B. Sch¨lkopf, MLSS France 2011
o
96. Feature extraction
Compute projections on the Eigenvectors
m
Vn = n
αi Φ(xi)
i=1
in H:
for a test point x with image Φ(x) in H we get the features
m
Vn, Φ(x) = n
αi Φ(xi), Φ(x)
i=1
m
= n
αi k(xi, x)
i=1
B. Sch¨lkopf, MLSS France 2011
o
97. The Kernel PCA Map
Recall
Φw : X → Rm
m
1
− 2 (k(x , x), . . . , k(x , x))⊤
x → K 1 m
If K = U DU ⊤ is K’s diagonalization, then K −1/2 =
U D−1/2U ⊤. Thus we have
Φw (x) = U D−1/2U ⊤(k(x1, x), . . . , k(xm, x))⊤.
m
We can drop the leading U (since it leaves the dot product invari-
ant) to get a map
Φw CA(x) = D−1/2U ⊤(k(x1, x), . . . , k(xm, x))⊤.
KP
The rows of U ⊤ are the eigenvectors αn of K, and the entries of
−1/2
the diagonal matrix D−1/2 equal λi .
B. Sch¨lkopf, MLSS France 2011
o
98. Toy Example with Gaussian Kernel
k(x, x′) = exp − x − x′ 2
B. Sch¨lkopf, MLSS France 2011
o
99. Super-Resolution (Kim, Franz, Sch¨lkopf, 2004)
o
a. original image of resolution b. low resolution image (264 × c. bicubic interpolation d. supervised example-based f. unsupervised KPCA recon-
528 × 396 198) stretched to the original learning based on nearest neigh- struction
scale bor classifier
g. enlarged portions of a-d, and f (from left to right)
Comparison between different super-resolution methods.
B. Sch¨lkopf, MLSS France 2011
o
100. Support Vector Classifiers
input space feature space
G N
N
N
N Φ
G
G G
G
G
[6]
B. Sch¨lkopf, MLSS France 2011
o
101. Separating Hyperplane
w, x + b 0
N
G N
G
N
w, x + b 0 w N
G
G
G
{x | w, x + b = 0}
B. Sch¨lkopf, MLSS France 2011
o
103. Eliminating the Scaling Freedom [47]
Note: if c = 0, then
{x| w, x + b = 0} = {x| cw, x + cb = 0}.
Hence (cw, cb) describes the same hyperplane as (w, b).
Definition: The hyperplane is in canonical form w.r.t. X ∗ =
{x1, . . . , xr } if minxi∈X | w, xi + b| = 1.
B. Sch¨lkopf, MLSS France 2011
o
104. Canonical Optimal Hyperplane
{x | w, x + b = +1}
{x | w, x + b = −1} Note:
N
w, x1 + b = +1
H N x1 yi = +1 w, x2 + b = −1
x2H
= w , (x1−x2) = 2
N
, w
yi = −1 w N = , (x1−x2) = 2
||w|| ||w||
H
H
H
{x | w, x + b = 0}
B. Sch¨lkopf, MLSS France 2011
o
105. Canonical Hyperplanes [47]
Note: if c = 0, then
{x| w, x + b = 0} = {x| cw, x + cb = 0}.
Hence (cw, cb) describes the same hyperplane as (w, b).
Definition: The hyperplane is in canonical form w.r.t. X ∗ =
{x1, . . . , xr } if minxi∈X | w, xi + b| = 1.
Note that for canonical hyperplanes, the distance of the closest
point to the hyperplane (“margin”) is 1/ w :
minxi∈X w ,x + b = 1 .
w i w w
B. Sch¨lkopf, MLSS France 2011
o
106. Theorem 8 (Vapnik [46]) Consider hyperplanes w, x = 0
where w is normalized such that they are in canonical form
w.r.t. a set of points X ∗ = {x1, . . . , xr }, i.e.,
min | w, xi | = 1.
i=1,...,r
The set of decision functions fw(x) = sgn x, w defined on
X ∗ and satisfying the constraint w ≤ Λ has a VC dimension
satisfying
h ≤ R2Λ2.
Here, R is the radius of the smallest sphere around the origin
containing X ∗.
B. Sch¨lkopf, MLSS France 2011
o
107. x
x x R x
x
γ1
γ2
B. Sch¨lkopf, MLSS France 2011
o
108. Proof Strategy (Gurvits, 1997)
Assume that x1, . . . , xr are shattered by canonical hyperplanes
with w ≤ Λ, i.e., for all y1, . . . , yr ∈ {±1},
yi w, xi ≥ 1 for all i = 1, . . . , r. (5)
Two steps:
• prove that the more points we want to shatter (5), the larger
r
i=1 yixi must be
r
• upper bound the size of i=1 yixi in terms of R
Combining the two tells us how many points we can at most shat-
ter.
B. Sch¨lkopf, MLSS France 2011
o
109. Part I
Summing (5) over i = 1, . . . , r yields
r
w, yi xi ≥ r.
i=1
By the Cauchy-Schwarz inequality, on the other hand, we have
r r r
w, yi xi ≤ w yixi ≤ Λ yixi .
i=1 i=1 i=1
Combine both:
r
r
≤ yi xi . (6)
Λ
i=1
B. Sch¨lkopf, MLSS France 2011
o
110. Part II
Consider independent random labels yi ∈ {±1}, uniformly dis-
tributed (Rademacher variables).
2
r r r
E yi xi = E yi xi , yj xj
i=1 i=1 j=1
r
= E y i x i , yj xj + yixi
i=1 j=i
r
= E yixi, yj xj + E [ yixi, yixi ]
i=1 j=i
r r
= E yi xi 2 = xi 2
i=1 i=1
B. Sch¨lkopf, MLSS France 2011
o
111. Part II, ctd.
Since xi ≤ R, we get
2
r
E yixi ≤ rR2.
i=1
• This holds for the expectation over the random choices of the
labels, hence there must be at least one set of labels for which
it also holds true. Use this set.
Hence
2
r
yi xi ≤ rR2.
i=1
B. Sch¨lkopf, MLSS France 2011
o
112. Part I and II Combined
r 2
Part I: Λ ≤ r
yi xi 2
i=1
Part II: r
i=1 yixi 2 ≤ rR2
Hence
r2
≤ rR2,
Λ2
i.e.,
r ≤ R2Λ2,
completing the proof.
B. Sch¨lkopf, MLSS France 2011
o