SlideShare una empresa de Scribd logo
1 de 138
Descargar para leer sin conexión
Introduction to Machine Learning


           Bernhard Schölkopf
     Empirical Inference Department
 Max Planck Institute for Intelligent Systems
           Tübingen, Germany

      http://www.tuebingen.mpg.de/bs

                                                1
Empirical Inference

• Drawing conclusions from empirical data (observations, measurements)

• Example 1: scientific inference

                                                      y = Σi ai k(x,xi) + b
                                                  x
       y                                                 y=a*x
                                              x
                                  x
                                      x

                                          x
                     x        x
                         x
                 x
                                                          x

     Leibniz, Weyl, Chaitin
                                                                              2
Empirical Inference

• Drawing conclusions from empirical data (observations, measurements)

• Example 1: scientific inference
       “If your experiment needs statistics [inference],
        you ought to have done a better experiment.” (Rutherford)




                                                                     3
Empirical Inference, II



• Example 2: perception




   “The brain is nothing but a statistical decision organ”
                  (H. Barlow)
                                                             4
Hard Inference Problems
                                                                  Sonnenburg, Rätsch, Schäfer,
                                                                  Schölkopf, 2006, Journal of Machine
                                                                  Learning Research

                                                                  Task: classify human DNA
                                                                  sequence locations into {acceptor
                                                                  splice site, decoy} using 15
                                                                  Million sequences of length 141,
                                                                  and a Multiple-Kernel Support
                                                                  Vector Machines.

                                                                  PRC = Precision-Recall-Curve,
                                                                  fraction of correct positive
                                                                  predictions among all positively
                                                                  predicted cases

   •   High dimensionality – consider many factors simultaneously to find the regularity
   •   Complex regularities – nonlinear, nonstationary, etc.
   •   Little prior knowledge – e.g., no mechanistic models for the data
   •   Need large data sets – processing requires computers and automatic inference methods
                                                                                                    5
Hard Inference Problems, II

  • We can solve scientific inference problems that humans can’t solve
  • Even if it’s just because of data set size / dimensionality, this is a
    quantum leap




                                                                             6
Generalization                              (thanks to O. Bousquet)



• observe             1,    2, 4,    7,..
• What’s next?
                       +1    +2     +3


•   1,2,4,7,11,16,…: an+1=an+n (“lazy caterer’s sequence”)
•   1,2,4,7,12,20,…: an+2=an+1+an+1
•   1,2,4,7,13,24,…: “Tribonacci”-sequence
•   1,2,4,7,14,28: set of divisors of 28
•   1,2,4,7,1,1,5,…: decimal expansions of p=3,14159…
    and e=2,718… interleaved
•   The On-Line Encyclopedia of Integer Sequences: >600 hits…

                                                                      7
Generalization, II

• Question: which continuation is correct (“generalizes”)?
• Answer: there’s no way to tell (“induction problem”)

• Question of statistical learning theory: how to come up
  with a law that is (probably) correct (“demarcation problem”)
  (more accurately: a law that is probably as correct on the test data as it is on the training data)




                                                                                                        8
2-class classification

 Learn                             based on m observations

                                   generated from some

 Goal: minimize expected error (“risk”)




                                                           V. Vapnik
 Problem: P is unknown.
 Induction principle: minimize training error (“empirical risk”)


 over some class of functions. Q: is this “consistent”?
                                          9
The law of large numbers
  For all                and




  Does this imply “consistency” of empirical risk minimization
  (optimality in the limit)?

              No – need a uniform law of large numbers:


  For all




                                        10
Consistency and uniform convergence
-> LaTeX




                                                                               12
  Bernhard Schölkopf   Empirical Inference Department   Tübingen , 03 October 2011
Support Vector Machines

                         class 2
         class 1


                                    F
                                                    +-
                  -                                 +
                      +
          - k(x,x’)
                 +                                               -
                                    =               +
                                                 <F(x),F(x’)>

                     +
• sparse expansion of solution in terms of SVs (Boser, Guyon, Vapnik 1992):
                -                                               -
  representer theorem (Kimeldorf & Wahba 1971, Schölkopf et al. 2000)

• unique solution found by convex QP

Bernhard Schölkopf, 03
October 2011                                 13
Support Vector Machines

                         class 2
         class 1


                                    F
                                                    +-
                  -                                 +
                      +
          - k(x,x’)
                 +                                               -
                                    =               +
                                                 <F(x),F(x’)>

                     +
• sparse expansion of solution in terms of SVs (Boser, Guyon, Vapnik 1992):
                -                                               -
  representer theorem (Kimeldorf & Wahba 1971, Schölkopf et al. 2000)

• unique solution found by convex QP

Bernhard Schölkopf, 03
October 2011                                 14
Applications in Computational Geometry / Graphics




Steinke, Walder, Blanz et al.,
Eurographics’05, ’06, ‘08, ICML ’05,‘08,
NIPS ’07
                                           15
Max-Planck-Institut für
biologische Kybernetik
Bernhard Schölkopf, Tübingen,
  FIFA World Cup – Germany vs. England, June 27, 2010
3. Oktober 2011                                16
Kernel Quiz




                                                                               17
  Bernhard Schölkopf   Empirical Inference Department   Tübingen , 03 October 2011
Kernel Methods


            Bernhard Sch¨lkopf
                        o

Max Planck Institute for Intelligent Systems




                                      B. Sch¨lkopf, MLSS France 2011
                                            o
Statistical Learning Theory

1. started by Vapnik and Chervonenkis in the Sixties
2. model: we observe data generated by an unknown stochastic
   regularity
3. learning = extraction of the regularity from the data
4. the analysis of the learning problem leads to notions of capacity
   of the function classes that a learning machine can implement.
5. support vector machines use a particular type of function class:
   classifiers with large “margins” in a feature space induced by a
   kernel.

                                                                 [47, 48]
                                                    B. Sch¨lkopf, MLSS France 2011
                                                          o
Example: Regression Estimation

                   y




                                        x

• Data: input-output pairs (xi, yi) ∈ R × R
• Regularity: (x1, y1), . . . (xm, ym) drawn from P(x, y)
• Learning: choose a function f : R → R such that the error,
  averaged over P, is minimized.
• Problem: P is unknown, so the average cannot be computed
  — need an “induction principle”
Pattern Recognition

Learn f : X → {±1} from examples
(x1, y1), . . . , (xm, ym) ∈ X×{±1},   generated i.i.d. from P(x, y),
such that the expected misclassification error on a test set, also
drawn from P(x, y),
                          1
                R[f ] =     |f (x) − y)| dP(x, y),
                          2
is minimal (Risk Minimization (RM)).
Problem: P is unknown. −→ need an induction principle.
Empirical risk minimization (ERM): replace the average over
P(x, y) by an average over the training sample, i.e. minimize the
training error
                           1     m 1
               Remp[f ] =              |f (xi) − yi|
                          m      i=1 2
                                                    B. Sch¨lkopf, MLSS France 2011
                                                          o
Convergence of Means to Expectations

Law of large numbers:
                        Remp[f ] → R[f ]
as m → ∞.

Does this imply that empirical risk minimization will give us the
optimal result in the limit of infinite sample size (“consistency”
of empirical risk minimization)?

No.
Need a uniform version of the law of large numbers. Uniform over
all functions that the learning machine can implement.

                                                  B. Sch¨lkopf, MLSS France 2011
                                                        o
Consistency and Uniform Convergence


                                      R
Risk
                                      Remp
             Remp [f]


             R[f]




         f              f opt   fm     Function class



                                          B. Sch¨lkopf, MLSS France 2011
                                                o
The Importance of the Set of Functions

What about allowing all functions from X to {±1}?
Training set (x1, y1), . . . , (xm, ym) ∈ X × {±1}
               ¯           ¯¯
Test patterns x1, . . . , xm ∈ X,
                        ¯¯
such that {¯ 1, . . . , xm} ∩ {x1, . . . , xm} = {}.
           x
                                 1. f ∗(xi) = f (xi) for all i
For any f there exists f ∗ s.t.:
                                 2. f ∗(¯ j ) = f (¯ j ) for all j.
                                        x          x
Based on the training set alone, there is no means of choosing
which one is better. On the test set, however, they give opposite
results. There is ’no free lunch’ [24, 56].
−→ a restriction must be placed on the functions that we allow


                                                       B. Sch¨lkopf, MLSS France 2011
                                                             o
Restricting the Class of Functions

Two views:

1. Statistical Learning (VC) Theory: take into account the ca-
pacity of the class of functions that the learning machine can
implement

2. The Bayesian Way: place Prior distributions P(f ) over the
class of functions




                                                B. Sch¨lkopf, MLSS France 2011
                                                      o
Detailed Analysis


• loss ξi := 1 |f (xi) − yi| in {0, 1}
             2
• the ξi are independent Bernoulli trials
                 1       m
• empirical mean m       i=1 ξi (by def: equals Remp[f ])
• expected value E [ξ] (equals R[f ])




                                                     B. Sch¨lkopf, MLSS France 2011
                                                           o
Chernoff ’s Bound
                              
            1 m               
         P       ξi − E [ξ] ≥ ǫ ≤ 2 exp(−2mǫ2)
           m                  
                   i=1

• here, P refers to the probability of getting a sample ξ1, . . . , ξm
  with the property m m ξi − E [ξ] ≥ ǫ (is a product mea-
                       1
                           i=1
  sure)

Useful corollary: Given a 2m-sample of Bernoulli trials, we have
                                    
        1 m          1
                          2m                        mǫ2
    P            ξi −         ξi ≥ ǫ ≤ 4 exp −              .
       m             m                               2
             i=1         i=m+1
                                                      B. Sch¨lkopf, MLSS France 2011
                                                            o
Chernoff ’s Bound, II

Translate this back into machine learning terminology: the prob-
ability of obtaining an m-sample where the training error and test
error differ by more than ǫ > 0 is bounded by

          P   Remp[f ] − R[f ] ≥ ǫ ≤ 2 exp(−2mǫ2).


• refers to one fixed f
• not allowed to look at the data before choosing f , hence not
  suitable as a bound on the test error of a learning algorithm
  using empirical risk minimization


                                                   B. Sch¨lkopf, MLSS France 2011
                                                         o
Uniform Convergence (Vapnik & Chervonenkis)

Necessary and sufficient conditions for nontrivial consistency of
empirical risk minimization (ERM):
One-sided convergence, uniformly over all functions that can be
implemented by the learning machine.
                 lim P { sup (R[f ] − Remp[f ]) > ǫ} = 0
             m→∞         f ∈F
for all ǫ > 0.

• note that this takes into account the whole set of functions that
  can be implemented by the learning machine
• this is hard to check for a learning machine

Are there properties of learning machines (≡ sets of functions)
which ensure uniform convergence of risk?
                                                     B. Sch¨lkopf, MLSS France 2011
                                                           o
How to Prove a VC Bound

Take a closer look at P{supf ∈F (R[f ] − Remp[f ]) > ǫ}.
Plan:
• if the function class F contains only one function, then Cher-
  noff’s bound suffices:
         P{ sup (R[f ] − Remp[f ]) > ǫ} ≤ 2 exp(−2mǫ2).
            f ∈F
• if there are finitely many functions, we use the ’union bound’
• even if there are infinitely many, then on any finite sample
  there are effectively only finitely many (use symmetrization
  and capacity concepts)


                                                   B. Sch¨lkopf, MLSS France 2011
                                                         o
The Case of Two Functions

Suppose F = {f1, f2}. Rewrite
                                              1    2
          P{ sup (R[f ] − Remp[f ]) > ǫ} = P(Cǫ ∪ Cǫ ),
             f ∈F
where
     i
    Cǫ := {(x1, y1), . . . , (xm, ym) | (R[fi] − Remp[fi]) > ǫ}
denotes the event that the risks of fi differ by more than ǫ.
The RHS equals
             1    2        1        2        1    2
          P(Cǫ ∪ Cǫ ) = P(Cǫ ) + P(Cǫ ) − P(Cǫ ∩ Cǫ )
                           1        2
                      ≤ P(Cǫ ) + P(Cǫ ).
Hence by Chernoff’s bound
                                            1        2
        P{ sup (R[f ] − Remp[f ]) > ǫ} ≤ P(Cǫ ) + P(Cǫ )
          f ∈F
                                      ≤ 2 · 2 exp(−2mǫ2).
The Union Bound

Similarly, if F = {f1, . . . , fn}, we have
                                           1            n
       P{ sup (R[f ] − Remp[f ]) > ǫ} = P(Cǫ ∪ · · · ∪ Cǫ ),
          f ∈F
and                                           n
                     1            n
                  P(Cǫ ∪ · · · ∪ Cǫ ) ≤              i
                                                  P(Cǫ ).
                                          i=1
Use Chernoff for each summand, to get an extra factor n in the
bound.
                                                              i
Note: this becomes an equality if and only if all the events Cǫ
involved are disjoint.


                                                            B. Sch¨lkopf, MLSS France 2011
                                                                  o
Infinite Function Classes

• Note: empirical risk only refers to m points. On these points,
  the functions of F can take at most 2m values
• for Remp, the function class thus “looks” finite
• how about R?
• need to use a trick




                                                    B. Sch¨lkopf, MLSS France 2011
                                                          o
Symmetrization

Lemma 1 (Vapnik & Chervonenkis (e.g., [46, 12]))
For mǫ2 ≥ 2 we have
                                                  ′
P{ sup (R[f ]−Remp[f ]) > ǫ} ≤ 2P{ sup (Remp[f ]−Remp[f ]) > ǫ/2}
  f ∈F                            f ∈F
Here, the first P refers to the distribution of iid samples of
size m, while the second one refers to iid samples of size 2m.
In the latter case, Remp measures the loss on the first half of
                    ′
the sample, and Remp on the second half.




                                                B. Sch¨lkopf, MLSS France 2011
                                                      o
Shattering Coefficient

• Hence, we only need to consider the maximum size of F on 2m
  points. Call it N(F, 2m).
• N(F, 2m) = max. number of different outputs (y1, . . . , y2m)
  that the function class can generate on 2m points — in other
  words, the max. number of different ways the function class can
  separate 2m points into two classes.
• N(F, 2m) ≤ 22m
• if N(F, 2m) = 22m, then the function class is said to shatter
  2m points.



                                                 B. Sch¨lkopf, MLSS France 2011
                                                       o
Putting Everything Together

We now use (1) symmetrization, (2) the shattering coefficient, and
(3) the union bound, to get


    P{sup(R[f ] − Remp[f ]) > ǫ}
      f ∈F
                     ′
≤ 2P{sup(Remp[f ] − Remp[f ]) > ǫ/2}
        f ∈F
                   ′                                         ′
 = 2P{(Remp[f1] − Remp[f1]) > ǫ/2 ∨. . .∨ (Remp[fN(F,2m)] − Remp[fN(F,2m)]) > ǫ/2}
    N(F,2m)
                               ′
≤             2P{(Remp[fn ] − Remp[fn]) > ǫ/2}.
     n=1




                                                          B. Sch¨lkopf, MLSS France 2011
                                                                o
ctd.
Use Chernoff’s bound for each term:∗
       
       1 m         1
                        2m                   mǫ2
     P         ξi −          ξi ≥ ǫ ≤ 2 exp −                                              .
       m           m                         2
                     i=1               i=m+1
This yields
                                                   mǫ2
  P{ sup (R[f ] − Remp[f ]) > ǫ} ≤ 4N(F, 2m) exp −                                             .
     f ∈F                                           8
    • provided that N(F, 2m) does not grow exponentially in m, this
      is nontrivial
    • such bounds are called VC type inequalities
    • two types of randomness: (1) the P refers to the drawing of
      the training examples, and (2) R[f ] is an expectation over the
      drawing of test examples.
∗
    Note that the fi depend on the 2m−sample. A rigorous treatment would need to use a second random-

ization over permutations of the 2m-sample, see [36].
Confidence Intervals

Rewrite the bound: specify the probability with which we want R
to be close to Remp, and solve for ǫ:
With a probability of at least 1 − δ,
                              8                  4
        R[f ] ≤ Remp[f ] +      ln(N(F, 2m)) + ln .
                              m                  δ
This bound holds independent of f ; in particular, it holds for the
function f m minimizing the empirical risk.




                                                    B. Sch¨lkopf, MLSS France 2011
                                                          o
Discussion

• tighter bounds are available (better constants etc.)
• cannot minimize the bound over f
• other capacity concepts can be used




                                                   B. Sch¨lkopf, MLSS France 2011
                                                         o
VC Entropy
On an example (x, y), f causes a loss
                                 1
               ξ(x, y, f (x)) = |f (x) − y| ∈ {0, 1}.
For a larger sample (x , y ), . .2. , (x , y ), the different functions
                       1   1          m    m
f ∈ F lead to a set of loss vectors
         ξf = (ξ(x1, y1, f (x1)), . . . , ξ(xm, ym, f (xm))),
whose cardinality we denote by
                   N (F, (x1, y1) . . . , (xm, ym)) .
The VC entropy is defined as
          HF (m) = E [ln N (F, (x1, y1) . . . , (xm, ym))] ,
where the expectation is taken over the random generation of the
m-sample (x1, y1) . . . , (xm, ym) from P.
HF (m)/m → 0 ⇐⇒ uniform convergence of risks (hence consis-
tency)
Further PR Capacity Concepts

• exchange ’E’ and ’ln’: annealed entropy.

  ann
 HF (m)/m → 0 ⇐⇒ exponentially fast uniform convergence
• take ’max’ instead of ’E’: growth function.
  Note that GF (m) = ln N(F, m).
  GF (m)/m → 0 ⇐⇒ exponential convergence for all underlying
  distributions P.
  GF (m) = m · ln(2) for all m ⇐⇒ for any m, all loss vectors
  can be generated, i.e., the m points can be chosen such that by
  using functions of the learning machine, they can be separated
  in all 2m possible ways (shattered ).
                                                  B. Sch¨lkopf, MLSS France 2011
                                                        o
Structure of the Growth Function

Either GF (m) = m · ln(2) for all m ∈ N
Or there exists some maximal m for which the above is possible.
Call this number the VC-dimension, and denote it by h. For
m > h,
                                   m
                   GF (m) ≤ h ln + 1 .
                                   h

Nothing “in between” linear growth and logarithmic growth is
possible.




                                                 B. Sch¨lkopf, MLSS France 2011
                                                       o
VC-Dimension: Example

Half-spaces in R2:
   f (x, y) = sgn(a + bx + cy),                                                                                  with parameters a, b, c ∈ R

• Clearly, we can shatter three non-collinear points.
• But we can never shatter four points.
• Hence the VC dimension is h = 3 (in this case, equal to the
  number of parameters)
                        xxxxxxxxxxxxxxxxxxxxxxxxxx
                        xxxxxxxxxxxxxxxxxxxxxxxxxx   xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                                                     xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx                                            xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                                                                                                                                      xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                        xxxxxxxxxxxxxxxxxxxxxxxxxx
                        xxxxxxxxxxxxxxxxxxxxxxxxxx   xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                                                     xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx                                            xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                                                                                                                                      xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                        xxxxxxxxxxxxxxxxxxxxxxxxxx
                        xxxxxxxxxxxxxxxxxxxxxxxxxx   xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                                                     xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx                                            xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                                                                                                                                      xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                        xxxxxxxxxxxxxxxxxxxxxxxxxx
                        xxxxxxxxxxxxxxxxxxxxxxxxxx   xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                                                     xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx                                            xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                                                                                                                                      xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                        xxxxxxxxxxxxxxxxxxxxxxxxxx   xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx                                            xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                                       x
                        xxxxxxxxxxxxxxxxxxxxxxxxxx
                        xxxxxxxxxxxxxxxxxxxxxxxxxx
                        xxxxxxxxxxxxxxxxxxxxxxxxxx
                        xxxxxxxxxxxxxxxxxxxxxxxxxx
                                                                                x
                                                     xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                                                     xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                                                     xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                                                     xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                                                                                                                        x                                        x
                                                                                                                                      xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                                                                                                                                      xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                                                                                                                                      xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                                                                                                                                      xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                        xxxxxxxxxxxxxxxxxxxxxxxxxx
                        xxxxxxxxxxxxxxxxxxxxxxxxxx   xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                                                     xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx                                            xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                                                                                                                                      xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                        xxxxxxxxxxxxxxxxxxxxxxxxxx
                        xxxxxxxxxxxxxxxxxxxxxxxxxx   xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                                                     xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx                                            xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                                                                                                                                      xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                    x   xxxxxxxxxxxxxxxxxxxxxxxxxx
                        xxxxxxxxxxxxxxxxxxxxxxxxxx
                        xxxxxxxxxxxxxxxxxxxxxxxxxx
                        xxxxxxxxxxxxxxxxxxxxxxxxxx
                                                            x
                                                     xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                                                     xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                                                     xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                                                     xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                                                                                                    x                                        x
                                                                                                                                      xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                                                                                                                                      xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                                                                                                                                      xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                                                                                                                                      xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                        xxxxxxxxxxxxxxxxxxxxxxxxxx
                        xxxxxxxxxxxxxxxxxxxxxxxxxx   xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                                                     xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx   xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                                                                                             xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx   xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                                                                                                                                      xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                        xxxxxxxxxxxxxxxxxxxxxxxxxx
                        xxxxxxxxxxxxxxxxxxxxxxxxxx   xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                                                     xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx   xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                                                                                             xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx   xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                                                                                                                                      xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                        xxxxxxxxxxxxxxxxxxxxxxxxxx
                        xxxxxxxxxxxxxxxxxxxxxxxxxx                                           xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                                                                                             xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx   xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                                                                                                                                      xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                        xxxxxxxxxxxxxxxxxxxxxxxxxx                                           xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx   xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                                   x
                        xxxxxxxxxxxxxxxxxxxxxxxxxx
                        xxxxxxxxxxxxxxxxxxxxxxxxxx
                        xxxxxxxxxxxxxxxxxxxxxxxxxx
                                                                           x                                       x
                                                                                             xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                                                                                             xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                                                                                             xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                                                                                                                                                            x
                                                                                                                                      xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                                                                                                                                      xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                                                                                                                                      xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                        xxxxxxxxxxxxxxxxxxxxxxxxxx
                        xxxxxxxxxxxxxxxxxxxxxxxxxx                                           xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                                                                                             xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx   xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                                                                                                                                      xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                        xxxxxxxxxxxxxxxxxxxxxxxxxx
                        xxxxxxxxxxxxxxxxxxxxxxxxxx                                           xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                                                                                             xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx   xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                                                                                                                                      xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                        xxxxxxxxxxxxxxxxxxxxxxxxxx
                        xxxxxxxxxxxxxxxxxxxxxxxxxx                                           xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                                                                                             xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx   xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                                                                                                                                      xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

             xxxxxxxxxxxxxxxxxxx
             xxxxxxxxxxxxxxxxxxx                                                             xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                                                                                             xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx                             xxxxxxxxxxxx
                                                                                                                                                                xxxxxxxxxxxx
             xxxxxxxxxxxxxxxxxxx
             xxxxxxxxxxxxxxxxxxx                                                             xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                                                                                             xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx                             xxxxxxxxxxxx
                                                                                                                                                                xxxxxxxxxxxx
             xxxxxxxxxxxxxxxxxxx
             xxxxxxxxxxxxxxxxxxx                     xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx   xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                                                                                             xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx                             xxxxxxxxxxxx
                                                                                                                                                                xxxxxxxxxxxx
             xxxxxxxxxxxxxxxxxxx
             xxxxxxxxxxxxxxxxxxx                     xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                                                     xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx   xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                                                                                             xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx                             xxxxxxxxxxxx
                                                                                                                                                                xxxxxxxxxxxx
             xxxxxxxxxxxxxxxxxxx                     xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx   xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx                             xxxxxxxxxxxx
             xxxxxxxxxxxxxxxxxxx
             xxxxxxxxxxxxxxxxxxx
             xxxxxxxxxxxxxxxxxxx
                                       x                                        x
                                                     xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                                                     xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                                                     xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                                                                                                                        x
                                                                                             xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                                                                                             xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                                                                                             xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                                                                                                                                                                 x
                                                                                                                                                                xxxxxxxxxxxx
                                                                                                                                                                xxxxxxxxxxxx
                                                                                                                                                                xxxxxxxxxxxx
             xxxxxxxxxxxxxxxxxxx
             xxxxxxxxxxxxxxxxxxx                     xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                                                     xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx   xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                                                                                             xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx                             xxxxxxxxxxxx
                                                                                                                                                                xxxxxxxxxxxx
             xxxxxxxxxxxxxxxxxxx
             xxxxxxxxxxxxxxxxxxx                     xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                                                     xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx   xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                                                                                             xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx                             xxxxxxxxxxxx
                                                                                                                                                                xxxxxxxxxxxx
             xxxxxxxxxxxxxxxxxxx                     xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx   xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx                             xxxxxxxxxxxx
                    x
             xxxxxxxxxxxxxxxxxxx
             xxxxxxxxxxxxxxxxxxx
             xxxxxxxxxxxxxxxxxxx
                                                            x
                                                     xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                                                     xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                                                     xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                                                                                                    x
                                                                                             xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                                                                                             xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                                                                                             xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                                                                                                                                             x                  xxxxxxxxxxxx
                                                                                                                                                                xxxxxxxxxxxx
                                                                                                                                                                xxxxxxxxxxxx
             xxxxxxxxxxxxxxxxxxx
             xxxxxxxxxxxxxxxxxxx                     xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                                                     xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx   xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                                                                                             xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx                             xxxxxxxxxxxx
                                                                                                                                                                xxxxxxxxxxxx
             xxxxxxxxxxxxxxxxxxx
             xxxxxxxxxxxxxxxxxxx                     xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                                                     xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx   xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                                                                                             xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx                             xxxxxxxxxxxx
                                                                                                                                                                xxxxxxxxxxxx
             xxxxxxxxxxxxxxxxxxx
             xxxxxxxxxxxxxxxxxxx                     xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                                                     xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx                                                                      xxxxxxxxxxxx
                                                                                                                                                                xxxxxxxxxxxx
             xxxxxxxxxxxxxxxxxxx
             xxxxxxxxxxxxxxxxxxx                     xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                                                     xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx                                                                      xxxxxxxxxxxx
                                                                                                                                                                xxxxxxxxxxxx
             xxxxxxxxxxxxxxxxxxx
             xxxxxxxxxxxxxxxxxxx
             xxxxxxxxxxxxxxxxxxx
                                   x                                       x
                                                     xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                                                     xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                                                     xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                                                                                                                   x                                        x   xxxxxxxxxxxx
                                                                                                                                                                xxxxxxxxxxxx
                                                                                                                                                                xxxxxxxxxxxx
             xxxxxxxxxxxxxxxxxxx
             xxxxxxxxxxxxxxxxxxx                     xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                                                     xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx                                                                      xxxxxxxxxxxx
                                                                                                                                                                xxxxxxxxxxxx
             xxxxxxxxxxxxxxxxxxx
             xxxxxxxxxxxxxxxxxxx                     xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                                                     xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx                                                                      xxxxxxxxxxxx
                                                                                                                                                                xxxxxxxxxxxx
             xxxxxxxxxxxxxxxxxxx
             xxxxxxxxxxxxxxxxxxx                     xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx                                                                      xxxxxxxxxxxx
                                                                                                                                                                xxxxxxxxxxxx




                                                                                                                                                                               B. Sch¨lkopf, MLSS France 2011
                                                                                                                                                                                     o
A Typical Bound for Pattern Recognition

For any f ∈ F and m > h, with a probability of at least 1 − δ,

                              h log 2m + 1 − log(δ/4)
                                     h
         R[f ] ≤ Remp[f ] +
                                        m
holds.

• does this mean, that we can learn anything?
• The study of the consistency of ERM has thus led to concepts
  and results which lets us formulate another induction principle
  (structural risk minimization)


                                                  B. Sch¨lkopf, MLSS France 2011
                                                        o
SRM


      error              R(f* )
                                  bound on test error


                                  capacity term




                                  training error

                                                      h
      structure
                  Sn−1      Sn       Sn+1


                                                   B. Sch¨lkopf, MLSS France 2011
                                                         o
Finding a Good Function Class


• recall: separating hyperplanes in R2 have a VC dimension of 3.
• more generally: separating hyperplanes in RN have a VC di-
  mension of N + 1.
• hence: separating hyperplanes in high-dimensional feature
  spaces have extremely large VC dimension, and may not gener-
  alize well
• however, margin hyperplanes can still have a small VC dimen-
  sion




                                                 B. Sch¨lkopf, MLSS France 2011
                                                       o
Kernels and Feature Spaces

Preprocess the data with
                        Φ:X → H
                          x → Φ(x),
where H is a dot product space, and learn the mapping from Φ(x)
to y [6].

• usually, dim(X) ≪ dim(H)
• “Curse of Dimensionality”?
• crucial issue: capacity, not dimensionality


                                                 B. Sch¨lkopf, MLSS France 2011
                                                       o
Example: All Degree 2 Monomials


                       Φ : R2 → R3                    √
                                                   2 , 2 x x , x2 )
                      (x1, x2) → (z1, z2, z3) := (x1      1 2 2
                          x
                                                           z3
                                 2
                                                                                  
                                                 
                                                                           
                                                                   
                                                                                   
                                                                                          
                                                                
                            H                                                
                                                                        
          H
                      H
                                      H
                                                      x1        H
                                                                H                             
                          H
                                                                  H 
  
                              H
                                      H
                                                                 HH   
                                                                        
                                                                                                       z1
                     H                                        H H
                                                                    H
                                                                                                     
                  
                                          

                                                
                                                          z2


                                                                                        B. Sch¨lkopf, MLSS France 2011
                                                                                              o
General Product Feature Space




How about patterns x ∈ RN and product features of order d?
Here, dim(H) grows like N d.
E.g. N = 16 × 16, and d = 5 −→ dimension 1010




                                                B. Sch¨lkopf, MLSS France 2011
                                                      o
The Kernel Trick, N = d = 2

                            √                 √
                                       2)(x′2, 2 x′ x′ , x′2)⊤
     Φ(x), Φ(x′)   = (x2,
                       1    2 x1 x2 , x2    1     1 2 2
                             2
                   = x, x′
                   = : k(x, x′)

−→ the dot product in H can be computed in R2




                                                    B. Sch¨lkopf, MLSS France 2011
                                                          o
The Kernel Trick, II

More generally: x, x′ ∈ RN , d ∈ N:
                    d
           N
 x, x ′ d =          xj · x′ 
                            j
                j=1
                 N
        =                 xj1 · · · · · xjd · x′ 1 · · · · · x′ d = Φ(x), Φ(x′) ,
                                               j              j
            j1,...,jd=1
where Φ maps into the space spanned by all ordered products of
d input directions



                                                              B. Sch¨lkopf, MLSS France 2011
                                                                    o
Mercer’s Theorem

If k is a continuous kernel of a positive definite integral oper-
ator on L2(X) (where X is some compact space),

                     k(x, x′)f (x)f (x′) dx dx′ ≥ 0,
                 X
it can be expanded as
                               ∞
                 k(x, x′) =          λiψi(x)ψi(x′)
                               i=1
using eigenfunctions ψi and eigenvalues λi ≥ 0 [30].




                                                       B. Sch¨lkopf, MLSS France 2011
                                                             o
The Mercer Feature Map

In that case                √         
                              √λ1ψ1(x)
                    Φ(x) :=  λ2ψ2(x) 
                                 .
                                 .
satisfies Φ(x), Φ(x′) = k(x, x′).
Proof:                      √         √        ′)
                                                     
                             √λ1ψ1(x)      √λ1ψ1(x′
         Φ(x), Φ(x′) =       λ2ψ2(x)  ,  λ2ψ2(x ) 
                                .
                                .             .
                                              .
                      ∞
                  =         λiψi(x)ψi(x′) = k(x, x′)
                      i=1
                                                       B. Sch¨lkopf, MLSS France 2011
                                                             o
Positive Definite Kernels

It can be shown that the admissible class of kernels coincides with
the one of positive definite (pd) kernels: kernels which are sym-
metric (i.e., k(x, x′) = k(x′, x)), and for
• any set of training points x1, . . . , xm ∈ X and
• any a1, . . . , am ∈ R
satisfy
                  aiaj Kij ≥ 0, where Kij := k(xi, xj ).
            i,j
K is called the Gram matrix or kernel matrix.
If for pairwise distinct points,   i,j aiaj Kij = 0 =⇒ a = 0, call
it strictly positive definite.
                                                      B. Sch¨lkopf, MLSS France 2011
                                                            o
The Kernel Trick — Summary

• any algorithm that only depends on dot products can benefit
  from the kernel trick
• this way, we can apply linear methods to vectorial as well as
  non-vectorial data
• think of the kernel as a nonlinear similarity measure
• examples of common kernels:
        Polynomial k(x, x′) = ( x, x′ + c)d
           Gaussian k(x, x′) = exp(− x − x′ 2/(2 σ 2))
• Kernels are also known as covariance functions [54, 52, 55, 29]

                                                   B. Sch¨lkopf, MLSS France 2011
                                                         o
Properties of PD Kernels, 1

Assumption: Φ maps X into a dot product space H; x, x′ ∈ X

Kernels from Feature Maps.
k(x, x′) := Φ(x), Φ(x′) is a pd kernel on X × X.

Kernels from Feature Maps, II
K(A, B) := x∈A,x′∈B k(x, x′),
where A, B are finite subsets of X, is also a pd kernel
                           ˜
(Hint: use the feature map Φ(A) :=   x∈A Φ(x))




                                                   B. Sch¨lkopf, MLSS France 2011
                                                         o
Properties of PD Kernels, 2 [36, 39]

Assumption: k, k1, k2, . . . are pd; x, x′ ∈ X
k(x, x) ≥ 0 for all x (Positivity on the Diagonal)
k(x, x′)2 ≤ k(x, x)k(x′, x′) (Cauchy-Schwarz Inequality)
(Hint: compute the determinant of the Gram matrix)

k(x, x) = 0 for all x =⇒ k(x, x′) = 0 for all x, x′        (Vanishing Diagonals)

The following kernels are pd:
 • αk, provided α ≥ 0
 • k1 + k2
 • k(x, x′) := limn→∞ kn(x, x′), provided it exists
 • k1 · k2
 • tensor products, direct sums, convolutions [22]
                                                            B. Sch¨lkopf, MLSS France 2011
                                                                  o
The Feature Space for PD Kernels                                    [4, 1, 35]

• define a feature map
                        Φ : X → RX
                            x → k(., x).
E.g., for the Gaussian kernel:            Φ




                                 .   .
                                 x   x'       Φ(x)   Φ(x')


Next steps:
• turn Φ(X) into a linear space
• endow it with a dot product satisfying
   Φ(x), Φ(x′) = k(x, x′), i.e., k(., x), k(., x′ ) = k(x, x′)
• complete the space to get a reproducing kernel Hilbert space
                                                             B. Sch¨lkopf, MLSS France 2011
                                                                   o
Turn it Into a Linear Space

Form linear combinations
                               m
                     f (.) =         αik(., xi),
                               i=1
                               m′
                     g(.) =          βj k(., x′ )
                                              j
                             j=1
(m, m′ ∈ N, αi, βj ∈ R, xi, x′ ∈ X).
                             j




                                                    B. Sch¨lkopf, MLSS France 2011
                                                          o
Endow it With a Dot Product

                                   m m′
                    f, g :=                  αiβj k(xi, x′ )
                                                         j
                                  i=1 j=1
                                   m                 m′
                             =          αig(xi) =         βj f (x′ )
                                                                 j
                                  i=1               j=1
• This is well-defined, symmetric, and bilinear (more later).
• So far, it also works for non-pd kernels




                                                                   B. Sch¨lkopf, MLSS France 2011
                                                                         o
The Reproducing Kernel Property

Two special cases:
 • Assume
                                      f (.) = k(., x).
   In this case, we have
                                    k(., x), g = g(x).
 • If moreover
                                      g(.) = k(., x′),
   we have
                              k(., x), k(., x′ ) = k(x, x′).

k is called a reproducing kernel
(up to here, have not used positive definiteness)
                                                               B. Sch¨lkopf, MLSS France 2011
                                                                     o
Endow it With a Dot Product, II

• It can be shown that ., . is a p.d. kernel on the set of functions
  {f (.) = m αik(., xi)|αi ∈ R, xi ∈ X} :
              i=1

               γi γj f i , f j =         γi f i ,        γj f j   =: f, f
         ij                          i              j

   =           αik(., xi),       αik(., xi)         =        αiαj k(xi, xj ) ≥ 0
         i                   i                          ij
• furthermore, it is strictly positive definite:
              f (x)2 = f, k(., x) 2 ≤ f, f              k(., x), k(., x)
 hence f, f = 0 implies f = 0.
• Complete the space in the corresponding norm to get a Hilbert
  space Hk .
The Empirical Kernel Map

Recall the feature map
                         Φ : X → RX
                             x → k(., x).
• each point is represented by its similarity to all other points
• how about representing it by its similarity to a sample of points?


Consider
    Φm : X → Rm
         x → k(., x)|(x1 ,...,xm) = (k(x1, x), . . . , k(xm, x))⊤

                                                      B. Sch¨lkopf, MLSS France 2011
                                                            o
ctd.

• Φm(x1), . . . , Φm(xm) contain all necessary information about
  Φ(x1), . . . , Φ(xm)
• the Gram matrix Gij := Φm(xi), Φm(xj ) satisfies G = K 2
  where Kij = k(xi, xj )
• modify Φm to
          Φw : X → Rm
            m
                           − 1 (k(x , x), . . . , k(x , x))⊤
                      x → K 2      1                 m
• this “whitened” map (“kernel PCA map”) satifies
                       Φw (xi), Φw (xj ) = k(xi, xj )
                        m        m
  for all i, j = 1, . . . , m.
                                                        B. Sch¨lkopf, MLSS France 2011
                                                              o
An Example of a Kernel Algorithm

Idea: classify points x := Φ(x) in feature space according to which
of the two class means is closer.
                   1                       1
          c+ :=           Φ(xi), c− :=              Φ(xi)
                 m+                       m−
                    yi=1                                            yi=−1

                                                                +
                               o

                      o                        .                +
                                   w                   +   c2
                                       o   c
                              c1                   x-c
                          o
                                                   x




Compute the sign of the dot product between w := c+ − c− and
x − c.
                                                                            B. Sch¨lkopf, MLSS France 2011
                                                                                  o
An Example of a Kernel Algorithm, ctd. [36]

                                                                                                               

f (x) = sgn 1             Φ(x), Φ(xi) −
                                          1
                                                      Φ(x), Φ(xi) +b
             m+                          m−
                 {i:yi=+1}                  {i:yi=−1}
                                                              

      = sgn  1            k(x, xi) −
                                       1
                                                   k(x, xi) + b
              m+                      m−
                       {i:yi=+1}                                {i:yi=−1}
where                                                                                       
         1 1                                         1
    b=                                k(xi, xj ) −                             k(xi, xj ) .
         2 m2−                                       m2+
                  {(i,j):yi=yj =−1}                        {(i,j):yi=yj =+1}



• provides a geometric interpretation of Parzen windows
                                                                               B. Sch¨lkopf, MLSS France 2011
                                                                                     o
An Example of a Kernel Algorithm, ctd.

• Demo
• Exercise: derive the Parzen windows classifier by computing the
  distance criterion directly
• SVMs (ppt)




                                                 B. Sch¨lkopf, MLSS France 2011
                                                       o
An example of a kernel algorithm, revisited


                                              o

                          +                  µ(Y )
                                  .
                  +           w               o
                 µ(X )        +
                                         o



                      +




X compact subset of a separable metric space, m, n ∈ N.
Positive class X := {x1, . . . , xm} ⊂ X
Negative class Y := {y1, . . . , yn} ⊂ X
                  1                   m                     1       n
RKHS means µ(X) = m                   i=1 k(xi, ·), µ(Y ) = n       i=1 k(yi, ·).
Get a problem if µ(X) = µ(Y )!
                                                                B. Sch¨lkopf, MLSS France 2011
                                                                      o
When do the means coincide?

k(x, x′) = x, x′ :        the means coincide

k(x, x′) = ( x, x′ + 1)d: all empirical moments up to order d coincide

k strictly pd:            X =Y.




The mean “remembers” each point that contributed to it.




                                                B. Sch¨lkopf, MLSS France 2011
                                                      o
Proposition 2 Assume X, Y are defined as above, k is
strictly pd, and for all i, j, xi = xj , and yi = yj .
If for some αi, βj ∈ R − {0}, we have
               m                    n
                     αik(xi, .) =         βj k(yj , .),                          (1)
               i=1                  j=1
then X = Y .




                                                          B. Sch¨lkopf, MLSS France 2011
                                                                o
Proof (by contradiction)

W.l.o.g., assume that x1 ∈ Y . Subtract n βj k(yj , .) from (1),
                                          j=1
and make it a sum over pairwise distinct points, to get
                       0=            γik(zi, .),
                               i
where z1 = x1, γ1 = α1 = 0, and
z2, · · · ∈ X ∪ Y − {x1}, γ2, · · · ∈ R.
Take the RKHS dot product with j γj k(zj , .) to get
                     0=            γiγj k(zi, zj ),
                          ij
with γ = 0, hence k cannot be strictly pd.


                                                      B. Sch¨lkopf, MLSS France 2011
                                                            o
The mean map

                                                m
                                         1
              µ : X = (x1, . . . , xm) →              k(xi, ·)
                                         m
                                                i=1
satisfies
                            m                               m
                        1                         1
           µ(X), f =              k(xi, ·), f   =                f (xi)
                        m                         m
                            i=1                          i=1
and
                                                        m                        n
                                             1                        1
 µ(X)−µ(Y ) = sup | µ(X) − µ(Y ), f | = sup                  f (xi) −                 f (yi) .
              f ≤1                      f ≤1 m         i=1
                                                                      n         i=1


Note: Large distance = can find a function distinguishing the
samples
                                                                B. Sch¨lkopf, MLSS France 2011
                                                                      o
Witness function

     µ(X)−µ(Y )
f = µ(X)−µ(Y ) , thus f (x) ∝ µ(X) − µ(Y ), k(x, .) ):
                                           Witness f for Gauss and Laplace data
                                     1
                                                                               f
                                    0.8                                        Gauss
                                                                               Laplace
                                    0.6
             Prob. density and f




                                    0.4

                                    0.2

                                     0

                                   −0.2

                                   −0.4
                                      −6    −4      −2      0       2      4             6
                                                            X

This function is in the RKHS of a Gaussian kernel, but not in the
RKHS of the linear kernel.
                                                                                         B. Sch¨lkopf, MLSS France 2011
                                                                                               o
The mean map for measures

p, q Borel probability measures,
Ex,x′∼p[k(x, x′)], Ex,x′∼q [k(x, x′)]  ∞ ( k(x, .) ≤ M  ∞ is sufficient)

Define
                         µ : p → Ex∼p[k(x, ·)].
Note
                         µ(p), f = Ex∼p[f (x)]
and
        µ(p) − µ(q) = sup Ex∼p[f (x)] − Ex∼q [f (x)] .
                             f ≤1
Recall that in the finite sample case, for strictly p.d. kernels, µ
was injective — how about now?
[43, 17]
                                                           B. Sch¨lkopf, MLSS France 2011
                                                                 o
Theorem 3 [15, 13]
     p = q ⇐⇒       sup    Ex∼p(f (x)) − Ex∼q (f (x)) = 0,
                f ∈C(X)
where C(X) is the space of continuous bounded functions on
X.
Combine this with
       µ(p) − µ(q) = sup Ex∼p[f (x)] − Ex∼q [f (x)] .
                          f ≤1
Replace C(X) by the unit ball in an RKHS that is dense in C(X)
— universal kernel [45], e.g., Gaussian.
Theorem 4 [19] If k is universal, then
                p = q ⇐⇒ µ(p) − µ(q) = 0.
                                                  B. Sch¨lkopf, MLSS France 2011
                                                        o
• µ is invertible on its image
  M = {µ(p) | p is a probability distribution}
  (the “marginal polytope”, [53])
• generalization of the moment generating function of a RV x
  with distribution p:
                     Mp(.) = Ex∼p e x, ·      .
This provides us with a convenient metric on probability distribu-
tions, which can be used to check whether two distributions are
different — provided that µ is invertible.



                                                   B. Sch¨lkopf, MLSS France 2011
                                                         o
Fourier Criterion

Assume we have densities, the kernel is shift invariant (k(x, y) =
k(x − y)), and all Fourier transforms below exist.
Note that µ is invertible iff

         k(x − y)p(y) dy =       k(x − y)q(y) dy =⇒ p = q,
i.e.,
                   ˆp ˆ
                   k(ˆ − q ) = 0 =⇒ p = q
(Sriperumbudur et al., 2008)

                          ˆ
E.g., µ is invertible if k has full support. Restricting the class of
                                                   ˆ
distributions, weaker conditions suffice (e.g., if k has non-empty in-
terior, µ is invertible for all distributions with compact support).
                                                     B. Sch¨lkopf, MLSS France 2011
                                                           o
Fourier Optics

Application: p source of incoherent light, I indicator of a finite
                                                                   ˆ
aperture. In Fraunhofer diffraction, the intensity image is ∝ p∗ I 2.
          ˆ
Set k = I 2, then this equals µ(p).
      ˆ
This k does not have full support, thus the imaging process is not
invertible for the class of all light sources (Abbe), but it is if we
restrict the class (e.g., to compact support).




                                                     B. Sch¨lkopf, MLSS France 2011
                                                           o
Application 1: Two-sample problem [19]

X, Y i.i.d. m-samples from p, q, respectively.


                 2
   µ(p) − µ(q)       =Ex,x′∼p [k(x, x′)] − 2Ex∼p,y∼q [k(x, y)] + Ey,y′∼q [k(y, y ′)]
                     =Ex,x′∼p,y,y′∼q [h((x, y), (x′, y ′))]
with
         h((x, y), (x′, y ′)) := k(x, x′) − k(x, y ′) − k(y, x′) + k(y, y ′).
Define
                   D(p, q)2 := Ex,x′∼p,y,y′∼q h((x, y), (x′, y ′))
                  ˆ
                  D(X, Y )2 := 1            h((xi, yi), (xj , yj )).
                                   m(m−1)
                                            i=j

 ˆ
D(X, Y )2 is an unbiased estimator of D(p, q)2.
It’s easy to compute, and works on structured data.
                                                                       B. Sch¨lkopf, MLSS France 2011
                                                                             o
Theorem 5 Assume k is bounded.
                                                              1
ˆ
D(X, Y )2 converges to D(p, q)2 in probability with rate O(m− 2 ).
This could be used as a basis for a test, but uniform convergence bounds are often loose..
                                                                  √ ˆ
Theorem 6 We assume E h2  ∞. When p = q, then m(D(X, Y )2 − D(p, q)2)
converges in distribution to a zero mean Gaussian with variance
                                                                        2
                     σu = 4 Ez (Ez′ h(z, z ′ ))2 − Ez,z′ (h(z, z ′ ))
                      2
                                                                            .
                   ˆ                        ˆ
When p = q, then m(D(X, Y )2 − D(p, q)2) = mD(X, Y )2 converges in distribution to
                                         ∞
                                              λl ql2 − 2 ,                                               (2)
                                        l=1

where ql ∼ N(0, 2) i.i.d., λi are the solutions to the eigenvalue equation
                                  ˜
                                  k(x, x′)ψi (x)dp(x) = λi ψi(x′ ),
                              X
    ˜
and k(xi, xj ) := k(xi, xj ) − Exk(xi, x) − Exk(x, xj ) + Ex,x′ k(x, x′) is the centred RKHS
kernel.

                                                                                B. Sch¨lkopf, MLSS France 2011
                                                                                      o
Application 2: Dependence Measures

Assume that (x, y) are drawn from pxy , with marginals px, py .

Want to know whether pxy factorizes.
[2, 16]: kernel generalized variance

[20, 21]: kernel constrained covariance, HSIC


Main idea [25, 34]:
x and y independent ⇐⇒ ∀ bounded continuous functions f, g,
we have Cov(f (x), g(y)) = 0.


                                                   B. Sch¨lkopf, MLSS France 2011
                                                         o
k kernel on X × Y.

                µ(pxy ) := E(x,y)∼pxy [k((x, y), ·)]
            µ(px × py ) := Ex∼px,y∼py [k((x, y), ·)] .

Use ∆ := µ(pxy ) − µ(px × py ) as a measure of dependence.

For k((x, y), (x′ , y ′)) = kx(x, x′)ky (y, y ′):
∆2 equals the Hilbert-Schmidt norm of the covariance opera-
tor between the two RKHSs (HSIC), with empirical estimate
m−2 tr HKxHKy , where H = I − 1/m [20, 44].

                                                       B. Sch¨lkopf, MLSS France 2011
                                                             o
Witness function of the equivalent optimisation problem:
                                 Dependence witness and sample
                        1.5

                                                                       0.05

                         1                                             0.04

                                                                       0.03
                        0.5
                                                                       0.02

                                                                       0.01
                   Y


                         0
                                                                       0


                       −0.5                                            −0.01

                                                                       −0.02

                        −1                                             −0.03

                                                                       −0.04
                       −1.5
                         −1.5   −1    −0.5    0      0.5    1    1.5
                                              X



Application: learning causal structures (Sun et al., ICML 2007; Fuku-
mizu et al., NIPS 2007))
                                                                               B. Sch¨lkopf, MLSS France 2011
                                                                                     o
Application 3: Covariate Shift Correction and Local
Learning

training set X = {(x1, y1), . . . , (xm, ym)} drawn from p,
test set X ′ = (x′ , y1), . . . , (x′ , yn) from p′ = p.
                 1
                      ′
                                    n
                                         ′


Assume py|x = p′ .
               y|x

[40]: reweight training set




                                                    B. Sch¨lkopf, MLSS France 2011
                                                          o
Minimize
                          2
  m
       βik(xi, ·) − µ(X ′) +λ β 2 subject to βi ≥ 0,
                                2                                          βi = 1.
 i=1                                                                  i
Equivalent QP:
                          1 ⊤
                 minimize β (K + λ1) β − β ⊤l
                    β     2
                 subject to βi ≥ 0 and       βi = 1,
                                         i
where Kij := k(xi, xj ), li = k(xi, ·), µ(X ′) .
Experiments show that in underspecified situations (e.g., large ker-
nel widths), this helps [23].
X ′ = x′ leads to a local sample weighting scheme.
                                                       B. Sch¨lkopf, MLSS France 2011
                                                             o
The Representer Theorem

Theorem 7 Given: a p.d. kernel k on X × X, a training set
(x1, y1), . . . , (xm, ym) ∈ X × R, a strictly monotonic increasing
real-valued function Ω on [0, ∞[, and an arbitrary cost function
c : (X × R2)m → R ∪ {∞}
Any f ∈ Hk minimizing the regularized risk functional
       c ((x1, y1, f (x1)), . . . , (xm, ym, f (xm))) + Ω ( f )               (3)
admits a representation of the form
                                 m
                     f (.) =           αik(xi, .).
                                 i=1



                                                       B. Sch¨lkopf, MLSS France 2011
                                                             o
Remarks

• significance: many learning algorithms have solutions that can
  be expressed as expansions in terms of the training examples
• original form, with mean squared loss
                                                     m
                                                 1
 c((x1, y1, f (x1)), . . . , (xm, ym, f (xm))) =           (yi − f (xi))2,
                                                 m
                                                     i=1
 and Ω( f ) = λ f 2 (λ  0): [27]
• generalization to non-quadratic cost functions: [10]
• present form: [36]



                                                         B. Sch¨lkopf, MLSS France 2011
                                                               o
Proof

Decompose f ∈ H into a part in the span of the k(xi, .) and an
orthogonal one:
                    f=       αik(xi, .) + f⊥,
where for all j           i
                        f⊥, k(xj , .) = 0.
Application of f to an arbitrary training point xj yields
           f (xj ) = f, k(xj , .)

                  =           αik(xi, .) + f⊥, k(xj , .)
                          i
                  =       αi k(xi, .), k(xj , .) ,
                      i
independent of f⊥.
                                                       B. Sch¨lkopf, MLSS France 2011
                                                             o
Proof: second part of (3)

Since f⊥ is orthogonal to   i αik(xi, .), and Ω is strictly mono-
tonic, we get
        Ω( f ) = Ω              αik(xi, .) + f⊥
                            i

                = Ω                    αik(xi, .) 2 + f⊥ 2
                                     i

                ≥ Ω             αik(xi, .)         ,                          (4)
                            i
with equality occuring if and only if f⊥ = 0.
Hence, any minimizer must have f⊥ = 0. Consequently, any
solution takes the form
                      f=             αik(xi, .).
                                 i
                                                       B. Sch¨lkopf, MLSS France 2011
                                                             o
Application: Support Vector Classification

Here, yi ∈ {±1}. Use
                               1
       c ((xi, yi, f (xi))i) =         max (0, 1 − yif (xi)) ,
                               λ
                                   i
and the regularizer Ω ( f ) = f 2.
λ → 0 leads to the hard margin SVM




                                                       B. Sch¨lkopf, MLSS France 2011
                                                             o
Further Applications

Bayesian MAP Estimates. Identify (3) with the negative log
posterior (cf. Kimeldorf  Wahba, 1970, Poggio  Girosi, 1990),
i.e.
• exp(−c((xi, yi, f (xi))i)) — likelihood of the data
• exp(−Ω( f )) — prior over the set of functions; e.g., Ω( f ) =
  λ f 2 — Gaussian process prior [55] with covariance function
  k
• minimizer of (3) = MAP estimate
Kernel PCA (see below) can be shown to correspond to the case
of                            
                                                              2
                                     1           1
c((xi, yi, f (xi))i=1,...,m) = 0 if m i f (xi) − m j f (xj ) = 1
                              
                               ∞ otherwise

with g an arbitrary strictly monotonically increasing function.
Conclusion

• the kernel corresponds to
  – a similarity measure for the data, or
  – a (linear) representation of the data, or
  – a hypothesis space for learning,
• kernels allow the formulation of a multitude of geometrical algo-
  rithms (Parzen windows, 2-sample tests, SVMs, kernel PCA,...)




                                                    B. Sch¨lkopf, MLSS France 2011
                                                          o
Kernel PCA                                                                                                               [37]


         linear PCA                                                         k(x,y) = (x.y)
                                                                        R2
                                             x
                                              x
                                              x
                                             xx
                                                                        x
                                                  x x   x
                                                            x
                                                                x
                                                                    x




         kernel PCA                                                         k(x,y) = (x.y)d
                                        R2
             x                                                                    x
              x
              x                                                                   x       x

                                                                              x       x  x
                 xx
                                                                                       x
                                        x
                  x x   x       x                                                     x x x
                            x       x
                                                                                        x x
             k                                                                                H
                                                  Φ

                                                                                                  B. Sch¨lkopf, MLSS France 2011
                                                                                                        o
Kernel PCA, II

                                                m
                                           1
  x1, . . . , xm ∈ X,   Φ : X → H,      C=           Φ(xj )Φ(xj )⊤
                                           m
                                               j=1
Eigenvalue problem
                               m
                        1
              λV = CV =              Φ(xj ), V Φ(xj ).
                        m
                              j=1
For λ = 0, V ∈ span{Φ(x1), . . . , Φ(xm)}, thus
                              m
                         V=         αiΦ(xi),
                              i=1
and the eigenvalue problem can be written as
       λ Φ(xn), V = Φ(xn), CV for all n = 1, . . . , m
                                                      B. Sch¨lkopf, MLSS France 2011
                                                            o
Kernel PCA in Dual Variables

In term of the m × m Gram matrix
                Kij := Φ(xi), Φ(xj ) = k(xi, xj ),
this leads to
                          mλKα = K 2α
where α = (α1, . . . , αm)⊤.
Solve
                           mλα = Kα
−→ (λn, αn)

               Vn, Vn = 1 ⇐⇒ λn αn, αn = 1
thus divide α n by √λ
                      n
                                                     B. Sch¨lkopf, MLSS France 2011
                                                           o
Feature extraction

Compute projections on the Eigenvectors
                              m
                      Vn =           n
                                    αi Φ(xi)
                              i=1
in H:

for a test point x with image Φ(x) in H we get the features
                              m
               Vn, Φ(x) =            n
                                    αi Φ(xi), Φ(x)
                              i=1
                               m
                          =          n
                                    αi k(xi, x)
                              i=1
                                                     B. Sch¨lkopf, MLSS France 2011
                                                           o
The Kernel PCA Map

Recall
             Φw : X → Rm
              m
                         1
                       − 2 (k(x , x), . . . , k(x , x))⊤
                 x → K         1                 m
If K =       U DU ⊤ is K’s diagonalization, then K −1/2                       =
U D−1/2U ⊤. Thus we have
          Φw (x) = U D−1/2U ⊤(k(x1, x), . . . , k(xm, x))⊤.
           m
We can drop the leading U (since it leaves the dot product invari-
ant) to get a map
         Φw CA(x) = D−1/2U ⊤(k(x1, x), . . . , k(xm, x))⊤.
          KP
The rows of U ⊤ are the eigenvectors αn of K, and the entries of
                                  −1/2
the diagonal matrix D−1/2 equal λi     .
                                                    B. Sch¨lkopf, MLSS France 2011
                                                          o
Toy Example with Gaussian Kernel


k(x, x′) = exp − x − x′ 2




                                   B. Sch¨lkopf, MLSS France 2011
                                         o
Super-Resolution                                                                              (Kim, Franz,  Sch¨lkopf, 2004)
                                                                                                                o




a. original image of resolution   b. low resolution image (264 ×   c. bicubic interpolation            d. supervised example-based        f. unsupervised KPCA recon-
528 × 396                         198) stretched to the original                                       learning based on nearest neigh-   struction
                                  scale                                                                bor classifier




                                                         g. enlarged portions of a-d, and f (from left to right)




Comparison between different super-resolution methods.
                                                                                                                                     B. Sch¨lkopf, MLSS France 2011
                                                                                                                                           o
Support Vector Classifiers

       input space               feature space

           G                     N
                                                N
           N
               N         Φ
                                  G
      G                      G
                     G
                                         G


                                                               [6]

                                       B. Sch¨lkopf, MLSS France 2011
                                             o
Separating Hyperplane


                                  w, x + b  0

                                           N

                  G                    N
                              G

                                                   N
       w, x + b  0     w                    N


                              G
                      G
                                  G
                                      {x | w, x + b = 0}


                                                       B. Sch¨lkopf, MLSS France 2011
                                                             o
Optimal Separating Hyperplane                                             [50]




                                       N

              G                    N
                      G

                                               N
                              .
                      w                    N


                          G
                  G
                              G
                                  {x | w, x + b = 0}


                                                   B. Sch¨lkopf, MLSS France 2011
                                                         o
Eliminating the Scaling Freedom                                         [47]

Note: if c = 0, then
         {x| w, x + b = 0} = {x| cw, x + cb = 0}.
Hence (cw, cb) describes the same hyperplane as (w, b).
Definition: The hyperplane is in canonical form w.r.t. X ∗ =
{x1, . . . , xr } if minxi∈X | w, xi + b| = 1.




                                                 B. Sch¨lkopf, MLSS France 2011
                                                       o
Canonical Optimal Hyperplane


                            {x | w, x + b = +1}
{x | w, x + b = −1}                                    Note:
                                      N
                                                                  w, x1 + b = +1
           H                      N   x1       yi = +1            w, x2 + b = −1
                   x2H
                                                         =     w , (x1−x2) = 2
                                                N
                             ,                                  w
    yi = −1        w                       N             =        , (x1−x2) = 2
                                                                                
                                                              ||w||            ||w||

                        H
               H
                            H
                                 {x | w, x + b = 0}



                                                                  B. Sch¨lkopf, MLSS France 2011
                                                                        o
Canonical Hyperplanes                                                   [47]

Note: if c = 0, then
         {x| w, x + b = 0} = {x| cw, x + cb = 0}.
Hence (cw, cb) describes the same hyperplane as (w, b).
Definition: The hyperplane is in canonical form w.r.t. X ∗ =
{x1, . . . , xr } if minxi∈X | w, xi + b| = 1.


Note that for canonical hyperplanes, the distance of the closest
point to the hyperplane (“margin”) is 1/ w :
minxi∈X      w ,x + b = 1 .
             w i         w      w

                                                 B. Sch¨lkopf, MLSS France 2011
                                                       o
Theorem 8 (Vapnik [46]) Consider hyperplanes w, x = 0
where w is normalized such that they are in canonical form
w.r.t. a set of points X ∗ = {x1, . . . , xr }, i.e.,
                      min | w, xi | = 1.
                    i=1,...,r
The set of decision functions fw(x) = sgn x, w defined on
X ∗ and satisfying the constraint w ≤ Λ has a VC dimension
satisfying
                           h ≤ R2Λ2.
Here, R is the radius of the smallest sphere around the origin
containing X ∗.

                                                B. Sch¨lkopf, MLSS France 2011
                                                      o
x


x            x        R   x


    x
        γ1

                 γ2




                              B. Sch¨lkopf, MLSS France 2011
                                    o
Proof Strategy (Gurvits, 1997)

Assume that x1, . . . , xr are shattered by canonical hyperplanes
with w ≤ Λ, i.e., for all y1, . . . , yr ∈ {±1},
               yi w, xi ≥ 1 for all i = 1, . . . , r.                          (5)
Two steps:
• prove that the more points we want to shatter (5), the larger
      r
      i=1 yixi must be
                                r
• upper bound the size of       i=1 yixi in terms of R
Combining the two tells us how many points we can at most shat-
ter.


                                                        B. Sch¨lkopf, MLSS France 2011
                                                              o
Part I

Summing (5) over i = 1, . . . , r yields
                                      
                                     r
                             w,          yi xi       ≥ r.
                                    i=1
By the Cauchy-Schwarz inequality, on the other hand, we have
                
              r                                 r                r
      w,          yi xi     ≤ w                    yixi ≤ Λ         yixi .
             i=1                               i=1              i=1
Combine both:
                                          r
                              r
                                ≤              yi xi .                                 (6)
                              Λ
                                         i=1
                                                                B. Sch¨lkopf, MLSS France 2011
                                                                      o
Part II

Consider independent random labels yi ∈ {±1}, uniformly dis-
tributed (Rademacher variables).
                 
                2                          
       r             r             r
E        yi xi  =    E  yi xi ,    yj xj 
                 
     i=1             i=1                     j=1
                                                                             
                      r
                 =         E  y i x i ,               yj xj  + yixi 
                     i=1                           j=i
                                                                                              
                      r
                 =                     E    yixi, yj xj  + E [ yixi, yixi ]
                     i=1           j=i
                      r                             r
                 =         E       yi xi 2 =             xi 2
                     i=1                           i=1
                                                                B. Sch¨lkopf, MLSS France 2011
                                                                      o
Part II, ctd.

Since xi ≤ R, we get
                                                
                                             2
                             r
                   E             yixi  ≤ rR2.
                                      
                         i=1

• This holds for the expectation over the random choices of the
  labels, hence there must be at least one set of labels for which
  it also holds true. Use this set.
Hence
                                         2
                         r
                                 yi xi       ≤ rR2.
                        i=1
                                                      B. Sch¨lkopf, MLSS France 2011
                                                            o
Part I and II Combined

        r 2
Part I: Λ ≤       r
                      yi xi 2
                  i=1
Part II:   r
           i=1 yixi 2 ≤ rR2

Hence
                          r2
                             ≤ rR2,
                          Λ2
i.e.,
                          r ≤ R2Λ2,
completing the proof.



                                      B. Sch¨lkopf, MLSS France 2011
                                            o
Introduction to Machine Learning
Introduction to Machine Learning
Introduction to Machine Learning
Introduction to Machine Learning
Introduction to Machine Learning
Introduction to Machine Learning
Introduction to Machine Learning
Introduction to Machine Learning
Introduction to Machine Learning
Introduction to Machine Learning
Introduction to Machine Learning
Introduction to Machine Learning
Introduction to Machine Learning
Introduction to Machine Learning
Introduction to Machine Learning
Introduction to Machine Learning
Introduction to Machine Learning
Introduction to Machine Learning
Introduction to Machine Learning
Introduction to Machine Learning
Introduction to Machine Learning
Introduction to Machine Learning
Introduction to Machine Learning
Introduction to Machine Learning
Introduction to Machine Learning
Introduction to Machine Learning

Más contenido relacionado

La actualidad más candente

Mining Uncertain Data (Sebastiaan van Schaaik)
Mining Uncertain Data (Sebastiaan van Schaaik)Mining Uncertain Data (Sebastiaan van Schaaik)
Mining Uncertain Data (Sebastiaan van Schaaik)
timfu
 
Machine Learning: Some theoretical and practical problems
Machine Learning: Some theoretical and practical problemsMachine Learning: Some theoretical and practical problems
Machine Learning: Some theoretical and practical problems
butest
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
butest
 

La actualidad más candente (14)

MLIP - Chapter 5 - Detection, Segmentation, Captioning
MLIP - Chapter 5 - Detection, Segmentation, CaptioningMLIP - Chapter 5 - Detection, Segmentation, Captioning
MLIP - Chapter 5 - Detection, Segmentation, Captioning
 
Tutorial SUM 2012: some of the things you wanted to know about uncertainty (b...
Tutorial SUM 2012: some of the things you wanted to know about uncertainty (b...Tutorial SUM 2012: some of the things you wanted to know about uncertainty (b...
Tutorial SUM 2012: some of the things you wanted to know about uncertainty (b...
 
MLIP - Chapter 2 - Preliminaries to deep learning
MLIP - Chapter 2 - Preliminaries to deep learningMLIP - Chapter 2 - Preliminaries to deep learning
MLIP - Chapter 2 - Preliminaries to deep learning
 
Hussain Learning Relevant Eye Movement Feature Spaces Across Users
Hussain Learning Relevant Eye Movement Feature Spaces Across UsersHussain Learning Relevant Eye Movement Feature Spaces Across Users
Hussain Learning Relevant Eye Movement Feature Spaces Across Users
 
Hoip10 presentación seguimiento de objetos_vicomtech
Hoip10 presentación seguimiento de objetos_vicomtechHoip10 presentación seguimiento de objetos_vicomtech
Hoip10 presentación seguimiento de objetos_vicomtech
 
Extracting Proximity for Brain Graph Voxel Classification
Extracting Proximity for Brain Graph Voxel ClassificationExtracting Proximity for Brain Graph Voxel Classification
Extracting Proximity for Brain Graph Voxel Classification
 
Mining Uncertain Data (Sebastiaan van Schaaik)
Mining Uncertain Data (Sebastiaan van Schaaik)Mining Uncertain Data (Sebastiaan van Schaaik)
Mining Uncertain Data (Sebastiaan van Schaaik)
 
Free lunch for few shot learning distribution calibration
Free lunch for few shot learning distribution calibrationFree lunch for few shot learning distribution calibration
Free lunch for few shot learning distribution calibration
 
Spectral methods for linear systems with random inputs
Spectral methods for linear systems with random inputsSpectral methods for linear systems with random inputs
Spectral methods for linear systems with random inputs
 
VBPR 1st seminar
VBPR 1st seminarVBPR 1st seminar
VBPR 1st seminar
 
Machine Learning: Some theoretical and practical problems
Machine Learning: Some theoretical and practical problemsMachine Learning: Some theoretical and practical problems
Machine Learning: Some theoretical and practical problems
 
Chapter 1 - Introduction
Chapter 1 - IntroductionChapter 1 - Introduction
Chapter 1 - Introduction
 
Multitask learning for GGM
Multitask learning for GGMMultitask learning for GGM
Multitask learning for GGM
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 

Similar a Introduction to Machine Learning

On the value of stochastic analysis for software engineering
On the value of stochastic analysis for software engineeringOn the value of stochastic analysis for software engineering
On the value of stochastic analysis for software engineering
CS, NcState
 
Johan Suykens: "Models from Data: a Unifying Picture"
Johan Suykens: "Models from Data: a Unifying Picture" Johan Suykens: "Models from Data: a Unifying Picture"
Johan Suykens: "Models from Data: a Unifying Picture"
ieee_cis_cyprus
 
05 history of cv a machine learning (theory) perspective on computer vision
05  history of cv a machine learning (theory) perspective on computer vision05  history of cv a machine learning (theory) perspective on computer vision
05 history of cv a machine learning (theory) perspective on computer vision
zukun
 
15_wk4_unsupervised-learning_manifold-EM-cs365-2014.pdf
15_wk4_unsupervised-learning_manifold-EM-cs365-2014.pdf15_wk4_unsupervised-learning_manifold-EM-cs365-2014.pdf
15_wk4_unsupervised-learning_manifold-EM-cs365-2014.pdf
McSwathi
 
The Automated-Reasoning Revolution: from Theory to Practice and Back
The Automated-Reasoning Revolution: from Theory to Practice and BackThe Automated-Reasoning Revolution: from Theory to Practice and Back
The Automated-Reasoning Revolution: from Theory to Practice and Back
Moshe Vardi
 
Fcv rep todorovic
Fcv rep todorovicFcv rep todorovic
Fcv rep todorovic
zukun
 
Introduction
IntroductionIntroduction
Introduction
butest
 
Chapter 3 projection
Chapter 3 projectionChapter 3 projection
Chapter 3 projection
NBER
 
Structured regression for efficient object detection
Structured regression for efficient object detectionStructured regression for efficient object detection
Structured regression for efficient object detection
zukun
 
. An introduction to machine learning and probabilistic ...
. An introduction to machine learning and probabilistic .... An introduction to machine learning and probabilistic ...
. An introduction to machine learning and probabilistic ...
butest
 
Generalizing Scientific Machine Learning and Differentiable Simulation Beyond...
Generalizing Scientific Machine Learning and Differentiable Simulation Beyond...Generalizing Scientific Machine Learning and Differentiable Simulation Beyond...
Generalizing Scientific Machine Learning and Differentiable Simulation Beyond...
Chris Rackauckas
 

Similar a Introduction to Machine Learning (20)

16 17 bag_words
16 17 bag_words16 17 bag_words
16 17 bag_words
 
On the value of stochastic analysis for software engineering
On the value of stochastic analysis for software engineeringOn the value of stochastic analysis for software engineering
On the value of stochastic analysis for software engineering
 
Johan Suykens: "Models from Data: a Unifying Picture"
Johan Suykens: "Models from Data: a Unifying Picture" Johan Suykens: "Models from Data: a Unifying Picture"
Johan Suykens: "Models from Data: a Unifying Picture"
 
05 history of cv a machine learning (theory) perspective on computer vision
05  history of cv a machine learning (theory) perspective on computer vision05  history of cv a machine learning (theory) perspective on computer vision
05 history of cv a machine learning (theory) perspective on computer vision
 
"Let us talk about output features! by Florence d’Alché-Buc, LTCI & Full Prof...
"Let us talk about output features! by Florence d’Alché-Buc, LTCI & Full Prof..."Let us talk about output features! by Florence d’Alché-Buc, LTCI & Full Prof...
"Let us talk about output features! by Florence d’Alché-Buc, LTCI & Full Prof...
 
15_wk4_unsupervised-learning_manifold-EM-cs365-2014.pdf
15_wk4_unsupervised-learning_manifold-EM-cs365-2014.pdf15_wk4_unsupervised-learning_manifold-EM-cs365-2014.pdf
15_wk4_unsupervised-learning_manifold-EM-cs365-2014.pdf
 
The Automated-Reasoning Revolution: from Theory to Practice and Back
The Automated-Reasoning Revolution: from Theory to Practice and BackThe Automated-Reasoning Revolution: from Theory to Practice and Back
The Automated-Reasoning Revolution: from Theory to Practice and Back
 
Fcv rep todorovic
Fcv rep todorovicFcv rep todorovic
Fcv rep todorovic
 
Introduction
IntroductionIntroduction
Introduction
 
Chapter 3 projection
Chapter 3 projectionChapter 3 projection
Chapter 3 projection
 
Structured regression for efficient object detection
Structured regression for efficient object detectionStructured regression for efficient object detection
Structured regression for efficient object detection
 
. An introduction to machine learning and probabilistic ...
. An introduction to machine learning and probabilistic .... An introduction to machine learning and probabilistic ...
. An introduction to machine learning and probabilistic ...
 
Paper reading best of both world
Paper reading best of both worldPaper reading best of both world
Paper reading best of both world
 
Methods of Manifold Learning for Dimension Reduction of Large Data Sets
Methods of Manifold Learning for Dimension Reduction of Large Data SetsMethods of Manifold Learning for Dimension Reduction of Large Data Sets
Methods of Manifold Learning for Dimension Reduction of Large Data Sets
 
My 2hr+ survey talk at the Vector Institute, on our deep learning theorems.
My 2hr+ survey talk at the Vector Institute, on our deep learning theorems.My 2hr+ survey talk at the Vector Institute, on our deep learning theorems.
My 2hr+ survey talk at the Vector Institute, on our deep learning theorems.
 
JAISTサマースクール2016「脳を知るための理論」講義04 Neural Networks and Neuroscience
JAISTサマースクール2016「脳を知るための理論」講義04 Neural Networks and Neuroscience JAISTサマースクール2016「脳を知るための理論」講義04 Neural Networks and Neuroscience
JAISTサマースクール2016「脳を知るための理論」講義04 Neural Networks and Neuroscience
 
Generalizing Scientific Machine Learning and Differentiable Simulation Beyond...
Generalizing Scientific Machine Learning and Differentiable Simulation Beyond...Generalizing Scientific Machine Learning and Differentiable Simulation Beyond...
Generalizing Scientific Machine Learning and Differentiable Simulation Beyond...
 
Machine Learning ebook.pdf
Machine Learning ebook.pdfMachine Learning ebook.pdf
Machine Learning ebook.pdf
 
1_5_AI_edx_ml_51intro_240204_104838machine learning lecture 1
1_5_AI_edx_ml_51intro_240204_104838machine learning lecture 11_5_AI_edx_ml_51intro_240204_104838machine learning lecture 1
1_5_AI_edx_ml_51intro_240204_104838machine learning lecture 1
 
Principal Component Analysis For Novelty Detection
Principal Component Analysis For Novelty DetectionPrincipal Component Analysis For Novelty Detection
Principal Component Analysis For Novelty Detection
 

Último

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Último (20)

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 

Introduction to Machine Learning

  • 1. Introduction to Machine Learning Bernhard Schölkopf Empirical Inference Department Max Planck Institute for Intelligent Systems Tübingen, Germany http://www.tuebingen.mpg.de/bs 1
  • 2. Empirical Inference • Drawing conclusions from empirical data (observations, measurements) • Example 1: scientific inference y = Σi ai k(x,xi) + b x y y=a*x x x x x x x x x x Leibniz, Weyl, Chaitin 2
  • 3. Empirical Inference • Drawing conclusions from empirical data (observations, measurements) • Example 1: scientific inference “If your experiment needs statistics [inference], you ought to have done a better experiment.” (Rutherford) 3
  • 4. Empirical Inference, II • Example 2: perception “The brain is nothing but a statistical decision organ” (H. Barlow) 4
  • 5. Hard Inference Problems Sonnenburg, Rätsch, Schäfer, Schölkopf, 2006, Journal of Machine Learning Research Task: classify human DNA sequence locations into {acceptor splice site, decoy} using 15 Million sequences of length 141, and a Multiple-Kernel Support Vector Machines. PRC = Precision-Recall-Curve, fraction of correct positive predictions among all positively predicted cases • High dimensionality – consider many factors simultaneously to find the regularity • Complex regularities – nonlinear, nonstationary, etc. • Little prior knowledge – e.g., no mechanistic models for the data • Need large data sets – processing requires computers and automatic inference methods 5
  • 6. Hard Inference Problems, II • We can solve scientific inference problems that humans can’t solve • Even if it’s just because of data set size / dimensionality, this is a quantum leap 6
  • 7. Generalization (thanks to O. Bousquet) • observe 1, 2, 4, 7,.. • What’s next? +1 +2 +3 • 1,2,4,7,11,16,…: an+1=an+n (“lazy caterer’s sequence”) • 1,2,4,7,12,20,…: an+2=an+1+an+1 • 1,2,4,7,13,24,…: “Tribonacci”-sequence • 1,2,4,7,14,28: set of divisors of 28 • 1,2,4,7,1,1,5,…: decimal expansions of p=3,14159… and e=2,718… interleaved • The On-Line Encyclopedia of Integer Sequences: >600 hits… 7
  • 8. Generalization, II • Question: which continuation is correct (“generalizes”)? • Answer: there’s no way to tell (“induction problem”) • Question of statistical learning theory: how to come up with a law that is (probably) correct (“demarcation problem”) (more accurately: a law that is probably as correct on the test data as it is on the training data) 8
  • 9. 2-class classification Learn based on m observations generated from some Goal: minimize expected error (“risk”) V. Vapnik Problem: P is unknown. Induction principle: minimize training error (“empirical risk”) over some class of functions. Q: is this “consistent”? 9
  • 10. The law of large numbers For all and Does this imply “consistency” of empirical risk minimization (optimality in the limit)? No – need a uniform law of large numbers: For all 10
  • 11. Consistency and uniform convergence
  • 12. -> LaTeX 12 Bernhard Schölkopf Empirical Inference Department Tübingen , 03 October 2011
  • 13. Support Vector Machines class 2 class 1 F +- - + + - k(x,x’) + - = + <F(x),F(x’)> + • sparse expansion of solution in terms of SVs (Boser, Guyon, Vapnik 1992): - - representer theorem (Kimeldorf & Wahba 1971, Schölkopf et al. 2000) • unique solution found by convex QP Bernhard Schölkopf, 03 October 2011 13
  • 14. Support Vector Machines class 2 class 1 F +- - + + - k(x,x’) + - = + <F(x),F(x’)> + • sparse expansion of solution in terms of SVs (Boser, Guyon, Vapnik 1992): - - representer theorem (Kimeldorf & Wahba 1971, Schölkopf et al. 2000) • unique solution found by convex QP Bernhard Schölkopf, 03 October 2011 14
  • 15. Applications in Computational Geometry / Graphics Steinke, Walder, Blanz et al., Eurographics’05, ’06, ‘08, ICML ’05,‘08, NIPS ’07 15
  • 16. Max-Planck-Institut für biologische Kybernetik Bernhard Schölkopf, Tübingen, FIFA World Cup – Germany vs. England, June 27, 2010 3. Oktober 2011 16
  • 17. Kernel Quiz 17 Bernhard Schölkopf Empirical Inference Department Tübingen , 03 October 2011
  • 18. Kernel Methods Bernhard Sch¨lkopf o Max Planck Institute for Intelligent Systems B. Sch¨lkopf, MLSS France 2011 o
  • 19. Statistical Learning Theory 1. started by Vapnik and Chervonenkis in the Sixties 2. model: we observe data generated by an unknown stochastic regularity 3. learning = extraction of the regularity from the data 4. the analysis of the learning problem leads to notions of capacity of the function classes that a learning machine can implement. 5. support vector machines use a particular type of function class: classifiers with large “margins” in a feature space induced by a kernel. [47, 48] B. Sch¨lkopf, MLSS France 2011 o
  • 20. Example: Regression Estimation y x • Data: input-output pairs (xi, yi) ∈ R × R • Regularity: (x1, y1), . . . (xm, ym) drawn from P(x, y) • Learning: choose a function f : R → R such that the error, averaged over P, is minimized. • Problem: P is unknown, so the average cannot be computed — need an “induction principle”
  • 21. Pattern Recognition Learn f : X → {±1} from examples (x1, y1), . . . , (xm, ym) ∈ X×{±1}, generated i.i.d. from P(x, y), such that the expected misclassification error on a test set, also drawn from P(x, y), 1 R[f ] = |f (x) − y)| dP(x, y), 2 is minimal (Risk Minimization (RM)). Problem: P is unknown. −→ need an induction principle. Empirical risk minimization (ERM): replace the average over P(x, y) by an average over the training sample, i.e. minimize the training error 1 m 1 Remp[f ] = |f (xi) − yi| m i=1 2 B. Sch¨lkopf, MLSS France 2011 o
  • 22. Convergence of Means to Expectations Law of large numbers: Remp[f ] → R[f ] as m → ∞. Does this imply that empirical risk minimization will give us the optimal result in the limit of infinite sample size (“consistency” of empirical risk minimization)? No. Need a uniform version of the law of large numbers. Uniform over all functions that the learning machine can implement. B. Sch¨lkopf, MLSS France 2011 o
  • 23. Consistency and Uniform Convergence R Risk Remp Remp [f] R[f] f f opt fm Function class B. Sch¨lkopf, MLSS France 2011 o
  • 24. The Importance of the Set of Functions What about allowing all functions from X to {±1}? Training set (x1, y1), . . . , (xm, ym) ∈ X × {±1} ¯ ¯¯ Test patterns x1, . . . , xm ∈ X, ¯¯ such that {¯ 1, . . . , xm} ∩ {x1, . . . , xm} = {}. x 1. f ∗(xi) = f (xi) for all i For any f there exists f ∗ s.t.: 2. f ∗(¯ j ) = f (¯ j ) for all j. x x Based on the training set alone, there is no means of choosing which one is better. On the test set, however, they give opposite results. There is ’no free lunch’ [24, 56]. −→ a restriction must be placed on the functions that we allow B. Sch¨lkopf, MLSS France 2011 o
  • 25. Restricting the Class of Functions Two views: 1. Statistical Learning (VC) Theory: take into account the ca- pacity of the class of functions that the learning machine can implement 2. The Bayesian Way: place Prior distributions P(f ) over the class of functions B. Sch¨lkopf, MLSS France 2011 o
  • 26. Detailed Analysis • loss ξi := 1 |f (xi) − yi| in {0, 1} 2 • the ξi are independent Bernoulli trials 1 m • empirical mean m i=1 ξi (by def: equals Remp[f ]) • expected value E [ξ] (equals R[f ]) B. Sch¨lkopf, MLSS France 2011 o
  • 27. Chernoff ’s Bound    1 m  P ξi − E [ξ] ≥ ǫ ≤ 2 exp(−2mǫ2) m  i=1 • here, P refers to the probability of getting a sample ξ1, . . . , ξm with the property m m ξi − E [ξ] ≥ ǫ (is a product mea- 1 i=1 sure) Useful corollary: Given a 2m-sample of Bernoulli trials, we have    1 m 1 2m  mǫ2 P ξi − ξi ≥ ǫ ≤ 4 exp − . m m  2 i=1 i=m+1 B. Sch¨lkopf, MLSS France 2011 o
  • 28. Chernoff ’s Bound, II Translate this back into machine learning terminology: the prob- ability of obtaining an m-sample where the training error and test error differ by more than ǫ > 0 is bounded by P Remp[f ] − R[f ] ≥ ǫ ≤ 2 exp(−2mǫ2). • refers to one fixed f • not allowed to look at the data before choosing f , hence not suitable as a bound on the test error of a learning algorithm using empirical risk minimization B. Sch¨lkopf, MLSS France 2011 o
  • 29. Uniform Convergence (Vapnik & Chervonenkis) Necessary and sufficient conditions for nontrivial consistency of empirical risk minimization (ERM): One-sided convergence, uniformly over all functions that can be implemented by the learning machine. lim P { sup (R[f ] − Remp[f ]) > ǫ} = 0 m→∞ f ∈F for all ǫ > 0. • note that this takes into account the whole set of functions that can be implemented by the learning machine • this is hard to check for a learning machine Are there properties of learning machines (≡ sets of functions) which ensure uniform convergence of risk? B. Sch¨lkopf, MLSS France 2011 o
  • 30. How to Prove a VC Bound Take a closer look at P{supf ∈F (R[f ] − Remp[f ]) > ǫ}. Plan: • if the function class F contains only one function, then Cher- noff’s bound suffices: P{ sup (R[f ] − Remp[f ]) > ǫ} ≤ 2 exp(−2mǫ2). f ∈F • if there are finitely many functions, we use the ’union bound’ • even if there are infinitely many, then on any finite sample there are effectively only finitely many (use symmetrization and capacity concepts) B. Sch¨lkopf, MLSS France 2011 o
  • 31. The Case of Two Functions Suppose F = {f1, f2}. Rewrite 1 2 P{ sup (R[f ] − Remp[f ]) > ǫ} = P(Cǫ ∪ Cǫ ), f ∈F where i Cǫ := {(x1, y1), . . . , (xm, ym) | (R[fi] − Remp[fi]) > ǫ} denotes the event that the risks of fi differ by more than ǫ. The RHS equals 1 2 1 2 1 2 P(Cǫ ∪ Cǫ ) = P(Cǫ ) + P(Cǫ ) − P(Cǫ ∩ Cǫ ) 1 2 ≤ P(Cǫ ) + P(Cǫ ). Hence by Chernoff’s bound 1 2 P{ sup (R[f ] − Remp[f ]) > ǫ} ≤ P(Cǫ ) + P(Cǫ ) f ∈F ≤ 2 · 2 exp(−2mǫ2).
  • 32. The Union Bound Similarly, if F = {f1, . . . , fn}, we have 1 n P{ sup (R[f ] − Remp[f ]) > ǫ} = P(Cǫ ∪ · · · ∪ Cǫ ), f ∈F and n 1 n P(Cǫ ∪ · · · ∪ Cǫ ) ≤ i P(Cǫ ). i=1 Use Chernoff for each summand, to get an extra factor n in the bound. i Note: this becomes an equality if and only if all the events Cǫ involved are disjoint. B. Sch¨lkopf, MLSS France 2011 o
  • 33. Infinite Function Classes • Note: empirical risk only refers to m points. On these points, the functions of F can take at most 2m values • for Remp, the function class thus “looks” finite • how about R? • need to use a trick B. Sch¨lkopf, MLSS France 2011 o
  • 34. Symmetrization Lemma 1 (Vapnik & Chervonenkis (e.g., [46, 12])) For mǫ2 ≥ 2 we have ′ P{ sup (R[f ]−Remp[f ]) > ǫ} ≤ 2P{ sup (Remp[f ]−Remp[f ]) > ǫ/2} f ∈F f ∈F Here, the first P refers to the distribution of iid samples of size m, while the second one refers to iid samples of size 2m. In the latter case, Remp measures the loss on the first half of ′ the sample, and Remp on the second half. B. Sch¨lkopf, MLSS France 2011 o
  • 35. Shattering Coefficient • Hence, we only need to consider the maximum size of F on 2m points. Call it N(F, 2m). • N(F, 2m) = max. number of different outputs (y1, . . . , y2m) that the function class can generate on 2m points — in other words, the max. number of different ways the function class can separate 2m points into two classes. • N(F, 2m) ≤ 22m • if N(F, 2m) = 22m, then the function class is said to shatter 2m points. B. Sch¨lkopf, MLSS France 2011 o
  • 36. Putting Everything Together We now use (1) symmetrization, (2) the shattering coefficient, and (3) the union bound, to get P{sup(R[f ] − Remp[f ]) > ǫ} f ∈F ′ ≤ 2P{sup(Remp[f ] − Remp[f ]) > ǫ/2} f ∈F ′ ′ = 2P{(Remp[f1] − Remp[f1]) > ǫ/2 ∨. . .∨ (Remp[fN(F,2m)] − Remp[fN(F,2m)]) > ǫ/2} N(F,2m) ′ ≤ 2P{(Remp[fn ] − Remp[fn]) > ǫ/2}. n=1 B. Sch¨lkopf, MLSS France 2011 o
  • 37. ctd. Use Chernoff’s bound for each term:∗  1 m 1 2m  mǫ2 P ξi − ξi ≥ ǫ ≤ 2 exp − . m m  2 i=1 i=m+1 This yields mǫ2 P{ sup (R[f ] − Remp[f ]) > ǫ} ≤ 4N(F, 2m) exp − . f ∈F 8 • provided that N(F, 2m) does not grow exponentially in m, this is nontrivial • such bounds are called VC type inequalities • two types of randomness: (1) the P refers to the drawing of the training examples, and (2) R[f ] is an expectation over the drawing of test examples. ∗ Note that the fi depend on the 2m−sample. A rigorous treatment would need to use a second random- ization over permutations of the 2m-sample, see [36].
  • 38. Confidence Intervals Rewrite the bound: specify the probability with which we want R to be close to Remp, and solve for ǫ: With a probability of at least 1 − δ, 8 4 R[f ] ≤ Remp[f ] + ln(N(F, 2m)) + ln . m δ This bound holds independent of f ; in particular, it holds for the function f m minimizing the empirical risk. B. Sch¨lkopf, MLSS France 2011 o
  • 39. Discussion • tighter bounds are available (better constants etc.) • cannot minimize the bound over f • other capacity concepts can be used B. Sch¨lkopf, MLSS France 2011 o
  • 40. VC Entropy On an example (x, y), f causes a loss 1 ξ(x, y, f (x)) = |f (x) − y| ∈ {0, 1}. For a larger sample (x , y ), . .2. , (x , y ), the different functions 1 1 m m f ∈ F lead to a set of loss vectors ξf = (ξ(x1, y1, f (x1)), . . . , ξ(xm, ym, f (xm))), whose cardinality we denote by N (F, (x1, y1) . . . , (xm, ym)) . The VC entropy is defined as HF (m) = E [ln N (F, (x1, y1) . . . , (xm, ym))] , where the expectation is taken over the random generation of the m-sample (x1, y1) . . . , (xm, ym) from P. HF (m)/m → 0 ⇐⇒ uniform convergence of risks (hence consis- tency)
  • 41. Further PR Capacity Concepts • exchange ’E’ and ’ln’: annealed entropy. ann HF (m)/m → 0 ⇐⇒ exponentially fast uniform convergence • take ’max’ instead of ’E’: growth function. Note that GF (m) = ln N(F, m). GF (m)/m → 0 ⇐⇒ exponential convergence for all underlying distributions P. GF (m) = m · ln(2) for all m ⇐⇒ for any m, all loss vectors can be generated, i.e., the m points can be chosen such that by using functions of the learning machine, they can be separated in all 2m possible ways (shattered ). B. Sch¨lkopf, MLSS France 2011 o
  • 42. Structure of the Growth Function Either GF (m) = m · ln(2) for all m ∈ N Or there exists some maximal m for which the above is possible. Call this number the VC-dimension, and denote it by h. For m > h, m GF (m) ≤ h ln + 1 . h Nothing “in between” linear growth and logarithmic growth is possible. B. Sch¨lkopf, MLSS France 2011 o
  • 43. VC-Dimension: Example Half-spaces in R2: f (x, y) = sgn(a + bx + cy), with parameters a, b, c ∈ R • Clearly, we can shatter three non-collinear points. • But we can never shatter four points. • Hence the VC dimension is h = 3 (in this case, equal to the number of parameters) xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx x xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx x xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx x x xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx x xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx x xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx x x xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx x xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx x x xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx x xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx x x xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx x xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx x xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx x xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx x xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx x xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx x xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx x x xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx x x xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx B. Sch¨lkopf, MLSS France 2011 o
  • 44. A Typical Bound for Pattern Recognition For any f ∈ F and m > h, with a probability of at least 1 − δ, h log 2m + 1 − log(δ/4) h R[f ] ≤ Remp[f ] + m holds. • does this mean, that we can learn anything? • The study of the consistency of ERM has thus led to concepts and results which lets us formulate another induction principle (structural risk minimization) B. Sch¨lkopf, MLSS France 2011 o
  • 45. SRM error R(f* ) bound on test error capacity term training error h structure Sn−1 Sn Sn+1 B. Sch¨lkopf, MLSS France 2011 o
  • 46. Finding a Good Function Class • recall: separating hyperplanes in R2 have a VC dimension of 3. • more generally: separating hyperplanes in RN have a VC di- mension of N + 1. • hence: separating hyperplanes in high-dimensional feature spaces have extremely large VC dimension, and may not gener- alize well • however, margin hyperplanes can still have a small VC dimen- sion B. Sch¨lkopf, MLSS France 2011 o
  • 47. Kernels and Feature Spaces Preprocess the data with Φ:X → H x → Φ(x), where H is a dot product space, and learn the mapping from Φ(x) to y [6]. • usually, dim(X) ≪ dim(H) • “Curse of Dimensionality”? • crucial issue: capacity, not dimensionality B. Sch¨lkopf, MLSS France 2011 o
  • 48. Example: All Degree 2 Monomials Φ : R2 → R3 √ 2 , 2 x x , x2 ) (x1, x2) → (z1, z2, z3) := (x1 1 2 2 x z3 2 H H H H x1 H H H H H H HH z1 H H H H z2 B. Sch¨lkopf, MLSS France 2011 o
  • 49. General Product Feature Space How about patterns x ∈ RN and product features of order d? Here, dim(H) grows like N d. E.g. N = 16 × 16, and d = 5 −→ dimension 1010 B. Sch¨lkopf, MLSS France 2011 o
  • 50. The Kernel Trick, N = d = 2 √ √ 2)(x′2, 2 x′ x′ , x′2)⊤ Φ(x), Φ(x′) = (x2, 1 2 x1 x2 , x2 1 1 2 2 2 = x, x′ = : k(x, x′) −→ the dot product in H can be computed in R2 B. Sch¨lkopf, MLSS France 2011 o
  • 51. The Kernel Trick, II More generally: x, x′ ∈ RN , d ∈ N:  d N x, x ′ d =  xj · x′  j j=1 N = xj1 · · · · · xjd · x′ 1 · · · · · x′ d = Φ(x), Φ(x′) , j j j1,...,jd=1 where Φ maps into the space spanned by all ordered products of d input directions B. Sch¨lkopf, MLSS France 2011 o
  • 52. Mercer’s Theorem If k is a continuous kernel of a positive definite integral oper- ator on L2(X) (where X is some compact space), k(x, x′)f (x)f (x′) dx dx′ ≥ 0, X it can be expanded as ∞ k(x, x′) = λiψi(x)ψi(x′) i=1 using eigenfunctions ψi and eigenvalues λi ≥ 0 [30]. B. Sch¨lkopf, MLSS France 2011 o
  • 53. The Mercer Feature Map In that case √  √λ1ψ1(x) Φ(x) :=  λ2ψ2(x)  . . satisfies Φ(x), Φ(x′) = k(x, x′). Proof: √  √ ′)  √λ1ψ1(x) √λ1ψ1(x′ Φ(x), Φ(x′) =  λ2ψ2(x)  ,  λ2ψ2(x )  . . . . ∞ = λiψi(x)ψi(x′) = k(x, x′) i=1 B. Sch¨lkopf, MLSS France 2011 o
  • 54. Positive Definite Kernels It can be shown that the admissible class of kernels coincides with the one of positive definite (pd) kernels: kernels which are sym- metric (i.e., k(x, x′) = k(x′, x)), and for • any set of training points x1, . . . , xm ∈ X and • any a1, . . . , am ∈ R satisfy aiaj Kij ≥ 0, where Kij := k(xi, xj ). i,j K is called the Gram matrix or kernel matrix. If for pairwise distinct points, i,j aiaj Kij = 0 =⇒ a = 0, call it strictly positive definite. B. Sch¨lkopf, MLSS France 2011 o
  • 55. The Kernel Trick — Summary • any algorithm that only depends on dot products can benefit from the kernel trick • this way, we can apply linear methods to vectorial as well as non-vectorial data • think of the kernel as a nonlinear similarity measure • examples of common kernels: Polynomial k(x, x′) = ( x, x′ + c)d Gaussian k(x, x′) = exp(− x − x′ 2/(2 σ 2)) • Kernels are also known as covariance functions [54, 52, 55, 29] B. Sch¨lkopf, MLSS France 2011 o
  • 56. Properties of PD Kernels, 1 Assumption: Φ maps X into a dot product space H; x, x′ ∈ X Kernels from Feature Maps. k(x, x′) := Φ(x), Φ(x′) is a pd kernel on X × X. Kernels from Feature Maps, II K(A, B) := x∈A,x′∈B k(x, x′), where A, B are finite subsets of X, is also a pd kernel ˜ (Hint: use the feature map Φ(A) := x∈A Φ(x)) B. Sch¨lkopf, MLSS France 2011 o
  • 57. Properties of PD Kernels, 2 [36, 39] Assumption: k, k1, k2, . . . are pd; x, x′ ∈ X k(x, x) ≥ 0 for all x (Positivity on the Diagonal) k(x, x′)2 ≤ k(x, x)k(x′, x′) (Cauchy-Schwarz Inequality) (Hint: compute the determinant of the Gram matrix) k(x, x) = 0 for all x =⇒ k(x, x′) = 0 for all x, x′ (Vanishing Diagonals) The following kernels are pd: • αk, provided α ≥ 0 • k1 + k2 • k(x, x′) := limn→∞ kn(x, x′), provided it exists • k1 · k2 • tensor products, direct sums, convolutions [22] B. Sch¨lkopf, MLSS France 2011 o
  • 58. The Feature Space for PD Kernels [4, 1, 35] • define a feature map Φ : X → RX x → k(., x). E.g., for the Gaussian kernel: Φ . . x x' Φ(x) Φ(x') Next steps: • turn Φ(X) into a linear space • endow it with a dot product satisfying Φ(x), Φ(x′) = k(x, x′), i.e., k(., x), k(., x′ ) = k(x, x′) • complete the space to get a reproducing kernel Hilbert space B. Sch¨lkopf, MLSS France 2011 o
  • 59. Turn it Into a Linear Space Form linear combinations m f (.) = αik(., xi), i=1 m′ g(.) = βj k(., x′ ) j j=1 (m, m′ ∈ N, αi, βj ∈ R, xi, x′ ∈ X). j B. Sch¨lkopf, MLSS France 2011 o
  • 60. Endow it With a Dot Product m m′ f, g := αiβj k(xi, x′ ) j i=1 j=1 m m′ = αig(xi) = βj f (x′ ) j i=1 j=1 • This is well-defined, symmetric, and bilinear (more later). • So far, it also works for non-pd kernels B. Sch¨lkopf, MLSS France 2011 o
  • 61. The Reproducing Kernel Property Two special cases: • Assume f (.) = k(., x). In this case, we have k(., x), g = g(x). • If moreover g(.) = k(., x′), we have k(., x), k(., x′ ) = k(x, x′). k is called a reproducing kernel (up to here, have not used positive definiteness) B. Sch¨lkopf, MLSS France 2011 o
  • 62. Endow it With a Dot Product, II • It can be shown that ., . is a p.d. kernel on the set of functions {f (.) = m αik(., xi)|αi ∈ R, xi ∈ X} : i=1 γi γj f i , f j = γi f i , γj f j =: f, f ij i j = αik(., xi), αik(., xi) = αiαj k(xi, xj ) ≥ 0 i i ij • furthermore, it is strictly positive definite: f (x)2 = f, k(., x) 2 ≤ f, f k(., x), k(., x) hence f, f = 0 implies f = 0. • Complete the space in the corresponding norm to get a Hilbert space Hk .
  • 63. The Empirical Kernel Map Recall the feature map Φ : X → RX x → k(., x). • each point is represented by its similarity to all other points • how about representing it by its similarity to a sample of points? Consider Φm : X → Rm x → k(., x)|(x1 ,...,xm) = (k(x1, x), . . . , k(xm, x))⊤ B. Sch¨lkopf, MLSS France 2011 o
  • 64. ctd. • Φm(x1), . . . , Φm(xm) contain all necessary information about Φ(x1), . . . , Φ(xm) • the Gram matrix Gij := Φm(xi), Φm(xj ) satisfies G = K 2 where Kij = k(xi, xj ) • modify Φm to Φw : X → Rm m − 1 (k(x , x), . . . , k(x , x))⊤ x → K 2 1 m • this “whitened” map (“kernel PCA map”) satifies Φw (xi), Φw (xj ) = k(xi, xj ) m m for all i, j = 1, . . . , m. B. Sch¨lkopf, MLSS France 2011 o
  • 65. An Example of a Kernel Algorithm Idea: classify points x := Φ(x) in feature space according to which of the two class means is closer. 1 1 c+ := Φ(xi), c− := Φ(xi) m+ m− yi=1 yi=−1 + o o . + w + c2 o c c1 x-c o x Compute the sign of the dot product between w := c+ − c− and x − c. B. Sch¨lkopf, MLSS France 2011 o
  • 66. An Example of a Kernel Algorithm, ctd. [36]   f (x) = sgn 1 Φ(x), Φ(xi) − 1 Φ(x), Φ(xi) +b m+ m− {i:yi=+1} {i:yi=−1}   = sgn  1 k(x, xi) − 1 k(x, xi) + b m+ m− {i:yi=+1} {i:yi=−1} where   1 1 1 b= k(xi, xj ) − k(xi, xj ) . 2 m2− m2+ {(i,j):yi=yj =−1} {(i,j):yi=yj =+1} • provides a geometric interpretation of Parzen windows B. Sch¨lkopf, MLSS France 2011 o
  • 67. An Example of a Kernel Algorithm, ctd. • Demo • Exercise: derive the Parzen windows classifier by computing the distance criterion directly • SVMs (ppt) B. Sch¨lkopf, MLSS France 2011 o
  • 68. An example of a kernel algorithm, revisited o + µ(Y ) . + w o µ(X ) + o + X compact subset of a separable metric space, m, n ∈ N. Positive class X := {x1, . . . , xm} ⊂ X Negative class Y := {y1, . . . , yn} ⊂ X 1 m 1 n RKHS means µ(X) = m i=1 k(xi, ·), µ(Y ) = n i=1 k(yi, ·). Get a problem if µ(X) = µ(Y )! B. Sch¨lkopf, MLSS France 2011 o
  • 69. When do the means coincide? k(x, x′) = x, x′ : the means coincide k(x, x′) = ( x, x′ + 1)d: all empirical moments up to order d coincide k strictly pd: X =Y. The mean “remembers” each point that contributed to it. B. Sch¨lkopf, MLSS France 2011 o
  • 70. Proposition 2 Assume X, Y are defined as above, k is strictly pd, and for all i, j, xi = xj , and yi = yj . If for some αi, βj ∈ R − {0}, we have m n αik(xi, .) = βj k(yj , .), (1) i=1 j=1 then X = Y . B. Sch¨lkopf, MLSS France 2011 o
  • 71. Proof (by contradiction) W.l.o.g., assume that x1 ∈ Y . Subtract n βj k(yj , .) from (1), j=1 and make it a sum over pairwise distinct points, to get 0= γik(zi, .), i where z1 = x1, γ1 = α1 = 0, and z2, · · · ∈ X ∪ Y − {x1}, γ2, · · · ∈ R. Take the RKHS dot product with j γj k(zj , .) to get 0= γiγj k(zi, zj ), ij with γ = 0, hence k cannot be strictly pd. B. Sch¨lkopf, MLSS France 2011 o
  • 72. The mean map m 1 µ : X = (x1, . . . , xm) → k(xi, ·) m i=1 satisfies m m 1 1 µ(X), f = k(xi, ·), f = f (xi) m m i=1 i=1 and m n 1 1 µ(X)−µ(Y ) = sup | µ(X) − µ(Y ), f | = sup f (xi) − f (yi) . f ≤1 f ≤1 m i=1 n i=1 Note: Large distance = can find a function distinguishing the samples B. Sch¨lkopf, MLSS France 2011 o
  • 73. Witness function µ(X)−µ(Y ) f = µ(X)−µ(Y ) , thus f (x) ∝ µ(X) − µ(Y ), k(x, .) ): Witness f for Gauss and Laplace data 1 f 0.8 Gauss Laplace 0.6 Prob. density and f 0.4 0.2 0 −0.2 −0.4 −6 −4 −2 0 2 4 6 X This function is in the RKHS of a Gaussian kernel, but not in the RKHS of the linear kernel. B. Sch¨lkopf, MLSS France 2011 o
  • 74. The mean map for measures p, q Borel probability measures, Ex,x′∼p[k(x, x′)], Ex,x′∼q [k(x, x′)] ∞ ( k(x, .) ≤ M ∞ is sufficient) Define µ : p → Ex∼p[k(x, ·)]. Note µ(p), f = Ex∼p[f (x)] and µ(p) − µ(q) = sup Ex∼p[f (x)] − Ex∼q [f (x)] . f ≤1 Recall that in the finite sample case, for strictly p.d. kernels, µ was injective — how about now? [43, 17] B. Sch¨lkopf, MLSS France 2011 o
  • 75. Theorem 3 [15, 13] p = q ⇐⇒ sup Ex∼p(f (x)) − Ex∼q (f (x)) = 0, f ∈C(X) where C(X) is the space of continuous bounded functions on X. Combine this with µ(p) − µ(q) = sup Ex∼p[f (x)] − Ex∼q [f (x)] . f ≤1 Replace C(X) by the unit ball in an RKHS that is dense in C(X) — universal kernel [45], e.g., Gaussian. Theorem 4 [19] If k is universal, then p = q ⇐⇒ µ(p) − µ(q) = 0. B. Sch¨lkopf, MLSS France 2011 o
  • 76. • µ is invertible on its image M = {µ(p) | p is a probability distribution} (the “marginal polytope”, [53]) • generalization of the moment generating function of a RV x with distribution p: Mp(.) = Ex∼p e x, · . This provides us with a convenient metric on probability distribu- tions, which can be used to check whether two distributions are different — provided that µ is invertible. B. Sch¨lkopf, MLSS France 2011 o
  • 77. Fourier Criterion Assume we have densities, the kernel is shift invariant (k(x, y) = k(x − y)), and all Fourier transforms below exist. Note that µ is invertible iff k(x − y)p(y) dy = k(x − y)q(y) dy =⇒ p = q, i.e., ˆp ˆ k(ˆ − q ) = 0 =⇒ p = q (Sriperumbudur et al., 2008) ˆ E.g., µ is invertible if k has full support. Restricting the class of ˆ distributions, weaker conditions suffice (e.g., if k has non-empty in- terior, µ is invertible for all distributions with compact support). B. Sch¨lkopf, MLSS France 2011 o
  • 78. Fourier Optics Application: p source of incoherent light, I indicator of a finite ˆ aperture. In Fraunhofer diffraction, the intensity image is ∝ p∗ I 2. ˆ Set k = I 2, then this equals µ(p). ˆ This k does not have full support, thus the imaging process is not invertible for the class of all light sources (Abbe), but it is if we restrict the class (e.g., to compact support). B. Sch¨lkopf, MLSS France 2011 o
  • 79. Application 1: Two-sample problem [19] X, Y i.i.d. m-samples from p, q, respectively. 2 µ(p) − µ(q) =Ex,x′∼p [k(x, x′)] − 2Ex∼p,y∼q [k(x, y)] + Ey,y′∼q [k(y, y ′)] =Ex,x′∼p,y,y′∼q [h((x, y), (x′, y ′))] with h((x, y), (x′, y ′)) := k(x, x′) − k(x, y ′) − k(y, x′) + k(y, y ′). Define D(p, q)2 := Ex,x′∼p,y,y′∼q h((x, y), (x′, y ′)) ˆ D(X, Y )2 := 1 h((xi, yi), (xj , yj )). m(m−1) i=j ˆ D(X, Y )2 is an unbiased estimator of D(p, q)2. It’s easy to compute, and works on structured data. B. Sch¨lkopf, MLSS France 2011 o
  • 80. Theorem 5 Assume k is bounded. 1 ˆ D(X, Y )2 converges to D(p, q)2 in probability with rate O(m− 2 ). This could be used as a basis for a test, but uniform convergence bounds are often loose.. √ ˆ Theorem 6 We assume E h2 ∞. When p = q, then m(D(X, Y )2 − D(p, q)2) converges in distribution to a zero mean Gaussian with variance 2 σu = 4 Ez (Ez′ h(z, z ′ ))2 − Ez,z′ (h(z, z ′ )) 2 . ˆ ˆ When p = q, then m(D(X, Y )2 − D(p, q)2) = mD(X, Y )2 converges in distribution to ∞ λl ql2 − 2 , (2) l=1 where ql ∼ N(0, 2) i.i.d., λi are the solutions to the eigenvalue equation ˜ k(x, x′)ψi (x)dp(x) = λi ψi(x′ ), X ˜ and k(xi, xj ) := k(xi, xj ) − Exk(xi, x) − Exk(x, xj ) + Ex,x′ k(x, x′) is the centred RKHS kernel. B. Sch¨lkopf, MLSS France 2011 o
  • 81. Application 2: Dependence Measures Assume that (x, y) are drawn from pxy , with marginals px, py . Want to know whether pxy factorizes. [2, 16]: kernel generalized variance [20, 21]: kernel constrained covariance, HSIC Main idea [25, 34]: x and y independent ⇐⇒ ∀ bounded continuous functions f, g, we have Cov(f (x), g(y)) = 0. B. Sch¨lkopf, MLSS France 2011 o
  • 82. k kernel on X × Y. µ(pxy ) := E(x,y)∼pxy [k((x, y), ·)] µ(px × py ) := Ex∼px,y∼py [k((x, y), ·)] . Use ∆ := µ(pxy ) − µ(px × py ) as a measure of dependence. For k((x, y), (x′ , y ′)) = kx(x, x′)ky (y, y ′): ∆2 equals the Hilbert-Schmidt norm of the covariance opera- tor between the two RKHSs (HSIC), with empirical estimate m−2 tr HKxHKy , where H = I − 1/m [20, 44]. B. Sch¨lkopf, MLSS France 2011 o
  • 83. Witness function of the equivalent optimisation problem: Dependence witness and sample 1.5 0.05 1 0.04 0.03 0.5 0.02 0.01 Y 0 0 −0.5 −0.01 −0.02 −1 −0.03 −0.04 −1.5 −1.5 −1 −0.5 0 0.5 1 1.5 X Application: learning causal structures (Sun et al., ICML 2007; Fuku- mizu et al., NIPS 2007)) B. Sch¨lkopf, MLSS France 2011 o
  • 84. Application 3: Covariate Shift Correction and Local Learning training set X = {(x1, y1), . . . , (xm, ym)} drawn from p, test set X ′ = (x′ , y1), . . . , (x′ , yn) from p′ = p. 1 ′ n ′ Assume py|x = p′ . y|x [40]: reweight training set B. Sch¨lkopf, MLSS France 2011 o
  • 85. Minimize 2 m βik(xi, ·) − µ(X ′) +λ β 2 subject to βi ≥ 0, 2 βi = 1. i=1 i Equivalent QP: 1 ⊤ minimize β (K + λ1) β − β ⊤l β 2 subject to βi ≥ 0 and βi = 1, i where Kij := k(xi, xj ), li = k(xi, ·), µ(X ′) . Experiments show that in underspecified situations (e.g., large ker- nel widths), this helps [23]. X ′ = x′ leads to a local sample weighting scheme. B. Sch¨lkopf, MLSS France 2011 o
  • 86. The Representer Theorem Theorem 7 Given: a p.d. kernel k on X × X, a training set (x1, y1), . . . , (xm, ym) ∈ X × R, a strictly monotonic increasing real-valued function Ω on [0, ∞[, and an arbitrary cost function c : (X × R2)m → R ∪ {∞} Any f ∈ Hk minimizing the regularized risk functional c ((x1, y1, f (x1)), . . . , (xm, ym, f (xm))) + Ω ( f ) (3) admits a representation of the form m f (.) = αik(xi, .). i=1 B. Sch¨lkopf, MLSS France 2011 o
  • 87. Remarks • significance: many learning algorithms have solutions that can be expressed as expansions in terms of the training examples • original form, with mean squared loss m 1 c((x1, y1, f (x1)), . . . , (xm, ym, f (xm))) = (yi − f (xi))2, m i=1 and Ω( f ) = λ f 2 (λ 0): [27] • generalization to non-quadratic cost functions: [10] • present form: [36] B. Sch¨lkopf, MLSS France 2011 o
  • 88. Proof Decompose f ∈ H into a part in the span of the k(xi, .) and an orthogonal one: f= αik(xi, .) + f⊥, where for all j i f⊥, k(xj , .) = 0. Application of f to an arbitrary training point xj yields f (xj ) = f, k(xj , .) = αik(xi, .) + f⊥, k(xj , .) i = αi k(xi, .), k(xj , .) , i independent of f⊥. B. Sch¨lkopf, MLSS France 2011 o
  • 89. Proof: second part of (3) Since f⊥ is orthogonal to i αik(xi, .), and Ω is strictly mono- tonic, we get Ω( f ) = Ω αik(xi, .) + f⊥ i = Ω αik(xi, .) 2 + f⊥ 2 i ≥ Ω αik(xi, .) , (4) i with equality occuring if and only if f⊥ = 0. Hence, any minimizer must have f⊥ = 0. Consequently, any solution takes the form f= αik(xi, .). i B. Sch¨lkopf, MLSS France 2011 o
  • 90. Application: Support Vector Classification Here, yi ∈ {±1}. Use 1 c ((xi, yi, f (xi))i) = max (0, 1 − yif (xi)) , λ i and the regularizer Ω ( f ) = f 2. λ → 0 leads to the hard margin SVM B. Sch¨lkopf, MLSS France 2011 o
  • 91. Further Applications Bayesian MAP Estimates. Identify (3) with the negative log posterior (cf. Kimeldorf Wahba, 1970, Poggio Girosi, 1990), i.e. • exp(−c((xi, yi, f (xi))i)) — likelihood of the data • exp(−Ω( f )) — prior over the set of functions; e.g., Ω( f ) = λ f 2 — Gaussian process prior [55] with covariance function k • minimizer of (3) = MAP estimate Kernel PCA (see below) can be shown to correspond to the case of  2 1 1 c((xi, yi, f (xi))i=1,...,m) = 0 if m i f (xi) − m j f (xj ) = 1   ∞ otherwise with g an arbitrary strictly monotonically increasing function.
  • 92. Conclusion • the kernel corresponds to – a similarity measure for the data, or – a (linear) representation of the data, or – a hypothesis space for learning, • kernels allow the formulation of a multitude of geometrical algo- rithms (Parzen windows, 2-sample tests, SVMs, kernel PCA,...) B. Sch¨lkopf, MLSS France 2011 o
  • 93. Kernel PCA [37] linear PCA k(x,y) = (x.y) R2 x x x xx x x x x x x x kernel PCA k(x,y) = (x.y)d R2 x x x x x x x x x xx x x x x x x x x x x x x x k H Φ B. Sch¨lkopf, MLSS France 2011 o
  • 94. Kernel PCA, II m 1 x1, . . . , xm ∈ X, Φ : X → H, C= Φ(xj )Φ(xj )⊤ m j=1 Eigenvalue problem m 1 λV = CV = Φ(xj ), V Φ(xj ). m j=1 For λ = 0, V ∈ span{Φ(x1), . . . , Φ(xm)}, thus m V= αiΦ(xi), i=1 and the eigenvalue problem can be written as λ Φ(xn), V = Φ(xn), CV for all n = 1, . . . , m B. Sch¨lkopf, MLSS France 2011 o
  • 95. Kernel PCA in Dual Variables In term of the m × m Gram matrix Kij := Φ(xi), Φ(xj ) = k(xi, xj ), this leads to mλKα = K 2α where α = (α1, . . . , αm)⊤. Solve mλα = Kα −→ (λn, αn) Vn, Vn = 1 ⇐⇒ λn αn, αn = 1 thus divide α n by √λ n B. Sch¨lkopf, MLSS France 2011 o
  • 96. Feature extraction Compute projections on the Eigenvectors m Vn = n αi Φ(xi) i=1 in H: for a test point x with image Φ(x) in H we get the features m Vn, Φ(x) = n αi Φ(xi), Φ(x) i=1 m = n αi k(xi, x) i=1 B. Sch¨lkopf, MLSS France 2011 o
  • 97. The Kernel PCA Map Recall Φw : X → Rm m 1 − 2 (k(x , x), . . . , k(x , x))⊤ x → K 1 m If K = U DU ⊤ is K’s diagonalization, then K −1/2 = U D−1/2U ⊤. Thus we have Φw (x) = U D−1/2U ⊤(k(x1, x), . . . , k(xm, x))⊤. m We can drop the leading U (since it leaves the dot product invari- ant) to get a map Φw CA(x) = D−1/2U ⊤(k(x1, x), . . . , k(xm, x))⊤. KP The rows of U ⊤ are the eigenvectors αn of K, and the entries of −1/2 the diagonal matrix D−1/2 equal λi . B. Sch¨lkopf, MLSS France 2011 o
  • 98. Toy Example with Gaussian Kernel k(x, x′) = exp − x − x′ 2 B. Sch¨lkopf, MLSS France 2011 o
  • 99. Super-Resolution (Kim, Franz, Sch¨lkopf, 2004) o a. original image of resolution b. low resolution image (264 × c. bicubic interpolation d. supervised example-based f. unsupervised KPCA recon- 528 × 396 198) stretched to the original learning based on nearest neigh- struction scale bor classifier g. enlarged portions of a-d, and f (from left to right) Comparison between different super-resolution methods. B. Sch¨lkopf, MLSS France 2011 o
  • 100. Support Vector Classifiers input space feature space G N N N N Φ G G G G G [6] B. Sch¨lkopf, MLSS France 2011 o
  • 101. Separating Hyperplane w, x + b 0 N G N G N w, x + b 0 w N G G G {x | w, x + b = 0} B. Sch¨lkopf, MLSS France 2011 o
  • 102. Optimal Separating Hyperplane [50] N G N G N . w N G G G {x | w, x + b = 0} B. Sch¨lkopf, MLSS France 2011 o
  • 103. Eliminating the Scaling Freedom [47] Note: if c = 0, then {x| w, x + b = 0} = {x| cw, x + cb = 0}. Hence (cw, cb) describes the same hyperplane as (w, b). Definition: The hyperplane is in canonical form w.r.t. X ∗ = {x1, . . . , xr } if minxi∈X | w, xi + b| = 1. B. Sch¨lkopf, MLSS France 2011 o
  • 104. Canonical Optimal Hyperplane {x | w, x + b = +1} {x | w, x + b = −1} Note: N w, x1 + b = +1 H N x1 yi = +1 w, x2 + b = −1 x2H = w , (x1−x2) = 2 N , w yi = −1 w N = , (x1−x2) = 2 ||w|| ||w|| H H H {x | w, x + b = 0} B. Sch¨lkopf, MLSS France 2011 o
  • 105. Canonical Hyperplanes [47] Note: if c = 0, then {x| w, x + b = 0} = {x| cw, x + cb = 0}. Hence (cw, cb) describes the same hyperplane as (w, b). Definition: The hyperplane is in canonical form w.r.t. X ∗ = {x1, . . . , xr } if minxi∈X | w, xi + b| = 1. Note that for canonical hyperplanes, the distance of the closest point to the hyperplane (“margin”) is 1/ w : minxi∈X w ,x + b = 1 . w i w w B. Sch¨lkopf, MLSS France 2011 o
  • 106. Theorem 8 (Vapnik [46]) Consider hyperplanes w, x = 0 where w is normalized such that they are in canonical form w.r.t. a set of points X ∗ = {x1, . . . , xr }, i.e., min | w, xi | = 1. i=1,...,r The set of decision functions fw(x) = sgn x, w defined on X ∗ and satisfying the constraint w ≤ Λ has a VC dimension satisfying h ≤ R2Λ2. Here, R is the radius of the smallest sphere around the origin containing X ∗. B. Sch¨lkopf, MLSS France 2011 o
  • 107. x x x R x x γ1 γ2 B. Sch¨lkopf, MLSS France 2011 o
  • 108. Proof Strategy (Gurvits, 1997) Assume that x1, . . . , xr are shattered by canonical hyperplanes with w ≤ Λ, i.e., for all y1, . . . , yr ∈ {±1}, yi w, xi ≥ 1 for all i = 1, . . . , r. (5) Two steps: • prove that the more points we want to shatter (5), the larger r i=1 yixi must be r • upper bound the size of i=1 yixi in terms of R Combining the two tells us how many points we can at most shat- ter. B. Sch¨lkopf, MLSS France 2011 o
  • 109. Part I Summing (5) over i = 1, . . . , r yields   r w,  yi xi  ≥ r. i=1 By the Cauchy-Schwarz inequality, on the other hand, we have   r r r w,  yi xi  ≤ w yixi ≤ Λ yixi . i=1 i=1 i=1 Combine both: r r ≤ yi xi . (6) Λ i=1 B. Sch¨lkopf, MLSS France 2011 o
  • 110. Part II Consider independent random labels yi ∈ {±1}, uniformly dis- tributed (Rademacher variables).   2   r r r E yi xi  = E  yi xi , yj xj    i=1 i=1 j=1      r = E  y i x i ,  yj xj  + yixi  i=1 j=i    r =  E yixi, yj xj  + E [ yixi, yixi ] i=1 j=i r r = E yi xi 2 = xi 2 i=1 i=1 B. Sch¨lkopf, MLSS France 2011 o
  • 111. Part II, ctd. Since xi ≤ R, we get   2 r E yixi  ≤ rR2.   i=1 • This holds for the expectation over the random choices of the labels, hence there must be at least one set of labels for which it also holds true. Use this set. Hence 2 r yi xi ≤ rR2. i=1 B. Sch¨lkopf, MLSS France 2011 o
  • 112. Part I and II Combined r 2 Part I: Λ ≤ r yi xi 2 i=1 Part II: r i=1 yixi 2 ≤ rR2 Hence r2 ≤ rR2, Λ2 i.e., r ≤ R2Λ2, completing the proof. B. Sch¨lkopf, MLSS France 2011 o