SlideShare una empresa de Scribd logo
1 de 56
Descargar para leer sin conexión
Sebastian Nowozin and Christoph Lampert   –   Structured Models in Computer Vision   –   Part 5. Structured SVMs




              Part 5: Structured Support Vector Machines

                       Sebastian Nowozin and Christoph H. Lampert


                                 Colorado Springs, 25th June 2011




                                                                                                           1 / 56
Sebastian Nowozin and Christoph Lampert   –   Structured Models in Computer Vision   –   Part 5. Structured SVMs


 Problem (Loss-Minimizing Parameter Learning)
 Let   d(x, y) be the (unknown) true data distribution.
 Let   D = {(x1 , y 1 ), . . . , (xN , y N )} be i.i.d. samples from d(x, y).
 Let   φ : X × Y → RD be a feature function.
 Let   ∆ : Y × Y → R be a loss function.
        Find a weight vector w∗ that leads to minimal expected loss

                                          E(x,y)∼d(x,y) {∆(y, f (x))}

        for f (x) = argmaxy∈Y w, φ(x, y) .




                                                                                                           2 / 56
Sebastian Nowozin and Christoph Lampert   –   Structured Models in Computer Vision   –   Part 5. Structured SVMs


 Problem (Loss-Minimizing Parameter Learning)
 Let   d(x, y) be the (unknown) true data distribution.
 Let   D = {(x1 , y 1 ), . . . , (xN , y N )} be i.i.d. samples from d(x, y).
 Let   φ : X × Y → RD be a feature function.
 Let   ∆ : Y × Y → R be a loss function.
        Find a weight vector w∗ that leads to minimal expected loss

                                          E(x,y)∼d(x,y) {∆(y, f (x))}

        for f (x) = argmaxy∈Y w, φ(x, y) .

 Pro:
        We directly optimize for the quantity of interest: expected loss.
        No expensive-to-compute partition function Z will show up.
 Con:
        We need to know the loss function already at training time.
        We can’t use probabilistic reasoning to find w∗ .
                                                                                                           3 / 56
Sebastian Nowozin and Christoph Lampert   –   Structured Models in Computer Vision   –   Part 5. Structured SVMs


Reminder: learning by regularized risk minimization




 For compatibility function g(x, y; w) := w, φ(x, y) find w∗ that minimizes

                            E(x,y)∼d(x,y) ∆( y, argmaxy g(x, y; w) ).

 Two major problems:
        d(x, y) is unknown
        argmaxy g(x, y; w) maps into a discrete space
        → ∆( y, argmaxy g(x, y; w)) is discontinuous, piecewise constant




                                                                                                           4 / 56
Sebastian Nowozin and Christoph Lampert       –    Structured Models in Computer Vision        –     Part 5. Structured SVMs


 Task:

                        min      E(x,y)∼d(x,y) ∆( y, argmaxy g(x, y; w) ).
                        w

 Problem 1:
        d(x, y) is unknown

 Solution:
                                                                                          1
        Replace E(x,y)∼d(x,y) ·                   with empirical estimate                 N        (xn ,y n )   ·
        To avoid overfitting: add a regularizer, e.g. λ w                                  2.



 New task:
                                                   N
                                     2        1
                  min       λ w           +             ∆( y n , argmaxy g(xn , y; w) ).
                    w                         N
                                                  n=1


                                                                                                                       5 / 56
Sebastian Nowozin and Christoph Lampert       –   Structured Models in Computer Vision   –   Part 5. Structured SVMs


 Task:
                                                  N
                                     2        1
                  min       λ w           +             ∆( y n , argmaxy g(xn , y; w) ).
                    w                         N
                                                  n=1

 Problem:
        ∆( y, argmaxy g(x, y; w) ) discontinuous w.r.t. w.

 Solution:
        Replace ∆(y, y ) with well behaved (x, y, w)
        Typically:         upper bound to ∆, continuous and convex w.r.t. w.

 New task:
                                                               N
                                                   2     1
                               min        λ w          +             (xn , y n , w))
                                 w                       N
                                                             n=1

                                                                                                               6 / 56
Sebastian Nowozin and Christoph Lampert   –   Structured Models in Computer Vision   –   Part 5. Structured SVMs


Regularized Risk Minimization
                                                                    N
                                                2             1
                      min                 λ w         +                    (xn , y n , w))
                        w                                     N
                                                                   n=1
                                  Regularization + Loss on training data




                                                                                                           7 / 56
Sebastian Nowozin and Christoph Lampert   –   Structured Models in Computer Vision   –   Part 5. Structured SVMs


Regularized Risk Minimization
                                                                    N
                                                2             1
                      min                 λ w         +                    (xn , y n , w))
                        w                                     N
                                                                   n=1
                                  Regularization + Loss on training data


 Hinge loss: maximum margin training

         (xn , y n , w) := max ∆(y n , y) + w, φ(xn , y) − w, φ(xn , y n )
                                 y∈Y




                                                                                                           8 / 56
Sebastian Nowozin and Christoph Lampert   –   Structured Models in Computer Vision   –   Part 5. Structured SVMs


Regularized Risk Minimization
                                                                    N
                                                2             1
                      min                 λ w         +                    (xn , y n , w))
                        w                                     N
                                                                   n=1
                                  Regularization + Loss on training data


 Hinge loss: maximum margin training

         (xn , y n , w) := max ∆(y n , y) + w, φ(xn , y) − w, φ(xn , y n )
                                 y∈Y


         is maximum over linear functions → continuous, convex.
         bounds ∆ from above.
        Proof: Let y = argmaxy g(xn , y, w)
                   ¯
                   ∆(y n , y ) ≤ ∆(y n , y ) + g(xn , y , w) − g(xn , y n , w)
                           ¯             ¯            ¯
                                  ≤ max ∆(y n , y) + g(xn , y, w) − g(xn , y n , w)
                                      y∈Y
                                                                                                           9 / 56
Sebastian Nowozin and Christoph Lampert   –     Structured Models in Computer Vision   –   Part 5. Structured SVMs


Regularized Risk Minimization
                                                                      N
                                                  2             1
                      min                 λ w           +                    (xn , y n , w))
                        w                                       N
                                                                     n=1
                                  Regularization + Loss on training data


 Hinge loss: maximum margin training

         (xn , y n , w) := max ∆(y n , y) + w, φ(xn , y) − w, φ(xn , y n )
                                 y∈Y


 Alternative:
 Logistic loss: probabilistic training

              (xn , y n , w) := log             exp     w, φ(xn , y) − w, φ(xn , y n )
                                          y∈Y

                                                                                                           10 / 56
Sebastian Nowozin and Christoph Lampert     –     Structured Models in Computer Vision   –   Part 5. Structured SVMs


 Structured Output Support Vector Machine

                                 N
           1        2       C
   min       w          +                 max ∆(y n , y) + w, φ(xn , y) − w, φ(xn , y n )
     w     2                N             y∈Y
                                n=1




 Conditional Random Field
                                N
             w 2
         min      +                       log         exp   w, φ(xn , y) − w, φ(xn , y n )
          w  2σ 2
                                n=1             y∈Y




 CRFs and SSVMs have more in common than usually assumed.
    both do regularized risk minimization
    log y exp(·) can be interpreted as a soft-max
                                                                                                             11 / 56
Sebastian Nowozin and Christoph Lampert     –   Structured Models in Computer Vision   –   Part 5. Structured SVMs


Solving the Training Optimization Problem Numerically



 Structured Output Support Vector Machine:

                                N
      1            2     C
  min   w              +                  max ∆(y n , y) + w, φ(xn , y) − w, φ(xn , y n )
   w  2                  N                y∈Y
                              n=1




  Unconstrained optimization, convex, non-differentiable objective.




                                                                                                           12 / 56
Sebastian Nowozin and Christoph Lampert   –    Structured Models in Computer Vision   –   Part 5. Structured SVMs




 Structured Output SVM (equivalent formulation):
                                              N
                           1        2     C
                min          w          +           ξn
                 w,ξ       2              N
                                              n=1

 subject to, for n = 1, . . . , N ,

                max        ∆(y n , y) + w, φ(xn , y) − w, φ(xn , y n )                    ≤ ξn
                 y∈Y




  N non-linear contraints, convex, differentiable objective.




                                                                                                          13 / 56
Sebastian Nowozin and Christoph Lampert    –     Structured Models in Computer Vision   –   Part 5. Structured SVMs




 Structured Output SVM (also equivalent formulation):
                                          N
                    1        2       C
         min          w          +              ξn
          w,ξ       2                N
                                          n=1


 subject to, for n = 1, . . . , N ,

         ∆(y n , y) + w, φ(xn , y) − w, φ(xn , y n ) ≤ ξ n ,                            for all y ∈ Y


   N |Y| linear constraints, convex, differentiable objective.




                                                                                                            14 / 56
Sebastian Nowozin and Christoph Lampert      –     Structured Models in Computer Vision      –      Part 5. Structured SVMs


Example: Multiclass SVM

                                                                1     for y = y
        Y = {1, 2, . . . , K}, ∆(y, y ) =                                       .
                                                                0     otherwise
        φ(x, y) =            y = 1 φ(x), y = 2 φ(x), . . . , y = K φ(x)

                                                  N
                           1         2       C
  Solve:            min      w           +             ξn
                     w,ξ   2                 N
                                                 n=1

 subject to, for i = 1, . . . , n,
                     w, φ(xn , y n ) − w, φ(xn , y) ≥ 1 − ξ n                           for all y ∈ Y  {y n }.

 Classification:             f (x) = argmaxy∈Y                   w, φ(x, y) .

                                Crammer-Singer Multiclass SVM
 [K. Crammer, Y. Singer: ”On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines”, JMLR, 2001] 15 / 56
Sebastian Nowozin and Christoph Lampert     –     Structured Models in Computer Vision     –     Part 5. Structured SVMs


Example: Hierarchical SVM

   Hierarchical Multiclass Loss:
                    1
         ∆(y, y ) := (distance in tree)
                    2
         ∆(cat, cat) = 0, ∆(cat, dog) = 1,
         ∆(cat, bus) = 2,                 etc.

                                                 N
                       1           2     C
  Solve:           min w               +              ξn
                   w,ξ 2                 N
                                                n=1

 subject to, for i = 1, . . . , n,

                    w, φ(xn , y n ) − w, φ(xn , y) ≥ ∆(y n , y) − ξ n                            for all y ∈ Y.

 [L. Cai, T. Hofmann: ”Hierarchical Document Categorization with Support Vector Machines”, ACM CIKM, 2004]
 [A. Binder, K.-R. M¨ller, M. Kawanabe: ”On taxonomies for multi-class image categorization”, IJCV, 2011]
                    u

                                                                                                                 16 / 56
Sebastian Nowozin and Christoph Lampert     –   Structured Models in Computer Vision   –   Part 5. Structured SVMs


Solving the Training Optimization Problem Numerically

 We can solve SSVM training like CRF training:

                                 N
           1        2       C
   min       w          +                 max ∆(y n , y) + w, φ(xn , y) − w, φ(xn , y n )
     w     2                N             y∈Y
                                n=1




         continuous
         unconstrained
         convex
         non-differentiable
         → we can’t use gradient descent directly.
         → we’ll have to use subgradients

                                                                                                           17 / 56
Sebastian Nowozin and Christoph Lampert   –   Structured Models in Computer Vision      –     Part 5. Structured SVMs


 Definition
 Let f : RD → R be a convex, not necessarily differentiable, function.
 A vector v ∈ RD is called a subgradient of f at w0 , if

                            f (w) ≥ f (w0 ) + v, w − w0                       for all w.

                         f(w)                                                f(w0)+⟨v,w-w0⟩
                                                             f(w )
                                                                 0




                                                                                       w
                                                                     w   0




                                                                                                              18 / 56
Sebastian Nowozin and Christoph Lampert   –   Structured Models in Computer Vision     –     Part 5. Structured SVMs


 Definition
 Let f : RD → R be a convex, not necessarily differentiable, function.
 A vector v ∈ RD is called a subgradient of f at w0 , if

                            f (w) ≥ f (w0 ) + v, w − w0                      for all w.

                         f(w)                                               f(w0)+⟨v,w-w0⟩


                                                           f(w )0



                                                                                      w
                                                                    w   0




                                                                                                             19 / 56
Sebastian Nowozin and Christoph Lampert   –   Structured Models in Computer Vision    –      Part 5. Structured SVMs


 Definition
 Let f : RD → R be a convex, not necessarily differentiable, function.
 A vector v ∈ RD is called a subgradient of f at w0 , if

                            f (w) ≥ f (w0 ) + v, w − w0                      for all w.

                         f(w)                                               f(w0)+⟨v,w-w0⟩
                                                            f(w )
                                                                0




                                                                                      w
                                                                    w   0




                                                                                                             20 / 56
Sebastian Nowozin and Christoph Lampert   –   Structured Models in Computer Vision       –      Part 5. Structured SVMs


 Definition
 Let f : RD → R be a convex, not necessarily differentiable, function.
 A vector v ∈ RD is called a subgradient of f at w0 , if

                            f (w) ≥ f (w0 ) + v, w − w0                         for all w.

                         f(w)                                                  f(w0)+⟨v,w-w0⟩
                                                            f(w )  0




                                                                                         w
                                                                       w   0




    For differentiable f , the gradient v =                     f (w0 ) is the only subgradient.
                         f(w)                                                  f(w0)+⟨v,w-w0⟩

                                                          f(w )0




                                                                                         w
                                                                       w   0

                                                                                                                21 / 56
Sebastian Nowozin and Christoph Lampert        –      Structured Models in Computer Vision   –   Part 5. Structured SVMs



 Subgradient descent works basically like gradient descent:
 Subgradient Descent Minimization – minimize F (w)

         require: tolerance                 > 0, stepsizes ηt
         wcur ← 0
         repeat
                 v ∈ subw F (wcur )
                 wcur ← wcur − ηt v
         until F changed less than
         return wcur


 Converges to global minimum, but rather inefficient if F non-differentiable.

 [Shor, ”Minimization methods for non-differentiable functions”, Springer, 1985.]



                                                                                                                 22 / 56
Sebastian Nowozin and Christoph Lampert     –       Structured Models in Computer Vision   –   Part 5. Structured SVMs

 Computing a subgradient:
                                                             N
                                     1          2       C          n
                             min       w            +                  (w)
                               w     2                  N
                                                            n=1

 with     n (w)    = maxy          n (w),   and
                                   y

                     n
                     y (w)   := ∆(y n , y) + w, φ(xn , y) − w, φ(xn , y n )




                                                                                                               23 / 56
Sebastian Nowozin and Christoph Lampert     –       Structured Models in Computer Vision   –   Part 5. Structured SVMs

 Computing a subgradient:
                                                             N
                                     1          2       C          n
                             min       w            +                  (w)
                               w     2                  N
                                                            n=1

 with     n (w)    = maxy          n (w),   and
                                   y

                     n
                     y (w)   := ∆(y n , y) + w, φ(xn , y) − w, φ(xn , y n )


  ℓ(w)
                                                                                                              y



                                                                                                                  w

 For each y ∈ Y,             y (w)   is a linear function.
                                                                                                               24 / 56
Sebastian Nowozin and Christoph Lampert     –       Structured Models in Computer Vision   –   Part 5. Structured SVMs

 Computing a subgradient:
                                                             N
                                     1          2       C          n
                             min       w            +                  (w)
                               w     2                  N
                                                            n=1

 with     n (w)    = maxy          n (w),   and
                                   y

                     n
                     y (w)   := ∆(y n , y) + w, φ(xn , y) − w, φ(xn , y n )


  ℓ(w)                                                                                                    y'




                                                                                                                w

 For each y ∈ Y,             y (w)   is a linear function.
                                                                                                               25 / 56
Sebastian Nowozin and Christoph Lampert     –       Structured Models in Computer Vision   –   Part 5. Structured SVMs

 Computing a subgradient:
                                                             N
                                     1          2       C          n
                             min       w            +                  (w)
                               w     2                  N
                                                            n=1

 with     n (w)    = maxy          n (w),   and
                                   y

                     n
                     y (w)   := ∆(y n , y) + w, φ(xn , y) − w, φ(xn , y n )


  ℓ(w)




                                                                                                                w

 For each y ∈ Y,             y (w)   is a linear function.
                                                                                                               26 / 56
Sebastian Nowozin and Christoph Lampert     –       Structured Models in Computer Vision   –   Part 5. Structured SVMs

 Computing a subgradient:
                                                             N
                                     1          2       C          n
                             min       w            +                  (w)
                               w     2                  N
                                                            n=1

 with     n (w)    = maxy          n (w),   and
                                   y

                     n
                     y (w)   := ∆(y n , y) + w, φ(xn , y) − w, φ(xn , y n )


  ℓ(w)




                                                                                                                w

  (w) = maxy            y (w):     maximum over all y ∈ Y.
                                                                                                               27 / 56
Sebastian Nowozin and Christoph Lampert     –       Structured Models in Computer Vision   –   Part 5. Structured SVMs

 Computing a subgradient:
                                                             N
                                     1          2       C          n
                              min      w            +                  (w)
                               w     2                  N
                                                            n=1

 with     n (w)    = maxy          n (w),   and
                                   y

                     n
                     y (w)    := ∆(y n , y) + w, φ(xn , y) − w, φ(xn , y n )


  ℓ(w)




                                                                                                                w
                                                                       w   0

 Subgradient of           n   at w0 :
                                                                                                               28 / 56
Sebastian Nowozin and Christoph Lampert     –       Structured Models in Computer Vision   –   Part 5. Structured SVMs

 Computing a subgradient:
                                                             N
                                     1          2       C          n
                              min      w            +                  (w)
                               w     2                  N
                                                            n=1

 with     n (w)    = maxy          n (w),   and
                                   y

                     n
                     y (w)    := ∆(y n , y) + w, φ(xn , y) − w, φ(xn , y n )


  ℓ(w)




                                                                                                                w
                                                                       w   0

 Subgradient of           n   at w0 : find maximal (active) y.
                                                                                                               29 / 56
Sebastian Nowozin and Christoph Lampert    –       Structured Models in Computer Vision   –   Part 5. Structured SVMs

 Computing a subgradient:
                                                            N
                                  1            2     C            n
                              min   w              +                  (w)
                               w  2                  N
                                                          n=1

 with     n (w)    = maxy         n (w),   and
                                  y

                     n
                     y (w)    := ∆(y n , y) + w, φ(xn , y) − w, φ(xn , y n )


  ℓ(w)




                                                                                                               w
                                                                      w   0

 Subgradient of           n   at w0 : find maximal (active) y, use v =                          n (w ).
                                                                                               y   0
                                                                                                              30 / 56
Sebastian Nowozin and Christoph Lampert   –   Structured Models in Computer Vision   –   Part 5. Structured SVMs



 Subgradient Descent S-SVM Training
 input training pairs {(x1 , y 1 ), . . . , (xn , y n )} ⊂ X × Y,
 input feature map φ(x, y), loss function ∆(y, y ), regularizer C,
 input number of iterations T , stepsizes ηt for t = 1, . . . , T
   1:   w←0
   2:   for t=1,. . . ,T do
   3:     for i=1,. . . ,n do
   4:        y ← argmaxy∈Y ∆(y n , y) + w, φ(xn , y) − w, φ(xn , y n )
             ˆ
   5:        v n ← φ(xn , y ) − φ(xn , y n )
                            ˆ
   6:     end for
                              C
   7:     w ← w − ηt (w − N n v n )
   8:   end for
 output prediction function f (x) = argmaxy∈Y w, φ(x, y) .


 Observation: each update of w needs 1 argmax-prediction per example.
                                                                                                         31 / 56
Sebastian Nowozin and Christoph Lampert   –   Structured Models in Computer Vision   –   Part 5. Structured SVMs

 We can use the same tricks as for CRFs, e.g. stochastic updates:

 Stochastic Subgradient Descent S-SVM Training
 input training pairs {(x1 , y 1 ), . . . , (xn , y n )} ⊂ X × Y,
 input feature map φ(x, y), loss function ∆(y, y ), regularizer C,
 input number of iterations T , stepsizes ηt for t = 1, . . . , T
   1:   w←0
   2:   for t=1,. . . ,T do
   3:     (xn , y n ) ← randomly chosen training example pair
   4:     y ← argmaxy∈Y ∆(y n , y) + w, φ(xn , y) − w, φ(xn , y n )
           ˆ
                             C
   5:     w ← w − ηt (w − N [φ(xn , y ) − φ(xn , y n )])
                                     ˆ
   6:   end for
 output prediction function f (x) = argmaxy∈Y w, φ(x, y) .


 Observation: each update of w needs only 1 argmax-prediction
 (but we’ll need many iterations until convergence)
                                                                                                         32 / 56
Sebastian Nowozin and Christoph Lampert     –   Structured Models in Computer Vision    –   Part 5. Structured SVMs


Solving the Training Optimization Problem Numerically


 We can solve an S-SVM like a linear SVM:

 One of the equivalent formulations was:
                                                                       N
                                                        2        C
                                          min      w        +               ξn
                                  w∈RD ,ξ∈Rn
                                           +
                                                                 N
                                                                      n=1

 subject to, for i = 1, . . . n,

          w, φ(xn , y n ) − w, φ(xn , y) ≥ ∆(y n , y) − ξ n ,                          for all y ∈ Y‘.


 Introduce feature vectors δφ(xn , y n , y) := φ(xn , y n ) − φ(xn , y).


                                                                                                            33 / 56
Sebastian Nowozin and Christoph Lampert   –   Structured Models in Computer Vision   –   Part 5. Structured SVMs


 Solve
                                                                  N
                                                  2         C
                                    min       w       +                ξn
                               w∈RD ,ξ∈Rn
                                        +
                                                            N
                                                                n=1

 subject to, for i = 1, . . . n, for all y ∈ Y,

                                w, δφ(xn , y n , y) ≥ ∆(y n , y) − ξ n .

 This has the same structure as an ordinary SVM!
        quadratic objective
        linear constraints




                                                                                                         34 / 56
Sebastian Nowozin and Christoph Lampert   –   Structured Models in Computer Vision   –   Part 5. Structured SVMs


 Solve
                                                                  N
                                                  2         C
                                    min       w       +                ξn
                               w∈RD ,ξ∈Rn
                                        +
                                                            N
                                                                n=1

 subject to, for i = 1, . . . n, for all y ∈ Y,

                                w, δφ(xn , y n , y) ≥ ∆(y n , y) − ξ n .

 This has the same structure as an ordinary SVM!
        quadratic objective
        linear constraints

 Question: Can’t we use a ordinary SVM/QP solver?




                                                                                                         35 / 56
Sebastian Nowozin and Christoph Lampert   –   Structured Models in Computer Vision   –   Part 5. Structured SVMs


 Solve
                                                                  N
                                                  2         C
                                    min       w       +                ξn
                               w∈RD ,ξ∈Rn
                                        +
                                                            N
                                                                n=1

 subject to, for i = 1, . . . n, for all y ∈ Y,

                                w, δφ(xn , y n , y) ≥ ∆(y n , y) − ξ n .

 This has the same structure as an ordinary SVM!
        quadratic objective
        linear constraints

 Question: Can’t we use a ordinary SVM/QP solver?

 Answer: Almost! We could, if there weren’t N |Y| constraints.
        E.g. 100 binary 16 × 16 images: 1079 constraints
                                                                                                         36 / 56
Sebastian Nowozin and Christoph Lampert   –   Structured Models in Computer Vision   –   Part 5. Structured SVMs

 Solution: working set training
     It’s enough if we enforce the active constraints.
     The others will be fulfilled automatically.
     We don’t know which ones are active for the optimal solution.
     But it’s likely to be only a small number ← can of course be formalized.
 Keep a set of potentially active constraints and update it iteratively:




                                                                                                         37 / 56
Sebastian Nowozin and Christoph Lampert   –   Structured Models in Computer Vision   –   Part 5. Structured SVMs

 Solution: working set training
     It’s enough if we enforce the active constraints.
     The others will be fulfilled automatically.
     We don’t know which ones are active for the optimal solution.
     But it’s likely to be only a small number ← can of course be formalized.
 Keep a set of potentially active constraints and update it iteratively:
 Working Set Training

        Start with working set S = ∅                  (no contraints)
        Repeat until convergence:
                Solve S-SVM training problem with constraints from S
                Check, if solution violates any of the full constraint set
                       if no: we found the optimal solution, terminate.
                       if yes: add most violated constraints to S, iterate.




                                                                                                         38 / 56
Sebastian Nowozin and Christoph Lampert      –      Structured Models in Computer Vision      –      Part 5. Structured SVMs

 Solution: working set training
     It’s enough if we enforce the active constraints.
     The others will be fulfilled automatically.
     We don’t know which ones are active for the optimal solution.
     But it’s likely to be only a small number ← can of course be formalized.
 Keep a set of potentially active constraints and update it iteratively:
 Working Set Training

        Start with working set S = ∅                         (no contraints)
        Repeat until convergence:
                Solve S-SVM training problem with constraints from S
                Check, if solution violates any of the full constraint set
                        if no: we found the optimal solution, terminate.
                        if yes: add most violated constraints to S, iterate.

 Good practical performance and theoretic guarantees:
     polynomial time convergence -close to the global optimum
 [Tsochantaridis et al. ”Large Margin Methods for Structured and Interdependent Output Variables”, JMLR, 2005.]
                                                                                                                     39 / 56
Sebastian Nowozin and Christoph Lampert   –   Structured Models in Computer Vision   –   Part 5. Structured SVMs


 Working Set S-SVM Training
 input training pairs {(x1 , y 1 ), . . . , (xn , y n )} ⊂ X × Y,
 input feature map φ(x, y), loss function ∆(y, y ), regularizer C
   1:   S←∅
   2:   repeat
   3:     (w, ξ) ← solution to QP only with constraints from S
   4:     for i=1,. . . ,n do
   5:        y ← argmaxy∈Y ∆(y n , y) + w, φ(xn , y)
             ˆ
   6:        if y = y n then
                ˆ
   7:           S ← S ∪ {(xn , y )}
                               ˆ
   8:        end if
   9:     end for
 10:    until S doesn’t change anymore.
 output prediction function f (x) = argmaxy∈Y w, φ(x, y) .

 Observation: each update of w needs 1 argmax-prediction per example.
 (but we solve globally for next w, not by local steps)              40 / 56
Sebastian Nowozin and Christoph Lampert   –    Structured Models in Computer Vision   –     Part 5. Structured SVMs


 One-Slack Formulation of S-SVM:
                                                                              1            n)
 (equivalent to ordinary S-SVM formulation by ξ =                             N       nξ


                                    1     2
                    min               w       + Cξ
              w∈RD ,ξ∈R+            2

 subject to, for all (ˆ1 , . . . , y N ) ∈ Y × · · · × Y,
                      y            ˆ

                N
                      ∆(y n , y N ) + w, φ(xn , y n ) − w, φ(xn , y n )
                              ˆ                 ˆ                                          ≤ N ξ,
              n=1




                                                                                                            41 / 56
Sebastian Nowozin and Christoph Lampert   –    Structured Models in Computer Vision   –     Part 5. Structured SVMs


 One-Slack Formulation of S-SVM:
                                                                              1            n)
 (equivalent to ordinary S-SVM formulation by ξ =                             N       nξ


                                    1     2
                    min               w       + Cξ
              w∈RD ,ξ∈R+            2

 subject to, for all (ˆ1 , . . . , y N ) ∈ Y × · · · × Y,
                      y            ˆ

                N
                      ∆(y n , y N ) + w, φ(xn , y n ) − w, φ(xn , y n )
                              ˆ                 ˆ                                          ≤ N ξ,
              n=1




  |Y|N linear constraints, convex, differentiable objective.

 We blew up the constraint set even further:
        100 binary 16 × 16 images: 10177 constraints (instead of 1079 ).
                                                                                                            42 / 56
Sebastian Nowozin and Christoph Lampert   –   Structured Models in Computer Vision   –   Part 5. Structured SVMs



 Working Set One-Slack S-SVM Training
 input training pairs {(x1 , y 1 ), . . . , (xn , y n )} ⊂ X × Y,
 input feature map φ(x, y), loss function ∆(y, y ), regularizer C
   1:   S←∅
   2:   repeat
   3:     (w, ξ) ← solution to QP only with constraints from S
   4:     for i=1,. . . ,n do
   5:        y n ← argmaxy∈Y ∆(y n , y) + w, φ(xn , y)
             ˆ
   6:     end for
   7:     S ← S ∪ { (x1 , . . . , xn ), (ˆ1 , . . . , y n ) }
                                         y            ˆ
   8:   until S doesn’t change anymore.
 output prediction function f (x) = argmaxy∈Y w, φ(x, y) .


 Often faster convergence:
 We add one strong constraint per iteration instead of n weak ones.
                                                                                                         43 / 56
Sebastian Nowozin and Christoph Lampert   –       Structured Models in Computer Vision   –   Part 5. Structured SVMs

 We can solve an S-SVM like a non-linear SVM: compute Lagrangian dual
        min becomes max,
        original (primal) variables w, ξ disappear,
        new (dual) variables αiy : one per constraint of the original problem.

 Dual S-SVM problem

                                              1
   max                 αny ∆(y n , y) −                    αny αny δφ(xn , y n , y), δφ(xn , y n , y )
                                                                ¯¯
                                                                                         ¯     ¯
                                                                                                   ¯
    n|Y|
 α∈R+                                         2
             n=1,...,n                          y,¯∈Y
                                                   y
              y∈Y                             n,¯ =1,...,N
                                                n


 subject to, for n = 1, . . . , N ,

                              C
                   αny ≤        .
                              N
             y∈Y


  N linear contraints, convex, differentiable objective, N |Y| variables.

                                                                                                             44 / 56
Sebastian Nowozin and Christoph Lampert     –    Structured Models in Computer Vision   –   Part 5. Structured SVMs


 We can kernelize:
        Define joint kernel function k : (X × Y) × (X × Y) → R

                                 k( (x, y) , (¯, y ) ) = φ(x, y), φ(¯, y ) .
                                              x ¯                   x ¯

        k measure similarity between two (input,output)-pairs.

        We can express the optimization in terms of k:

              δφ(xn , y n , y) , δφ(xn , y n , y )
                                     ¯     ¯
                                               ¯
                                    = φ(xn , y n ) − φ(xn , y) , φ(xn , y n ) − φ(xn , y )
                                                                    ¯     ¯        ¯
                                                                                       ¯
                                    = φ(xn , y n ), φ(xn , y n ) − φ(xn , y n ), φ(xn , y )
                                                       ¯     ¯                      ¯
                                                                                        ¯
                                          − φ(xn , y), φ(xn , y n ) + φ(xn , y), φ(xn , y )
                                                          ¯     ¯                   ¯
                                                                                        ¯
                                    = k( (xn , y n ), (xn , y n ) ) − k( (xn , y n ), φ(xn , y ) )
                                                        ¯     ¯                          ¯
                                                                                             ¯
                                          − k( (xn , y), (xn , y n ) ) + k( (xn , y), φ(xn , y ) )
                                                           ¯     ¯                       ¯
                                                                                             ¯
                                    =: Ki¯yy
                                         ı ¯

                                                                                                            45 / 56
Sebastian Nowozin and Christoph Lampert    –    Structured Models in Computer Vision   –   Part 5. Structured SVMs

 Kernelized S-SVM problem:
                                                                  1
                         max               αiy ∆(y n , y) −                αiy α¯y Ki¯yy
                                                                                ı¯   ı ¯
                          n|Y|
                       α∈R+                                       2
                                   i=1,...,n                        y,¯∈Y
                                                                      y
                                     y∈Y                          i,¯=1,...,n
                                                                    ı

 subject to, for i = 1, . . . , n,

                                                  C
                                          αiy ≤     .
                                                  N
                                   y∈Y



        too many variables: train with working set of αiy .

 Kernelized prediction function:

                           f (x) = argmax               αiy k( (xi , yi ), (x, y) )
                                          y∈Y
                                                   iy

                                                                                                           46 / 56
Sebastian Nowozin and Christoph Lampert   –      Structured Models in Computer Vision   –   Part 5. Structured SVMs



 What do ”joint kernel functions” look like?

                              k( (x, y) , (¯, y ) ) = φ(x, y), φ(¯, y ) .
                                           x ¯                   x ¯

 As in graphical model: easier if φ decomposes w.r.t. factors:
        φ(x, y) = φF (x, yF )             F ∈F


 Then the kernel k decomposes into sum over factors:

             k( (x, y) , (¯, y ) ) =
                          x ¯                    φF (x, yF )      F ∈F
                                                                         , φF (x , yF )     F ∈F

                                          =           φF (x, yF ), φF (x , yF )
                                              F ∈F

                                          =          kF ( (x, yF ), (x , yF ) )
                                              F ∈F

 We can define kernels for each factor (e.g. nonlinear).

                                                                                                            47 / 56
Sebastian Nowozin and Christoph Lampert   –       Structured Models in Computer Vision   –   Part 5. Structured SVMs

 Example: figure-ground segmentation with grid structure


   (x, y)=(
         ˆ                                    ,                               )




 Typical kernels: arbirary in x, linear (or at least simple) w.r.t. y:
     Unary factors:

                               kp ((xp , yp ), (xp , yp ) = k(xp , xp ) yp = yp

        with k(xp , xp ) local image kernel, e.g. χ2 or histogram intersection

        Pairwise factors:

                              kpq ((yp , yq ), (yp , yp ) = yq = yq                yq = yq

 More powerful than all-linear, and argmax-prediction still possible.
                                                                                                             48 / 56
Sebastian Nowozin and Christoph Lampert   –   Structured Models in Computer Vision     –     Part 5. Structured SVMs




 Example: object localization
                                                                                     left      top



       (x, y)=(
             ˆ                                ,                               )
                                                                                                      image

                                                                                     right   bottom




 Only one factor that includes all x and y:

                              k( (x, y) , (x , y ) ) = kimage (x|y , x |y )

 with kimage image kernel and x|y is image region within box y.

 argmax-prediction as difficult as object localization with kimage -SVM.


                                                                                                              49 / 56
Sebastian Nowozin and Christoph Lampert   –   Structured Models in Computer Vision   –   Part 5. Structured SVMs


Summary – S-SVM Learning

 Given:
        training set {(x1 , y 1 ), . . . , (xn , y n )} ⊂ X × Y
        loss function ∆ : Y × Y → R.
 Task: learn parameter w for f (x) := argmaxy w, φ(x, y) that minimizes
 expected loss on future data.




                                                                                                         50 / 56
Sebastian Nowozin and Christoph Lampert   –   Structured Models in Computer Vision   –   Part 5. Structured SVMs


Summary – S-SVM Learning

 Given:
        training set {(x1 , y 1 ), . . . , (xn , y n )} ⊂ X × Y
        loss function ∆ : Y × Y → R.
 Task: learn parameter w for f (x) := argmaxy w, φ(x, y) that minimizes
 expected loss on future data.

 S-SVM solution derived by maximum margin framework:
        enforce correct output to be better than others by a margin :

                 w, φ(xn , y n ) ≥ ∆(y n , y) + w, φ(xn , y)                         for all y ∈ Y.

        convex optimization problem, but non-differentiable
        many equivalent formulations → different training algorithms
        training needs repeated argmax prediction, no probabilistic inference
                                                                                                         51 / 56
Sebastian Nowozin and Christoph Lampert   –   Structured Models in Computer Vision   –   Part 5. Structured SVMs


Extra I: Beyond Fully Supervised Learning

 So far, training was fully supervised, all variables were observed.
 In real life, some variables are unobserved even during training.




       missing labels in training data                   latent variables, e.g. part location




   latent variables, e.g. part occlusion                    latent variables, e.g. viewpoint
                                                                                                         52 / 56
Sebastian Nowozin and Christoph Lampert   –   Structured Models in Computer Vision   –   Part 5. Structured SVMs

 Three types of variables:
     x ∈ X always observed,
     y ∈ Y observed only in training,
     z ∈ Z never observed (latent).
 Decision function:      f (x) = argmaxy∈Y maxz∈Z w, φ(x, y, z)




                                                                                                         53 / 56
Sebastian Nowozin and Christoph Lampert       –     Structured Models in Computer Vision       –     Part 5. Structured SVMs

 Three types of variables:
     x ∈ X always observed,
     y ∈ Y observed only in training,
     z ∈ Z never observed (latent).
 Decision function:      f (x) = argmaxy∈Y maxz∈Z w, φ(x, y, z)
 Maximum Margin Training with Maximization over Latent Variables
                                                       N
                             1            2     C
       Solve:            min w                +             ξn
                         w,ξ 2                  N
                                                      n=1

 subject to, for n = 1, . . . , N , for all y ∈ Y

                        ∆(y n , y) + max w, φ(xn , y, z) − max w, φ(xn , y n , z)
                                              z∈Z                                  z∈Z


 Problem: not a convex problem → can have local minima

 [C. Yu, T. Joachims, ”Learning Structural SVMs with Latent Variables”, ICML, 2009]
 similar idea: [Felzenszwalb, McAllester, Ramaman. A Discriminatively Trained, Multiscale, Deformable Part Model, CVPR’08]
                                                                                                                       54 / 56
Sebastian Nowozin and Christoph Lampert   –   Structured Models in Computer Vision   –   Part 5. Structured SVMs



 Structured Learning is full of Open Research Questions
        How to train faster?
                CRFs need many runs of probablistic inference,
                SSVMs need many runs of argmax-predictions.

        How to reduce the necessary amount of training data?
                semi-supervised learning? transfer learning?

        How can we better understand different loss function?
                when to use probabilistic training, when maximum margin?
                CRFs are “consistent”, SSVMs are not. Is this relevant?

        Can we understand structured learning with approximate inference?
                often computing L(w) or argmaxy w, φ(x, y) exactly is infeasible.
                can we guarantee good results even with approximate inference?

        More and new applications!

                                                                                                         55 / 56
Sebastian Nowozin and Christoph Lampert   –   Structured Models in Computer Vision   –   Part 5. Structured SVMs




                      Lunch-Break
                   Continuing at 13:30

           Slides available at
  http://www.nowozin.net/sebastian/
          cvpr2011tutorial/




                                                                                                         56 / 56

Más contenido relacionado

La actualidad más candente

Pairwise and Problem-Specific Distance Metrics in the Linkage Tree Genetic Al...
Pairwise and Problem-Specific Distance Metrics in the Linkage Tree Genetic Al...Pairwise and Problem-Specific Distance Metrics in the Linkage Tree Genetic Al...
Pairwise and Problem-Specific Distance Metrics in the Linkage Tree Genetic Al...Martin Pelikan
 
Lecun 20060816-ciar-02-deep learning for generic object recognition
Lecun 20060816-ciar-02-deep learning for generic object recognitionLecun 20060816-ciar-02-deep learning for generic object recognition
Lecun 20060816-ciar-02-deep learning for generic object recognitionzukun
 
06 cv mil_learning_and_inference
06 cv mil_learning_and_inference06 cv mil_learning_and_inference
06 cv mil_learning_and_inferencezukun
 
State of art pde based ip to bt vijayakrishna rowthu
State of art pde based ip to bt  vijayakrishna rowthuState of art pde based ip to bt  vijayakrishna rowthu
State of art pde based ip to bt vijayakrishna rowthuvijayakrishna rowthu
 
Fcv hum mach_geman
Fcv hum mach_gemanFcv hum mach_geman
Fcv hum mach_gemanzukun
 
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
(研究会輪読) Facial Landmark Detection by Deep Multi-task LearningMasahiro Suzuki
 
Johan Suykens: "Models from Data: a Unifying Picture"
Johan Suykens: "Models from Data: a Unifying Picture" Johan Suykens: "Models from Data: a Unifying Picture"
Johan Suykens: "Models from Data: a Unifying Picture" ieee_cis_cyprus
 
Label propagation - Semisupervised Learning with Applications to NLP
Label propagation - Semisupervised Learning with Applications to NLPLabel propagation - Semisupervised Learning with Applications to NLP
Label propagation - Semisupervised Learning with Applications to NLPDavid Przybilla
 
Lesson 28: Integration by Subsitution
Lesson 28: Integration by SubsitutionLesson 28: Integration by Subsitution
Lesson 28: Integration by SubsitutionMatthew Leingang
 
Nonlinear transport phenomena: models, method of solving and unusual features...
Nonlinear transport phenomena: models, method of solving and unusual features...Nonlinear transport phenomena: models, method of solving and unusual features...
Nonlinear transport phenomena: models, method of solving and unusual features...SSA KPI
 
Lesson 8: Basic Differentiation Rules (Section 41 slides)
Lesson 8: Basic Differentiation Rules (Section 41 slides)Lesson 8: Basic Differentiation Rules (Section 41 slides)
Lesson 8: Basic Differentiation Rules (Section 41 slides)Matthew Leingang
 
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...Universitat Politècnica de Catalunya
 
MLIP - Chapter 2 - Preliminaries to deep learning
MLIP - Chapter 2 - Preliminaries to deep learningMLIP - Chapter 2 - Preliminaries to deep learning
MLIP - Chapter 2 - Preliminaries to deep learningCharles Deledalle
 
Stability of adaptive random-walk Metropolis algorithms
Stability of adaptive random-walk Metropolis algorithmsStability of adaptive random-walk Metropolis algorithms
Stability of adaptive random-walk Metropolis algorithmsBigMC
 

La actualidad más candente (19)

Pairwise and Problem-Specific Distance Metrics in the Linkage Tree Genetic Al...
Pairwise and Problem-Specific Distance Metrics in the Linkage Tree Genetic Al...Pairwise and Problem-Specific Distance Metrics in the Linkage Tree Genetic Al...
Pairwise and Problem-Specific Distance Metrics in the Linkage Tree Genetic Al...
 
Lecun 20060816-ciar-02-deep learning for generic object recognition
Lecun 20060816-ciar-02-deep learning for generic object recognitionLecun 20060816-ciar-02-deep learning for generic object recognition
Lecun 20060816-ciar-02-deep learning for generic object recognition
 
06 cv mil_learning_and_inference
06 cv mil_learning_and_inference06 cv mil_learning_and_inference
06 cv mil_learning_and_inference
 
Répétition soutenance
Répétition soutenanceRépétition soutenance
Répétition soutenance
 
State of art pde based ip to bt vijayakrishna rowthu
State of art pde based ip to bt  vijayakrishna rowthuState of art pde based ip to bt  vijayakrishna rowthu
State of art pde based ip to bt vijayakrishna rowthu
 
Fcv hum mach_geman
Fcv hum mach_gemanFcv hum mach_geman
Fcv hum mach_geman
 
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
 
Johan Suykens: "Models from Data: a Unifying Picture"
Johan Suykens: "Models from Data: a Unifying Picture" Johan Suykens: "Models from Data: a Unifying Picture"
Johan Suykens: "Models from Data: a Unifying Picture"
 
Label propagation - Semisupervised Learning with Applications to NLP
Label propagation - Semisupervised Learning with Applications to NLPLabel propagation - Semisupervised Learning with Applications to NLP
Label propagation - Semisupervised Learning with Applications to NLP
 
Lesson 28: Integration by Subsitution
Lesson 28: Integration by SubsitutionLesson 28: Integration by Subsitution
Lesson 28: Integration by Subsitution
 
Nonlinear transport phenomena: models, method of solving and unusual features...
Nonlinear transport phenomena: models, method of solving and unusual features...Nonlinear transport phenomena: models, method of solving and unusual features...
Nonlinear transport phenomena: models, method of solving and unusual features...
 
Nonlinear observer design
Nonlinear observer designNonlinear observer design
Nonlinear observer design
 
Lesson 8: Basic Differentiation Rules (Section 41 slides)
Lesson 8: Basic Differentiation Rules (Section 41 slides)Lesson 8: Basic Differentiation Rules (Section 41 slides)
Lesson 8: Basic Differentiation Rules (Section 41 slides)
 
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
 
Lecture11 - neural networks
Lecture11 - neural networksLecture11 - neural networks
Lecture11 - neural networks
 
MLIP - Chapter 2 - Preliminaries to deep learning
MLIP - Chapter 2 - Preliminaries to deep learningMLIP - Chapter 2 - Preliminaries to deep learning
MLIP - Chapter 2 - Preliminaries to deep learning
 
Sampling Distributions
Sampling DistributionsSampling Distributions
Sampling Distributions
 
Stability of adaptive random-walk Metropolis algorithms
Stability of adaptive random-walk Metropolis algorithmsStability of adaptive random-walk Metropolis algorithms
Stability of adaptive random-walk Metropolis algorithms
 
Chapter 1 - Introduction
Chapter 1 - IntroductionChapter 1 - Introduction
Chapter 1 - Introduction
 

Similar a 04 structured support vector machine

Machine learning of structured outputs
Machine learning of structured outputsMachine learning of structured outputs
Machine learning of structured outputszukun
 
05 history of cv a machine learning (theory) perspective on computer vision
05  history of cv a machine learning (theory) perspective on computer vision05  history of cv a machine learning (theory) perspective on computer vision
05 history of cv a machine learning (theory) perspective on computer visionzukun
 
Kernel based models for geo- and environmental sciences- Alexei Pozdnoukhov –...
Kernel based models for geo- and environmental sciences- Alexei Pozdnoukhov –...Kernel based models for geo- and environmental sciences- Alexei Pozdnoukhov –...
Kernel based models for geo- and environmental sciences- Alexei Pozdnoukhov –...Beniamino Murgante
 
Matrix Computations in Machine Learning
Matrix Computations in Machine LearningMatrix Computations in Machine Learning
Matrix Computations in Machine Learningbutest
 
CVPR2010: Advanced ITinCVPR in a Nutshell: part 5: Shape, Matching and Diverg...
CVPR2010: Advanced ITinCVPR in a Nutshell: part 5: Shape, Matching and Diverg...CVPR2010: Advanced ITinCVPR in a Nutshell: part 5: Shape, Matching and Diverg...
CVPR2010: Advanced ITinCVPR in a Nutshell: part 5: Shape, Matching and Diverg...zukun
 
Basics of probability in statistical simulation and stochastic programming
Basics of probability in statistical simulation and stochastic programmingBasics of probability in statistical simulation and stochastic programming
Basics of probability in statistical simulation and stochastic programmingSSA KPI
 
Linear Machine Learning Models with L2 Regularization and Kernel Tricks
Linear Machine Learning Models with L2 Regularization and Kernel TricksLinear Machine Learning Models with L2 Regularization and Kernel Tricks
Linear Machine Learning Models with L2 Regularization and Kernel TricksFengtao Wu
 
01 graphical models
01 graphical models01 graphical models
01 graphical modelszukun
 
Decomposition Methods in SLP
Decomposition Methods in SLPDecomposition Methods in SLP
Decomposition Methods in SLPSSA KPI
 
Kernels and Support Vector Machines
Kernels and Support Vector  MachinesKernels and Support Vector  Machines
Kernels and Support Vector MachinesEdgar Marca
 
Neural Processes Family
Neural Processes FamilyNeural Processes Family
Neural Processes FamilyKota Matsui
 
CMA-ES with local meta-models
CMA-ES with local meta-modelsCMA-ES with local meta-models
CMA-ES with local meta-modelszyedb
 
The world of loss function
The world of loss functionThe world of loss function
The world of loss function홍배 김
 
Anomaly detection using deep one class classifier
Anomaly detection using deep one class classifierAnomaly detection using deep one class classifier
Anomaly detection using deep one class classifier홍배 김
 
Monte-Carlo method for Two-Stage SLP
Monte-Carlo method for Two-Stage SLPMonte-Carlo method for Two-Stage SLP
Monte-Carlo method for Two-Stage SLPSSA KPI
 
The multilayer perceptron
The multilayer perceptronThe multilayer perceptron
The multilayer perceptronESCOM
 

Similar a 04 structured support vector machine (20)

Machine learning of structured outputs
Machine learning of structured outputsMachine learning of structured outputs
Machine learning of structured outputs
 
05 history of cv a machine learning (theory) perspective on computer vision
05  history of cv a machine learning (theory) perspective on computer vision05  history of cv a machine learning (theory) perspective on computer vision
05 history of cv a machine learning (theory) perspective on computer vision
 
Kernel based models for geo- and environmental sciences- Alexei Pozdnoukhov –...
Kernel based models for geo- and environmental sciences- Alexei Pozdnoukhov –...Kernel based models for geo- and environmental sciences- Alexei Pozdnoukhov –...
Kernel based models for geo- and environmental sciences- Alexei Pozdnoukhov –...
 
Matrix Computations in Machine Learning
Matrix Computations in Machine LearningMatrix Computations in Machine Learning
Matrix Computations in Machine Learning
 
Curve fitting
Curve fittingCurve fitting
Curve fitting
 
CVPR2010: Advanced ITinCVPR in a Nutshell: part 5: Shape, Matching and Diverg...
CVPR2010: Advanced ITinCVPR in a Nutshell: part 5: Shape, Matching and Diverg...CVPR2010: Advanced ITinCVPR in a Nutshell: part 5: Shape, Matching and Diverg...
CVPR2010: Advanced ITinCVPR in a Nutshell: part 5: Shape, Matching and Diverg...
 
Basics of probability in statistical simulation and stochastic programming
Basics of probability in statistical simulation and stochastic programmingBasics of probability in statistical simulation and stochastic programming
Basics of probability in statistical simulation and stochastic programming
 
Linear Machine Learning Models with L2 Regularization and Kernel Tricks
Linear Machine Learning Models with L2 Regularization and Kernel TricksLinear Machine Learning Models with L2 Regularization and Kernel Tricks
Linear Machine Learning Models with L2 Regularization and Kernel Tricks
 
The Perceptron (D1L1 Insight@DCU Machine Learning Workshop 2017)
The Perceptron (D1L1 Insight@DCU Machine Learning Workshop 2017)The Perceptron (D1L1 Insight@DCU Machine Learning Workshop 2017)
The Perceptron (D1L1 Insight@DCU Machine Learning Workshop 2017)
 
01 graphical models
01 graphical models01 graphical models
01 graphical models
 
Decomposition Methods in SLP
Decomposition Methods in SLPDecomposition Methods in SLP
Decomposition Methods in SLP
 
Kernels and Support Vector Machines
Kernels and Support Vector  MachinesKernels and Support Vector  Machines
Kernels and Support Vector Machines
 
1 - Linear Regression
1 - Linear Regression1 - Linear Regression
1 - Linear Regression
 
Neural Processes Family
Neural Processes FamilyNeural Processes Family
Neural Processes Family
 
5.n nmodels i
5.n nmodels i5.n nmodels i
5.n nmodels i
 
CMA-ES with local meta-models
CMA-ES with local meta-modelsCMA-ES with local meta-models
CMA-ES with local meta-models
 
The world of loss function
The world of loss functionThe world of loss function
The world of loss function
 
Anomaly detection using deep one class classifier
Anomaly detection using deep one class classifierAnomaly detection using deep one class classifier
Anomaly detection using deep one class classifier
 
Monte-Carlo method for Two-Stage SLP
Monte-Carlo method for Two-Stage SLPMonte-Carlo method for Two-Stage SLP
Monte-Carlo method for Two-Stage SLP
 
The multilayer perceptron
The multilayer perceptronThe multilayer perceptron
The multilayer perceptron
 

Más de zukun

My lyn tutorial 2009
My lyn tutorial 2009My lyn tutorial 2009
My lyn tutorial 2009zukun
 
ETHZ CV2012: Tutorial openCV
ETHZ CV2012: Tutorial openCVETHZ CV2012: Tutorial openCV
ETHZ CV2012: Tutorial openCVzukun
 
ETHZ CV2012: Information
ETHZ CV2012: InformationETHZ CV2012: Information
ETHZ CV2012: Informationzukun
 
Siwei lyu: natural image statistics
Siwei lyu: natural image statisticsSiwei lyu: natural image statistics
Siwei lyu: natural image statisticszukun
 
Lecture9 camera calibration
Lecture9 camera calibrationLecture9 camera calibration
Lecture9 camera calibrationzukun
 
Brunelli 2008: template matching techniques in computer vision
Brunelli 2008: template matching techniques in computer visionBrunelli 2008: template matching techniques in computer vision
Brunelli 2008: template matching techniques in computer visionzukun
 
Modern features-part-4-evaluation
Modern features-part-4-evaluationModern features-part-4-evaluation
Modern features-part-4-evaluationzukun
 
Modern features-part-3-software
Modern features-part-3-softwareModern features-part-3-software
Modern features-part-3-softwarezukun
 
Modern features-part-2-descriptors
Modern features-part-2-descriptorsModern features-part-2-descriptors
Modern features-part-2-descriptorszukun
 
Modern features-part-1-detectors
Modern features-part-1-detectorsModern features-part-1-detectors
Modern features-part-1-detectorszukun
 
Modern features-part-0-intro
Modern features-part-0-introModern features-part-0-intro
Modern features-part-0-introzukun
 
Lecture 02 internet video search
Lecture 02 internet video searchLecture 02 internet video search
Lecture 02 internet video searchzukun
 
Lecture 01 internet video search
Lecture 01 internet video searchLecture 01 internet video search
Lecture 01 internet video searchzukun
 
Lecture 03 internet video search
Lecture 03 internet video searchLecture 03 internet video search
Lecture 03 internet video searchzukun
 
Icml2012 tutorial representation_learning
Icml2012 tutorial representation_learningIcml2012 tutorial representation_learning
Icml2012 tutorial representation_learningzukun
 
Advances in discrete energy minimisation for computer vision
Advances in discrete energy minimisation for computer visionAdvances in discrete energy minimisation for computer vision
Advances in discrete energy minimisation for computer visionzukun
 
Gephi tutorial: quick start
Gephi tutorial: quick startGephi tutorial: quick start
Gephi tutorial: quick startzukun
 
EM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysisEM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysiszukun
 
Object recognition with pictorial structures
Object recognition with pictorial structuresObject recognition with pictorial structures
Object recognition with pictorial structureszukun
 
Iccv2011 learning spatiotemporal graphs of human activities
Iccv2011 learning spatiotemporal graphs of human activities Iccv2011 learning spatiotemporal graphs of human activities
Iccv2011 learning spatiotemporal graphs of human activities zukun
 

Más de zukun (20)

My lyn tutorial 2009
My lyn tutorial 2009My lyn tutorial 2009
My lyn tutorial 2009
 
ETHZ CV2012: Tutorial openCV
ETHZ CV2012: Tutorial openCVETHZ CV2012: Tutorial openCV
ETHZ CV2012: Tutorial openCV
 
ETHZ CV2012: Information
ETHZ CV2012: InformationETHZ CV2012: Information
ETHZ CV2012: Information
 
Siwei lyu: natural image statistics
Siwei lyu: natural image statisticsSiwei lyu: natural image statistics
Siwei lyu: natural image statistics
 
Lecture9 camera calibration
Lecture9 camera calibrationLecture9 camera calibration
Lecture9 camera calibration
 
Brunelli 2008: template matching techniques in computer vision
Brunelli 2008: template matching techniques in computer visionBrunelli 2008: template matching techniques in computer vision
Brunelli 2008: template matching techniques in computer vision
 
Modern features-part-4-evaluation
Modern features-part-4-evaluationModern features-part-4-evaluation
Modern features-part-4-evaluation
 
Modern features-part-3-software
Modern features-part-3-softwareModern features-part-3-software
Modern features-part-3-software
 
Modern features-part-2-descriptors
Modern features-part-2-descriptorsModern features-part-2-descriptors
Modern features-part-2-descriptors
 
Modern features-part-1-detectors
Modern features-part-1-detectorsModern features-part-1-detectors
Modern features-part-1-detectors
 
Modern features-part-0-intro
Modern features-part-0-introModern features-part-0-intro
Modern features-part-0-intro
 
Lecture 02 internet video search
Lecture 02 internet video searchLecture 02 internet video search
Lecture 02 internet video search
 
Lecture 01 internet video search
Lecture 01 internet video searchLecture 01 internet video search
Lecture 01 internet video search
 
Lecture 03 internet video search
Lecture 03 internet video searchLecture 03 internet video search
Lecture 03 internet video search
 
Icml2012 tutorial representation_learning
Icml2012 tutorial representation_learningIcml2012 tutorial representation_learning
Icml2012 tutorial representation_learning
 
Advances in discrete energy minimisation for computer vision
Advances in discrete energy minimisation for computer visionAdvances in discrete energy minimisation for computer vision
Advances in discrete energy minimisation for computer vision
 
Gephi tutorial: quick start
Gephi tutorial: quick startGephi tutorial: quick start
Gephi tutorial: quick start
 
EM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysisEM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysis
 
Object recognition with pictorial structures
Object recognition with pictorial structuresObject recognition with pictorial structures
Object recognition with pictorial structures
 
Iccv2011 learning spatiotemporal graphs of human activities
Iccv2011 learning spatiotemporal graphs of human activities Iccv2011 learning spatiotemporal graphs of human activities
Iccv2011 learning spatiotemporal graphs of human activities
 

Último

Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 

Último (20)

Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 

04 structured support vector machine

  • 1. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs Part 5: Structured Support Vector Machines Sebastian Nowozin and Christoph H. Lampert Colorado Springs, 25th June 2011 1 / 56
  • 2. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs Problem (Loss-Minimizing Parameter Learning) Let d(x, y) be the (unknown) true data distribution. Let D = {(x1 , y 1 ), . . . , (xN , y N )} be i.i.d. samples from d(x, y). Let φ : X × Y → RD be a feature function. Let ∆ : Y × Y → R be a loss function. Find a weight vector w∗ that leads to minimal expected loss E(x,y)∼d(x,y) {∆(y, f (x))} for f (x) = argmaxy∈Y w, φ(x, y) . 2 / 56
  • 3. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs Problem (Loss-Minimizing Parameter Learning) Let d(x, y) be the (unknown) true data distribution. Let D = {(x1 , y 1 ), . . . , (xN , y N )} be i.i.d. samples from d(x, y). Let φ : X × Y → RD be a feature function. Let ∆ : Y × Y → R be a loss function. Find a weight vector w∗ that leads to minimal expected loss E(x,y)∼d(x,y) {∆(y, f (x))} for f (x) = argmaxy∈Y w, φ(x, y) . Pro: We directly optimize for the quantity of interest: expected loss. No expensive-to-compute partition function Z will show up. Con: We need to know the loss function already at training time. We can’t use probabilistic reasoning to find w∗ . 3 / 56
  • 4. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs Reminder: learning by regularized risk minimization For compatibility function g(x, y; w) := w, φ(x, y) find w∗ that minimizes E(x,y)∼d(x,y) ∆( y, argmaxy g(x, y; w) ). Two major problems: d(x, y) is unknown argmaxy g(x, y; w) maps into a discrete space → ∆( y, argmaxy g(x, y; w)) is discontinuous, piecewise constant 4 / 56
  • 5. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs Task: min E(x,y)∼d(x,y) ∆( y, argmaxy g(x, y; w) ). w Problem 1: d(x, y) is unknown Solution: 1 Replace E(x,y)∼d(x,y) · with empirical estimate N (xn ,y n ) · To avoid overfitting: add a regularizer, e.g. λ w 2. New task: N 2 1 min λ w + ∆( y n , argmaxy g(xn , y; w) ). w N n=1 5 / 56
  • 6. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs Task: N 2 1 min λ w + ∆( y n , argmaxy g(xn , y; w) ). w N n=1 Problem: ∆( y, argmaxy g(x, y; w) ) discontinuous w.r.t. w. Solution: Replace ∆(y, y ) with well behaved (x, y, w) Typically: upper bound to ∆, continuous and convex w.r.t. w. New task: N 2 1 min λ w + (xn , y n , w)) w N n=1 6 / 56
  • 7. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs Regularized Risk Minimization N 2 1 min λ w + (xn , y n , w)) w N n=1 Regularization + Loss on training data 7 / 56
  • 8. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs Regularized Risk Minimization N 2 1 min λ w + (xn , y n , w)) w N n=1 Regularization + Loss on training data Hinge loss: maximum margin training (xn , y n , w) := max ∆(y n , y) + w, φ(xn , y) − w, φ(xn , y n ) y∈Y 8 / 56
  • 9. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs Regularized Risk Minimization N 2 1 min λ w + (xn , y n , w)) w N n=1 Regularization + Loss on training data Hinge loss: maximum margin training (xn , y n , w) := max ∆(y n , y) + w, φ(xn , y) − w, φ(xn , y n ) y∈Y is maximum over linear functions → continuous, convex. bounds ∆ from above. Proof: Let y = argmaxy g(xn , y, w) ¯ ∆(y n , y ) ≤ ∆(y n , y ) + g(xn , y , w) − g(xn , y n , w) ¯ ¯ ¯ ≤ max ∆(y n , y) + g(xn , y, w) − g(xn , y n , w) y∈Y 9 / 56
  • 10. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs Regularized Risk Minimization N 2 1 min λ w + (xn , y n , w)) w N n=1 Regularization + Loss on training data Hinge loss: maximum margin training (xn , y n , w) := max ∆(y n , y) + w, φ(xn , y) − w, φ(xn , y n ) y∈Y Alternative: Logistic loss: probabilistic training (xn , y n , w) := log exp w, φ(xn , y) − w, φ(xn , y n ) y∈Y 10 / 56
  • 11. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs Structured Output Support Vector Machine N 1 2 C min w + max ∆(y n , y) + w, φ(xn , y) − w, φ(xn , y n ) w 2 N y∈Y n=1 Conditional Random Field N w 2 min + log exp w, φ(xn , y) − w, φ(xn , y n ) w 2σ 2 n=1 y∈Y CRFs and SSVMs have more in common than usually assumed. both do regularized risk minimization log y exp(·) can be interpreted as a soft-max 11 / 56
  • 12. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs Solving the Training Optimization Problem Numerically Structured Output Support Vector Machine: N 1 2 C min w + max ∆(y n , y) + w, φ(xn , y) − w, φ(xn , y n ) w 2 N y∈Y n=1 Unconstrained optimization, convex, non-differentiable objective. 12 / 56
  • 13. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs Structured Output SVM (equivalent formulation): N 1 2 C min w + ξn w,ξ 2 N n=1 subject to, for n = 1, . . . , N , max ∆(y n , y) + w, φ(xn , y) − w, φ(xn , y n ) ≤ ξn y∈Y N non-linear contraints, convex, differentiable objective. 13 / 56
  • 14. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs Structured Output SVM (also equivalent formulation): N 1 2 C min w + ξn w,ξ 2 N n=1 subject to, for n = 1, . . . , N , ∆(y n , y) + w, φ(xn , y) − w, φ(xn , y n ) ≤ ξ n , for all y ∈ Y N |Y| linear constraints, convex, differentiable objective. 14 / 56
  • 15. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs Example: Multiclass SVM 1 for y = y Y = {1, 2, . . . , K}, ∆(y, y ) = . 0 otherwise φ(x, y) = y = 1 φ(x), y = 2 φ(x), . . . , y = K φ(x) N 1 2 C Solve: min w + ξn w,ξ 2 N n=1 subject to, for i = 1, . . . , n, w, φ(xn , y n ) − w, φ(xn , y) ≥ 1 − ξ n for all y ∈ Y {y n }. Classification: f (x) = argmaxy∈Y w, φ(x, y) . Crammer-Singer Multiclass SVM [K. Crammer, Y. Singer: ”On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines”, JMLR, 2001] 15 / 56
  • 16. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs Example: Hierarchical SVM Hierarchical Multiclass Loss: 1 ∆(y, y ) := (distance in tree) 2 ∆(cat, cat) = 0, ∆(cat, dog) = 1, ∆(cat, bus) = 2, etc. N 1 2 C Solve: min w + ξn w,ξ 2 N n=1 subject to, for i = 1, . . . , n, w, φ(xn , y n ) − w, φ(xn , y) ≥ ∆(y n , y) − ξ n for all y ∈ Y. [L. Cai, T. Hofmann: ”Hierarchical Document Categorization with Support Vector Machines”, ACM CIKM, 2004] [A. Binder, K.-R. M¨ller, M. Kawanabe: ”On taxonomies for multi-class image categorization”, IJCV, 2011] u 16 / 56
  • 17. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs Solving the Training Optimization Problem Numerically We can solve SSVM training like CRF training: N 1 2 C min w + max ∆(y n , y) + w, φ(xn , y) − w, φ(xn , y n ) w 2 N y∈Y n=1 continuous unconstrained convex non-differentiable → we can’t use gradient descent directly. → we’ll have to use subgradients 17 / 56
  • 18. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs Definition Let f : RD → R be a convex, not necessarily differentiable, function. A vector v ∈ RD is called a subgradient of f at w0 , if f (w) ≥ f (w0 ) + v, w − w0 for all w. f(w) f(w0)+⟨v,w-w0⟩ f(w ) 0 w w 0 18 / 56
  • 19. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs Definition Let f : RD → R be a convex, not necessarily differentiable, function. A vector v ∈ RD is called a subgradient of f at w0 , if f (w) ≥ f (w0 ) + v, w − w0 for all w. f(w) f(w0)+⟨v,w-w0⟩ f(w )0 w w 0 19 / 56
  • 20. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs Definition Let f : RD → R be a convex, not necessarily differentiable, function. A vector v ∈ RD is called a subgradient of f at w0 , if f (w) ≥ f (w0 ) + v, w − w0 for all w. f(w) f(w0)+⟨v,w-w0⟩ f(w ) 0 w w 0 20 / 56
  • 21. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs Definition Let f : RD → R be a convex, not necessarily differentiable, function. A vector v ∈ RD is called a subgradient of f at w0 , if f (w) ≥ f (w0 ) + v, w − w0 for all w. f(w) f(w0)+⟨v,w-w0⟩ f(w ) 0 w w 0 For differentiable f , the gradient v = f (w0 ) is the only subgradient. f(w) f(w0)+⟨v,w-w0⟩ f(w )0 w w 0 21 / 56
  • 22. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs Subgradient descent works basically like gradient descent: Subgradient Descent Minimization – minimize F (w) require: tolerance > 0, stepsizes ηt wcur ← 0 repeat v ∈ subw F (wcur ) wcur ← wcur − ηt v until F changed less than return wcur Converges to global minimum, but rather inefficient if F non-differentiable. [Shor, ”Minimization methods for non-differentiable functions”, Springer, 1985.] 22 / 56
  • 23. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs Computing a subgradient: N 1 2 C n min w + (w) w 2 N n=1 with n (w) = maxy n (w), and y n y (w) := ∆(y n , y) + w, φ(xn , y) − w, φ(xn , y n ) 23 / 56
  • 24. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs Computing a subgradient: N 1 2 C n min w + (w) w 2 N n=1 with n (w) = maxy n (w), and y n y (w) := ∆(y n , y) + w, φ(xn , y) − w, φ(xn , y n ) ℓ(w) y w For each y ∈ Y, y (w) is a linear function. 24 / 56
  • 25. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs Computing a subgradient: N 1 2 C n min w + (w) w 2 N n=1 with n (w) = maxy n (w), and y n y (w) := ∆(y n , y) + w, φ(xn , y) − w, φ(xn , y n ) ℓ(w) y' w For each y ∈ Y, y (w) is a linear function. 25 / 56
  • 26. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs Computing a subgradient: N 1 2 C n min w + (w) w 2 N n=1 with n (w) = maxy n (w), and y n y (w) := ∆(y n , y) + w, φ(xn , y) − w, φ(xn , y n ) ℓ(w) w For each y ∈ Y, y (w) is a linear function. 26 / 56
  • 27. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs Computing a subgradient: N 1 2 C n min w + (w) w 2 N n=1 with n (w) = maxy n (w), and y n y (w) := ∆(y n , y) + w, φ(xn , y) − w, φ(xn , y n ) ℓ(w) w (w) = maxy y (w): maximum over all y ∈ Y. 27 / 56
  • 28. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs Computing a subgradient: N 1 2 C n min w + (w) w 2 N n=1 with n (w) = maxy n (w), and y n y (w) := ∆(y n , y) + w, φ(xn , y) − w, φ(xn , y n ) ℓ(w) w w 0 Subgradient of n at w0 : 28 / 56
  • 29. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs Computing a subgradient: N 1 2 C n min w + (w) w 2 N n=1 with n (w) = maxy n (w), and y n y (w) := ∆(y n , y) + w, φ(xn , y) − w, φ(xn , y n ) ℓ(w) w w 0 Subgradient of n at w0 : find maximal (active) y. 29 / 56
  • 30. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs Computing a subgradient: N 1 2 C n min w + (w) w 2 N n=1 with n (w) = maxy n (w), and y n y (w) := ∆(y n , y) + w, φ(xn , y) − w, φ(xn , y n ) ℓ(w) w w 0 Subgradient of n at w0 : find maximal (active) y, use v = n (w ). y 0 30 / 56
  • 31. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs Subgradient Descent S-SVM Training input training pairs {(x1 , y 1 ), . . . , (xn , y n )} ⊂ X × Y, input feature map φ(x, y), loss function ∆(y, y ), regularizer C, input number of iterations T , stepsizes ηt for t = 1, . . . , T 1: w←0 2: for t=1,. . . ,T do 3: for i=1,. . . ,n do 4: y ← argmaxy∈Y ∆(y n , y) + w, φ(xn , y) − w, φ(xn , y n ) ˆ 5: v n ← φ(xn , y ) − φ(xn , y n ) ˆ 6: end for C 7: w ← w − ηt (w − N n v n ) 8: end for output prediction function f (x) = argmaxy∈Y w, φ(x, y) . Observation: each update of w needs 1 argmax-prediction per example. 31 / 56
  • 32. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs We can use the same tricks as for CRFs, e.g. stochastic updates: Stochastic Subgradient Descent S-SVM Training input training pairs {(x1 , y 1 ), . . . , (xn , y n )} ⊂ X × Y, input feature map φ(x, y), loss function ∆(y, y ), regularizer C, input number of iterations T , stepsizes ηt for t = 1, . . . , T 1: w←0 2: for t=1,. . . ,T do 3: (xn , y n ) ← randomly chosen training example pair 4: y ← argmaxy∈Y ∆(y n , y) + w, φ(xn , y) − w, φ(xn , y n ) ˆ C 5: w ← w − ηt (w − N [φ(xn , y ) − φ(xn , y n )]) ˆ 6: end for output prediction function f (x) = argmaxy∈Y w, φ(x, y) . Observation: each update of w needs only 1 argmax-prediction (but we’ll need many iterations until convergence) 32 / 56
  • 33. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs Solving the Training Optimization Problem Numerically We can solve an S-SVM like a linear SVM: One of the equivalent formulations was: N 2 C min w + ξn w∈RD ,ξ∈Rn + N n=1 subject to, for i = 1, . . . n, w, φ(xn , y n ) − w, φ(xn , y) ≥ ∆(y n , y) − ξ n , for all y ∈ Y‘. Introduce feature vectors δφ(xn , y n , y) := φ(xn , y n ) − φ(xn , y). 33 / 56
  • 34. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs Solve N 2 C min w + ξn w∈RD ,ξ∈Rn + N n=1 subject to, for i = 1, . . . n, for all y ∈ Y, w, δφ(xn , y n , y) ≥ ∆(y n , y) − ξ n . This has the same structure as an ordinary SVM! quadratic objective linear constraints 34 / 56
  • 35. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs Solve N 2 C min w + ξn w∈RD ,ξ∈Rn + N n=1 subject to, for i = 1, . . . n, for all y ∈ Y, w, δφ(xn , y n , y) ≥ ∆(y n , y) − ξ n . This has the same structure as an ordinary SVM! quadratic objective linear constraints Question: Can’t we use a ordinary SVM/QP solver? 35 / 56
  • 36. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs Solve N 2 C min w + ξn w∈RD ,ξ∈Rn + N n=1 subject to, for i = 1, . . . n, for all y ∈ Y, w, δφ(xn , y n , y) ≥ ∆(y n , y) − ξ n . This has the same structure as an ordinary SVM! quadratic objective linear constraints Question: Can’t we use a ordinary SVM/QP solver? Answer: Almost! We could, if there weren’t N |Y| constraints. E.g. 100 binary 16 × 16 images: 1079 constraints 36 / 56
  • 37. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs Solution: working set training It’s enough if we enforce the active constraints. The others will be fulfilled automatically. We don’t know which ones are active for the optimal solution. But it’s likely to be only a small number ← can of course be formalized. Keep a set of potentially active constraints and update it iteratively: 37 / 56
  • 38. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs Solution: working set training It’s enough if we enforce the active constraints. The others will be fulfilled automatically. We don’t know which ones are active for the optimal solution. But it’s likely to be only a small number ← can of course be formalized. Keep a set of potentially active constraints and update it iteratively: Working Set Training Start with working set S = ∅ (no contraints) Repeat until convergence: Solve S-SVM training problem with constraints from S Check, if solution violates any of the full constraint set if no: we found the optimal solution, terminate. if yes: add most violated constraints to S, iterate. 38 / 56
  • 39. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs Solution: working set training It’s enough if we enforce the active constraints. The others will be fulfilled automatically. We don’t know which ones are active for the optimal solution. But it’s likely to be only a small number ← can of course be formalized. Keep a set of potentially active constraints and update it iteratively: Working Set Training Start with working set S = ∅ (no contraints) Repeat until convergence: Solve S-SVM training problem with constraints from S Check, if solution violates any of the full constraint set if no: we found the optimal solution, terminate. if yes: add most violated constraints to S, iterate. Good practical performance and theoretic guarantees: polynomial time convergence -close to the global optimum [Tsochantaridis et al. ”Large Margin Methods for Structured and Interdependent Output Variables”, JMLR, 2005.] 39 / 56
  • 40. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs Working Set S-SVM Training input training pairs {(x1 , y 1 ), . . . , (xn , y n )} ⊂ X × Y, input feature map φ(x, y), loss function ∆(y, y ), regularizer C 1: S←∅ 2: repeat 3: (w, ξ) ← solution to QP only with constraints from S 4: for i=1,. . . ,n do 5: y ← argmaxy∈Y ∆(y n , y) + w, φ(xn , y) ˆ 6: if y = y n then ˆ 7: S ← S ∪ {(xn , y )} ˆ 8: end if 9: end for 10: until S doesn’t change anymore. output prediction function f (x) = argmaxy∈Y w, φ(x, y) . Observation: each update of w needs 1 argmax-prediction per example. (but we solve globally for next w, not by local steps) 40 / 56
  • 41. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs One-Slack Formulation of S-SVM: 1 n) (equivalent to ordinary S-SVM formulation by ξ = N nξ 1 2 min w + Cξ w∈RD ,ξ∈R+ 2 subject to, for all (ˆ1 , . . . , y N ) ∈ Y × · · · × Y, y ˆ N ∆(y n , y N ) + w, φ(xn , y n ) − w, φ(xn , y n ) ˆ ˆ ≤ N ξ, n=1 41 / 56
  • 42. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs One-Slack Formulation of S-SVM: 1 n) (equivalent to ordinary S-SVM formulation by ξ = N nξ 1 2 min w + Cξ w∈RD ,ξ∈R+ 2 subject to, for all (ˆ1 , . . . , y N ) ∈ Y × · · · × Y, y ˆ N ∆(y n , y N ) + w, φ(xn , y n ) − w, φ(xn , y n ) ˆ ˆ ≤ N ξ, n=1 |Y|N linear constraints, convex, differentiable objective. We blew up the constraint set even further: 100 binary 16 × 16 images: 10177 constraints (instead of 1079 ). 42 / 56
  • 43. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs Working Set One-Slack S-SVM Training input training pairs {(x1 , y 1 ), . . . , (xn , y n )} ⊂ X × Y, input feature map φ(x, y), loss function ∆(y, y ), regularizer C 1: S←∅ 2: repeat 3: (w, ξ) ← solution to QP only with constraints from S 4: for i=1,. . . ,n do 5: y n ← argmaxy∈Y ∆(y n , y) + w, φ(xn , y) ˆ 6: end for 7: S ← S ∪ { (x1 , . . . , xn ), (ˆ1 , . . . , y n ) } y ˆ 8: until S doesn’t change anymore. output prediction function f (x) = argmaxy∈Y w, φ(x, y) . Often faster convergence: We add one strong constraint per iteration instead of n weak ones. 43 / 56
  • 44. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs We can solve an S-SVM like a non-linear SVM: compute Lagrangian dual min becomes max, original (primal) variables w, ξ disappear, new (dual) variables αiy : one per constraint of the original problem. Dual S-SVM problem 1 max αny ∆(y n , y) − αny αny δφ(xn , y n , y), δφ(xn , y n , y ) ¯¯ ¯ ¯ ¯ n|Y| α∈R+ 2 n=1,...,n y,¯∈Y y y∈Y n,¯ =1,...,N n subject to, for n = 1, . . . , N , C αny ≤ . N y∈Y N linear contraints, convex, differentiable objective, N |Y| variables. 44 / 56
  • 45. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs We can kernelize: Define joint kernel function k : (X × Y) × (X × Y) → R k( (x, y) , (¯, y ) ) = φ(x, y), φ(¯, y ) . x ¯ x ¯ k measure similarity between two (input,output)-pairs. We can express the optimization in terms of k: δφ(xn , y n , y) , δφ(xn , y n , y ) ¯ ¯ ¯ = φ(xn , y n ) − φ(xn , y) , φ(xn , y n ) − φ(xn , y ) ¯ ¯ ¯ ¯ = φ(xn , y n ), φ(xn , y n ) − φ(xn , y n ), φ(xn , y ) ¯ ¯ ¯ ¯ − φ(xn , y), φ(xn , y n ) + φ(xn , y), φ(xn , y ) ¯ ¯ ¯ ¯ = k( (xn , y n ), (xn , y n ) ) − k( (xn , y n ), φ(xn , y ) ) ¯ ¯ ¯ ¯ − k( (xn , y), (xn , y n ) ) + k( (xn , y), φ(xn , y ) ) ¯ ¯ ¯ ¯ =: Ki¯yy ı ¯ 45 / 56
  • 46. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs Kernelized S-SVM problem: 1 max αiy ∆(y n , y) − αiy α¯y Ki¯yy ı¯ ı ¯ n|Y| α∈R+ 2 i=1,...,n y,¯∈Y y y∈Y i,¯=1,...,n ı subject to, for i = 1, . . . , n, C αiy ≤ . N y∈Y too many variables: train with working set of αiy . Kernelized prediction function: f (x) = argmax αiy k( (xi , yi ), (x, y) ) y∈Y iy 46 / 56
  • 47. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs What do ”joint kernel functions” look like? k( (x, y) , (¯, y ) ) = φ(x, y), φ(¯, y ) . x ¯ x ¯ As in graphical model: easier if φ decomposes w.r.t. factors: φ(x, y) = φF (x, yF ) F ∈F Then the kernel k decomposes into sum over factors: k( (x, y) , (¯, y ) ) = x ¯ φF (x, yF ) F ∈F , φF (x , yF ) F ∈F = φF (x, yF ), φF (x , yF ) F ∈F = kF ( (x, yF ), (x , yF ) ) F ∈F We can define kernels for each factor (e.g. nonlinear). 47 / 56
  • 48. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs Example: figure-ground segmentation with grid structure (x, y)=( ˆ , ) Typical kernels: arbirary in x, linear (or at least simple) w.r.t. y: Unary factors: kp ((xp , yp ), (xp , yp ) = k(xp , xp ) yp = yp with k(xp , xp ) local image kernel, e.g. χ2 or histogram intersection Pairwise factors: kpq ((yp , yq ), (yp , yp ) = yq = yq yq = yq More powerful than all-linear, and argmax-prediction still possible. 48 / 56
  • 49. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs Example: object localization left top (x, y)=( ˆ , ) image right bottom Only one factor that includes all x and y: k( (x, y) , (x , y ) ) = kimage (x|y , x |y ) with kimage image kernel and x|y is image region within box y. argmax-prediction as difficult as object localization with kimage -SVM. 49 / 56
  • 50. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs Summary – S-SVM Learning Given: training set {(x1 , y 1 ), . . . , (xn , y n )} ⊂ X × Y loss function ∆ : Y × Y → R. Task: learn parameter w for f (x) := argmaxy w, φ(x, y) that minimizes expected loss on future data. 50 / 56
  • 51. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs Summary – S-SVM Learning Given: training set {(x1 , y 1 ), . . . , (xn , y n )} ⊂ X × Y loss function ∆ : Y × Y → R. Task: learn parameter w for f (x) := argmaxy w, φ(x, y) that minimizes expected loss on future data. S-SVM solution derived by maximum margin framework: enforce correct output to be better than others by a margin : w, φ(xn , y n ) ≥ ∆(y n , y) + w, φ(xn , y) for all y ∈ Y. convex optimization problem, but non-differentiable many equivalent formulations → different training algorithms training needs repeated argmax prediction, no probabilistic inference 51 / 56
  • 52. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs Extra I: Beyond Fully Supervised Learning So far, training was fully supervised, all variables were observed. In real life, some variables are unobserved even during training. missing labels in training data latent variables, e.g. part location latent variables, e.g. part occlusion latent variables, e.g. viewpoint 52 / 56
  • 53. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs Three types of variables: x ∈ X always observed, y ∈ Y observed only in training, z ∈ Z never observed (latent). Decision function: f (x) = argmaxy∈Y maxz∈Z w, φ(x, y, z) 53 / 56
  • 54. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs Three types of variables: x ∈ X always observed, y ∈ Y observed only in training, z ∈ Z never observed (latent). Decision function: f (x) = argmaxy∈Y maxz∈Z w, φ(x, y, z) Maximum Margin Training with Maximization over Latent Variables N 1 2 C Solve: min w + ξn w,ξ 2 N n=1 subject to, for n = 1, . . . , N , for all y ∈ Y ∆(y n , y) + max w, φ(xn , y, z) − max w, φ(xn , y n , z) z∈Z z∈Z Problem: not a convex problem → can have local minima [C. Yu, T. Joachims, ”Learning Structural SVMs with Latent Variables”, ICML, 2009] similar idea: [Felzenszwalb, McAllester, Ramaman. A Discriminatively Trained, Multiscale, Deformable Part Model, CVPR’08] 54 / 56
  • 55. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs Structured Learning is full of Open Research Questions How to train faster? CRFs need many runs of probablistic inference, SSVMs need many runs of argmax-predictions. How to reduce the necessary amount of training data? semi-supervised learning? transfer learning? How can we better understand different loss function? when to use probabilistic training, when maximum margin? CRFs are “consistent”, SSVMs are not. Is this relevant? Can we understand structured learning with approximate inference? often computing L(w) or argmaxy w, φ(x, y) exactly is infeasible. can we guarantee good results even with approximate inference? More and new applications! 55 / 56
  • 56. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs Lunch-Break Continuing at 13:30 Slides available at http://www.nowozin.net/sebastian/ cvpr2011tutorial/ 56 / 56