SlideShare una empresa de Scribd logo
1 de 85
Descargar para leer sin conexión
Visual Recognition: Learning Features

                          Raquel Urtasun

                            TTI Chicago


                           Feb 21, 2012




Raquel Urtasun (TTI-C)      Visual Recognition    Feb 21, 2012   1 / 82
Learning Representations




    Sparse coding
    Deep architectures
    Topic models




   Raquel Urtasun (TTI-C)   Visual Recognition   Feb 21, 2012   2 / 82
Image Classification using BoW

              Dictionary Learning                   Image Classification




                                                         Dense/Sparse SIFT


                Dense/Sparse SIFT
                                                            VQ Coding


                       K-means                            Spatial Pyramid
                                                              Pooling

                      Dictionary
                                                          Nonlinear SVM


[Source: K. Yu]
   Raquel Urtasun (TTI-C)           Visual Recognition                       Feb 21, 2012   3 / 82
BoW+SPM: the architecture

     Nonlinear SVM is not scalable
     VQ coding may be too coarse
     Average pooling is not optimal
     Why not learn the whole thing?




              Local Gradients         Pooling         VQ Coding    Average Pooling     Nonlinear
                                                                  (obtain histogram)     SVM

                            e.g, SIFT, HOG


[Source: K. Yu]
   Raquel Urtasun (TTI-C)                    Visual Recognition                        Feb 21, 2012   4 / 82
Sparse Architecture

    Nonlinear SVM is not scalable → Scalable linear classifier
    VQ coding may be too coarse → Better coding methods
    Average pooling is not optimal → Better pooling methods
    Why not learn the whole → Deep learning




           Local Gradients         Pooling      Better Coding   Better Pooling     Scalable
                                                                                    Linear
                                                                                   Classifier
                        e.g, SIFT, HOG


[Source: A. Ng]
   Raquel Urtasun (TTI-C)                Visual Recognition                      Feb 21, 2012   5 / 82
Feature learning problem
    Given a 14 × 14 image patch x, can represent it using 196 real numbers.
    Problem: Can we find a learn a better representation for this?




    Given a set of images, learn a better way to represent image than pixels.




[Source: A. Ng]
   Raquel Urtasun (TTI-C)        Visual Recognition                 Feb 21, 2012   6 / 82
Learning an image representation



    Sparse coding [Olshausen & Field,1996]
    Input: Images x (1) , x (2) , · · · , x (m) (each in        n×n
                                                                      )
                                                                 n×n
    Learn: Dictionary of bases φ1 , · · · , φk (also                      ), so that each input x can
    be approximately decomposed as:
                                                    k
                                           x≈           aj φj
                                                  j=1

    such that the aj ’s are mostly zero, i.e., sparse

[Source: A. Ng]




   Raquel Urtasun (TTI-C)               Visual Recognition                            Feb 21, 2012   7 / 82
Sparse Coding Illustration

         !!!!"#$%&#'!()#*+,!                 -+#&.+/!0#,+,!1!!"#"$#"!%&23!!45/*+,6!




     D+,$!+E#)A'+!

                     " 0.8 *            + 0.3 *                + 0.5 *

             !       " 0.8 *     !'%    + 0.3 *        !&(     + 0.5 *    !%'"
                 !789!89!:9!89!!"#9!89!:9!89!!"$9!89!:9!89!!"%9!:;!!
                                                                     FC)A#G$!H!+#,I'J!
                  <!7#=9!:9!#>?;!!!!1@+#$%&+!&+A&+,+.$#BC.2!!          I.$+&A&+$#0'+!


[Source: A. Ng]
   Raquel Urtasun (TTI-C)                 Visual Recognition                       Feb 21, 2012   8 / 82
More examples

    Method hypothesizes that edge-like patches are the most basic elements of a
    scene, and represents an image in terms of the edges that appear in it.
    Use to obtain a more compact, higher-level representation of the scene than
    pixels.




[Source: A. Ng]
   Raquel Urtasun (TTI-C)        Visual Recognition              Feb 21, 2012   9 / 82
Sparse Coding details


    Input: Images x (1) , x (2) , · · · , x (m) (each in     n×n
                                                                   )
    Obtain dictionary elements and weights by
                                                                                  
                                                                                  
                                m             k                       k           
                                     (i)           (i)                        (i) 
                            min     ||x −         aj φj ||2
                                                           2   +λ            |aj |1 
                            a,φ                                                   
                                i=1         j=1                       j=1         
                                                                        sparsity


    Alternating minimization with respect to φj ’s and a’s.
    The second is harder, the first one is closed form.




   Raquel Urtasun (TTI-C)               Visual Recognition                              Feb 21, 2012   10 / 82
Fast algorithms
    Solving for a is expensive.
    Simple algorithm that works well
        Repeatedly guess sign (+, - or 0) of each of the ai ’s.
        Solve for ai ’s in closed form. Refine guess for signs.
    Other algorithms such as projective gradient descent, stochastic subgradient
    descent, etc




[Source: A. Ng]
   Raquel Urtasun (TTI-C)         Visual Recognition              Feb 21, 2012   11 / 82
Recap of sparse coding for feature learning
Training:
     Input: Images x (1) , x (2) , · · · , x (m) (each in              n×n
                                                                             )
                                                                        n×n
     Learn: Dictionary of bases φ1 , · · · , φk (also                            ),
                                                                                                  
                                      m                   k                           k
                                           ||x (i) −           (i)                          (i)
                            min                                aj φj ||2 + λ               |aj |
                            a,φ
                                  i=1                    j=1                      j=1


Test time:
     Input: novel x (1) , x (2) , · · · , x (m) and learned φ1 , · · · , φk .
     Solve for the representation a1 , · · · , ak for each example
                                                                 
                                          m               k                      k
                              min               ||x −         aj φj ||2 + λ              |aj |
                                  a
                                          i=1            j=1                     j=1



   Raquel Urtasun (TTI-C)                         Visual Recognition                                   Feb 21, 2012   12 / 82
Sparse coding recap




    Much better than pixel representation, but still not competitive with SIFT,
    etc.
    Three ways to make it competitive:
            Combine this with SIFT.
            Advanced versions of sparse coding, e.g., LCC.
            Deep learning.

[Source: A. Ng]




   Raquel Urtasun (TTI-C)           Visual Recognition           Feb 21, 2012   13 / 82
Sparse Classification




[Source: A. Ng]
   Raquel Urtasun (TTI-C)   Visual Recognition   Feb 21, 2012   14 / 82
K-means vs sparse coding




[Source: A. Ng]
   Raquel Urtasun (TTI-C)   Visual Recognition   Feb 21, 2012   15 / 82
Why sparse coding helps classification?

     The coding is a nonlinear feature mapping
     Represent data in a higher dimensional space
     Sparsity makes prominent patterns more distinctive




[Source: K. Yu]
   Raquel Urtasun (TTI-C)         Visual Recognition      Feb 21, 2012   16 / 82
A topic model view to sparse coding


     Each basis is a direction or a topic.
     Sparsity: each datum is a linear combination of only a few bases.
     Applicable to image denoising, inpainting, and super-resolution.


                            Basis 1

                              Basis 2




                                                   Both figures adapted from CVPR10 tutorial by F.
                                                   Bach, J. Mairal, J. Ponce and G. Sapiro



[Source: K. Yu]


   Raquel Urtasun (TTI-C)               Visual Recognition                              Feb 21, 2012   17 / 82
A geometric view to sparse coding


     Each basis is somewhat like a pseudo data point anchor point
     Sparsity: each datum is a sparse combination of neighbor anchors.
     The coding scheme explores the manifold structure of data


                                                     Data manifold
                               Data


                            Basis




[Source: K. Yu]


   Raquel Urtasun (TTI-C)             Visual Recognition             Feb 21, 2012   18 / 82
Influence of Sparsity
    When SC achieves the best classification accuracy, the learned bases are like
    digits – each basis has a clear local class association.




          Error: 4.54%            Error: 3.75%               Error: 2.64%




   Raquel Urtasun (TTI-C)        Visual Recognition              Feb 21, 2012   19 / 82
The image classification setting for analysis
     Learning an image classifier is a matter of learning nonlinear functions on
     patches
                                               m                       m
                            f (x) = w T x =         ai (w T φ(i) ) =         ai f (φ(i) )
                                              i=1                      i=1
                       f .images                                               f .patches
                       m        (i)
     where x =         i=1 ai φ




                                         Dense local feature


                                           Sparse Coding


                                           Linear Pooling


                                              Linear SVM


[Source: K. Yu]
   Raquel Urtasun (TTI-C)                 Visual Recognition                           Feb 21, 2012   20 / 82
Nonlinear learning via local coding
                            k
     We assume xi ≈         j=1 ai,j φj    and thus
                                                      k
                                          f (xi ) =         ai,j f (φj )
                                                      j=1



                               locally linear




                                                      data points
                                                      bases


[Source: K. Yu]
   Raquel Urtasun (TTI-C)                  Visual Recognition              Feb 21, 2012   21 / 82
How to learn the non-linear function

  1    Learning the dictionary φ1 , · · · , φk from unlabeled data
  2    Use the dictionary to encode data xi → ai,1 , · · · , ai,k




  3    Estimate the parameters




Nonlinear local learning via learning a global linear function
      Raquel Urtasun (TTI-C)           Visual Recognition            Feb 21, 2012   22 / 82
Local Coordinate Coding (LCC):

     If f(x) is (alpha, beta)-Lipschitz smooth




            Function                Coding error         Locality term
          approximation
              error

A good coding scheme should [Yu et al. 09]
     have a small coding error,
     and also be sufficiently local
[Source: K. Yu]
   Raquel Urtasun (TTI-C)           Visual Recognition        Feb 21, 2012   23 / 82
Results

     The larger dictionary, the higher accuracy, but also the higher comp. cost




                                    (Yu et al. 09)




                                   (Yang et al. 09)

     The same observation for Caltech256, PASCAL, ImageNet

[Source: K. Yu]
   Raquel Urtasun (TTI-C)         Visual Recognition              Feb 21, 2012   24 / 82
Locality-constrained linear coding




     A fast implementation of LCC [Wang et al. 10]
     Dictionary learning using k-means and code for x based on
  Step 1 ensure locality: find the K nearest bases [φj ]j∈J(x)
  Step 2 ensure low coding error:

                            min ||x −            ai,j φj ||2 ,   s.t.            ai,j = 1
                             a
                                        j∈J(x)                          j∈J(x)


[Source: K. Yu]




   Raquel Urtasun (TTI-C)                 Visual Recognition                         Feb 21, 2012   25 / 82
Results




[Source: K. Yu]

   Raquel Urtasun (TTI-C)   Visual Recognition   Feb 21, 2012   26 / 82
Interpretation of BoW + linear classifier

                                  Piece-wise local constant (zero-order)




                            data points
                            cluster centers



[Source: K. Yu]
   Raquel Urtasun (TTI-C)             Visual Recognition         Feb 21, 2012   27 / 82
Support vector coding [Zhou et al, 10]

                Local tangent Piecewise local linear (first-order)




                             data points
                             cluster centers



[Source: K. Yu]
   Raquel Urtasun (TTI-C)         Visual Recognition          Feb 21, 2012   28 / 82
Details...


     Let [ai , 1, · · · , ai,k ] be the VQ coding of xi




                                  Super-vector codes
                                       of data
                                                             Global linear
                                                             weights to be
                                                               learned


     No.1 position in PASCAL VOC 2009

[Source: K. Yu]


   Raquel Urtasun (TTI-C)               Visual Recognition          Feb 21, 2012   29 / 82
Summary of coding algorithms

     All lead to higher-dimensional, sparse, and localized coding
     All explore geometric structure of data
     New coding methods are suitable for linear classifiers.
     Their implementations are quite straightforward.




[Source: K. Yu]
   Raquel Urtasun (TTI-C)          Visual Recognition               Feb 21, 2012   30 / 82
PASCAL VOC 2009
      No.1 for 18 of 20 categories
      PASCAL: 20 categories, 10,000 images
      They use only HOG feature on gray images

                                                  Best of
                           Classes       Ours   Other Teams   Difference




[Source: Urtasun (TTI-C)
    Raquel K. Yu]                    Visual Recognition                    Feb 21, 2012   31 / 82
ImageNet Large-scale Visual Recognition Challenge 2010




                      ImageNet: 1000 categories, 1.2 million images for training


[Source: K. Yu]
   Raquel Urtasun (TTI-C)                 Visual Recognition                       Feb 21, 2012   32 / 82
ImageNet Large-scale Visual Recognition Challenge 2010

     150 teams registered worldwide, resulting 37 submissions from 11 teams
     SC achieved 52% for 1000 class classification.


             Teams                                              Top 5 Hit Rate
             Our team: NEC-UIUC                                     72.8%
             Xerox European Lab, France                             66.4%
             Univ. of Tokyo                                         55.4%
             Univ. of California, Irvine                            53.4%
             MIT                                                    45.6%
             NTU, Singapore                                         41.7%
             LIG, France                                            39.3%
             IBM T. J. Waston Research Center                       30.0%
             National Institute of Informatics, Tokyo               25.8%
             SRI (Stanford Research Institute)                      24.9%


[Source: K. Yu]
   Raquel Urtasun (TTI-C)                  Visual Recognition                    Feb 21, 2012   33 / 82
Learning Codebooks for Image Classification




Replacing Vector Quantization by Learned Dictionaries
     unsupervised: [Yang et al., 2009]
     supervised: [Boureau et al., 2010, Yang et al., 2010]
[Source: Mairal]



   Raquel Urtasun (TTI-C)          Visual Recognition        Feb 21, 2012   34 / 82
Discriminative Training




[Source: Mairal]
   Raquel Urtasun (TTI-C)   Visual Recognition   Feb 21, 2012   35 / 82
Discriminative Dictionaries




Figure: Top: reconstructive, Bottom: discriminative, Left: Bicycle, Right:
Background

[Source: Mairal]
   Raquel Urtasun (TTI-C)          Visual Recognition             Feb 21, 2012   36 / 82
Application:Edge detection




[Source: Mairal]


   Raquel Urtasun (TTI-C)   Visual Recognition   Feb 21, 2012   37 / 82
Application:Edge detection




[Source: Mairal]


   Raquel Urtasun (TTI-C)   Visual Recognition   Feb 21, 2012   38 / 82
Application:Authentic from fake




[Source: Mairal]
   Raquel Urtasun (TTI-C)   Visual Recognition   Feb 21, 2012   39 / 82
Predictive Sparse Decomposition (PSD)


     Feed-forward predictor function for feature extraction
                                                                                                 
                    m                  k                        k
                          ||x (i) −          (i)                     (i)
            min                              aj φj ||2 + λ          |aj | + β||a − C (X , K )||2 
                                                                                               2
            a,φ,K
                    i=1                j=1                    j=1


     with e.g., C (X , K ) = g · tanh(x ∗ k)

Learning is done by
 1) Fix K and a, minimize to get optimal φ
 2) Update a and K using optimal φ
 3) Scale elements of φ to be unit norm.
[Source: Y. LeCun]


   Raquel Urtasun (TTI-C)                      Visual Recognition                      Feb 21, 2012   40 / 82
Predictive Sparse Decomposition (PSD)

    12 x12 natural image patches with 256 dictionary elements




                            Figure: (Left) Encoder, (Right) Decoder


[Source: Y. LeCun]
   Raquel Urtasun (TTI-C)                Visual Recognition           Feb 21, 2012   41 / 82
Training Deep Networks with PSD




    Train layer wise [Hinton, 06]
            C (X , K 1 )
            C (f (K 1 ), K 2 )
            ···
    Each layer is trained on the output f (x) produced from previous layer.
    f is a series of non-linearity and pooling operations

[Source: Y. LeCun]




   Raquel Urtasun (TTI-C)           Visual Recognition           Feb 21, 2012   42 / 82
Object Recognition - Caltech 101




[Source: Y. LeCun]
   Raquel Urtasun (TTI-C)   Visual Recognition   Feb 21, 2012   43 / 82
Problems with sparse coding



    Sparse coding produces filters that are shifted version of each other
    It ignores that it’s going to be used in a convolutional fashion
    Inference in all overlapping patches independently

Problems with sparse coding
 1) the representations are redundant, as the training and inference are done at
    the patch level
 2) inference for the whole image is computationally expensive




   Raquel Urtasun (TTI-C)         Visual Recognition               Feb 21, 2012   44 / 82
Solutions via Convolutional Nets
Problems
 1) the representations are redundant, as the training and inference are done at
    the patch level
 2) inference for the whole image is computationally expensive
Solutions
 1) Apply sparse coding to the entire image at once, with the dictionary as a
    convolutional filter bank
                                                          K
                                              1
                              (x, z, D) =       ||x −            Dk ∗ zk ||2 + |z|1
                                                                           2
                                              2
                                                         k=1
     with Dk a s × s 2D filter, x is w × h image and zk a 2D
     (w + s − 1) × (h + s − 1) feature.
 2) Use feed-forward, non-linear encoders to produce a fast approx. to the
    sparse code
                                        K                          K
                              1
            (x, z, D, W ) =     ||x −         Dk ∗ zk ||2 +
                                                        2               ||zk − f (W k ∗ x)||2 + |z|1
                                                                                            2
                              2
                                        k=1                       k=1
   Raquel Urtasun (TTI-C)                   Visual Recognition                         Feb 21, 2012    45 / 82
Convolutional Nets




Figure: Image from Kavukcuoglu et al. 10. (left) Dictionary from sparse coding.
(right) Dictionary learned from conv. nets.




   Raquel Urtasun (TTI-C)         Visual Recognition             Feb 21, 2012   46 / 82
Stacking
   One or more additional stages can be stacked on top of the first one.
   Each stage then takes the output of its preceding stage as input and
   processes it using the same series of operations with different architectural
   parameters like size and connections.




             Figure: Image from Kavukcuoglu et al. 10. Second layer.

  Raquel Urtasun (TTI-C)          Visual Recognition             Feb 21, 2012   47 / 82
Convolutional Nets




[Source: Y. LeCun]
   Raquel Urtasun (TTI-C)   Visual Recognition   Feb 21, 2012   48 / 82
Convolutional Nets




[Source: H. Lee]
   Raquel Urtasun (TTI-C)   Visual Recognition   Feb 21, 2012   49 / 82
INRIA pedestrians




             Figure: Image from Kavukcuoglu et al. 10. Second layer.


  Raquel Urtasun (TTI-C)          Visual Recognition             Feb 21, 2012   50 / 82
Many more deep architectures




   Restricted Bolzmann machines (RBMs)
   Deep belief nets
   Deep Boltzmann machines
   Stacked denoising auto-encoders
   ···




  Raquel Urtasun (TTI-C)       Visual Recognition   Feb 21, 2012   51 / 82
Topic Models




Raquel Urtasun (TTI-C)   Visual Recognition   Feb 21, 2012   52 / 82
Topic Models

     Topic models have become a powerful unsupervised learning tool for text
     summarization, document classification, information retrieval, link
     prediction, and collaborative filtering.
     Topic modeling introduces latent topic variables in text and image that
     reveal the underlying structure.
     Topics are a group of related words




[Source: J. Zen]
   Raquel Urtasun (TTI-C)         Visual Recognition              Feb 21, 2012   53 / 82
Probabilistic Topic Modeling



     Treat data as observations that arise from a generative model composed of
     latent variables.
            The latent variables reflect the semantic structure of the document.
     Infer the latent structure using posterior inference (MAP):
            What are the topics that summarize the document network?
     Predict new data by the estimated topic model
            How the new data fit into the estimated topic structures?

[Source: J. Zen]




   Raquel Urtasun (TTI-C)           Visual Recognition             Feb 21, 2012   54 / 82
Latent Dirichlet Allocation (LDA) [Blei et al. 03]




     Simple intuition: a document exhibits multiple topics

[Source: J. Zen]
   Raquel Urtasun (TTI-C)         Visual Recognition         Feb 21, 2012   55 / 82
Generative models




     Each document is a mixture of corpus-wide topics.
     Each word is drawn from one of those topics.

[Source: J. Zen]
   Raquel Urtasun (TTI-C)        Visual Recognition      Feb 21, 2012   56 / 82
The posterior distribution (inference)




     Observations include documents and their words. Our goal is to infer the
     underlying topic structures marked by ? from observations.

[Source: J. Zen]
   Raquel Urtasun (TTI-C)         Visual Recognition             Feb 21, 2012   57 / 82
What does it mean with visual words




[Source: B. Schiele]
   Raquel Urtasun (TTI-C)   Visual Recognition   Feb 21, 2012   58 / 82
How to read Graphical Models




[Source: B. Schiele]
   Raquel Urtasun (TTI-C)   Visual Recognition   Feb 21, 2012   59 / 82
LDA




[Source: J. Zen]
   Raquel Urtasun (TTI-C)   Visual Recognition   Feb 21, 2012   60 / 82
LDA generative process




[Source: J. Zen]
   Raquel Urtasun (TTI-C)   Visual Recognition   Feb 21, 2012   61 / 82
LDA as matrix factorization




[Source: B. Schiele]
   Raquel Urtasun (TTI-C)   Visual Recognition   Feb 21, 2012   62 / 82
LDA as matrix factorization




[Source: B. Schiele]
   Raquel Urtasun (TTI-C)   Visual Recognition   Feb 21, 2012   62 / 82
Dirichlet Distribution




   Raquel Urtasun (TTI-C)   Visual Recognition   Feb 21, 2012   63 / 82
LDA inference




[Source: J. Zen]
   Raquel Urtasun (TTI-C)   Visual Recognition   Feb 21, 2012   64 / 82
LDA: intractable exact inference




[Source: J. Zen]


   Raquel Urtasun (TTI-C)   Visual Recognition   Feb 21, 2012   65 / 82
LDA: approximate inference




[Source: J. Zen]
   Raquel Urtasun (TTI-C)   Visual Recognition   Feb 21, 2012   66 / 82
Computer Vision Applications




   Natural Scene Categorization [Fei-Fei and Perona, 05]
   Object Recognition [Sivic et al. 05]
   Object Recognition to model distribution on patches [Fritz et al. ]
   Activity Recognition [Niebles et al. ]
   Text and image [Saenko et al. ]
   ···




  Raquel Urtasun (TTI-C)          Visual Recognition            Feb 21, 2012   67 / 82
Natural Scene Categorization




     A different model is learned for each class.

[Source: J. Zen]
   Raquel Urtasun (TTI-C)          Visual Recognition   Feb 21, 2012   68 / 82
Natural Scene Categorization




    In reality is a bit more complicated as it reasons about class [Fei-Fei et al.,
    05].

   Raquel Urtasun (TTI-C)          Visual Recognition               Feb 21, 2012   69 / 82
Natural Scene Categorization




   Raquel Urtasun (TTI-C)   Visual Recognition   Feb 21, 2012   70 / 82
Classification Results




   Raquel Urtasun (TTI-C)   Visual Recognition   Feb 21, 2012   71 / 82
More Results




  Raquel Urtasun (TTI-C)   Visual Recognition   Feb 21, 2012   72 / 82
Object Recognition [Sivic et al. 05]




     Matrix Factorization view

[Source: B. Schiele]
   Raquel Urtasun (TTI-C)        Visual Recognition   Feb 21, 2012   73 / 82
Most likely words given topic




[Source: B. Schiele]


   Raquel Urtasun (TTI-C)   Visual Recognition   Feb 21, 2012   74 / 82
Most likely words given topic




[Source: B. Schiele]


   Raquel Urtasun (TTI-C)   Visual Recognition   Feb 21, 2012   74 / 82
Object Recognition [Sivic et al. 05]




     Matrix Factorization view

[Source: B. Schiele]
   Raquel Urtasun (TTI-C)        Visual Recognition   Feb 21, 2012   75 / 82
Object Recognition [Sivic et al. 05]




[Source: B. Schiele]


   Raquel Urtasun (TTI-C)   Visual Recognition   Feb 21, 2012   76 / 82
Object Recognition [Sivic et al. 05]




[Source: B. Schiele]


   Raquel Urtasun (TTI-C)   Visual Recognition   Feb 21, 2012   76 / 82
Object Recognition [Fritz et al.]




[Source: B. Schiele]
   Raquel Urtasun (TTI-C)   Visual Recognition   Feb 21, 2012   77 / 82
Learning of Generative Decompositions [Fritz et al.]




[Source: B. Schiele]
   Raquel Urtasun (TTI-C)   Visual Recognition   Feb 21, 2012   78 / 82
Topics [Fritz et al.]




[Source: B. Schiele]
   Raquel Urtasun (TTI-C)   Visual Recognition   Feb 21, 2012   79 / 82
Supervised Classification [Fritz et al.]




[Source: B. Schiele]
   Raquel Urtasun (TTI-C)   Visual Recognition   Feb 21, 2012   80 / 82
UIUC Cars




[Source: B. Schiele]
   Raquel Urtasun (TTI-C)   Visual Recognition   Feb 21, 2012   81 / 82
Extensions




    Introduce supervision, e.g., discriminative LDA, supervised LDA
    Introduce spatial information
    Multi-view approaches, e.g., correlated LDA, Pachenko
    Use infinite mixtures, e.g., HDP, PY processes to model heavy tales.
    And many more.




   Raquel Urtasun (TTI-C)           Visual Recognition          Feb 21, 2012   82 / 82

Más contenido relacionado

La actualidad más candente

How to Handle Interval Solutions for Cooperative Interval Games
How to Handle Interval Solutions for Cooperative Interval GamesHow to Handle Interval Solutions for Cooperative Interval Games
How to Handle Interval Solutions for Cooperative Interval GamesSSA KPI
 
Kernelization algorithms for graph and other structure modification problems
Kernelization algorithms for graph and other structure modification problemsKernelization algorithms for graph and other structure modification problems
Kernelization algorithms for graph and other structure modification problemsAnthony Perez
 
The convenience yield implied by quadratic volatility smiles presentation [...
The convenience yield implied by quadratic volatility smiles   presentation [...The convenience yield implied by quadratic volatility smiles   presentation [...
The convenience yield implied by quadratic volatility smiles presentation [...yigalbt
 
A New Model for Credit Approval Problems a Neuro Genetic System with Quantum ...
A New Model for Credit Approval Problems a Neuro Genetic System with Quantum ...A New Model for Credit Approval Problems a Neuro Genetic System with Quantum ...
A New Model for Credit Approval Problems a Neuro Genetic System with Quantum ...Anderson Pinho
 
Gentle Introduction to Dirichlet Processes
Gentle Introduction to Dirichlet ProcessesGentle Introduction to Dirichlet Processes
Gentle Introduction to Dirichlet ProcessesYap Wooi Hen
 
On generalized dislocated quasi metrics
On generalized dislocated quasi metricsOn generalized dislocated quasi metrics
On generalized dislocated quasi metricsAlexander Decker
 
11.on generalized dislocated quasi metrics
11.on generalized dislocated quasi metrics11.on generalized dislocated quasi metrics
11.on generalized dislocated quasi metricsAlexander Decker
 
Survey ecc 09june12
Survey ecc 09june12Survey ecc 09june12
Survey ecc 09june12IJASCSE
 

La actualidad más candente (20)

How to Handle Interval Solutions for Cooperative Interval Games
How to Handle Interval Solutions for Cooperative Interval GamesHow to Handle Interval Solutions for Cooperative Interval Games
How to Handle Interval Solutions for Cooperative Interval Games
 
CSMR11b.ppt
CSMR11b.pptCSMR11b.ppt
CSMR11b.ppt
 
Lecture18
Lecture18Lecture18
Lecture18
 
Kernelization algorithms for graph and other structure modification problems
Kernelization algorithms for graph and other structure modification problemsKernelization algorithms for graph and other structure modification problems
Kernelization algorithms for graph and other structure modification problems
 
Er24902905
Er24902905Er24902905
Er24902905
 
The convenience yield implied by quadratic volatility smiles presentation [...
The convenience yield implied by quadratic volatility smiles   presentation [...The convenience yield implied by quadratic volatility smiles   presentation [...
The convenience yield implied by quadratic volatility smiles presentation [...
 
A New Model for Credit Approval Problems a Neuro Genetic System with Quantum ...
A New Model for Credit Approval Problems a Neuro Genetic System with Quantum ...A New Model for Credit Approval Problems a Neuro Genetic System with Quantum ...
A New Model for Credit Approval Problems a Neuro Genetic System with Quantum ...
 
Lecture12 - SVM
Lecture12 - SVMLecture12 - SVM
Lecture12 - SVM
 
Lecture8 - From CBR to IBk
Lecture8 - From CBR to IBkLecture8 - From CBR to IBk
Lecture8 - From CBR to IBk
 
The Transformer - Xavier Giró - UPC Barcelona 2021
The Transformer - Xavier Giró - UPC Barcelona 2021The Transformer - Xavier Giró - UPC Barcelona 2021
The Transformer - Xavier Giró - UPC Barcelona 2021
 
Km2417821785
Km2417821785Km2417821785
Km2417821785
 
Gentle Introduction to Dirichlet Processes
Gentle Introduction to Dirichlet ProcessesGentle Introduction to Dirichlet Processes
Gentle Introduction to Dirichlet Processes
 
Mm2521542158
Mm2521542158Mm2521542158
Mm2521542158
 
Ben Gal
Ben Gal Ben Gal
Ben Gal
 
Lecture11 - neural networks
Lecture11 - neural networksLecture11 - neural networks
Lecture11 - neural networks
 
On generalized dislocated quasi metrics
On generalized dislocated quasi metricsOn generalized dislocated quasi metrics
On generalized dislocated quasi metrics
 
11.on generalized dislocated quasi metrics
11.on generalized dislocated quasi metrics11.on generalized dislocated quasi metrics
11.on generalized dislocated quasi metrics
 
String-based Multi-adjoint Lattices for Tracing Fuzzy Logic Computations
String-based Multi-adjoint Lattices for Tracing Fuzzy Logic ComputationsString-based Multi-adjoint Lattices for Tracing Fuzzy Logic Computations
String-based Multi-adjoint Lattices for Tracing Fuzzy Logic Computations
 
Survey ecc 09june12
Survey ecc 09june12Survey ecc 09june12
Survey ecc 09june12
 
Linear Classifiers
Linear ClassifiersLinear Classifiers
Linear Classifiers
 

Destacado

Talk icml
Talk icmlTalk icml
Talk icmlBo Li
 
Tuto cvpr part4
Tuto cvpr part4Tuto cvpr part4
Tuto cvpr part4Bo Li
 
Lec18 bag of_features
Lec18 bag of_featuresLec18 bag of_features
Lec18 bag of_featuresBo Li
 
Tuto part2
Tuto part2Tuto part2
Tuto part2Bo Li
 
Cvpr2007 tutorial bag_of_words
Cvpr2007 tutorial bag_of_wordsCvpr2007 tutorial bag_of_words
Cvpr2007 tutorial bag_of_wordsBo Li
 
32 Ways a Digital Marketing Consultant Can Help Grow Your Business
32 Ways a Digital Marketing Consultant Can Help Grow Your Business32 Ways a Digital Marketing Consultant Can Help Grow Your Business
32 Ways a Digital Marketing Consultant Can Help Grow Your BusinessBarry Feldman
 

Destacado (6)

Talk icml
Talk icmlTalk icml
Talk icml
 
Tuto cvpr part4
Tuto cvpr part4Tuto cvpr part4
Tuto cvpr part4
 
Lec18 bag of_features
Lec18 bag of_featuresLec18 bag of_features
Lec18 bag of_features
 
Tuto part2
Tuto part2Tuto part2
Tuto part2
 
Cvpr2007 tutorial bag_of_words
Cvpr2007 tutorial bag_of_wordsCvpr2007 tutorial bag_of_words
Cvpr2007 tutorial bag_of_words
 
32 Ways a Digital Marketing Consultant Can Help Grow Your Business
32 Ways a Digital Marketing Consultant Can Help Grow Your Business32 Ways a Digital Marketing Consultant Can Help Grow Your Business
32 Ways a Digital Marketing Consultant Can Help Grow Your Business
 

Similar a Lecture11

Time Machine session @ ICME 2012 - DTW's New Youth
Time Machine session @ ICME 2012 - DTW's New YouthTime Machine session @ ICME 2012 - DTW's New Youth
Time Machine session @ ICME 2012 - DTW's New YouthXavier Anguera
 
05 history of cv a machine learning (theory) perspective on computer vision
05  history of cv a machine learning (theory) perspective on computer vision05  history of cv a machine learning (theory) perspective on computer vision
05 history of cv a machine learning (theory) perspective on computer visionzukun
 
Multimodal pattern matching algorithms and applications
Multimodal pattern matching algorithms and applicationsMultimodal pattern matching algorithms and applications
Multimodal pattern matching algorithms and applicationsXavier Anguera
 
Introduction to Common Spatial Pattern Filters for EEG Motor Imagery Classifi...
Introduction to Common Spatial Pattern Filters for EEG Motor Imagery Classifi...Introduction to Common Spatial Pattern Filters for EEG Motor Imagery Classifi...
Introduction to Common Spatial Pattern Filters for EEG Motor Imagery Classifi...Tatsuya Yokota
 
Principal Component Analysis For Novelty Detection
Principal Component Analysis For Novelty DetectionPrincipal Component Analysis For Novelty Detection
Principal Component Analysis For Novelty DetectionJordan McBain
 
New Mathematical Tools for the Financial Sector
New Mathematical Tools for the Financial SectorNew Mathematical Tools for the Financial Sector
New Mathematical Tools for the Financial SectorSSA KPI
 
Pert 05 aplikasi clustering
Pert 05 aplikasi clusteringPert 05 aplikasi clustering
Pert 05 aplikasi clusteringaiiniR
 
A probabilistic model for recursive factorized image features ppt
A probabilistic model for recursive factorized image features pptA probabilistic model for recursive factorized image features ppt
A probabilistic model for recursive factorized image features pptirisshicat
 
論文紹介:Learning With Neighbor Consistency for Noisy Labels
論文紹介:Learning With Neighbor Consistency for Noisy Labels論文紹介:Learning With Neighbor Consistency for Noisy Labels
論文紹介:Learning With Neighbor Consistency for Noisy LabelsToru Tamaki
 
Support Vector Machine
Support Vector MachineSupport Vector Machine
Support Vector MachineLucas Xu
 
Using Vector Clocks to Visualize Communication Flow
Using Vector Clocks to Visualize Communication FlowUsing Vector Clocks to Visualize Communication Flow
Using Vector Clocks to Visualize Communication FlowMartin Harrigan
 
Csr2011 june18 11_00_tiskin
Csr2011 june18 11_00_tiskinCsr2011 june18 11_00_tiskin
Csr2011 june18 11_00_tiskinCSR2011
 
Topological Inference via Meshing
Topological Inference via MeshingTopological Inference via Meshing
Topological Inference via MeshingDon Sheehy
 
CVPR2010: Advanced ITinCVPR in a Nutshell: part 4: additional slides
CVPR2010: Advanced ITinCVPR in a Nutshell: part 4: additional slidesCVPR2010: Advanced ITinCVPR in a Nutshell: part 4: additional slides
CVPR2010: Advanced ITinCVPR in a Nutshell: part 4: additional slideszukun
 
Projection methods for stochastic structural dynamics
Projection methods for stochastic structural dynamicsProjection methods for stochastic structural dynamics
Projection methods for stochastic structural dynamicsUniversity of Glasgow
 
Marjanović, M: Advanced Landslide Assessment of the Halenkovice Experimental ...
Marjanović, M: Advanced Landslide Assessment of the Halenkovice Experimental ...Marjanović, M: Advanced Landslide Assessment of the Halenkovice Experimental ...
Marjanović, M: Advanced Landslide Assessment of the Halenkovice Experimental ...indogpr
 
Speech recognition using vector quantization through modified k means lbg alg...
Speech recognition using vector quantization through modified k means lbg alg...Speech recognition using vector quantization through modified k means lbg alg...
Speech recognition using vector quantization through modified k means lbg alg...Alexander Decker
 
Output Units and Cost Function in FNN
Output Units and Cost Function in FNNOutput Units and Cost Function in FNN
Output Units and Cost Function in FNNLin JiaMing
 

Similar a Lecture11 (20)

Time Machine session @ ICME 2012 - DTW's New Youth
Time Machine session @ ICME 2012 - DTW's New YouthTime Machine session @ ICME 2012 - DTW's New Youth
Time Machine session @ ICME 2012 - DTW's New Youth
 
05 history of cv a machine learning (theory) perspective on computer vision
05  history of cv a machine learning (theory) perspective on computer vision05  history of cv a machine learning (theory) perspective on computer vision
05 history of cv a machine learning (theory) perspective on computer vision
 
Multimodal pattern matching algorithms and applications
Multimodal pattern matching algorithms and applicationsMultimodal pattern matching algorithms and applications
Multimodal pattern matching algorithms and applications
 
Introduction to Common Spatial Pattern Filters for EEG Motor Imagery Classifi...
Introduction to Common Spatial Pattern Filters for EEG Motor Imagery Classifi...Introduction to Common Spatial Pattern Filters for EEG Motor Imagery Classifi...
Introduction to Common Spatial Pattern Filters for EEG Motor Imagery Classifi...
 
Principal Component Analysis For Novelty Detection
Principal Component Analysis For Novelty DetectionPrincipal Component Analysis For Novelty Detection
Principal Component Analysis For Novelty Detection
 
New Mathematical Tools for the Financial Sector
New Mathematical Tools for the Financial SectorNew Mathematical Tools for the Financial Sector
New Mathematical Tools for the Financial Sector
 
Pert 05 aplikasi clustering
Pert 05 aplikasi clusteringPert 05 aplikasi clustering
Pert 05 aplikasi clustering
 
Inex07
Inex07Inex07
Inex07
 
A probabilistic model for recursive factorized image features ppt
A probabilistic model for recursive factorized image features pptA probabilistic model for recursive factorized image features ppt
A probabilistic model for recursive factorized image features ppt
 
論文紹介:Learning With Neighbor Consistency for Noisy Labels
論文紹介:Learning With Neighbor Consistency for Noisy Labels論文紹介:Learning With Neighbor Consistency for Noisy Labels
論文紹介:Learning With Neighbor Consistency for Noisy Labels
 
VoxelNet
VoxelNetVoxelNet
VoxelNet
 
Support Vector Machine
Support Vector MachineSupport Vector Machine
Support Vector Machine
 
Using Vector Clocks to Visualize Communication Flow
Using Vector Clocks to Visualize Communication FlowUsing Vector Clocks to Visualize Communication Flow
Using Vector Clocks to Visualize Communication Flow
 
Csr2011 june18 11_00_tiskin
Csr2011 june18 11_00_tiskinCsr2011 june18 11_00_tiskin
Csr2011 june18 11_00_tiskin
 
Topological Inference via Meshing
Topological Inference via MeshingTopological Inference via Meshing
Topological Inference via Meshing
 
CVPR2010: Advanced ITinCVPR in a Nutshell: part 4: additional slides
CVPR2010: Advanced ITinCVPR in a Nutshell: part 4: additional slidesCVPR2010: Advanced ITinCVPR in a Nutshell: part 4: additional slides
CVPR2010: Advanced ITinCVPR in a Nutshell: part 4: additional slides
 
Projection methods for stochastic structural dynamics
Projection methods for stochastic structural dynamicsProjection methods for stochastic structural dynamics
Projection methods for stochastic structural dynamics
 
Marjanović, M: Advanced Landslide Assessment of the Halenkovice Experimental ...
Marjanović, M: Advanced Landslide Assessment of the Halenkovice Experimental ...Marjanović, M: Advanced Landslide Assessment of the Halenkovice Experimental ...
Marjanović, M: Advanced Landslide Assessment of the Halenkovice Experimental ...
 
Speech recognition using vector quantization through modified k means lbg alg...
Speech recognition using vector quantization through modified k means lbg alg...Speech recognition using vector quantization through modified k means lbg alg...
Speech recognition using vector quantization through modified k means lbg alg...
 
Output Units and Cost Function in FNN
Output Units and Cost Function in FNNOutput Units and Cost Function in FNN
Output Units and Cost Function in FNN
 

Lecture11

  • 1. Visual Recognition: Learning Features Raquel Urtasun TTI Chicago Feb 21, 2012 Raquel Urtasun (TTI-C) Visual Recognition Feb 21, 2012 1 / 82
  • 2. Learning Representations Sparse coding Deep architectures Topic models Raquel Urtasun (TTI-C) Visual Recognition Feb 21, 2012 2 / 82
  • 3. Image Classification using BoW Dictionary Learning Image Classification Dense/Sparse SIFT Dense/Sparse SIFT VQ Coding K-means Spatial Pyramid Pooling Dictionary Nonlinear SVM [Source: K. Yu] Raquel Urtasun (TTI-C) Visual Recognition Feb 21, 2012 3 / 82
  • 4. BoW+SPM: the architecture Nonlinear SVM is not scalable VQ coding may be too coarse Average pooling is not optimal Why not learn the whole thing? Local Gradients Pooling VQ Coding Average Pooling Nonlinear (obtain histogram) SVM e.g, SIFT, HOG [Source: K. Yu] Raquel Urtasun (TTI-C) Visual Recognition Feb 21, 2012 4 / 82
  • 5. Sparse Architecture Nonlinear SVM is not scalable → Scalable linear classifier VQ coding may be too coarse → Better coding methods Average pooling is not optimal → Better pooling methods Why not learn the whole → Deep learning Local Gradients Pooling Better Coding Better Pooling Scalable Linear Classifier e.g, SIFT, HOG [Source: A. Ng] Raquel Urtasun (TTI-C) Visual Recognition Feb 21, 2012 5 / 82
  • 6. Feature learning problem Given a 14 × 14 image patch x, can represent it using 196 real numbers. Problem: Can we find a learn a better representation for this? Given a set of images, learn a better way to represent image than pixels. [Source: A. Ng] Raquel Urtasun (TTI-C) Visual Recognition Feb 21, 2012 6 / 82
  • 7. Learning an image representation Sparse coding [Olshausen & Field,1996] Input: Images x (1) , x (2) , · · · , x (m) (each in n×n ) n×n Learn: Dictionary of bases φ1 , · · · , φk (also ), so that each input x can be approximately decomposed as: k x≈ aj φj j=1 such that the aj ’s are mostly zero, i.e., sparse [Source: A. Ng] Raquel Urtasun (TTI-C) Visual Recognition Feb 21, 2012 7 / 82
  • 8. Sparse Coding Illustration !!!!"#$%&#'!()#*+,! -+#&.+/!0#,+,!1!!"#"$#"!%&23!!45/*+,6! D+,$!+E#)A'+! " 0.8 * + 0.3 * + 0.5 * ! " 0.8 * !'% + 0.3 * !&( + 0.5 * !%'" !789!89!:9!89!!"#9!89!:9!89!!"$9!89!:9!89!!"%9!:;!! FC)A#G$!H!+#,I'J! <!7#=9!:9!#>?;!!!!1@+#$%&+!&+A&+,+.$#BC.2!! I.$+&A&+$#0'+! [Source: A. Ng] Raquel Urtasun (TTI-C) Visual Recognition Feb 21, 2012 8 / 82
  • 9. More examples Method hypothesizes that edge-like patches are the most basic elements of a scene, and represents an image in terms of the edges that appear in it. Use to obtain a more compact, higher-level representation of the scene than pixels. [Source: A. Ng] Raquel Urtasun (TTI-C) Visual Recognition Feb 21, 2012 9 / 82
  • 10. Sparse Coding details Input: Images x (1) , x (2) , · · · , x (m) (each in n×n ) Obtain dictionary elements and weights by     m  k k   (i) (i) (i)  min ||x − aj φj ||2 2 +λ |aj |1  a,φ   i=1  j=1 j=1  sparsity Alternating minimization with respect to φj ’s and a’s. The second is harder, the first one is closed form. Raquel Urtasun (TTI-C) Visual Recognition Feb 21, 2012 10 / 82
  • 11. Fast algorithms Solving for a is expensive. Simple algorithm that works well Repeatedly guess sign (+, - or 0) of each of the ai ’s. Solve for ai ’s in closed form. Refine guess for signs. Other algorithms such as projective gradient descent, stochastic subgradient descent, etc [Source: A. Ng] Raquel Urtasun (TTI-C) Visual Recognition Feb 21, 2012 11 / 82
  • 12. Recap of sparse coding for feature learning Training: Input: Images x (1) , x (2) , · · · , x (m) (each in n×n ) n×n Learn: Dictionary of bases φ1 , · · · , φk (also ),   m k k ||x (i) − (i) (i) min aj φj ||2 + λ |aj | a,φ i=1 j=1 j=1 Test time: Input: novel x (1) , x (2) , · · · , x (m) and learned φ1 , · · · , φk . Solve for the representation a1 , · · · , ak for each example   m k k min ||x − aj φj ||2 + λ |aj | a i=1 j=1 j=1 Raquel Urtasun (TTI-C) Visual Recognition Feb 21, 2012 12 / 82
  • 13. Sparse coding recap Much better than pixel representation, but still not competitive with SIFT, etc. Three ways to make it competitive: Combine this with SIFT. Advanced versions of sparse coding, e.g., LCC. Deep learning. [Source: A. Ng] Raquel Urtasun (TTI-C) Visual Recognition Feb 21, 2012 13 / 82
  • 14. Sparse Classification [Source: A. Ng] Raquel Urtasun (TTI-C) Visual Recognition Feb 21, 2012 14 / 82
  • 15. K-means vs sparse coding [Source: A. Ng] Raquel Urtasun (TTI-C) Visual Recognition Feb 21, 2012 15 / 82
  • 16. Why sparse coding helps classification? The coding is a nonlinear feature mapping Represent data in a higher dimensional space Sparsity makes prominent patterns more distinctive [Source: K. Yu] Raquel Urtasun (TTI-C) Visual Recognition Feb 21, 2012 16 / 82
  • 17. A topic model view to sparse coding Each basis is a direction or a topic. Sparsity: each datum is a linear combination of only a few bases. Applicable to image denoising, inpainting, and super-resolution. Basis 1 Basis 2 Both figures adapted from CVPR10 tutorial by F. Bach, J. Mairal, J. Ponce and G. Sapiro [Source: K. Yu] Raquel Urtasun (TTI-C) Visual Recognition Feb 21, 2012 17 / 82
  • 18. A geometric view to sparse coding Each basis is somewhat like a pseudo data point anchor point Sparsity: each datum is a sparse combination of neighbor anchors. The coding scheme explores the manifold structure of data Data manifold Data Basis [Source: K. Yu] Raquel Urtasun (TTI-C) Visual Recognition Feb 21, 2012 18 / 82
  • 19. Influence of Sparsity When SC achieves the best classification accuracy, the learned bases are like digits – each basis has a clear local class association. Error: 4.54% Error: 3.75% Error: 2.64% Raquel Urtasun (TTI-C) Visual Recognition Feb 21, 2012 19 / 82
  • 20. The image classification setting for analysis Learning an image classifier is a matter of learning nonlinear functions on patches m m f (x) = w T x = ai (w T φ(i) ) = ai f (φ(i) ) i=1 i=1 f .images f .patches m (i) where x = i=1 ai φ Dense local feature Sparse Coding Linear Pooling Linear SVM [Source: K. Yu] Raquel Urtasun (TTI-C) Visual Recognition Feb 21, 2012 20 / 82
  • 21. Nonlinear learning via local coding k We assume xi ≈ j=1 ai,j φj and thus k f (xi ) = ai,j f (φj ) j=1 locally linear data points bases [Source: K. Yu] Raquel Urtasun (TTI-C) Visual Recognition Feb 21, 2012 21 / 82
  • 22. How to learn the non-linear function 1 Learning the dictionary φ1 , · · · , φk from unlabeled data 2 Use the dictionary to encode data xi → ai,1 , · · · , ai,k 3 Estimate the parameters Nonlinear local learning via learning a global linear function Raquel Urtasun (TTI-C) Visual Recognition Feb 21, 2012 22 / 82
  • 23. Local Coordinate Coding (LCC): If f(x) is (alpha, beta)-Lipschitz smooth Function Coding error Locality term approximation error A good coding scheme should [Yu et al. 09] have a small coding error, and also be sufficiently local [Source: K. Yu] Raquel Urtasun (TTI-C) Visual Recognition Feb 21, 2012 23 / 82
  • 24. Results The larger dictionary, the higher accuracy, but also the higher comp. cost (Yu et al. 09) (Yang et al. 09) The same observation for Caltech256, PASCAL, ImageNet [Source: K. Yu] Raquel Urtasun (TTI-C) Visual Recognition Feb 21, 2012 24 / 82
  • 25. Locality-constrained linear coding A fast implementation of LCC [Wang et al. 10] Dictionary learning using k-means and code for x based on Step 1 ensure locality: find the K nearest bases [φj ]j∈J(x) Step 2 ensure low coding error: min ||x − ai,j φj ||2 , s.t. ai,j = 1 a j∈J(x) j∈J(x) [Source: K. Yu] Raquel Urtasun (TTI-C) Visual Recognition Feb 21, 2012 25 / 82
  • 26. Results [Source: K. Yu] Raquel Urtasun (TTI-C) Visual Recognition Feb 21, 2012 26 / 82
  • 27. Interpretation of BoW + linear classifier Piece-wise local constant (zero-order) data points cluster centers [Source: K. Yu] Raquel Urtasun (TTI-C) Visual Recognition Feb 21, 2012 27 / 82
  • 28. Support vector coding [Zhou et al, 10] Local tangent Piecewise local linear (first-order) data points cluster centers [Source: K. Yu] Raquel Urtasun (TTI-C) Visual Recognition Feb 21, 2012 28 / 82
  • 29. Details... Let [ai , 1, · · · , ai,k ] be the VQ coding of xi Super-vector codes of data Global linear weights to be learned No.1 position in PASCAL VOC 2009 [Source: K. Yu] Raquel Urtasun (TTI-C) Visual Recognition Feb 21, 2012 29 / 82
  • 30. Summary of coding algorithms All lead to higher-dimensional, sparse, and localized coding All explore geometric structure of data New coding methods are suitable for linear classifiers. Their implementations are quite straightforward. [Source: K. Yu] Raquel Urtasun (TTI-C) Visual Recognition Feb 21, 2012 30 / 82
  • 31. PASCAL VOC 2009 No.1 for 18 of 20 categories PASCAL: 20 categories, 10,000 images They use only HOG feature on gray images Best of Classes Ours Other Teams Difference [Source: Urtasun (TTI-C) Raquel K. Yu] Visual Recognition Feb 21, 2012 31 / 82
  • 32. ImageNet Large-scale Visual Recognition Challenge 2010 ImageNet: 1000 categories, 1.2 million images for training [Source: K. Yu] Raquel Urtasun (TTI-C) Visual Recognition Feb 21, 2012 32 / 82
  • 33. ImageNet Large-scale Visual Recognition Challenge 2010 150 teams registered worldwide, resulting 37 submissions from 11 teams SC achieved 52% for 1000 class classification. Teams Top 5 Hit Rate Our team: NEC-UIUC 72.8% Xerox European Lab, France 66.4% Univ. of Tokyo 55.4% Univ. of California, Irvine 53.4% MIT 45.6% NTU, Singapore 41.7% LIG, France 39.3% IBM T. J. Waston Research Center 30.0% National Institute of Informatics, Tokyo 25.8% SRI (Stanford Research Institute) 24.9% [Source: K. Yu] Raquel Urtasun (TTI-C) Visual Recognition Feb 21, 2012 33 / 82
  • 34. Learning Codebooks for Image Classification Replacing Vector Quantization by Learned Dictionaries unsupervised: [Yang et al., 2009] supervised: [Boureau et al., 2010, Yang et al., 2010] [Source: Mairal] Raquel Urtasun (TTI-C) Visual Recognition Feb 21, 2012 34 / 82
  • 35. Discriminative Training [Source: Mairal] Raquel Urtasun (TTI-C) Visual Recognition Feb 21, 2012 35 / 82
  • 36. Discriminative Dictionaries Figure: Top: reconstructive, Bottom: discriminative, Left: Bicycle, Right: Background [Source: Mairal] Raquel Urtasun (TTI-C) Visual Recognition Feb 21, 2012 36 / 82
  • 37. Application:Edge detection [Source: Mairal] Raquel Urtasun (TTI-C) Visual Recognition Feb 21, 2012 37 / 82
  • 38. Application:Edge detection [Source: Mairal] Raquel Urtasun (TTI-C) Visual Recognition Feb 21, 2012 38 / 82
  • 39. Application:Authentic from fake [Source: Mairal] Raquel Urtasun (TTI-C) Visual Recognition Feb 21, 2012 39 / 82
  • 40. Predictive Sparse Decomposition (PSD) Feed-forward predictor function for feature extraction   m k k ||x (i) − (i) (i) min aj φj ||2 + λ |aj | + β||a − C (X , K )||2  2 a,φ,K i=1 j=1 j=1 with e.g., C (X , K ) = g · tanh(x ∗ k) Learning is done by 1) Fix K and a, minimize to get optimal φ 2) Update a and K using optimal φ 3) Scale elements of φ to be unit norm. [Source: Y. LeCun] Raquel Urtasun (TTI-C) Visual Recognition Feb 21, 2012 40 / 82
  • 41. Predictive Sparse Decomposition (PSD) 12 x12 natural image patches with 256 dictionary elements Figure: (Left) Encoder, (Right) Decoder [Source: Y. LeCun] Raquel Urtasun (TTI-C) Visual Recognition Feb 21, 2012 41 / 82
  • 42. Training Deep Networks with PSD Train layer wise [Hinton, 06] C (X , K 1 ) C (f (K 1 ), K 2 ) ··· Each layer is trained on the output f (x) produced from previous layer. f is a series of non-linearity and pooling operations [Source: Y. LeCun] Raquel Urtasun (TTI-C) Visual Recognition Feb 21, 2012 42 / 82
  • 43. Object Recognition - Caltech 101 [Source: Y. LeCun] Raquel Urtasun (TTI-C) Visual Recognition Feb 21, 2012 43 / 82
  • 44. Problems with sparse coding Sparse coding produces filters that are shifted version of each other It ignores that it’s going to be used in a convolutional fashion Inference in all overlapping patches independently Problems with sparse coding 1) the representations are redundant, as the training and inference are done at the patch level 2) inference for the whole image is computationally expensive Raquel Urtasun (TTI-C) Visual Recognition Feb 21, 2012 44 / 82
  • 45. Solutions via Convolutional Nets Problems 1) the representations are redundant, as the training and inference are done at the patch level 2) inference for the whole image is computationally expensive Solutions 1) Apply sparse coding to the entire image at once, with the dictionary as a convolutional filter bank K 1 (x, z, D) = ||x − Dk ∗ zk ||2 + |z|1 2 2 k=1 with Dk a s × s 2D filter, x is w × h image and zk a 2D (w + s − 1) × (h + s − 1) feature. 2) Use feed-forward, non-linear encoders to produce a fast approx. to the sparse code K K 1 (x, z, D, W ) = ||x − Dk ∗ zk ||2 + 2 ||zk − f (W k ∗ x)||2 + |z|1 2 2 k=1 k=1 Raquel Urtasun (TTI-C) Visual Recognition Feb 21, 2012 45 / 82
  • 46. Convolutional Nets Figure: Image from Kavukcuoglu et al. 10. (left) Dictionary from sparse coding. (right) Dictionary learned from conv. nets. Raquel Urtasun (TTI-C) Visual Recognition Feb 21, 2012 46 / 82
  • 47. Stacking One or more additional stages can be stacked on top of the first one. Each stage then takes the output of its preceding stage as input and processes it using the same series of operations with different architectural parameters like size and connections. Figure: Image from Kavukcuoglu et al. 10. Second layer. Raquel Urtasun (TTI-C) Visual Recognition Feb 21, 2012 47 / 82
  • 48. Convolutional Nets [Source: Y. LeCun] Raquel Urtasun (TTI-C) Visual Recognition Feb 21, 2012 48 / 82
  • 49. Convolutional Nets [Source: H. Lee] Raquel Urtasun (TTI-C) Visual Recognition Feb 21, 2012 49 / 82
  • 50. INRIA pedestrians Figure: Image from Kavukcuoglu et al. 10. Second layer. Raquel Urtasun (TTI-C) Visual Recognition Feb 21, 2012 50 / 82
  • 51. Many more deep architectures Restricted Bolzmann machines (RBMs) Deep belief nets Deep Boltzmann machines Stacked denoising auto-encoders ··· Raquel Urtasun (TTI-C) Visual Recognition Feb 21, 2012 51 / 82
  • 52. Topic Models Raquel Urtasun (TTI-C) Visual Recognition Feb 21, 2012 52 / 82
  • 53. Topic Models Topic models have become a powerful unsupervised learning tool for text summarization, document classification, information retrieval, link prediction, and collaborative filtering. Topic modeling introduces latent topic variables in text and image that reveal the underlying structure. Topics are a group of related words [Source: J. Zen] Raquel Urtasun (TTI-C) Visual Recognition Feb 21, 2012 53 / 82
  • 54. Probabilistic Topic Modeling Treat data as observations that arise from a generative model composed of latent variables. The latent variables reflect the semantic structure of the document. Infer the latent structure using posterior inference (MAP): What are the topics that summarize the document network? Predict new data by the estimated topic model How the new data fit into the estimated topic structures? [Source: J. Zen] Raquel Urtasun (TTI-C) Visual Recognition Feb 21, 2012 54 / 82
  • 55. Latent Dirichlet Allocation (LDA) [Blei et al. 03] Simple intuition: a document exhibits multiple topics [Source: J. Zen] Raquel Urtasun (TTI-C) Visual Recognition Feb 21, 2012 55 / 82
  • 56. Generative models Each document is a mixture of corpus-wide topics. Each word is drawn from one of those topics. [Source: J. Zen] Raquel Urtasun (TTI-C) Visual Recognition Feb 21, 2012 56 / 82
  • 57. The posterior distribution (inference) Observations include documents and their words. Our goal is to infer the underlying topic structures marked by ? from observations. [Source: J. Zen] Raquel Urtasun (TTI-C) Visual Recognition Feb 21, 2012 57 / 82
  • 58. What does it mean with visual words [Source: B. Schiele] Raquel Urtasun (TTI-C) Visual Recognition Feb 21, 2012 58 / 82
  • 59. How to read Graphical Models [Source: B. Schiele] Raquel Urtasun (TTI-C) Visual Recognition Feb 21, 2012 59 / 82
  • 60. LDA [Source: J. Zen] Raquel Urtasun (TTI-C) Visual Recognition Feb 21, 2012 60 / 82
  • 61. LDA generative process [Source: J. Zen] Raquel Urtasun (TTI-C) Visual Recognition Feb 21, 2012 61 / 82
  • 62. LDA as matrix factorization [Source: B. Schiele] Raquel Urtasun (TTI-C) Visual Recognition Feb 21, 2012 62 / 82
  • 63. LDA as matrix factorization [Source: B. Schiele] Raquel Urtasun (TTI-C) Visual Recognition Feb 21, 2012 62 / 82
  • 64. Dirichlet Distribution Raquel Urtasun (TTI-C) Visual Recognition Feb 21, 2012 63 / 82
  • 65. LDA inference [Source: J. Zen] Raquel Urtasun (TTI-C) Visual Recognition Feb 21, 2012 64 / 82
  • 66. LDA: intractable exact inference [Source: J. Zen] Raquel Urtasun (TTI-C) Visual Recognition Feb 21, 2012 65 / 82
  • 67. LDA: approximate inference [Source: J. Zen] Raquel Urtasun (TTI-C) Visual Recognition Feb 21, 2012 66 / 82
  • 68. Computer Vision Applications Natural Scene Categorization [Fei-Fei and Perona, 05] Object Recognition [Sivic et al. 05] Object Recognition to model distribution on patches [Fritz et al. ] Activity Recognition [Niebles et al. ] Text and image [Saenko et al. ] ··· Raquel Urtasun (TTI-C) Visual Recognition Feb 21, 2012 67 / 82
  • 69. Natural Scene Categorization A different model is learned for each class. [Source: J. Zen] Raquel Urtasun (TTI-C) Visual Recognition Feb 21, 2012 68 / 82
  • 70. Natural Scene Categorization In reality is a bit more complicated as it reasons about class [Fei-Fei et al., 05]. Raquel Urtasun (TTI-C) Visual Recognition Feb 21, 2012 69 / 82
  • 71. Natural Scene Categorization Raquel Urtasun (TTI-C) Visual Recognition Feb 21, 2012 70 / 82
  • 72. Classification Results Raquel Urtasun (TTI-C) Visual Recognition Feb 21, 2012 71 / 82
  • 73. More Results Raquel Urtasun (TTI-C) Visual Recognition Feb 21, 2012 72 / 82
  • 74. Object Recognition [Sivic et al. 05] Matrix Factorization view [Source: B. Schiele] Raquel Urtasun (TTI-C) Visual Recognition Feb 21, 2012 73 / 82
  • 75. Most likely words given topic [Source: B. Schiele] Raquel Urtasun (TTI-C) Visual Recognition Feb 21, 2012 74 / 82
  • 76. Most likely words given topic [Source: B. Schiele] Raquel Urtasun (TTI-C) Visual Recognition Feb 21, 2012 74 / 82
  • 77. Object Recognition [Sivic et al. 05] Matrix Factorization view [Source: B. Schiele] Raquel Urtasun (TTI-C) Visual Recognition Feb 21, 2012 75 / 82
  • 78. Object Recognition [Sivic et al. 05] [Source: B. Schiele] Raquel Urtasun (TTI-C) Visual Recognition Feb 21, 2012 76 / 82
  • 79. Object Recognition [Sivic et al. 05] [Source: B. Schiele] Raquel Urtasun (TTI-C) Visual Recognition Feb 21, 2012 76 / 82
  • 80. Object Recognition [Fritz et al.] [Source: B. Schiele] Raquel Urtasun (TTI-C) Visual Recognition Feb 21, 2012 77 / 82
  • 81. Learning of Generative Decompositions [Fritz et al.] [Source: B. Schiele] Raquel Urtasun (TTI-C) Visual Recognition Feb 21, 2012 78 / 82
  • 82. Topics [Fritz et al.] [Source: B. Schiele] Raquel Urtasun (TTI-C) Visual Recognition Feb 21, 2012 79 / 82
  • 83. Supervised Classification [Fritz et al.] [Source: B. Schiele] Raquel Urtasun (TTI-C) Visual Recognition Feb 21, 2012 80 / 82
  • 84. UIUC Cars [Source: B. Schiele] Raquel Urtasun (TTI-C) Visual Recognition Feb 21, 2012 81 / 82
  • 85. Extensions Introduce supervision, e.g., discriminative LDA, supervised LDA Introduce spatial information Multi-view approaches, e.g., correlated LDA, Pachenko Use infinite mixtures, e.g., HDP, PY processes to model heavy tales. And many more. Raquel Urtasun (TTI-C) Visual Recognition Feb 21, 2012 82 / 82