SlideShare una empresa de Scribd logo
1 de 102
Descargar para leer sin conexión
CVPR 2012 Tutorial
Deep Learning Methods for Vision
             (draft)

              Honglak Lee
 Computer Science and Engineering Division
    University of Michigan, Ann Arbor



                                             1
Feature representations
pixel 1

                                                      Learning
                                                      algorithm
                     pixel 2
             Input
                                   Motorbikes
              Input space          “Non”-Motorbikes
   pixel 2




                         pixel 1
                                                                  2
Feature representations

                  handle
                                    Feature                       Learning
                    wheel        representation                   algorithm

          Input
                                 Motorbikes
           Input space           “Non”-Motorbikes         Feature space



                                               “handle”
pixel 2




                       pixel 1                                   “wheel”
                                                                              3
How is computer perception done?
                            State-of-the-art:
                            “hand-crafting”
                                             Feature                 Learning
                 Input data
                                          representation             algorithm


Object
detection

                    Image                      Low-level           Object detection
                                            vision features         / classification
                                           (SIFT, HOG, etc.)


Audio
classification

                                               Low-level
                    Audio                                              Speaker
                                             audio features
                                                                     identification
                                       (spectrogram, MFCC, etc.)                       4
Computer vision features


SIFT               Spin image




 HoG                 RIFT




Textons              GLOH
                                5
Computer vision features


        SIFT                    Spin image



          Hand-crafted features:
1. NeedsHoG
         expert knowledge          RIFT
2. Requires time-consuming hand-tuning
3. (Arguably) one of the limiting factors of
computer vision systems
        Textons                   GLOH
                                               6
Learning Feature Representations
• Key idea:
   – Learn statistical structure or correlation of the data from
     unlabeled data
   – The learned representations can be used as features in
     supervised and semi-supervised settings
   – Known as: unsupervised feature learning, feature learning,
     deep learning, representation learning, etc.
• Topics covered in this talk:
   –   Restricted Boltzmann Machines
   –   Deep Belief Networks
   –   Denoising Autoencoders
   –   Applications: Vision, Audio, and Multimodal learning

                                                                   7
Learning Feature Hierarchy

• Deep Learning                        3rd layer
  – Deep architectures can be          “Objects”
    representationally efficient.

  – Natural progression from          2nd layer
    low level to high level         “Object parts”
    structures.
                                       1st layer
  – Can share the lower-level          “edges”
    representations for
    multiple tasks.
                                         Input


                                                     8
Outline
•   Restricted Boltzmann machines
•   Deep Belief Networks
•   Denoising Autoencoders
•   Applications to Vision
•   Applications to Audio and Multimodal Data




                                                9
Learning Feature Hierarchy
      [Related work: Hinton, Bengio, LeCun, Ng, and others.]


                                  Higher layer: DBNs
                                    (Combinations
                                      of edges)

                                   First layer: RBMs
                                         (edges)



                                   Input image patch
                                        (pixels)




                                                               11
Restricted Boltzmann Machines
      with binary-valued input data
• Representation
  – Undirected bipartite graphical model
  – 𝐯 ∈ 0,1 𝐷 : observed (visible) binary variables
  – 𝐡 ∈ 0,1 𝐾 : hidden binary variables.         hidden (H)
                                                         j



                                                  i
                                                      visible (V)




                                                                    12
Conditional Probabilities
    (RBM with binary-valued input data)
• Given 𝐯, all the ℎ 𝑗 are conditionally
  independent
                    exp     𝑊 𝑖𝑗 𝑣 𝑗 +𝑏 𝑗
                            𝑖
                                                       hidden (H)
   𝑃 ℎ𝑗 = 1 𝐯 =                                   h1        h2         h3
                   exp⁡ 𝑖 𝑊 𝑖𝑗 𝑣 𝑗 +𝑏 𝑗 )+1
                      (
      =sigmoid( 𝑖 𝑊𝑖𝑗 𝑣 𝑗 + 𝑏 𝑗 )                           j
                                                             𝒘𝟐
      =sigmoid(𝒘 𝑗𝑇 𝐯 + 𝑏 𝑗 )                     𝒘𝟏
                                                                        𝒘𝟑


                                                       i
   – P(h|v) can be used as “features”
                                                       v1         v2
                                                       visible (V)
• Given⁡𝐡, all the 𝑣 𝑖 are conditionally
  independent
   𝑃 𝑣 𝑖 𝐡 = sigmoid(       𝑗   𝑊𝑖𝑗 ℎ 𝑗 + 𝑐 𝑖 )
                                                                             13
Restricted Boltzmann Machines
      with real-valued input data
• Representation
  – Undirected bipartite graphical model
  – V: observed (visible) real variables
  – H: hidden binary variables.            hidden (H)
                                                  j



                                           i
                                               visible (V)




                                                             14
Conditional Probabilities
      (RBM with real-valued input data)
• Given 𝐯, all the ℎ 𝑗 are conditionally
  independent
                                 1
                      exp              𝑖   𝑊 𝑖𝑗 𝑣 𝑗 +𝑏 𝑗         hidden (H)
                                  𝜎
    𝑃 ℎ𝑗 = 1 𝒗 =            1                               h1        h2         h3
                     exp⁡
                        (             𝑖 𝑊 𝑖𝑗 𝑣 𝑗 +𝑏 𝑗 )+1
                             𝜎
                 1                                                    j
       =sigmoid(       𝑖    𝑊𝑖𝑗 𝑣 𝑗 + 𝑏 𝑗 )
                 𝜎                                                     𝒘𝟐
                 1                                                                𝒘𝟑
       =sigmoid(     𝒘 𝑗𝑇
                      𝒗 + 𝑏𝑗)                               𝒘𝟏
                 𝜎
   – P(h|v) can be used as “features”                            i
                                                                 v1         v2
• Given⁡𝐡, all the 𝑣 𝑖 are conditionally                         visible (V)
  independent
    𝑃 𝑣 𝑖 𝐡 = 𝒩 𝜎 𝑗 𝑊𝑖𝑗 ℎ 𝑗 + 𝑐 𝑖 , 𝜎 2 or
    𝑃 𝐯 𝐡 = 𝒩 𝜎𝑾𝐡 + 𝐜, 𝜎 2 𝐈 .
                                                                                       15
Inference
• Conditional Distribution: P(v|h) or P(h|v)
   – Easy to compute (see previous slides).
   – Due to conditional independence, we can sample all hidden
     units given all visible units in parallel (and vice versa)
• Joint Distribution: P(v,h)
   – Requires Gibbs Sampling (approximate; lots of iterations to
     converge).
     Initialize with 𝐯 0
     Sample 𝐡0 from 𝑃(𝐡|𝐯 0 )

     Repeat until convergence (t=1,…) {
             Sample 𝐯 𝑡 from 𝑃(𝐯 𝑡 |𝐡 𝑡−1 )
             Sample 𝐡 𝑡 from 𝑃(𝐡|𝐯 𝑡 )
     }
                                                                   16
Training RBMs
• Model:
• How can we find parameters 𝜃 that maximize 𝑃 𝜃 (𝐯)?



            Data Distribution                   Model Distribution
         (posterior of h given v)

• We need to compute P(h|v) and P(v,h), and
  derivative of E wrt parameters {W,b,c}
   – P(h|v): tractable
   – P(v,h): intractable
      • Can approximate with Gibbs sampling, but requires lots of iterations
                                                                               17
                                                                                    17
Contrastive Divergence
• An approximation of the log-likelihood gradient for
  RBMs
   1. Replace the average over all possible inputs by samples




   2. Run the MCMC chain (Gibbs sampling) for only k steps
   starting from the observed example
      Initialize with 𝐯 0 = 𝐯
      Sample 𝐡0 from 𝑃(𝐡|𝐯 0 )

      For t = 1,…,k {
                Sample 𝐯 𝑡 from 𝑃(𝐯 𝑡 |𝐡 𝑡−1 )
                Sample 𝐡 𝑡 from 𝑃(𝐡|𝐯 𝑡 )
      }                                                         18
                                                                     18
A picture of the maximum likelihood learning
                    algorithm for an RBM

             j         j         j                          j

vi h j 0                               vi h j 

    i            i         i                   i

    t=0          t=1       t=2               t = infinity


                                            Equilibrium
                                            distribution




                                       Slide Credit: Geoff Hinton   19
A quick way to learn an RBM
             j                    j   Start with a training vector on the
                                      visible units.
vi h j 0          vi h j 1
                                      Update all the hidden units in
                                      parallel
   i                i                 Update the all the visible units in
                                      parallel to get a “reconstruction”.
   t=0              t=1
   data          reconstruction       Update the hidden units again.




                                                  Slide Credit: Geoff Hinton   20
Update Rule: Putting together
• Training via stochastic gradient.
           𝜕𝐸
• Note,           = ℎ 𝑖 𝑣𝑗.
          𝜕𝑊 𝑖𝑗
• Therefore,



  – Can derive similar update rule for biases b and c
  – Implemented in ~10 lines of matlab code


                                                        21
Other ways of training RBMs
• Persistent CD [Tieleman, ICML 2008; Tieleman & Hinton, ICML 2009]
   – Keep a background MCMC chain to obtain the
     negative phase samples.
   – Related to Stochastic Approximation
       • Robbins and Monro, Ann. Math. Stats, 1957
       • L. Younes, Probability Theory 1989
• Score Matching [Swersky et al., ICML 2011; Hyvarinen, JMLR 2005]
   – Use score function to eliminate Z
   – Match model’s & empirical score function


                                                                      22
Estimating Log-Likelihood of RBMs
• How do we measure the likelihood of the learned
  models?
• RBM: requires estimating partition function
  • Reconstruction error provides a cheap proxy
  • Log Z tractable analytically for < 25 binary inputs or hidden
  • Can be approximated with Annealed Importance Sampling (AIS)
     • Salakhutdinov & Murray, ICML 2008



• Open question: efficient ways to monitor progress


                                                                23
Variants of RBMs




                   25
Sparse RBM / DBN
• Main idea                                [Lee et al., NIPS 2008]
  – Constrain the hidden layer nodes to have “sparse”
    average values (activation). [cf. sparse coding]
• Optimization
  – Tradeoff between “likelihood” and “sparsity penalty”
                  Log-likelihood Sparsity penalty



                                                   we can use other
                                                   penalty functions
                                                   (e.g.,
                                                   KLdivergence)
                 Average activation   Target sparsity
                                                                       26
Modeling handwritten digits
• Sparse dictionary learning via sparse RBMs


          W1                               First layer bases
                                            (“pen-strokes”)
 input nodes (data)



                      Learned sparse representations
                      can be used as features.

 Training examples    [Lee et al., NIPS 2008; Ranzato et al,
                      NIPS 2007]
                                                               27
3-way factorized RBM
         [Ranzato et al., AISTATS 2010; Ranzato and Hinton, CVPR 2010]

• Models the covariance structure of images using
  hidden variables
  – 3-way factorized RBM / mean-covariance RBM
      Gaussian MRF                  3-way (covariance) RBM



                                                        hc
                                                         k

                                                 F




                                       xp    C          C      xq
 xp                  xq
                                                 F
         F


                                      [Slide Credit: Marc’Aurelio Ranzato] 28
Generating natural image patches

                         mcRBM
Natural images          Ranzato and Hinton CVPR 2010




                         GRBM
                         from Osindero and Hinton NIPS 2008




                         S-RBM + DBN
                         from Osindero and Hinton NIPS 2008

                        Slide Credit: Marc’Aurelio Ranzato 29
Outline
•   Restricted Boltzmann machines
•   Deep Belief Networks
•   Denoising Autoencoders
•   Applications to Vision
•   Applications to Audio and Multimodal Data




                                                30
Deep Belief Networks (DBNs)
                                           Hinton et al., 2006

• Probabilistic generative model
• Deep architecture – multiple layers
• Unsupervised pre-learning provides a good
  initialization of the network
  – maximizing the lower-bound of the log-likelihood
    of the data
• Supervised fine-tuning
  – Generative: Up-down algorithm
  – Discriminative: backpropagation

                                                            32
DBN structure
                                                                            Hinton et al., 2006
                                                                       3
                                                                   h
                                                                           RBM
   Hidden                                                              2
                                                                   h
   layers

                                                                       1
                                                                           Directed
                                                                   h       belief nets


Visible layer                                                      v
     P( v, h1 , h2 ,..., hl )  P( v | h1 ) P(h1 | h2 )...P(hl 2 | hl 1 ) P(hl 1, hl )

                                                                                             33
DBN structure                             Hinton et al., 2006

 (approximate) inference                                   h3
                                                                                Generative
  Q ( h | h )  P( h | h )
       3   2         3   2                            W3        P( h 2 , h3 )    process


                                                           h2
               Q(h2 | h1 )                                      P(h1 | h2 )
                                                      W2
                                                           h1
                     1
               Q(h | v)                               W1        P( v | h1 )

                                                           v
P( v, h1 , h2 ,..., hl )  P( v | h1 ) P(h1 | h2 )...P(hl 2 | hl 1 ) P(hl 1, hl )
Q(hi | hi 1 )   sigm(bij1  Wij hi 1 )   P(hi 1 | hi )   sigm(bij  W.ij ' hi )
                 j                                              j
                                                                                          34
DBN Greedy training
                                  Hinton et al., 2006
• First step:
   – Construct an RBM with
     an input layer v and a
     hidden layer h
   – Train the RBM




                                                   35
DBN Greedy training
                                    Hinton et al., 2006
• Second step:
  – Stack another hidden
    layer on top of the RBM
    to form a new RBM
  – Fix W1, sample h from
                      1


   Q(h1 | v) as input. Train
    W 2 as RBM.                W2

                         1
                     Q(h | v) W1



                                                     36
DBN Greedy training
                                              Hinton et al., 2006
• Third step:
                                         h3
  – Continue to stack layers
    on top of the network,         W3
    train it as previous step,
    with sample sampled
    from Q(h2 | h1 )
                       Q(h2 | h1 ) W 2
• And so on…
                            1
                        Q(h | v) W1



                                                               37
Why greedy training works?
                                  Hinton et al., 2006
• RBM specifies P(v,h) from
  P(v|h) and P(h|v)
  – Implicitly defines P(v) and
    P(h)
• Key idea of stacking
  – Keep P(v|h) from 1st RBM
  – Replace P(h) by the
    distribution generated by
    2nd level RBM


                                                   38
Why greedy training works?
                                                         Hinton et al., 2006
• Greey Training:
   – Variational lower-bound justifies
     greedy layerwise training of
     RBMs


                                     Q (h | v) W1




                                         Trained by the second layer RBM
                                                                           39
Why greedy training works?
                                                        Hinton et al., 2006
• Greey Training:
   – Variational lower-bound justifies
     greedy layerwise training of
     RBMs                                    W2
   – Note: RBM and 2-layer DBN are
     equivalent when 𝑊 2 = 𝑊 1 𝑇 . Q(h1 | v) W1
     Therefore, the lower bound is
     tight and the log-likelihood
     improves by greedy training.




                                        Trained by the second layer RBM
                                                                          40
DBN and supervised fine-tuning
• Discriminative fine-tuning
  – Initializing with neural nets + backpropagation
  – Maximizes log P(Y | X ) (X: data Y: label)


• Generative fine-tuning
  – Up-down algorithm
  – Maximizes log P(Y , X )   (joint likelihood of data and labels)




                                                                      41
A model for digit recognition
    The top two layers form an
    associative memory whose                    2000 top-level neurons
    energy landscape models the low
    dimensional manifolds of the
    digits.
                                                10 label
    The energy valleys have names                               500 neurons
                                                neurons

   The model learns to generate
   combinations of labels and images.                           500 neurons

   To perform recognition we start with a
   neutral state of the label units and do                         28 x 28
   an up-pass from the image followed                              pixel
   by a few iterations of the top-level                            image
   associative memory.
http://www.cs.toronto.edu/~hinton/nipstutorial/nipstut3.ppt   Slide Credit: Geoff Hinton   43
Fine-tuning with a contrastive version
         of the “wake-sleep” algorithm
     After learning many layers of features, we can fine-tune the
      features to improve generation.
   1. Do a stochastic bottom-up pass
        – Adjust the top-down weights to be good at reconstructing
          the feature activities in the layer below.
   2. Do a few iterations of sampling in the top level RBM
       -- Adjust the weights in the top-level RBM.
   3. Do a stochastic top-down pass
        – Adjust the bottom-up weights to be good at reconstructing
          the feature activities in the layer above.


http://www.cs.toronto.edu/~hinton/nipstutorial/nipstut3.ppt   Slide Credit: Geoff Hinton   44
Generating sample from a DBN
• Want to sample from
  P( v, h1 , h2 ,..., hl )  P( v | h1 ) P(h1 | h2 )...P(hl 2 | hl 1 ) P(hl 1, hl )
  – Sample hl 1 using Gibbs sampling in the RBM
  – Sample the lower layer hi 1 from P(hi 1 | hi )




                                                                                         45
Generating samples from DBN




Hinton et al, A Fast Learning Algorithm for Deep Belief Nets, 2006
                                                                     46
Result for supervised fine-tuning on MNIST
   • Very carefully trained backprop net with                              1.6%
     one or two hidden layers (Platt; Hinton)

   • SVM (Decoste & Schoelkopf, 2002)                                     1.4%

   • Generative model of joint density of                                 1.25%
     images and labels (+ generative fine-tuning)

   • Generative model of unlabelled digits                                1.15%
     followed by gentle backpropagation
       (Hinton & Salakhutdinov, Science 2006)


http://www.cs.toronto.edu/~hinton/nipstutorial/nipstut3.ppt   Slide Credit: Geoff Hinton 47
• More details on up-down algorithm:
  – Hinton, G. E., Osindero, S. and Teh, Y. (2006) “A fast learning algorithm
    for deep belief nets”, Neural Computation, 18, pp 1527-1554.
    http://www.cs.toronto.edu/~hinton/absps/ncfast.pdf


• Handwritten digit demo:
  – http://www.cs.toronto.edu/~hinton/digits.html




                                                                                48
Outline
•   Restricted Boltzmann machines
•   Deep Belief Networks
•   Denoising Autoencoders
•   Applications to Vision
•   Applications to Audio and Multimodal Data




                                                49
Denoising Autoencoder
• Denoising Autoencoder
  – Perturbs the input x to a corrupted version:
     • E.g., randomly sets some of the coordinates of input to zeros.
  – Recover x from encoded h of perturbed data.
  – Minimize loss between 𝐱 and x




                                                 [Vincent et al., ICML 2008]   51
Denoising Autoencoder
• Learns a vector field towards              [Vincent et al., ICML 2008]

  higher probability regions
• Minimizes variational lower
  bound on a generative model
                                      Corrupted input
• Corresponds to regularized
  score matching on an RBM

                         Corrupted input




                                           Slide Credit: Yoshua Bengio   52
Stacked Denoising Autoencoders
• Greedy Layer wise learning
   – Start with the lowest level and stack upwards
   – Train each layer of autoencoder on the intermediate code
     (features) from the layer below
   – Top layer can have a different output (e.g., softmax non-
     linearity) to provide an output for classification




                                           Slide Credit: Yoshua Bengio   53
Stacked Denoising Autoencoders
• No partition function,
  can measure training
  criterion
• Encoder & decoder:
  any parametrization
• Performs as well or
  better than stacking
  RBMs for usupervised
  pre-training

                           Infinite MNIST



                                 Slide Credit: Yoshua Bengio   54
Denoising Autoencoders: Benchmarks
                              Larochelle et al., 2009




                        Slide Credit: Yoshua Bengio   55
Denoising Autoencoders: Results
• Test errors on the benchmarks         Larochelle et al., 2009




                                  Slide Credit: Yoshua Bengio   56
Why Greedy Layer Wise Training Works
                                             (Bengio 2009, Erhan et al. 2009)
• Regularization Hypothesis
  – Pre-training is “constraining” parameters in a
    region relevant to unsupervised dataset
  – Better generalization
    (Representations that better describe unlabeled data are more
    discriminative for labeled data)


• Optimization Hypothesis
  – Unsupervised training initializes lower level
    parameters near localities of better minima than
    random initialization can

                                                 Slide Credit: Yoshua Bengio   58
Outline
•   Restricted Boltzmann machines
•   Deep Belief Networks
•   Denoising Autoencoders
•   Applications to Vision
•   Applications to Audio and Multimodal Data




                                                65
Convolutional Neural Networks
                                      (LeCun et al., 1989)




Local Receptive   Weight
    Fields        sharing

                            Pooling




                                                       67
Deep Convolutional Architectures
State-of-the-art on MNIST digits, Caltech-101 objects, etc.




                                                          Slide Credit: Yann LeCun 70
Learning object representations
• Learning objects and parts in images




• Large image patches contain interesting higher-
  level structures.
  – E.g., object parts and full objects



                                                    71
Illustration: Learning an “eye” detector
                                                      Advantage of shrinking
                         “Eye detector”               1. Filter size is kept smal
                                                      2. Invariance

   “Shrink”
(max over 2x2)
               filter1        filter2     filter3             filter4


 “Filtering”
   output




                                                    Example image


                                                                             72
Convolutional RBM (CRBM) [Lee et al, ICML 2009]
    [Related work: Norouzi et al., CVPR 2009; Desjardins and Bengio, 2008]
                                  ‘’max-pooling’’ node (binary)
    For “filter” k,
                                          Max-pooling layer P

                                          Detection layer H

                               Hidden nodes (binary)
Constraint: At most
one hidden node is 1
                             Wk
                                   “Filter“ weights (shared)
(active).
                                           V (visibleV
                                           Input data layer)

                                             Key Properties
                                        Convolutional structure
                                        Probabilistic max-pooling
                                       (“mutual exclusion”)
                                                                        73
Convolutional deep belief networks illustration
              Layer 3


         W3                        Layer 3 activation (coefficients)

              Layer 2

                                   Layer 2 activation (coefficients)
         W2

              Layer 1


                                   Layer 1 activation (coefficients)
                      Filter
         W1        visualization


                Input image                               Example image
                                                                       74
Unsupervised learning from natural images


                        Second layer bases

                        contours, corners, arcs,
                        surface boundaries




                        First layer bases
                        localized, oriented edges

                                                    75
Unsupervised learning of object-parts
Faces      Cars     Elephants   Chairs




                                         76
Image classification with Spatial Pyramids
                         [Lazebnik et al., CVPR 2005; Yang et al., CVPR 2009]


 • Descriptor Layer: detect and locate
   features, extract corresponding
   descriptors (e.g. SIFT)

 • Code Layer: code the descriptors
    – Vector Quantization (VQ): each code has
      only one non-zero element
    – Soft-VQ: small group of elements can be
      non-zero


 • SPM layer: pool codes across
   subregions and average/normalize into
   a histogram
                                                          Slide Credit: Kai Yu 78
Improving the coding step
• Classifiers using these features need
  nonlinear kernels
   – Lazebnik et al., CVPR 2005; Grauman and
     Darrell, JMLR 2007
   – High computational complexity
• Idea: modify the coding step to
  produce feature representations that
  linear classifiers can use effectively
   – Sparse coding [Olshausen & Field, Nature
     1996; Lee et al., NIPS 2007; Yang et al.,
     CVPR 2009; Boureau et al., CVPR 2010]
   – Local Coordinate coding [Yu et al., NIPS
     2009; Wang et al., CVPR 2010]
   – RBMs [Sohn, Jung, Lee, Hero III, ICCV
     2011]
   – Other feature learning algorithms
                                                 Slide Credit: Kai Yu 79
Object Recognition Results
• Classification accuracy on Caltech 101/256
                         < Caltech 101 >
                  # of training images              15     30
                Zhang et al., CVPR 2005            59.1   66.2
                   Griffin et al., 2008            59.0   67.6
            ScSPM [Yang et al., CVPR 2009]         67.0   73.2
             LLC [Wang et al., CVPR 2010]          65.4   73.4
       Macrofeatures [Boureau et al., CVPR 2010]     -    75.7
               Boureau et al., ICCV 2011             -    77.1
          Sparse RBM [Sohn et al., ICCV 2011]      68.6   74.9
         Sparse CRBM [Sohn et al., ICCV 2011]      71.3   77.8

      Competitive performance to other state-of-the-art methods
      using a single type of features

                                                                  80
Object Recognition Results
• Classification accuracy on Caltech 101/256
                         < Caltech 256 >
                # of training images           30       60
                   Griffin et al. [2]         34.10      -
           vanGemert et al., PAMI 2010        27.17      -
          ScSPM [Yang et al., CVPR 2009]      34.02    40.14
           LLC [Wang et al., CVPR 2010]       41.19    47.68
       Sparse CRBM [Sohn et al., ICCV 2011]   42.05    47.94




      Competitive performance to other state-of-the-art methods
      using a single type of features

                                                                  81
Local Convolutional RBM
• Modeling convolutional structures in local
  regions jointly
  – More statistically effective in learning for non-
    stationary (roughly aligned) images




                            [Huang, Lee, Learned-Miller, CVPR 2012] 88
Face Verification
                                       [Huang, Lee, Learned-Miller, CVPR 2012]


          Feature Learning                        Face Verification
 Convolutional Deep Belief Network                        LFW pairs
    1 to 2 layers of CRBM/local CRBM



               Representation
        whitened pixel intensity/LBP
                                                        Representation
                                                  top-most pooling layer

                  Source
                                                  Verification Algorithm
Faces                                              Metric Learning + SVM


Kyoto
(self-taught                                              Prediction
learning)                                     matched                  mismatched




                                                                                    89
Face Verification
                                    [Huang, Lee, Learned-Miller, CVPR 2012]

Method                                              Accuracy ± SE

V1-like with MKL (Pinto et al., CVPR 2009)          0.7935 ± 0.0055

Linear rectified units (Nair and Hinton, ICML 2010) 0.8073 ± 0.0134

CSML (Nguyen & Bai, ACCV 2010)                      0.8418 ± 0.0048

Learning-based descriptor (Cao et al., CVPR 2010) 0.8445 ± 0.0046

OSS, TSS, full (Wolf et al., ACCV 2009)             0.8683 ± 0.0034

OSS only (Wolf et al., ACCV 2009)                   0.8207 ± 0.0041

Combined (LBP + deep learning features)             0.8777 ± 0.0062




                                                                          90
Tiled Convolutional models
IDEA: have one subset of filters applied to these locations,




                                                        91
                                                             91
Tiled Convolutional models
IDEA: have one subset of filters applied to these locations,
another subset to these locations




                                                        92
                                                             92
Tiled Convolutional models
IDEA: have one subset of filters applied to these locations,
another subset to these locations, etc.




                                  Train jointly all parameters.


                                        No block artifacts
Gregor LeCun arXiv 2010                Reduced redundancy
Ranzato, Mnih, Hinton NIPS 2010                             93
                                                                  93
Tiled Convolutional models
       Treat these units as data
   to train a similar model on the top




                     SECOND STAGE
                     Field of binary RBM's.
                     Each hidden unit has a
                     receptive field of 30x30
                     pixels in input space.


                                         94
                                              94
Facial Expression Recognition
Toronto Face Dataset (J. Susskind et al. 2010)
~ 100K unlabeled faces from different sources
~ 4K labeled images
Resolution: 48x48 pixels
7 facial expressions


       neutral           sadness                 surprise




                                                 Ranzato, et al. CVPR 2011
                                                                    95
                                                                        95
Facial Expression Recognition
Drawing samples from the model (5th layer with 128 hiddens)
             5
         h

         RBM      ...   RBM     RBM        RBM     5th
     4                                             layer
 h

                                           RBM




                                           ...
                                           gMRF    1st
                                                   layer      96
                                                                   96
Facial Expression Recognition
Drawing samples from the model (5th layer with 128 hiddens)




                                                Ranzato, et al. CVPR 2011
                                                                   97
                                                                       97
Facial Expression Recognition




- 7 synthetic occlusions             RBM    RBM    RBM       RBM      RBM

- use generative model to fill-in    RBM    RBM    RBM       RBM      RBM     ...
 (conditional on the known pixels)   RBM    RBM    RBM       RBM      RBM
                                     gMRF   gMRF   gMRF      gMRF    gMRF




                                                         Ranzato, et al. CVPR 2011
                                                                            98
                                                                                98
Facial Expression Recognition

originals




Type 1 occlusion: eyes




Restored images




                                    Ranzato, et al. CVPR 2011
                                                       99
                                                           99
Facial Expression Recognition

originals




Type 2 occlusion: mouth




Restored images




                                    Ranzato, et al. CVPR 2011
                                                      100
                                                           100
Facial Expression Recognition

originals




Type 3 occlusion: right half




Restored images




                                    Ranzato, et al. CVPR 2011
                                                      101
                                                           101
Facial Expression Recognition

originals




Type 5 occlusion: top half




Restored images




                                    Ranzato, et al. CVPR 2011
                                                      103
                                                           103
Facial Expression Recognition

originals




Type 7 occlusion: 70% of pixels at random




Restored images




                                            Ranzato, et al. CVPR 2011
                                                              105
                                                                   105
Facial Expression Recognition

 occluded images for both training and test




Dailey, et al. J. Cog. Neuros. 2003
Wright, et al. PAMI 2008                      Ranzato, et al. CVPR 2011
                                                                106
                                                                     106
Outline
•   Restricted Boltzmann machines
•   Deep Belief Networks
•   Denoising Autoencoders
•   Applications to Vision
•   Applications to Audio and Multimodal Data




                                                107
Motivation: Multi-modal learning
• Single learning algorithms that combine multiple input
  domains
   –   Images
   –   Audio & speech
   –   Video
   –   Text
   –   Robotic sensors
   –   Time-series data      image     audio     text
   –   Others




                                                           109
Motivation: Multi-modal learning
• Benefits: more robust performance in
   – Multimedia processing



   – Biomedical data mining



     fMRI         PET scan      X-ray           EEG           Ultra sound

   – Robot perception



  Visible light       Audio   Thermal Infrared Camera array   3d range scans
     image

                                                                               110
Sparse dictionary learning on audio
                                                Spectrogram




          ~   0.9 *         + 0.7 *     + 0.2 *




      x               36                         
                                [Lee, 42
                                      Largman, Pham,63 NIPS 2009]
                                                     Ng,            111
Convolutional DBN for audio
                   Max pooling node

                   Detection nodes
frequency




                       Spectrogram                            time

                                     [Lee, Largman, Pham, Ng, NIPS 2009]   113
Convolutional DBN for audio
frequency




                      Spectrogram                        time

                                [Lee, Largman, Pham, Ng, NIPS 2009]   114
Convolutional DBN for audio




                [Lee, Largman, Pham, Ng, NIPS 2009]   115
Convolutional DBN for audio


Max pooling
                                                  Second CDBN
Detection nodes                                   layer


Max pooling
                                                    One CDBN
Detection nodes                                     layer




                              [Lee, Largman, Pham, Ng, NIPS 2009]   116
CDBNs for speech
Trained on unlabeled TIMIT corpus




                    Learned first-layer bases
                                      [Lee, Largman, Pham, Ng, NIPS 2009]   117
Comparison of bases to phonemes
                      “oy”         “el”         “s”
Phoneme
First layer bases




                                                      118
Experimental Results
• Speaker identification
  TIMIT Speaker identification        Accuracy
  Prior art (Reynolds, 1995)          99.7%
  Convolutional DBN                   100.0%

• Phone classification
  TIMIT Phone classification                               Accuracy
  Clarkson et al. (1999)                                   77.6%
  Petrov et al. (2007)                                     78.6%
  Sha & Saul (2006)                                        78.9%
  Yu et al. (2009)                                         79.2%
  Convolutional DBN                                        80.3%
  Transformation-invariant RBM (Sohn et al., ICML 2012)    81.5%


                                                    [Lee, Pham, Largman, Ng, NIPS 2009]   120
Phone recognition using mcRBM
• Mean-covariance RBM + DBN




                                     Mean-covariance RBM




                 [Dahl, Ranzato, Mohamed, Hinton, NIPS 2009]
                                                               122
Speech Recognition on TIMIT
Method                                                      PER
Stochastic Segmental Models                                 36.0%
Conditional Random Field                                    34.8%
Large-Margin GMM                                            33.0%
CD-HMM                                                      27.3%
Augmented conditional Random Fields                         26.6%
Recurrent Neural Nets                                       26.1%
Bayesian Triphone HMM                                       25.6%
Monophone HTMs                                              24.8%
Heterogeneous Classifiers                                   24.4%
Deep Belief Networks(DBNs)                                  23.0%
Triphone HMMs discriminatively trained w/ BMMI              22.7%
Deep Belief Networks with mcRBM feature extraction          20.5%


                                                     (Dahl et al., NIPS 2010)
                                                                                123
Multimodal Feature Learning
• Lip reading via multimodal feature learning
  (audio / visual data)




                                 Slide credit: Jiquan Ngiam
                                                              126
Multimodal Feature Learning
• Lip reading via multimodal feature learning
  (audio / visual data)




 Q. Is concatenating the best option?
                                        Slide credit: Jiquan Ngiam
                                                                     127
Multimodal Feature Learning
• Concatenating and learning features (via a single
  layer)doesn’t work




           Mostly unimodal features are learned
                                                      128
Multimodal Feature Learning
• Bimodal autoencoder
  – Idea: predict unseen modality from observed modality




                                      Ngiam et al., ICML 2011
                                                                129
Multimodal Feature Learning
• Visualization of learned filters




   Audio(spectrogram) and Video features learned over 100ms windows

• Results: AVLetters Lip reading dataset
 Method                                                 Accuracy
 Prior art (Zhao et al., 2009)                            58.9%
 Multimodal deep autoencoder (Ngiam et al., 2011)         65.8%
                                                                      130
Summary
• Learning Feature Representations
  – Restricted Boltzmann Machines
  – Deep Belief Networks
  – Stacked Denoising Autoencoders
• Deep learning algorithms and unsupervised
  feature learning algorithms show promising
  results in many applications
  – vision, audio, multimodal data, and others.



                                                  131
Thank you!




             132

Más contenido relacionado

La actualidad más candente

Deep Learning & NLP: Graphs to the Rescue!
Deep Learning & NLP: Graphs to the Rescue!Deep Learning & NLP: Graphs to the Rescue!
Deep Learning & NLP: Graphs to the Rescue!Roelof Pieters
 
Unsupervised Feature Learning
Unsupervised Feature LearningUnsupervised Feature Learning
Unsupervised Feature LearningAmgad Muhammad
 
Visualization of Deep Learning
Visualization of Deep LearningVisualization of Deep Learning
Visualization of Deep LearningYaminiAlapati1
 
Yann le cun
Yann le cunYann le cun
Yann le cunYandex
 
Intro to Deep Learning
Intro to Deep LearningIntro to Deep Learning
Intro to Deep LearningKushal Arora
 
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015Turi, Inc.
 
Deep neural networks
Deep neural networksDeep neural networks
Deep neural networksSi Haem
 
A Brief Introduction on Recurrent Neural Network and Its Application
A Brief Introduction on Recurrent Neural Network and Its ApplicationA Brief Introduction on Recurrent Neural Network and Its Application
A Brief Introduction on Recurrent Neural Network and Its ApplicationXiaohu ZHU
 
Artificial neural networks
Artificial neural networksArtificial neural networks
Artificial neural networksstellajoseph
 
Deep Style: Using Variational Auto-encoders for Image Generation
Deep Style: Using Variational Auto-encoders for Image GenerationDeep Style: Using Variational Auto-encoders for Image Generation
Deep Style: Using Variational Auto-encoders for Image GenerationTJ Torres
 
What's Wrong With Deep Learning?
What's Wrong With Deep Learning?What's Wrong With Deep Learning?
What's Wrong With Deep Learning?Philip Zheng
 
Introduction to deep learning
Introduction to deep learningIntroduction to deep learning
Introduction to deep learningJunaid Bhat
 
Understanding Autoencoder (Deep Learning Book, Chapter 14)
Understanding Autoencoder  (Deep Learning Book, Chapter 14)Understanding Autoencoder  (Deep Learning Book, Chapter 14)
Understanding Autoencoder (Deep Learning Book, Chapter 14)Entrepreneur / Startup
 
Introduction to Autoencoders
Introduction to AutoencodersIntroduction to Autoencoders
Introduction to AutoencodersYan Xu
 
Deep learning in Computer Vision
Deep learning in Computer VisionDeep learning in Computer Vision
Deep learning in Computer VisionDavid Dao
 
Deep learning - Conceptual understanding and applications
Deep learning - Conceptual understanding and applicationsDeep learning - Conceptual understanding and applications
Deep learning - Conceptual understanding and applicationsBuhwan Jeong
 
TypeScript and Deep Learning
TypeScript and Deep LearningTypeScript and Deep Learning
TypeScript and Deep LearningOswald Campesato
 

La actualidad más candente (20)

Deep Learning & NLP: Graphs to the Rescue!
Deep Learning & NLP: Graphs to the Rescue!Deep Learning & NLP: Graphs to the Rescue!
Deep Learning & NLP: Graphs to the Rescue!
 
Unsupervised Feature Learning
Unsupervised Feature LearningUnsupervised Feature Learning
Unsupervised Feature Learning
 
Visualization of Deep Learning
Visualization of Deep LearningVisualization of Deep Learning
Visualization of Deep Learning
 
Yann le cun
Yann le cunYann le cun
Yann le cun
 
Intro to Deep Learning
Intro to Deep LearningIntro to Deep Learning
Intro to Deep Learning
 
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
 
Deep neural networks
Deep neural networksDeep neural networks
Deep neural networks
 
A Brief Introduction on Recurrent Neural Network and Its Application
A Brief Introduction on Recurrent Neural Network and Its ApplicationA Brief Introduction on Recurrent Neural Network and Its Application
A Brief Introduction on Recurrent Neural Network and Its Application
 
Artificial neural networks
Artificial neural networksArtificial neural networks
Artificial neural networks
 
Deep Style: Using Variational Auto-encoders for Image Generation
Deep Style: Using Variational Auto-encoders for Image GenerationDeep Style: Using Variational Auto-encoders for Image Generation
Deep Style: Using Variational Auto-encoders for Image Generation
 
What's Wrong With Deep Learning?
What's Wrong With Deep Learning?What's Wrong With Deep Learning?
What's Wrong With Deep Learning?
 
Deep Learning
Deep LearningDeep Learning
Deep Learning
 
Introduction to deep learning
Introduction to deep learningIntroduction to deep learning
Introduction to deep learning
 
Deep learning - a primer
Deep learning - a primerDeep learning - a primer
Deep learning - a primer
 
Scene understanding
Scene understandingScene understanding
Scene understanding
 
Understanding Autoencoder (Deep Learning Book, Chapter 14)
Understanding Autoencoder  (Deep Learning Book, Chapter 14)Understanding Autoencoder  (Deep Learning Book, Chapter 14)
Understanding Autoencoder (Deep Learning Book, Chapter 14)
 
Introduction to Autoencoders
Introduction to AutoencodersIntroduction to Autoencoders
Introduction to Autoencoders
 
Deep learning in Computer Vision
Deep learning in Computer VisionDeep learning in Computer Vision
Deep learning in Computer Vision
 
Deep learning - Conceptual understanding and applications
Deep learning - Conceptual understanding and applicationsDeep learning - Conceptual understanding and applications
Deep learning - Conceptual understanding and applications
 
TypeScript and Deep Learning
TypeScript and Deep LearningTypeScript and Deep Learning
TypeScript and Deep Learning
 

Destacado

Restricted Boltzmann Machine - A comprehensive study with a focus on Deep Bel...
Restricted Boltzmann Machine - A comprehensive study with a focus on Deep Bel...Restricted Boltzmann Machine - A comprehensive study with a focus on Deep Bel...
Restricted Boltzmann Machine - A comprehensive study with a focus on Deep Bel...Indraneel Pole
 
Brief Introduction to Boltzmann Machine
Brief Introduction to Boltzmann MachineBrief Introduction to Boltzmann Machine
Brief Introduction to Boltzmann MachineArunabha Saha
 
Deep Belief nets
Deep Belief netsDeep Belief nets
Deep Belief netsbutest
 
Lattice boltzmann method ammar
Lattice boltzmann method ammarLattice boltzmann method ammar
Lattice boltzmann method ammarKrunal Gangawane
 
P05 deep boltzmann machines cvpr2012 deep learning methods for vision
P05 deep boltzmann machines cvpr2012 deep learning methods for visionP05 deep boltzmann machines cvpr2012 deep learning methods for vision
P05 deep boltzmann machines cvpr2012 deep learning methods for visionzukun
 
Denoising auto encoders(d a)
Denoising auto encoders(d a)Denoising auto encoders(d a)
Denoising auto encoders(d a)Tae Young Lee
 
Sf data mining_meetup
Sf data mining_meetupSf data mining_meetup
Sf data mining_meetupAdam Gibson
 
Learning RBM(Restricted Boltzmann Machine in Practice)
Learning RBM(Restricted Boltzmann Machine in Practice)Learning RBM(Restricted Boltzmann Machine in Practice)
Learning RBM(Restricted Boltzmann Machine in Practice)Mad Scientists
 
Multi-modal embeddings: from discriminative to generative models and creative ai
Multi-modal embeddings: from discriminative to generative models and creative aiMulti-modal embeddings: from discriminative to generative models and creative ai
Multi-modal embeddings: from discriminative to generative models and creative aiRoelof Pieters
 
[251] implementing deep learning using cu dnn
[251] implementing deep learning using cu dnn[251] implementing deep learning using cu dnn
[251] implementing deep learning using cu dnnNAVER D2
 
Shared Services in Higher Education: conceps, clients, consumers and stakehol...
Shared Services in Higher Education: conceps, clients, consumers and stakehol...Shared Services in Higher Education: conceps, clients, consumers and stakehol...
Shared Services in Higher Education: conceps, clients, consumers and stakehol...Chazey Partners
 
HAPPYWEEK 184 - 2016.09.05.
HAPPYWEEK 184 - 2016.09.05.HAPPYWEEK 184 - 2016.09.05.
HAPPYWEEK 184 - 2016.09.05.Jiří Černák
 
Leveraging Social Media for Your Law Practice
Leveraging Social Media for Your Law PracticeLeveraging Social Media for Your Law Practice
Leveraging Social Media for Your Law PracticeLouellen Coker
 
How the Congressional Budget Office Uses Demographic Data
How the Congressional Budget Office Uses Demographic DataHow the Congressional Budget Office Uses Demographic Data
How the Congressional Budget Office Uses Demographic DataCongressional Budget Office
 
A decentralized future – the technology of next century
A decentralized future – the technology of next centuryA decentralized future – the technology of next century
A decentralized future – the technology of next centuryBlockchain Conferences
 
Situación de Aprendizaje basada en la Didáctica Crítica
Situación de Aprendizaje basada en la Didáctica CríticaSituación de Aprendizaje basada en la Didáctica Crítica
Situación de Aprendizaje basada en la Didáctica CríticaADORACION RODRIGUEZ
 
Forged copper bus bar
Forged copper bus barForged copper bus bar
Forged copper bus barconexcoppe
 
MOLTO Annual Report 2011
MOLTO Annual Report 2011MOLTO Annual Report 2011
MOLTO Annual Report 2011Olga Caprotti
 

Destacado (20)

Restricted Boltzmann Machine - A comprehensive study with a focus on Deep Bel...
Restricted Boltzmann Machine - A comprehensive study with a focus on Deep Bel...Restricted Boltzmann Machine - A comprehensive study with a focus on Deep Bel...
Restricted Boltzmann Machine - A comprehensive study with a focus on Deep Bel...
 
Brief Introduction to Boltzmann Machine
Brief Introduction to Boltzmann MachineBrief Introduction to Boltzmann Machine
Brief Introduction to Boltzmann Machine
 
Deep Belief nets
Deep Belief netsDeep Belief nets
Deep Belief nets
 
Lattice boltzmann method ammar
Lattice boltzmann method ammarLattice boltzmann method ammar
Lattice boltzmann method ammar
 
P05 deep boltzmann machines cvpr2012 deep learning methods for vision
P05 deep boltzmann machines cvpr2012 deep learning methods for visionP05 deep boltzmann machines cvpr2012 deep learning methods for vision
P05 deep boltzmann machines cvpr2012 deep learning methods for vision
 
Denoising auto encoders(d a)
Denoising auto encoders(d a)Denoising auto encoders(d a)
Denoising auto encoders(d a)
 
Sf data mining_meetup
Sf data mining_meetupSf data mining_meetup
Sf data mining_meetup
 
Learning RBM(Restricted Boltzmann Machine in Practice)
Learning RBM(Restricted Boltzmann Machine in Practice)Learning RBM(Restricted Boltzmann Machine in Practice)
Learning RBM(Restricted Boltzmann Machine in Practice)
 
DNN and RBM
DNN and RBMDNN and RBM
DNN and RBM
 
Multi-modal embeddings: from discriminative to generative models and creative ai
Multi-modal embeddings: from discriminative to generative models and creative aiMulti-modal embeddings: from discriminative to generative models and creative ai
Multi-modal embeddings: from discriminative to generative models and creative ai
 
[251] implementing deep learning using cu dnn
[251] implementing deep learning using cu dnn[251] implementing deep learning using cu dnn
[251] implementing deep learning using cu dnn
 
Shared Services in Higher Education: conceps, clients, consumers and stakehol...
Shared Services in Higher Education: conceps, clients, consumers and stakehol...Shared Services in Higher Education: conceps, clients, consumers and stakehol...
Shared Services in Higher Education: conceps, clients, consumers and stakehol...
 
HAPPYWEEK 184 - 2016.09.05.
HAPPYWEEK 184 - 2016.09.05.HAPPYWEEK 184 - 2016.09.05.
HAPPYWEEK 184 - 2016.09.05.
 
Leveraging Social Media for Your Law Practice
Leveraging Social Media for Your Law PracticeLeveraging Social Media for Your Law Practice
Leveraging Social Media for Your Law Practice
 
How the Congressional Budget Office Uses Demographic Data
How the Congressional Budget Office Uses Demographic DataHow the Congressional Budget Office Uses Demographic Data
How the Congressional Budget Office Uses Demographic Data
 
A decentralized future – the technology of next century
A decentralized future – the technology of next centuryA decentralized future – the technology of next century
A decentralized future – the technology of next century
 
Situación de Aprendizaje basada en la Didáctica Crítica
Situación de Aprendizaje basada en la Didáctica CríticaSituación de Aprendizaje basada en la Didáctica Crítica
Situación de Aprendizaje basada en la Didáctica Crítica
 
How To Be Happy During Pregnancy?
How To Be Happy During Pregnancy?How To Be Happy During Pregnancy?
How To Be Happy During Pregnancy?
 
Forged copper bus bar
Forged copper bus barForged copper bus bar
Forged copper bus bar
 
MOLTO Annual Report 2011
MOLTO Annual Report 2011MOLTO Annual Report 2011
MOLTO Annual Report 2011
 

Similar a P04 restricted boltzmann machines cvpr2012 deep learning methods for vision

2008 brokerage 04 smart vision system [compatibility mode]
2008 brokerage 04 smart vision system [compatibility mode]2008 brokerage 04 smart vision system [compatibility mode]
2008 brokerage 04 smart vision system [compatibility mode]imec.archive
 
2008 brokerage 04 smart vision system [compatibility mode]
2008 brokerage 04 smart vision system [compatibility mode]2008 brokerage 04 smart vision system [compatibility mode]
2008 brokerage 04 smart vision system [compatibility mode]imec.archive
 
426 Lecture 9: Research Directions in AR
426 Lecture 9: Research Directions in AR426 Lecture 9: Research Directions in AR
426 Lecture 9: Research Directions in ARMark Billinghurst
 
Ncm2010 ruo ando
Ncm2010 ruo andoNcm2010 ruo ando
Ncm2010 ruo andoRuo Ando
 
P02 sparse coding cvpr2012 deep learning methods for vision
P02 sparse coding cvpr2012 deep learning methods for visionP02 sparse coding cvpr2012 deep learning methods for vision
P02 sparse coding cvpr2012 deep learning methods for visionzukun
 
CG simple openGL point & line-course 2
CG simple openGL point & line-course 2CG simple openGL point & line-course 2
CG simple openGL point & line-course 2fungfung Chen
 
Contribution of Non-Scrambled Chroma Information in Privacy-Protected Face Im...
Contribution of Non-Scrambled Chroma Information in Privacy-Protected Face Im...Contribution of Non-Scrambled Chroma Information in Privacy-Protected Face Im...
Contribution of Non-Scrambled Chroma Information in Privacy-Protected Face Im...Wesley De Neve
 
Towards Conscious-like Behavior in Computer Game Characters
Towards Conscious-like  Behavior in Computer Game CharactersTowards Conscious-like  Behavior in Computer Game Characters
Towards Conscious-like Behavior in Computer Game CharactersAccenture Analytics
 
Pertemuan 1. introduction to image processing
Pertemuan 1. introduction to image processingPertemuan 1. introduction to image processing
Pertemuan 1. introduction to image processingAditya Kurniawan
 
Pertemuan 1. introduction to image processing
Pertemuan 1. introduction to image processingPertemuan 1. introduction to image processing
Pertemuan 1. introduction to image processingAditya Kurniawan
 
NIAR_VRC_2010
NIAR_VRC_2010NIAR_VRC_2010
NIAR_VRC_2010fftoledo
 
COSC 426 Lect 2. - AR Technology
COSC 426 Lect 2. - AR Technology COSC 426 Lect 2. - AR Technology
COSC 426 Lect 2. - AR Technology Mark Billinghurst
 
Digital image processing / Procesiranje digitalnih slik / 2. 9. 2009
Digital image processing / Procesiranje digitalnih slik / 2. 9. 2009Digital image processing / Procesiranje digitalnih slik / 2. 9. 2009
Digital image processing / Procesiranje digitalnih slik / 2. 9. 2009National and University Library
 
Il test pres_wanims
Il test pres_wanimsIl test pres_wanims
Il test pres_wanimsrrintala
 
Object modeling in robotic perception
Object modeling in robotic perceptionObject modeling in robotic perception
Object modeling in robotic perceptionMoniqueO Opris
 
Univ of va intentional introduction 2013 01-31
Univ of va intentional introduction 2013 01-31Univ of va intentional introduction 2013 01-31
Univ of va intentional introduction 2013 01-31Magnus Christerson
 
Easydd program3
Easydd program3Easydd program3
Easydd program3Taha Sochi
 
[CB16] Be a Binary Rockstar: An Introduction to Program Analysis with Binary ...
[CB16] Be a Binary Rockstar: An Introduction to Program Analysis with Binary ...[CB16] Be a Binary Rockstar: An Introduction to Program Analysis with Binary ...
[CB16] Be a Binary Rockstar: An Introduction to Program Analysis with Binary ...CODE BLUE
 
project_final_seminar
project_final_seminarproject_final_seminar
project_final_seminarMUKUL BICHKAR
 

Similar a P04 restricted boltzmann machines cvpr2012 deep learning methods for vision (20)

2008 brokerage 04 smart vision system [compatibility mode]
2008 brokerage 04 smart vision system [compatibility mode]2008 brokerage 04 smart vision system [compatibility mode]
2008 brokerage 04 smart vision system [compatibility mode]
 
2008 brokerage 04 smart vision system [compatibility mode]
2008 brokerage 04 smart vision system [compatibility mode]2008 brokerage 04 smart vision system [compatibility mode]
2008 brokerage 04 smart vision system [compatibility mode]
 
426 Lecture 9: Research Directions in AR
426 Lecture 9: Research Directions in AR426 Lecture 9: Research Directions in AR
426 Lecture 9: Research Directions in AR
 
Ncm2010 ruo ando
Ncm2010 ruo andoNcm2010 ruo ando
Ncm2010 ruo ando
 
P02 sparse coding cvpr2012 deep learning methods for vision
P02 sparse coding cvpr2012 deep learning methods for visionP02 sparse coding cvpr2012 deep learning methods for vision
P02 sparse coding cvpr2012 deep learning methods for vision
 
CG simple openGL point & line-course 2
CG simple openGL point & line-course 2CG simple openGL point & line-course 2
CG simple openGL point & line-course 2
 
Contribution of Non-Scrambled Chroma Information in Privacy-Protected Face Im...
Contribution of Non-Scrambled Chroma Information in Privacy-Protected Face Im...Contribution of Non-Scrambled Chroma Information in Privacy-Protected Face Im...
Contribution of Non-Scrambled Chroma Information in Privacy-Protected Face Im...
 
Towards Conscious-like Behavior in Computer Game Characters
Towards Conscious-like  Behavior in Computer Game CharactersTowards Conscious-like  Behavior in Computer Game Characters
Towards Conscious-like Behavior in Computer Game Characters
 
Pertemuan 1. introduction to image processing
Pertemuan 1. introduction to image processingPertemuan 1. introduction to image processing
Pertemuan 1. introduction to image processing
 
Pertemuan 1. introduction to image processing
Pertemuan 1. introduction to image processingPertemuan 1. introduction to image processing
Pertemuan 1. introduction to image processing
 
Gesture Recognition
Gesture RecognitionGesture Recognition
Gesture Recognition
 
NIAR_VRC_2010
NIAR_VRC_2010NIAR_VRC_2010
NIAR_VRC_2010
 
COSC 426 Lect 2. - AR Technology
COSC 426 Lect 2. - AR Technology COSC 426 Lect 2. - AR Technology
COSC 426 Lect 2. - AR Technology
 
Digital image processing / Procesiranje digitalnih slik / 2. 9. 2009
Digital image processing / Procesiranje digitalnih slik / 2. 9. 2009Digital image processing / Procesiranje digitalnih slik / 2. 9. 2009
Digital image processing / Procesiranje digitalnih slik / 2. 9. 2009
 
Il test pres_wanims
Il test pres_wanimsIl test pres_wanims
Il test pres_wanims
 
Object modeling in robotic perception
Object modeling in robotic perceptionObject modeling in robotic perception
Object modeling in robotic perception
 
Univ of va intentional introduction 2013 01-31
Univ of va intentional introduction 2013 01-31Univ of va intentional introduction 2013 01-31
Univ of va intentional introduction 2013 01-31
 
Easydd program3
Easydd program3Easydd program3
Easydd program3
 
[CB16] Be a Binary Rockstar: An Introduction to Program Analysis with Binary ...
[CB16] Be a Binary Rockstar: An Introduction to Program Analysis with Binary ...[CB16] Be a Binary Rockstar: An Introduction to Program Analysis with Binary ...
[CB16] Be a Binary Rockstar: An Introduction to Program Analysis with Binary ...
 
project_final_seminar
project_final_seminarproject_final_seminar
project_final_seminar
 

Más de zukun

My lyn tutorial 2009
My lyn tutorial 2009My lyn tutorial 2009
My lyn tutorial 2009zukun
 
ETHZ CV2012: Tutorial openCV
ETHZ CV2012: Tutorial openCVETHZ CV2012: Tutorial openCV
ETHZ CV2012: Tutorial openCVzukun
 
ETHZ CV2012: Information
ETHZ CV2012: InformationETHZ CV2012: Information
ETHZ CV2012: Informationzukun
 
Siwei lyu: natural image statistics
Siwei lyu: natural image statisticsSiwei lyu: natural image statistics
Siwei lyu: natural image statisticszukun
 
Lecture9 camera calibration
Lecture9 camera calibrationLecture9 camera calibration
Lecture9 camera calibrationzukun
 
Brunelli 2008: template matching techniques in computer vision
Brunelli 2008: template matching techniques in computer visionBrunelli 2008: template matching techniques in computer vision
Brunelli 2008: template matching techniques in computer visionzukun
 
Modern features-part-4-evaluation
Modern features-part-4-evaluationModern features-part-4-evaluation
Modern features-part-4-evaluationzukun
 
Modern features-part-3-software
Modern features-part-3-softwareModern features-part-3-software
Modern features-part-3-softwarezukun
 
Modern features-part-2-descriptors
Modern features-part-2-descriptorsModern features-part-2-descriptors
Modern features-part-2-descriptorszukun
 
Modern features-part-1-detectors
Modern features-part-1-detectorsModern features-part-1-detectors
Modern features-part-1-detectorszukun
 
Modern features-part-0-intro
Modern features-part-0-introModern features-part-0-intro
Modern features-part-0-introzukun
 
Lecture 02 internet video search
Lecture 02 internet video searchLecture 02 internet video search
Lecture 02 internet video searchzukun
 
Lecture 01 internet video search
Lecture 01 internet video searchLecture 01 internet video search
Lecture 01 internet video searchzukun
 
Lecture 03 internet video search
Lecture 03 internet video searchLecture 03 internet video search
Lecture 03 internet video searchzukun
 
Icml2012 tutorial representation_learning
Icml2012 tutorial representation_learningIcml2012 tutorial representation_learning
Icml2012 tutorial representation_learningzukun
 
Advances in discrete energy minimisation for computer vision
Advances in discrete energy minimisation for computer visionAdvances in discrete energy minimisation for computer vision
Advances in discrete energy minimisation for computer visionzukun
 
Gephi tutorial: quick start
Gephi tutorial: quick startGephi tutorial: quick start
Gephi tutorial: quick startzukun
 
EM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysisEM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysiszukun
 
Object recognition with pictorial structures
Object recognition with pictorial structuresObject recognition with pictorial structures
Object recognition with pictorial structureszukun
 
Iccv2011 learning spatiotemporal graphs of human activities
Iccv2011 learning spatiotemporal graphs of human activities Iccv2011 learning spatiotemporal graphs of human activities
Iccv2011 learning spatiotemporal graphs of human activities zukun
 

Más de zukun (20)

My lyn tutorial 2009
My lyn tutorial 2009My lyn tutorial 2009
My lyn tutorial 2009
 
ETHZ CV2012: Tutorial openCV
ETHZ CV2012: Tutorial openCVETHZ CV2012: Tutorial openCV
ETHZ CV2012: Tutorial openCV
 
ETHZ CV2012: Information
ETHZ CV2012: InformationETHZ CV2012: Information
ETHZ CV2012: Information
 
Siwei lyu: natural image statistics
Siwei lyu: natural image statisticsSiwei lyu: natural image statistics
Siwei lyu: natural image statistics
 
Lecture9 camera calibration
Lecture9 camera calibrationLecture9 camera calibration
Lecture9 camera calibration
 
Brunelli 2008: template matching techniques in computer vision
Brunelli 2008: template matching techniques in computer visionBrunelli 2008: template matching techniques in computer vision
Brunelli 2008: template matching techniques in computer vision
 
Modern features-part-4-evaluation
Modern features-part-4-evaluationModern features-part-4-evaluation
Modern features-part-4-evaluation
 
Modern features-part-3-software
Modern features-part-3-softwareModern features-part-3-software
Modern features-part-3-software
 
Modern features-part-2-descriptors
Modern features-part-2-descriptorsModern features-part-2-descriptors
Modern features-part-2-descriptors
 
Modern features-part-1-detectors
Modern features-part-1-detectorsModern features-part-1-detectors
Modern features-part-1-detectors
 
Modern features-part-0-intro
Modern features-part-0-introModern features-part-0-intro
Modern features-part-0-intro
 
Lecture 02 internet video search
Lecture 02 internet video searchLecture 02 internet video search
Lecture 02 internet video search
 
Lecture 01 internet video search
Lecture 01 internet video searchLecture 01 internet video search
Lecture 01 internet video search
 
Lecture 03 internet video search
Lecture 03 internet video searchLecture 03 internet video search
Lecture 03 internet video search
 
Icml2012 tutorial representation_learning
Icml2012 tutorial representation_learningIcml2012 tutorial representation_learning
Icml2012 tutorial representation_learning
 
Advances in discrete energy minimisation for computer vision
Advances in discrete energy minimisation for computer visionAdvances in discrete energy minimisation for computer vision
Advances in discrete energy minimisation for computer vision
 
Gephi tutorial: quick start
Gephi tutorial: quick startGephi tutorial: quick start
Gephi tutorial: quick start
 
EM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysisEM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysis
 
Object recognition with pictorial structures
Object recognition with pictorial structuresObject recognition with pictorial structures
Object recognition with pictorial structures
 
Iccv2011 learning spatiotemporal graphs of human activities
Iccv2011 learning spatiotemporal graphs of human activities Iccv2011 learning spatiotemporal graphs of human activities
Iccv2011 learning spatiotemporal graphs of human activities
 

Último

Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
PSYCHIATRIC History collection FORMAT.pptx
PSYCHIATRIC   History collection FORMAT.pptxPSYCHIATRIC   History collection FORMAT.pptx
PSYCHIATRIC History collection FORMAT.pptxPoojaSen20
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Celine George
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 

Último (20)

Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
PSYCHIATRIC History collection FORMAT.pptx
PSYCHIATRIC   History collection FORMAT.pptxPSYCHIATRIC   History collection FORMAT.pptx
PSYCHIATRIC History collection FORMAT.pptx
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 

P04 restricted boltzmann machines cvpr2012 deep learning methods for vision

  • 1. CVPR 2012 Tutorial Deep Learning Methods for Vision (draft) Honglak Lee Computer Science and Engineering Division University of Michigan, Ann Arbor 1
  • 2. Feature representations pixel 1 Learning algorithm pixel 2 Input Motorbikes Input space “Non”-Motorbikes pixel 2 pixel 1 2
  • 3. Feature representations handle Feature Learning wheel representation algorithm Input Motorbikes Input space “Non”-Motorbikes Feature space “handle” pixel 2 pixel 1 “wheel” 3
  • 4. How is computer perception done? State-of-the-art: “hand-crafting” Feature Learning Input data representation algorithm Object detection Image Low-level Object detection vision features / classification (SIFT, HOG, etc.) Audio classification Low-level Audio Speaker audio features identification (spectrogram, MFCC, etc.) 4
  • 5. Computer vision features SIFT Spin image HoG RIFT Textons GLOH 5
  • 6. Computer vision features SIFT Spin image Hand-crafted features: 1. NeedsHoG expert knowledge RIFT 2. Requires time-consuming hand-tuning 3. (Arguably) one of the limiting factors of computer vision systems Textons GLOH 6
  • 7. Learning Feature Representations • Key idea: – Learn statistical structure or correlation of the data from unlabeled data – The learned representations can be used as features in supervised and semi-supervised settings – Known as: unsupervised feature learning, feature learning, deep learning, representation learning, etc. • Topics covered in this talk: – Restricted Boltzmann Machines – Deep Belief Networks – Denoising Autoencoders – Applications: Vision, Audio, and Multimodal learning 7
  • 8. Learning Feature Hierarchy • Deep Learning 3rd layer – Deep architectures can be “Objects” representationally efficient. – Natural progression from 2nd layer low level to high level “Object parts” structures. 1st layer – Can share the lower-level “edges” representations for multiple tasks. Input 8
  • 9. Outline • Restricted Boltzmann machines • Deep Belief Networks • Denoising Autoencoders • Applications to Vision • Applications to Audio and Multimodal Data 9
  • 10. Learning Feature Hierarchy [Related work: Hinton, Bengio, LeCun, Ng, and others.] Higher layer: DBNs (Combinations of edges) First layer: RBMs (edges) Input image patch (pixels) 11
  • 11. Restricted Boltzmann Machines with binary-valued input data • Representation – Undirected bipartite graphical model – 𝐯 ∈ 0,1 𝐷 : observed (visible) binary variables – 𝐡 ∈ 0,1 𝐾 : hidden binary variables. hidden (H) j i visible (V) 12
  • 12. Conditional Probabilities (RBM with binary-valued input data) • Given 𝐯, all the ℎ 𝑗 are conditionally independent exp 𝑊 𝑖𝑗 𝑣 𝑗 +𝑏 𝑗 𝑖 hidden (H) 𝑃 ℎ𝑗 = 1 𝐯 = h1 h2 h3 exp⁡ 𝑖 𝑊 𝑖𝑗 𝑣 𝑗 +𝑏 𝑗 )+1 ( =sigmoid( 𝑖 𝑊𝑖𝑗 𝑣 𝑗 + 𝑏 𝑗 ) j 𝒘𝟐 =sigmoid(𝒘 𝑗𝑇 𝐯 + 𝑏 𝑗 ) 𝒘𝟏 𝒘𝟑 i – P(h|v) can be used as “features” v1 v2 visible (V) • Given⁡𝐡, all the 𝑣 𝑖 are conditionally independent 𝑃 𝑣 𝑖 𝐡 = sigmoid( 𝑗 𝑊𝑖𝑗 ℎ 𝑗 + 𝑐 𝑖 ) 13
  • 13. Restricted Boltzmann Machines with real-valued input data • Representation – Undirected bipartite graphical model – V: observed (visible) real variables – H: hidden binary variables. hidden (H) j i visible (V) 14
  • 14. Conditional Probabilities (RBM with real-valued input data) • Given 𝐯, all the ℎ 𝑗 are conditionally independent 1 exp 𝑖 𝑊 𝑖𝑗 𝑣 𝑗 +𝑏 𝑗 hidden (H) 𝜎 𝑃 ℎ𝑗 = 1 𝒗 = 1 h1 h2 h3 exp⁡ ( 𝑖 𝑊 𝑖𝑗 𝑣 𝑗 +𝑏 𝑗 )+1 𝜎 1 j =sigmoid( 𝑖 𝑊𝑖𝑗 𝑣 𝑗 + 𝑏 𝑗 ) 𝜎 𝒘𝟐 1 𝒘𝟑 =sigmoid( 𝒘 𝑗𝑇 𝒗 + 𝑏𝑗) 𝒘𝟏 𝜎 – P(h|v) can be used as “features” i v1 v2 • Given⁡𝐡, all the 𝑣 𝑖 are conditionally visible (V) independent 𝑃 𝑣 𝑖 𝐡 = 𝒩 𝜎 𝑗 𝑊𝑖𝑗 ℎ 𝑗 + 𝑐 𝑖 , 𝜎 2 or 𝑃 𝐯 𝐡 = 𝒩 𝜎𝑾𝐡 + 𝐜, 𝜎 2 𝐈 . 15
  • 15. Inference • Conditional Distribution: P(v|h) or P(h|v) – Easy to compute (see previous slides). – Due to conditional independence, we can sample all hidden units given all visible units in parallel (and vice versa) • Joint Distribution: P(v,h) – Requires Gibbs Sampling (approximate; lots of iterations to converge). Initialize with 𝐯 0 Sample 𝐡0 from 𝑃(𝐡|𝐯 0 ) Repeat until convergence (t=1,…) { Sample 𝐯 𝑡 from 𝑃(𝐯 𝑡 |𝐡 𝑡−1 ) Sample 𝐡 𝑡 from 𝑃(𝐡|𝐯 𝑡 ) } 16
  • 16. Training RBMs • Model: • How can we find parameters 𝜃 that maximize 𝑃 𝜃 (𝐯)? Data Distribution Model Distribution (posterior of h given v) • We need to compute P(h|v) and P(v,h), and derivative of E wrt parameters {W,b,c} – P(h|v): tractable – P(v,h): intractable • Can approximate with Gibbs sampling, but requires lots of iterations 17 17
  • 17. Contrastive Divergence • An approximation of the log-likelihood gradient for RBMs 1. Replace the average over all possible inputs by samples 2. Run the MCMC chain (Gibbs sampling) for only k steps starting from the observed example Initialize with 𝐯 0 = 𝐯 Sample 𝐡0 from 𝑃(𝐡|𝐯 0 ) For t = 1,…,k { Sample 𝐯 𝑡 from 𝑃(𝐯 𝑡 |𝐡 𝑡−1 ) Sample 𝐡 𝑡 from 𝑃(𝐡|𝐯 𝑡 ) } 18 18
  • 18. A picture of the maximum likelihood learning algorithm for an RBM j j j j vi h j 0 vi h j  i i i i t=0 t=1 t=2 t = infinity Equilibrium distribution Slide Credit: Geoff Hinton 19
  • 19. A quick way to learn an RBM j j Start with a training vector on the visible units. vi h j 0 vi h j 1 Update all the hidden units in parallel i i Update the all the visible units in parallel to get a “reconstruction”. t=0 t=1 data reconstruction Update the hidden units again. Slide Credit: Geoff Hinton 20
  • 20. Update Rule: Putting together • Training via stochastic gradient. 𝜕𝐸 • Note, = ℎ 𝑖 𝑣𝑗. 𝜕𝑊 𝑖𝑗 • Therefore, – Can derive similar update rule for biases b and c – Implemented in ~10 lines of matlab code 21
  • 21. Other ways of training RBMs • Persistent CD [Tieleman, ICML 2008; Tieleman & Hinton, ICML 2009] – Keep a background MCMC chain to obtain the negative phase samples. – Related to Stochastic Approximation • Robbins and Monro, Ann. Math. Stats, 1957 • L. Younes, Probability Theory 1989 • Score Matching [Swersky et al., ICML 2011; Hyvarinen, JMLR 2005] – Use score function to eliminate Z – Match model’s & empirical score function 22
  • 22. Estimating Log-Likelihood of RBMs • How do we measure the likelihood of the learned models? • RBM: requires estimating partition function • Reconstruction error provides a cheap proxy • Log Z tractable analytically for < 25 binary inputs or hidden • Can be approximated with Annealed Importance Sampling (AIS) • Salakhutdinov & Murray, ICML 2008 • Open question: efficient ways to monitor progress 23
  • 24. Sparse RBM / DBN • Main idea [Lee et al., NIPS 2008] – Constrain the hidden layer nodes to have “sparse” average values (activation). [cf. sparse coding] • Optimization – Tradeoff between “likelihood” and “sparsity penalty” Log-likelihood Sparsity penalty we can use other penalty functions (e.g., KLdivergence) Average activation Target sparsity 26
  • 25. Modeling handwritten digits • Sparse dictionary learning via sparse RBMs W1 First layer bases (“pen-strokes”) input nodes (data) Learned sparse representations can be used as features. Training examples [Lee et al., NIPS 2008; Ranzato et al, NIPS 2007] 27
  • 26. 3-way factorized RBM [Ranzato et al., AISTATS 2010; Ranzato and Hinton, CVPR 2010] • Models the covariance structure of images using hidden variables – 3-way factorized RBM / mean-covariance RBM Gaussian MRF 3-way (covariance) RBM hc k F xp C C xq xp xq F F [Slide Credit: Marc’Aurelio Ranzato] 28
  • 27. Generating natural image patches mcRBM Natural images Ranzato and Hinton CVPR 2010 GRBM from Osindero and Hinton NIPS 2008 S-RBM + DBN from Osindero and Hinton NIPS 2008 Slide Credit: Marc’Aurelio Ranzato 29
  • 28. Outline • Restricted Boltzmann machines • Deep Belief Networks • Denoising Autoencoders • Applications to Vision • Applications to Audio and Multimodal Data 30
  • 29. Deep Belief Networks (DBNs) Hinton et al., 2006 • Probabilistic generative model • Deep architecture – multiple layers • Unsupervised pre-learning provides a good initialization of the network – maximizing the lower-bound of the log-likelihood of the data • Supervised fine-tuning – Generative: Up-down algorithm – Discriminative: backpropagation 32
  • 30. DBN structure Hinton et al., 2006 3 h RBM Hidden 2 h layers 1 Directed h belief nets Visible layer v P( v, h1 , h2 ,..., hl )  P( v | h1 ) P(h1 | h2 )...P(hl 2 | hl 1 ) P(hl 1, hl ) 33
  • 31. DBN structure Hinton et al., 2006 (approximate) inference h3 Generative Q ( h | h )  P( h | h ) 3 2 3 2 W3 P( h 2 , h3 ) process h2 Q(h2 | h1 ) P(h1 | h2 ) W2 h1 1 Q(h | v) W1 P( v | h1 ) v P( v, h1 , h2 ,..., hl )  P( v | h1 ) P(h1 | h2 )...P(hl 2 | hl 1 ) P(hl 1, hl ) Q(hi | hi 1 )   sigm(bij1  Wij hi 1 ) P(hi 1 | hi )   sigm(bij  W.ij ' hi ) j j 34
  • 32. DBN Greedy training Hinton et al., 2006 • First step: – Construct an RBM with an input layer v and a hidden layer h – Train the RBM 35
  • 33. DBN Greedy training Hinton et al., 2006 • Second step: – Stack another hidden layer on top of the RBM to form a new RBM – Fix W1, sample h from 1 Q(h1 | v) as input. Train W 2 as RBM. W2 1 Q(h | v) W1 36
  • 34. DBN Greedy training Hinton et al., 2006 • Third step: h3 – Continue to stack layers on top of the network, W3 train it as previous step, with sample sampled from Q(h2 | h1 ) Q(h2 | h1 ) W 2 • And so on… 1 Q(h | v) W1 37
  • 35. Why greedy training works? Hinton et al., 2006 • RBM specifies P(v,h) from P(v|h) and P(h|v) – Implicitly defines P(v) and P(h) • Key idea of stacking – Keep P(v|h) from 1st RBM – Replace P(h) by the distribution generated by 2nd level RBM 38
  • 36. Why greedy training works? Hinton et al., 2006 • Greey Training: – Variational lower-bound justifies greedy layerwise training of RBMs Q (h | v) W1 Trained by the second layer RBM 39
  • 37. Why greedy training works? Hinton et al., 2006 • Greey Training: – Variational lower-bound justifies greedy layerwise training of RBMs W2 – Note: RBM and 2-layer DBN are equivalent when 𝑊 2 = 𝑊 1 𝑇 . Q(h1 | v) W1 Therefore, the lower bound is tight and the log-likelihood improves by greedy training. Trained by the second layer RBM 40
  • 38. DBN and supervised fine-tuning • Discriminative fine-tuning – Initializing with neural nets + backpropagation – Maximizes log P(Y | X ) (X: data Y: label) • Generative fine-tuning – Up-down algorithm – Maximizes log P(Y , X ) (joint likelihood of data and labels) 41
  • 39. A model for digit recognition The top two layers form an associative memory whose 2000 top-level neurons energy landscape models the low dimensional manifolds of the digits. 10 label The energy valleys have names 500 neurons neurons The model learns to generate combinations of labels and images. 500 neurons To perform recognition we start with a neutral state of the label units and do 28 x 28 an up-pass from the image followed pixel by a few iterations of the top-level image associative memory. http://www.cs.toronto.edu/~hinton/nipstutorial/nipstut3.ppt Slide Credit: Geoff Hinton 43
  • 40. Fine-tuning with a contrastive version of the “wake-sleep” algorithm After learning many layers of features, we can fine-tune the features to improve generation. 1. Do a stochastic bottom-up pass – Adjust the top-down weights to be good at reconstructing the feature activities in the layer below. 2. Do a few iterations of sampling in the top level RBM -- Adjust the weights in the top-level RBM. 3. Do a stochastic top-down pass – Adjust the bottom-up weights to be good at reconstructing the feature activities in the layer above. http://www.cs.toronto.edu/~hinton/nipstutorial/nipstut3.ppt Slide Credit: Geoff Hinton 44
  • 41. Generating sample from a DBN • Want to sample from P( v, h1 , h2 ,..., hl )  P( v | h1 ) P(h1 | h2 )...P(hl 2 | hl 1 ) P(hl 1, hl ) – Sample hl 1 using Gibbs sampling in the RBM – Sample the lower layer hi 1 from P(hi 1 | hi ) 45
  • 42. Generating samples from DBN Hinton et al, A Fast Learning Algorithm for Deep Belief Nets, 2006 46
  • 43. Result for supervised fine-tuning on MNIST • Very carefully trained backprop net with 1.6% one or two hidden layers (Platt; Hinton) • SVM (Decoste & Schoelkopf, 2002) 1.4% • Generative model of joint density of 1.25% images and labels (+ generative fine-tuning) • Generative model of unlabelled digits 1.15% followed by gentle backpropagation (Hinton & Salakhutdinov, Science 2006) http://www.cs.toronto.edu/~hinton/nipstutorial/nipstut3.ppt Slide Credit: Geoff Hinton 47
  • 44. • More details on up-down algorithm: – Hinton, G. E., Osindero, S. and Teh, Y. (2006) “A fast learning algorithm for deep belief nets”, Neural Computation, 18, pp 1527-1554. http://www.cs.toronto.edu/~hinton/absps/ncfast.pdf • Handwritten digit demo: – http://www.cs.toronto.edu/~hinton/digits.html 48
  • 45. Outline • Restricted Boltzmann machines • Deep Belief Networks • Denoising Autoencoders • Applications to Vision • Applications to Audio and Multimodal Data 49
  • 46. Denoising Autoencoder • Denoising Autoencoder – Perturbs the input x to a corrupted version: • E.g., randomly sets some of the coordinates of input to zeros. – Recover x from encoded h of perturbed data. – Minimize loss between 𝐱 and x [Vincent et al., ICML 2008] 51
  • 47. Denoising Autoencoder • Learns a vector field towards [Vincent et al., ICML 2008] higher probability regions • Minimizes variational lower bound on a generative model Corrupted input • Corresponds to regularized score matching on an RBM Corrupted input Slide Credit: Yoshua Bengio 52
  • 48. Stacked Denoising Autoencoders • Greedy Layer wise learning – Start with the lowest level and stack upwards – Train each layer of autoencoder on the intermediate code (features) from the layer below – Top layer can have a different output (e.g., softmax non- linearity) to provide an output for classification Slide Credit: Yoshua Bengio 53
  • 49. Stacked Denoising Autoencoders • No partition function, can measure training criterion • Encoder & decoder: any parametrization • Performs as well or better than stacking RBMs for usupervised pre-training Infinite MNIST Slide Credit: Yoshua Bengio 54
  • 50. Denoising Autoencoders: Benchmarks Larochelle et al., 2009 Slide Credit: Yoshua Bengio 55
  • 51. Denoising Autoencoders: Results • Test errors on the benchmarks Larochelle et al., 2009 Slide Credit: Yoshua Bengio 56
  • 52. Why Greedy Layer Wise Training Works (Bengio 2009, Erhan et al. 2009) • Regularization Hypothesis – Pre-training is “constraining” parameters in a region relevant to unsupervised dataset – Better generalization (Representations that better describe unlabeled data are more discriminative for labeled data) • Optimization Hypothesis – Unsupervised training initializes lower level parameters near localities of better minima than random initialization can Slide Credit: Yoshua Bengio 58
  • 53. Outline • Restricted Boltzmann machines • Deep Belief Networks • Denoising Autoencoders • Applications to Vision • Applications to Audio and Multimodal Data 65
  • 54. Convolutional Neural Networks (LeCun et al., 1989) Local Receptive Weight Fields sharing Pooling 67
  • 55. Deep Convolutional Architectures State-of-the-art on MNIST digits, Caltech-101 objects, etc. Slide Credit: Yann LeCun 70
  • 56. Learning object representations • Learning objects and parts in images • Large image patches contain interesting higher- level structures. – E.g., object parts and full objects 71
  • 57. Illustration: Learning an “eye” detector Advantage of shrinking “Eye detector” 1. Filter size is kept smal 2. Invariance “Shrink” (max over 2x2) filter1 filter2 filter3 filter4 “Filtering” output Example image 72
  • 58. Convolutional RBM (CRBM) [Lee et al, ICML 2009] [Related work: Norouzi et al., CVPR 2009; Desjardins and Bengio, 2008] ‘’max-pooling’’ node (binary) For “filter” k, Max-pooling layer P Detection layer H Hidden nodes (binary) Constraint: At most one hidden node is 1 Wk “Filter“ weights (shared) (active). V (visibleV Input data layer) Key Properties  Convolutional structure  Probabilistic max-pooling (“mutual exclusion”) 73
  • 59. Convolutional deep belief networks illustration Layer 3 W3 Layer 3 activation (coefficients) Layer 2 Layer 2 activation (coefficients) W2 Layer 1 Layer 1 activation (coefficients) Filter W1 visualization Input image Example image 74
  • 60. Unsupervised learning from natural images Second layer bases contours, corners, arcs, surface boundaries First layer bases localized, oriented edges 75
  • 61. Unsupervised learning of object-parts Faces Cars Elephants Chairs 76
  • 62. Image classification with Spatial Pyramids [Lazebnik et al., CVPR 2005; Yang et al., CVPR 2009] • Descriptor Layer: detect and locate features, extract corresponding descriptors (e.g. SIFT) • Code Layer: code the descriptors – Vector Quantization (VQ): each code has only one non-zero element – Soft-VQ: small group of elements can be non-zero • SPM layer: pool codes across subregions and average/normalize into a histogram Slide Credit: Kai Yu 78
  • 63. Improving the coding step • Classifiers using these features need nonlinear kernels – Lazebnik et al., CVPR 2005; Grauman and Darrell, JMLR 2007 – High computational complexity • Idea: modify the coding step to produce feature representations that linear classifiers can use effectively – Sparse coding [Olshausen & Field, Nature 1996; Lee et al., NIPS 2007; Yang et al., CVPR 2009; Boureau et al., CVPR 2010] – Local Coordinate coding [Yu et al., NIPS 2009; Wang et al., CVPR 2010] – RBMs [Sohn, Jung, Lee, Hero III, ICCV 2011] – Other feature learning algorithms Slide Credit: Kai Yu 79
  • 64. Object Recognition Results • Classification accuracy on Caltech 101/256 < Caltech 101 > # of training images 15 30 Zhang et al., CVPR 2005 59.1 66.2 Griffin et al., 2008 59.0 67.6 ScSPM [Yang et al., CVPR 2009] 67.0 73.2 LLC [Wang et al., CVPR 2010] 65.4 73.4 Macrofeatures [Boureau et al., CVPR 2010] - 75.7 Boureau et al., ICCV 2011 - 77.1 Sparse RBM [Sohn et al., ICCV 2011] 68.6 74.9 Sparse CRBM [Sohn et al., ICCV 2011] 71.3 77.8 Competitive performance to other state-of-the-art methods using a single type of features 80
  • 65. Object Recognition Results • Classification accuracy on Caltech 101/256 < Caltech 256 > # of training images 30 60 Griffin et al. [2] 34.10 - vanGemert et al., PAMI 2010 27.17 - ScSPM [Yang et al., CVPR 2009] 34.02 40.14 LLC [Wang et al., CVPR 2010] 41.19 47.68 Sparse CRBM [Sohn et al., ICCV 2011] 42.05 47.94 Competitive performance to other state-of-the-art methods using a single type of features 81
  • 66. Local Convolutional RBM • Modeling convolutional structures in local regions jointly – More statistically effective in learning for non- stationary (roughly aligned) images [Huang, Lee, Learned-Miller, CVPR 2012] 88
  • 67. Face Verification [Huang, Lee, Learned-Miller, CVPR 2012] Feature Learning Face Verification Convolutional Deep Belief Network LFW pairs 1 to 2 layers of CRBM/local CRBM Representation whitened pixel intensity/LBP Representation top-most pooling layer Source Verification Algorithm Faces Metric Learning + SVM Kyoto (self-taught Prediction learning) matched mismatched 89
  • 68. Face Verification [Huang, Lee, Learned-Miller, CVPR 2012] Method Accuracy ± SE V1-like with MKL (Pinto et al., CVPR 2009) 0.7935 ± 0.0055 Linear rectified units (Nair and Hinton, ICML 2010) 0.8073 ± 0.0134 CSML (Nguyen & Bai, ACCV 2010) 0.8418 ± 0.0048 Learning-based descriptor (Cao et al., CVPR 2010) 0.8445 ± 0.0046 OSS, TSS, full (Wolf et al., ACCV 2009) 0.8683 ± 0.0034 OSS only (Wolf et al., ACCV 2009) 0.8207 ± 0.0041 Combined (LBP + deep learning features) 0.8777 ± 0.0062 90
  • 69. Tiled Convolutional models IDEA: have one subset of filters applied to these locations, 91 91
  • 70. Tiled Convolutional models IDEA: have one subset of filters applied to these locations, another subset to these locations 92 92
  • 71. Tiled Convolutional models IDEA: have one subset of filters applied to these locations, another subset to these locations, etc. Train jointly all parameters. No block artifacts Gregor LeCun arXiv 2010 Reduced redundancy Ranzato, Mnih, Hinton NIPS 2010 93 93
  • 72. Tiled Convolutional models Treat these units as data to train a similar model on the top SECOND STAGE Field of binary RBM's. Each hidden unit has a receptive field of 30x30 pixels in input space. 94 94
  • 73. Facial Expression Recognition Toronto Face Dataset (J. Susskind et al. 2010) ~ 100K unlabeled faces from different sources ~ 4K labeled images Resolution: 48x48 pixels 7 facial expressions neutral sadness surprise Ranzato, et al. CVPR 2011 95 95
  • 74. Facial Expression Recognition Drawing samples from the model (5th layer with 128 hiddens) 5 h RBM ... RBM RBM RBM 5th 4 layer h RBM ... gMRF 1st layer 96 96
  • 75. Facial Expression Recognition Drawing samples from the model (5th layer with 128 hiddens) Ranzato, et al. CVPR 2011 97 97
  • 76. Facial Expression Recognition - 7 synthetic occlusions RBM RBM RBM RBM RBM - use generative model to fill-in RBM RBM RBM RBM RBM ... (conditional on the known pixels) RBM RBM RBM RBM RBM gMRF gMRF gMRF gMRF gMRF Ranzato, et al. CVPR 2011 98 98
  • 77. Facial Expression Recognition originals Type 1 occlusion: eyes Restored images Ranzato, et al. CVPR 2011 99 99
  • 78. Facial Expression Recognition originals Type 2 occlusion: mouth Restored images Ranzato, et al. CVPR 2011 100 100
  • 79. Facial Expression Recognition originals Type 3 occlusion: right half Restored images Ranzato, et al. CVPR 2011 101 101
  • 80. Facial Expression Recognition originals Type 5 occlusion: top half Restored images Ranzato, et al. CVPR 2011 103 103
  • 81. Facial Expression Recognition originals Type 7 occlusion: 70% of pixels at random Restored images Ranzato, et al. CVPR 2011 105 105
  • 82. Facial Expression Recognition occluded images for both training and test Dailey, et al. J. Cog. Neuros. 2003 Wright, et al. PAMI 2008 Ranzato, et al. CVPR 2011 106 106
  • 83. Outline • Restricted Boltzmann machines • Deep Belief Networks • Denoising Autoencoders • Applications to Vision • Applications to Audio and Multimodal Data 107
  • 84. Motivation: Multi-modal learning • Single learning algorithms that combine multiple input domains – Images – Audio & speech – Video – Text – Robotic sensors – Time-series data image audio text – Others 109
  • 85. Motivation: Multi-modal learning • Benefits: more robust performance in – Multimedia processing – Biomedical data mining fMRI PET scan X-ray EEG Ultra sound – Robot perception Visible light Audio Thermal Infrared Camera array 3d range scans image 110
  • 86. Sparse dictionary learning on audio Spectrogram ~ 0.9 * + 0.7 * + 0.2 * x 36   [Lee, 42 Largman, Pham,63 NIPS 2009] Ng, 111
  • 87. Convolutional DBN for audio Max pooling node Detection nodes frequency Spectrogram time [Lee, Largman, Pham, Ng, NIPS 2009] 113
  • 88. Convolutional DBN for audio frequency Spectrogram time [Lee, Largman, Pham, Ng, NIPS 2009] 114
  • 89. Convolutional DBN for audio [Lee, Largman, Pham, Ng, NIPS 2009] 115
  • 90. Convolutional DBN for audio Max pooling Second CDBN Detection nodes layer Max pooling One CDBN Detection nodes layer [Lee, Largman, Pham, Ng, NIPS 2009] 116
  • 91. CDBNs for speech Trained on unlabeled TIMIT corpus Learned first-layer bases [Lee, Largman, Pham, Ng, NIPS 2009] 117
  • 92. Comparison of bases to phonemes “oy” “el” “s” Phoneme First layer bases 118
  • 93. Experimental Results • Speaker identification TIMIT Speaker identification Accuracy Prior art (Reynolds, 1995) 99.7% Convolutional DBN 100.0% • Phone classification TIMIT Phone classification Accuracy Clarkson et al. (1999) 77.6% Petrov et al. (2007) 78.6% Sha & Saul (2006) 78.9% Yu et al. (2009) 79.2% Convolutional DBN 80.3% Transformation-invariant RBM (Sohn et al., ICML 2012) 81.5% [Lee, Pham, Largman, Ng, NIPS 2009] 120
  • 94. Phone recognition using mcRBM • Mean-covariance RBM + DBN Mean-covariance RBM [Dahl, Ranzato, Mohamed, Hinton, NIPS 2009] 122
  • 95. Speech Recognition on TIMIT Method PER Stochastic Segmental Models 36.0% Conditional Random Field 34.8% Large-Margin GMM 33.0% CD-HMM 27.3% Augmented conditional Random Fields 26.6% Recurrent Neural Nets 26.1% Bayesian Triphone HMM 25.6% Monophone HTMs 24.8% Heterogeneous Classifiers 24.4% Deep Belief Networks(DBNs) 23.0% Triphone HMMs discriminatively trained w/ BMMI 22.7% Deep Belief Networks with mcRBM feature extraction 20.5% (Dahl et al., NIPS 2010) 123
  • 96. Multimodal Feature Learning • Lip reading via multimodal feature learning (audio / visual data) Slide credit: Jiquan Ngiam 126
  • 97. Multimodal Feature Learning • Lip reading via multimodal feature learning (audio / visual data) Q. Is concatenating the best option? Slide credit: Jiquan Ngiam 127
  • 98. Multimodal Feature Learning • Concatenating and learning features (via a single layer)doesn’t work Mostly unimodal features are learned 128
  • 99. Multimodal Feature Learning • Bimodal autoencoder – Idea: predict unseen modality from observed modality Ngiam et al., ICML 2011 129
  • 100. Multimodal Feature Learning • Visualization of learned filters Audio(spectrogram) and Video features learned over 100ms windows • Results: AVLetters Lip reading dataset Method Accuracy Prior art (Zhao et al., 2009) 58.9% Multimodal deep autoencoder (Ngiam et al., 2011) 65.8% 130
  • 101. Summary • Learning Feature Representations – Restricted Boltzmann Machines – Deep Belief Networks – Stacked Denoising Autoencoders • Deep learning algorithms and unsupervised feature learning algorithms show promising results in many applications – vision, audio, multimodal data, and others. 131
  • 102. Thank you! 132