SlideShare una empresa de Scribd logo
1 de 71
Descargar para leer sin conexión
    A Tutorial on
             Energy­Based Learning

                 Yann LeCun, Sumit Chopra, Raia Hadsell, 
                   Marc'Aurelio Ranzato, Fu­Jie Huang
                 The Courant Institute of Mathematical Sciences
                           New York University



                  http://yann.lecun.com
                  http://www.cs.nyu.edu/~yann

Yann LeCun
Two Problems in Machine Learning

       1. The “Deep Learning Problem”
        “Deep” architectures are necessary to solve the invariance problem in
        vision (and perception in general)
        How do we train deep architectures with lots of non-linear stages

       2. The “Partition Function Problem”
        Give high probability (or low energy) to good answers
        Give low probability (or high energy) to bad answers
        There are too many bad answers!

       This tutorial discusses problem #2
        The partition function problem arises with probabilistic approaches
        Non-probabilistic approaches may allow us to get around it.

       Energy­Based Learning provides a framework in which to describe 
       probabilistic and non­probabilistic approaches to learning


Yann LeCun
Energy­Based Model for Decision­Making


                                   Model: Measures the compatibility 
                                  between an observed variable X and 
                                  a variable to be predicted Y through 
                                  an energy function E(Y,X). 




                                   Inference: Search for the Y that 
                                  minimizes the energy within a set
                                   If the set has low cardinality, we can 
                                  use exhaustive search. 




Yann LeCun
Complex Tasks: Inference is non­trivial


                                              When the 
                                              cardinality or 
                                              dimension of Y 
                                              is large, 
                                              exhaustive 
                                              search is 
                                              impractical.
                                              We need to use a 
                                              “smart” 
                                              inference 
                                              procedure: min­
                                              sum, Viterbi, .....




Yann LeCun
What Questions Can a Model Answer?

       1. Classification & Decision Making: 
        “which value of Y is most compatible with X?”
        Applications: Robot navigation,.....
        Training: give the lowest energy to the correct answer

       2. Ranking:
        “Is Y1 or Y2 more compatible with X?”
        Applications: Data-mining....
        Training: produce energies that rank the answers correctly

       3. Detection:
        “Is this value of Y compatible with X”?
        Application: face detection....
        Training: energies that increase as the image looks less like a face.

       4. Conditional Density Estimation:
        “What is the conditional distribution P(Y|X)?”
        Application: feeding a decision-making system
        Training: differences of energies must be just so.
Yann LeCun
Decision­Making versus Probabilistic Modeling

       Energies are uncalibrated
        The energies of two separately-trained systems cannot be combined
        The energies are uncalibrated (measured in arbitrary untis)

       How do we calibrate energies?
        We turn them into probabilities (positive numbers that sum to 1).
        Simplest way: Gibbs distribution
        Other ways can be reduced to Gibbs by a suitable redefinition of the
        energy.




              Partition function             Inverse temperature

Yann LeCun
Architecture and Loss Function

       Family of energy functions

       Training set

       Loss functional / Loss function
        Measures the quality of an energy function

       Training
       Form of the loss functional 
        invariant under permutations and repetitions of the samples




                                                Energy surface
                                                                 Regularizer
                      Per­sample      Desired   for a given Xi
                         loss         answer     as Y varies
Yann LeCun
Designing a Loss Functional




       Correct answer has the lowest energy ­> LOW LOSS

       Lowest energy is not for the correct answer ­> HIGH LOSS
Yann LeCun
Designing a Loss Functional




             Push down on the energy of the correct answer

             Pull up on the energies of the incorrect answers, particularly if they 
             are smaller than the correct one
Yann LeCun
Architecture + Inference Algo + Loss Function = Model

             E(W,Y,X)      1. Design an architecture: a particular form for E(W,Y,X). 
                           2. Pick an inference algorithm for Y: MAP or conditional 
                          distribution, belief prop, min cut, variational methods, 
             W            gradient descent, MCMC, HMC.....
                           3. Pick a loss function: in such a way that minimizing it 
                          with respect to W over a training set will make the inference 
                          algorithm find the correct Y for a given X.
             X     Y       4. Pick an optimization method.


        PROBLEM: What loss functions will make the machine approach 
       the desired behavior?  



Yann LeCun
Several Energy Surfaces can give the same answers




              Y                   X



             Both surfaces compute Y=X^2

             MINy E(Y,X) = X^2

             Minimum­energy inference gives us the same answer


Yann LeCun
Simple Architectures




             Regression      Binary Classification   Multi­class 
                                                     Classification




Yann LeCun
Simple Architecture: Implicit Regression




        The Implicit Regression architecture 
         allows multiple answers to have low
         energy.
         Encodes a constraint between X and Y
         rather than an explicit functional
         relationship
         This is useful for many applications
         Example: sentence completion: “The
         cat ate the {mouse,bird,homework,...}”
         [Bengio et al. 2003]
         But, inference may be difficult.




Yann LeCun
Examples of Loss Functions: Energy Loss


       Energy Loss
        Simply pushes down on the energy of the correct answer
                !!
             S!
         K
        R
       O
     W




                                                                       !!!
                                                                     ES
                                                                  PS
                                                                 A
                                                            LL
                                                           O
                                                          C
Yann LeCun
Examples of Loss Functions:Perceptron Loss




       Perceptron Loss [LeCun et al. 1998], [Collins 2002]
        Pushes down on the energy of the correct answer
        Pulls up on the energy of the machine's answer
        Always positive. Zero when answer is correct
        No “margin”: technically does not prevent the energy surface from
        being almost flat.
        Works pretty well in practice, particularly if the energy
        parameterization does not allow flat surfaces.




Yann LeCun
Perceptron Loss for Binary Classification




       Energy: 

       Inference: 

       Loss: 


       Learning Rule: 

       If Gw(X) is linear in W: 




Yann LeCun
Examples of Loss Functions: Generalized Margin Losses


       First, we need to define the Most Offending Incorrect Answer


       Most Offending Incorrect Answer: discrete case




       Most Offending Incorrect Answer: continuous case




Yann LeCun
Examples of Loss Functions: Generalized Margin Losses




                                            Generalized Margin Loss
                                            Qm increases with the
                                            energy of the correct
                                            answer
                                            Qm decreases with the
                                            energy of the most
                                            offending incorrect
                                            answer
                                            whenever it is less than
                                            the energy of the
                                            correct answer plus a
                                            margin m.




Yann LeCun
Examples of Generalized Margin Losses




             Hinge Loss 
             [Altun et al. 2003], [Taskar et al. 2003]
             With the linearly-parameterized binary
             classifier architecture, we get linear SVMs




             Log Loss 
             “soft hinge” loss
             With the linearly-parameterized binary
             classifier architecture, we get linear
             Logistic Regression


Yann LeCun
Examples of Margin Losses: Square­Square Loss




        Square­Square Loss 
             [LeCun-Huang 2005]
             Appropriate for positive energy
             functions




                                   Learning Y = X^2




                                                                !!
                                                              E!
                                                            PS
                                                           A
                                                           LL
                                                          O
                                                         C
                                                       O
                                                      N
Yann LeCun
Other Margin­Like Losses


        LVQ2 Loss [Kohonen, Oja], Driancourt­Bottou 1991]




        Minimum Classification Error Loss [Juang, Chou, Lee 1997]




        Square­Exponential Loss [Osadchy, Miller, LeCun 2004]




Yann LeCun
Negative Log­Likelihood Loss

        Conditional probability of the samples (assuming independence)




        Gibbs distribution:




        We get the NLL loss by dividing by P and Beta:



        Reduces to the perceptron loss when Beta­>infinity

Yann LeCun
Negative Log­Likelihood Loss

        Pushes down on the energy of the correct answer

        Pulls up on the energies of all answers in proportion to their probability




Yann LeCun
Negative Log­Likelihood Loss: Binary Classification


        Binary Classifier Architecture:




        Linear Binary Classifier Architecture:




        Learning Rule: logistic regression




Yann LeCun
What Makes a “Good” 
      Loss Function

        Good loss functions make the 
        machine produce the correct 
        answer
         Avoid collapses and flat
         energy surfaces




Yann LeCun
What Make a “Good” Loss Function

        Good and bad loss functions




Yann LeCun
Advantages/Disadvantages of various losses

        Loss functions differ in how they pick the point(s) whose energy is 
        pulled up, and how much they pull them up

        Losses with a log partition function in the contrastive term pull up all 
        the bad answers simultaneously. 
         This may be good if the gradient of the contrastive term can be
         computed efficiently
         This may be bad if it cannot, in which case we might as well use a
         loss with a single point in the contrastive term

        Variational methods pull up many points, but not as many as with the 
        full log partition function.

        Efficiency of a loss/architecture: how many energies are pulled up for 
        a given amount of computation? 
         The theory for this is to be developed


Yann LeCun
Latent Variable Models

       The energy includes “hidden” variables Z whose value is never given to us




Yann LeCun
What can the latent variables represent?

       Variables that would make the task easier if they were known:
        Face recognition: the gender of the person, the orientation of the
        face.
        Object recognition: the pose parameters of the object (location,
        orientation, scale), the lighting conditions.
        Parts of Speech Tagging: the segmentation of the sentence into
        syntactic units, the parse tree.
        Speech Recognition: the segmentation of the sentence into
        phonemes or phones.
        Handwriting Recognition: the segmentation of the line into
        characters.

       In general, we will search for the value of the latent variable that 
       allows us to get an answer (Y) of smallest energy.




Yann LeCun
Probabilistic Latent Variable Models


        Marginalizing over latent variables instead of minimizing.




        Equivalent to traditional energy­based inference with a redefined 
        energy function:



        Reduces to traditional minimization when Beta­>infinity


Yann LeCun
Face Detection and Pose Estimation with a Convolutional EBM
     ●


          Training: 52,850,  32x32 
         grey­level images of faces, 
         52,850 selected non­faces.

          Each training image was used 
         5 times with random variation 
                                                             E W , Z , X    ( energy)
         in scale, in­plane rotation, 
         brightness and contrast.
                                                              ∥G W  X −F  Z ∥
         2  phase: half of the initial 
          nd


         negative set was replaced by                  GW  X                      FZ 
         false positives of the initial                                         analytical 
                                                      convolutional 
         version of the detector .                      network
                                                                              mapping onto 
                                                                              face manifold
                                           W(param)
                Small E*(W,X): face                      X                       Z 
                Large E*(W,X): no face                (image)                  (pose) 
                [Osadchy, Miller, LeCun, NIPS 2004]
Yann LeCun
Face Manifold


                                  Low dimensional space
       ||G(X)­min_z F(Z)|||                   G(X)
                               
        Face Manifold                  F(Z)
    parameterized by pose 




           Apply                    Mapping: G

                                        Image X
Probabilistic Approach: Density model of joint P(face,pose)

Probability that image
X is a face with pose Z

Given a training set of faces annotated with pose, find the W that
maximizes the likelihood of the data under the model: 
                              (         ) +         (       )


Equivalently, minimize the negative log likelihood:
                =                                 =        (         )




                                           COMPLICATED
Energy­Based Contrastive Loss Function




                       Attract the network output Gw(X) to the 
                                    ) +                 )
                       location of the desired pose F(Z) on the manifold
                        (                       (



             =                                =        (            )
                       Repel the network output Gw(X) away 
                       from the face/pose manifold
Convolutional Network Architecture

    [LeCun et al. 1988, 1989, 1998, 2005]




     Hierarchy of local filters (convolution kernels), 
     sigmoid pointwise non­linearities, and spatial subsampling
     All the filter coefficients are learned with gradient descent (back­prop) 
Yann LeCun
“Simple cells”
     Alternated Convolutions                                     “Complex cells”


      and Pooling/Subsampling

       Local features are extracted 
                                                                    pooling 
      everywhere.                            Multiple               subsampling
                                             convolutions
       pooling/subsampling layer builds 
      robustness to variations in feature 
      locations.
       Long history in neuroscience and 
      computer vision:
         Hubel/Wiesel 1962, 
         Fukushima 1971­82,  
         LeCun 1988­06
         Poggio, Riesenhuber, Serre 02­06
         Ullman 2002­06
         Triggs, Lowe,....


Yann LeCun
Building a Detector/Recognizer: Replicated Conv. Nets

                                                                                  output: 3x3




                                                                                 Window
                                                                                  32x32

                                                                            input:40x40 
              Traditional Detectors/Classifiers must be applied to every 
             location on a large input image, at multiple scales.
              Convolutional nets can replicated over large images very 
             cheaply.
              The network is applied to multiple scales spaced by sqrt(2)
             Non­maximum suppression with exclusion window
Yann LeCun
Building a Detector/Recognizer: 
Replicated Convolutional Nets

  Computational cost for replicated convolutional net:
    96x96 ­> 4.6 million multiply­accumulate operations
    120x120 ­> 8.3 million multiply­accumulate operations
    240x240 ­> 47.5 million multiply­accumulate operations 
    480x480 ­> 232 million multiply­accumulate operations
  Computational cost for a non­convolutional detector of the 
 same size, applied every 12 pixels:
    96x96 ­> 4.6 million multiply­accumulate operations
    120x120 ­> 42.0 million multiply­accumulate operations
    240x240 ­> 788.0 million multiply­accumulate operations 
    480x480 ­> 5,083 million multiply­accumulate operations     96x96 window

                                                                 12 pixel shift

                                                                84x84 overlap
Face Detection: Results

                              Data Set->   TILTED              PROFILE          MIT+CMU

                                           4.42       26.9   0.47       3.36   0.5       1.28
             False positives per image->
             Our Detector                  90% 97%           67%        83%    83%       88%

             Jones & Viola (tilted)        90% 95%                  x                x

             Jones & Viola (profile)              x          70%        83%          x

             Rowley et al                  89% 96%                                   x

             Schneiderman & Kanade                           86%        93%          x




Yann LeCun
Face Detection and Pose Estimation: Results




Yann LeCun
Face Detection with a Convolutional Net




Yann LeCun
Performance on standard dataset

       Detection                                        Pose estimation




             Pose estimation is performed on faces located automatically by the system 

             when the faces are localized by hand we get: 89% of yaw and 100% of in­plane 
             rotations within 15 degrees.

Yann LeCun
Synergy Between Detection and Pose Estimation

       Pose Estimation Improves       Detection improves 
       Detection                      pose estimation




Yann LeCun
EBM for Face Recognition

                                    X and Y are images
               E(W,X,Y)
                                    Y is a discrete variable with many 
             || Gw(X)­Gw(Y)||       possible values
                                    All the people in our gallery

                                    Example of architecture: 
       Gw(X)                Gw(Y)
                                    A function G(X) maps input images
                                    into a low-dimensional space in
                                    which the Euclidean distance
                                    measures dissemblance.

                                    Inference:
                                    Find the Y in the gallery that
                                    minimizes the energy (find the Y
                                    that is most similar to X)
                                    Minimization through exhaustive
                                    search.

Yann LeCun
Learning an Invariant Dissimilarity Metric with EBM 


 [Chopra, Hadsell, LeCun CVPR 2005]                                      E(W,X1,X2)
  Training a parameterized, invariant dissimilarity metric           || Gw(X1)­Gw(X2)||
 may be a solution to the many­category problem.
  Find a mapping Gw(X) such that the Euclidean distance            Gw(X1)            Gw(X2)
 ||Gw(X1)­ Gw(X2)|| reflects the “semantic” distance between 
 X1 and X2.
                                                                    X1                   X2
  Once trained, a trainable dissimilarity metric can be used to 
 classify new categories using a very small number of 
 training samples (used as prototypes).                                 E(W,X1,X2)
  This is an example where probabilistic models are too             || Gw(X1)­Gw(X2)||
 constraining, because we would have to limit ourselves to 
 models that can be normalized over the space of input pairs. 
                                                                   Gw(X1)        Gw(X2)
  With EBMs, we can put what we want in the box (e.g. A 
 convolutional net).

  Siamese Architecture                                             X1                  X2

 Application: face verification/recognition
Loss Function

            E(W,X1,X2)
          || Gw(X1)­Gw(X2)||


       Gw(X1)              Gw(X2)

          X1                   X2


  Siamese models: distance between the outputs of two identical copies of a model. 
  Energy function: E(W,X1,X2) =  ||Gw(X1)­Gw(X2)||
  If X1 and X2 are from the same category (genuine pair), train the two copies of the model 
 to produce similar outputs (low energy)
  If X1 and X2 are from different categories (impostor pair), train the two copies of the 
 model to produce different outputs (high energy)
  Loss function: increasing function of genuine pair energy, decreasing function of 
 impostor pair energy.
Loss Function

  Our Loss function for a single training pair (X1,X2): 
L W , X 1, X 2 =1−Y L G E W  X 1, X 2 YL I  E W X 1, X 2 
                                                               E W  X 1, X 2 
                    2                 2
                                                       −2.77
                                                               R
        =1−Y           E W X 1, X 2  Y 2R e
                    R


E W  X 1, X 2 =∥G W X 1 −GW  X 2 ∥L1


And R is the largest possible value of 

 E W  X 1, X 2 

 Y=0 for a genuine pair, and Y=1 for 
   an impostor pair.
Face Verification datasets: AT&T, FERET, and AR/Purdue
●
     The AT&T/ORL dataset 
 Total subjects: 40. Images per subject: 10.  Total images: 400.
●



 Images had a moderate degree of variation in pose, lighting, expression and head position.
●



 Images from 35 subjects were used for training. Images from 5 remaining subjects for testing.
●



 Training set was taken from: 3500 genuine and 119000 impostor pairs.
●



 Test set was taken from: 500 genuine and 2000 impostor pairs.
●



 http://www.uk.research.att.com/facedatabase.html
●




                                                                   AT&T/ORL 
                                                                   Dataset
Face Verification datasets: AT&T, FERET, and AR/Purdue
●
     The FERET dataset. part of the dataset was used only for training.
 Total subjects: 96. Images per subject: 6. Total images: 1122.
●



 Images had high degree of variation in pose, lighting, expression and head position.
●



 The images were used for training only.
●



 http://www.itl.nist.gov/iad/humanid/feret/
●




                                                                      FERET Dataset
Face Verification datasets: AT&T, FERET, and AR/Purdue

●
     The AR/Purdue dataset
 Total subjects: 136. Images per subject: 26. Total images: 3536. 
●



 Each subject has 2 sets of 13 images taken 14 days apart. 
●


●
  Images had very high degree of variation in pose, lighting, expression and position. Within each set 
of 13, there are 4 images with expression variation, 3 with lighting variation, 3 with dark sun glasses 
and lighting variation, and 3 with face obscuring scarfs and lighting variation.
 Images from 96 subjects were used for training. The remaining 40 subjects were used for testing.
●



 Training set drawn from: 64896 genuine and 6165120 impostor pairs.
●



 Test set drawn from: 27040 genuine and 1054560 impostor pairs.
●



 http://rv11.ecn.purdue.edu/aleix/aleix_face_DB.html
●
Face Verification dataset: AR/Purdue
Preprocessing

                 The 3 datasets each required a small amount of preprocessing. 
                   FERET: Cropping, subsampling, and centering (see below)
                   AR/PURDUE: Cropping and subsampling
                   AT&T: Subsampling only




                 crop                           subsample

                                                         center
Centering with a Gaussian­blurred face template


    Coarse centering was done on the FERET database images
    1. Construct a template by blurring a well­centered face.
    2. Convolve the template with an uncentered image.
    3. Choose the 'peak' of the convolution as the center of the image. 




                Convolve mask with 
                                                  peak is center 
                     image
                                                      of image
Architecture for the Mapping Function Gw(X)


   Convolutional net

                                                                           Layer 6
                                       Layer 3
Input     Layer 1          Layer 2                       Layer 4   Layer 5 Fully connected
image     15@50x40                     45@20x15
                           15@25x20                      45@5x5    250
2@56x46                                                                  50 units.
                                                                         Low­dimensional
                                                                         invariant representation




            7x7                                    6x6                         5x5
                                     2x2                           4x3
                                                   convolution
            convolution              subsampling                   subsampling convolution
            (15 kernels)                           (198 kernels)               (11250 kernels)
Internal state for genuine and impostor pairs
Gaussian Face Model in the output space




                                          A gaussian model 
                                          constructed from 5 
        X 2  genuine
                                          images of the 
                                          above subject.




        X 2 impostor 
Dataset for Verification              Verification Results

 tested on AT&T and AR/Purdue          The AT&T dataset          The AR/Purdue dataset
 AT&T dataset                     False Accept False Reject   False Accept False Reject
 Number of subjects:       5        10.00%       0.00%          10.00%       11.00%
                                     7.50%       1.00%           7.50%       14.60%
 Images/subject:           10
                                     5.00%       1.00%           5.00%       19.00%
 Images/Model:             5
 Total test size:          5000
 Number of Genuine:        500
 Number of Impostors:  4500
 Purdue/AR dataset
 Number of subjects:       40
 Images/subject:           26
 Images/Model:             13
 Total test size:          5000
 Number of Genuine:        500
 Number of Impostors:  4500
Classification Examples

 Example: Correctly classified genuine pairs




         energy: 0.3159             energy: 0.0043      energy: 0.0046
  Example: Correctly classified impostor pairs




                                energy: 32.7897        energy: 5.7186
     energy: 20.1259


 Example: Mis­classified 
 pairs

                             energy: 10.3209         energy: 2.8243
Internal State
1                                 1
     A similar idea           Lsimilar = D2
                                        2 w
                                                             Ldissimilar = {max 0, m −D W }2
                                                                          2
     for Learning
                                                                          Margin
     a Manifold
                                                                          m
     with Invariance
     Properties

                                    DW                                    DW


      Loss function:           ∥G W  x 1 −G w  x 2 ∥             ∥G W  x 1 −G w  x 2 ∥

       Pay quadratically
       for making outputs   GW  x 1           GW  x 2         GW  x 1           GW  x 2 
       of neighbors far
       apart                 x1                  x2                x1                  x2
       Pay quadratically
       for making outputs
       of non-neighbors
       smaller than a
       margin m
Yann LeCun
A Manifold with Invariance to Shifts



                                            Training set: 3000 “4” and 
                                           3000 “9” from MNIST. 
                                           Each digit is shifted 
                                           horizontally by ­6, ­3, 3, 
                                           and 6  pixels
                                           Neighborhood graph: 5 
                                           nearest neighbors in 
                                           Euclidean distance, and 
                                           shifted versions of self and 
                                           nearest neighbors
                                           Output Dimension: 2
                                           Test set (shown) 1000 “4” 
                                           and 1000 “9”



Yann LeCun
Automatic Discovery of the Viewpoint Manifold 
    with Invariant to Illumination 




Yann LeCun
Efficient Inference: Energy­Based Factor Graphs


       Graphical models have brought us efficient inference algorithms, such 
       as belief propagation and its numerous variations.

       Traditionally, graphical models are viewed as probabilistic models

       At first glance, is seems difficult to dissociate graphical models from the 
       probabilistic view

       Energy­Based Factor Graphs are an extension of graphical models to 
       non­probabilistic settings.

       An EBFG is an energy function that can be written as a sum of “factor” 
       functions that take different subsets of variables as inputs.




Yann LeCun
Efficient Inference: Energy­Based Factor Graphs

        The energy is a sum of “factor” functions 
       Example: 
        Z1, Z2, Y1 are binary
        Z2 is ternary
        A naïve exhaustive inference
        would require 2x2x2x3
        energy evaluations (= 96
        factor evaluations)
        BUT: Ea only has 2 possible
        input configurations, Eb and
        Ec have 4, and Ed 6.
        Hence, we can precompute
        the 16 factor values, and put
        them on the arcs in a graph.
        A path in the graph is a
        config of variable
        The cost of the path is the
        energy of the config
Yann LeCun
Energy­Based Belief Prop


             The previous picture shows a chain graph of factors with 2 
             inputs.

             The extension of this procedure to trees, with factors that can 
             have more than 2 inputs the “min­sum” algorithm (a non­
             probabilistic form of belief propagation)

             Basically, it is the sum­product algorithm with a different semi­
             ring algebra (min instead of sum, sum instead of product), and 
             no normalization step.
             [Kschischang, Frey, Loeliger, 2001][McKay's book]




Yann LeCun
Feed­Forward, Causal, and Bi­directional Models 

        EBFG are all “undirected”, but the architecture determines the 
        complexity of the inference in certain directions


               Distance


             Complicated
             function




                X          Y    X      Y             X                Y
        Feed­Forward            “Causal”            Bi­directional
         Predicting Y           Predicting Y         X->Y and Y->X are both
         from X is easy         from X is hard       hard if the two factors
         Predicting X           Predicting X         don't agree.
         from Y is hard         from Y is easy       They are both easy if the
                                                     factors agree
Yann LeCun
Two types of “deep” architectures

        Factors are deep / graph is deep




             X      Y            X         Y
Yann LeCun
Shallow Factors / Deep Graph

        Linearly Parameterized Factors

        with the NLL Loss :
         Lafferty's Conditional
         Random Field

        with Hinge Loss: 
         Taskar's Max Margin
         Markov Nets

        with Perceptron Loss
         Collins's sequence
         labeling model

        With Log Loss:
         Altun/Hofmann
         sequence labeling
         model


Yann LeCun
Deep Factors / Deep Graph: ASR with TDNN/DTW

        Trainable Automatic Speech Recognition system with convolutional 
        nets (TDNN) and dynamic time warping (DTW)

        Training the feature 
        extractor as part of the 
        whole process.

        with the LVQ2 Loss :
         Driancourt and
         Bottou's speech
         recognizer (1991)

        with NLL: 
         Bengio's speech
         recognizer (1992)
         Haffner's speech
         recognizer (1993)

Yann LeCun
Deep Factors / Deep Graph: ASR with TDNN/HMM
        Discriminative Automatic Speech Recognition system with HMM and 
        various acoustic models
         Training the acoustic model (feature extractor) and a (normalized)
         HMM in an integrated fashion.

        With Minimum Empirical Error loss
         Ljolje and Rabiner (1990)

        with NLL: 
         Bengio (1992)
         Haffner (1993)
         Bourlard (1994)

        With MCE
         Juang et al. (1997)

        Late normalization scheme (un­normalized HMM)
         Bottou pointed out the label bias problem (1991)
         Denker and Burges proposed a solution (1995)
Yann LeCun
Really Deep Factors / 
      Really Deep Graph


        Handwriting Recognition with 
        Graph Transformer Networks

        Un­normalized hierarchical 
        HMMs
         Trained with Perceptron loss
         [LeCun, Bottou, Bengio,
         Haffner 1998]
         Trained with NLL loss
         [Bengio, LeCun 1994],
         [LeCun, Bottou, Bengio,
         Haffner 1998]

        Answer = sequence of symbols

        Latent variable = segmentation
Yann LeCun

Más contenido relacionado

Similar a Lecun 20060816-ciar-01-energy based learning

Random Matrix Theory and Machine Learning - Part 4
Random Matrix Theory and Machine Learning - Part 4Random Matrix Theory and Machine Learning - Part 4
Random Matrix Theory and Machine Learning - Part 4Fabian Pedregosa
 
Bayesian Generalization Error and Real Log Canonical Threshold in Non-negativ...
Bayesian Generalization Error and Real Log Canonical Threshold in Non-negativ...Bayesian Generalization Error and Real Log Canonical Threshold in Non-negativ...
Bayesian Generalization Error and Real Log Canonical Threshold in Non-negativ...Naoki Hayashi
 
Artificial intelligence
Artificial intelligenceArtificial intelligence
Artificial intelligencekeerthikaA8
 
Artificial intelligence
Artificial intelligenceArtificial intelligence
Artificial intelligencekeerthikaA8
 
Artificial intelligence.pptx
Artificial intelligence.pptxArtificial intelligence.pptx
Artificial intelligence.pptxkeerthikaA8
 
Introduction to Supervised ML Concepts and Algorithms
Introduction to Supervised ML Concepts and AlgorithmsIntroduction to Supervised ML Concepts and Algorithms
Introduction to Supervised ML Concepts and AlgorithmsNBER
 
REDES NEURONALES APRENDIZAJE Supervised Vs No Supervised
REDES NEURONALES APRENDIZAJE Supervised Vs No SupervisedREDES NEURONALES APRENDIZAJE Supervised Vs No Supervised
REDES NEURONALES APRENDIZAJE Supervised Vs No SupervisedESCOM
 
REDES NEURONALES APRENDIZAJE Supervised Vs Unsupervised
REDES NEURONALES APRENDIZAJE Supervised Vs UnsupervisedREDES NEURONALES APRENDIZAJE Supervised Vs Unsupervised
REDES NEURONALES APRENDIZAJE Supervised Vs UnsupervisedESCOM
 
Time Independent Perturbation Theory, 1st order correction, 2nd order correction
Time Independent Perturbation Theory, 1st order correction, 2nd order correctionTime Independent Perturbation Theory, 1st order correction, 2nd order correction
Time Independent Perturbation Theory, 1st order correction, 2nd order correctionJames Salveo Olarve
 
(研究会輪読) Weight Uncertainty in Neural Networks
(研究会輪読) Weight Uncertainty in Neural Networks(研究会輪読) Weight Uncertainty in Neural Networks
(研究会輪読) Weight Uncertainty in Neural NetworksMasahiro Suzuki
 
Bayesian inversion of deterministic dynamic causal models
Bayesian inversion of deterministic dynamic causal modelsBayesian inversion of deterministic dynamic causal models
Bayesian inversion of deterministic dynamic causal modelskhbrodersen
 
Basic Learning Algorithms of ANN
Basic Learning Algorithms of ANNBasic Learning Algorithms of ANN
Basic Learning Algorithms of ANNwaseem khan
 
PRML Chapter 3
PRML Chapter 3PRML Chapter 3
PRML Chapter 3Sunwoo Kim
 
Machine learning and Neural Networks
Machine learning and Neural NetworksMachine learning and Neural Networks
Machine learning and Neural Networksbutest
 
[ICLR2021 (spotlight)] Benefit of deep learning with non-convex noisy gradien...
[ICLR2021 (spotlight)] Benefit of deep learning with non-convex noisy gradien...[ICLR2021 (spotlight)] Benefit of deep learning with non-convex noisy gradien...
[ICLR2021 (spotlight)] Benefit of deep learning with non-convex noisy gradien...Taiji Suzuki
 
Illustrative Introductory Neural Networks
Illustrative Introductory Neural NetworksIllustrative Introductory Neural Networks
Illustrative Introductory Neural NetworksYasutoTamura1
 

Similar a Lecun 20060816-ciar-01-energy based learning (20)

Random Matrix Theory and Machine Learning - Part 4
Random Matrix Theory and Machine Learning - Part 4Random Matrix Theory and Machine Learning - Part 4
Random Matrix Theory and Machine Learning - Part 4
 
Bayesian Generalization Error and Real Log Canonical Threshold in Non-negativ...
Bayesian Generalization Error and Real Log Canonical Threshold in Non-negativ...Bayesian Generalization Error and Real Log Canonical Threshold in Non-negativ...
Bayesian Generalization Error and Real Log Canonical Threshold in Non-negativ...
 
alexVAE_New.pdf
alexVAE_New.pdfalexVAE_New.pdf
alexVAE_New.pdf
 
tutorial.ppt
tutorial.ppttutorial.ppt
tutorial.ppt
 
Artificial intelligence
Artificial intelligenceArtificial intelligence
Artificial intelligence
 
Artificial intelligence
Artificial intelligenceArtificial intelligence
Artificial intelligence
 
Artificial intelligence.pptx
Artificial intelligence.pptxArtificial intelligence.pptx
Artificial intelligence.pptx
 
nber_slides.pdf
nber_slides.pdfnber_slides.pdf
nber_slides.pdf
 
Introduction to Supervised ML Concepts and Algorithms
Introduction to Supervised ML Concepts and AlgorithmsIntroduction to Supervised ML Concepts and Algorithms
Introduction to Supervised ML Concepts and Algorithms
 
REDES NEURONALES APRENDIZAJE Supervised Vs No Supervised
REDES NEURONALES APRENDIZAJE Supervised Vs No SupervisedREDES NEURONALES APRENDIZAJE Supervised Vs No Supervised
REDES NEURONALES APRENDIZAJE Supervised Vs No Supervised
 
REDES NEURONALES APRENDIZAJE Supervised Vs Unsupervised
REDES NEURONALES APRENDIZAJE Supervised Vs UnsupervisedREDES NEURONALES APRENDIZAJE Supervised Vs Unsupervised
REDES NEURONALES APRENDIZAJE Supervised Vs Unsupervised
 
Time Independent Perturbation Theory, 1st order correction, 2nd order correction
Time Independent Perturbation Theory, 1st order correction, 2nd order correctionTime Independent Perturbation Theory, 1st order correction, 2nd order correction
Time Independent Perturbation Theory, 1st order correction, 2nd order correction
 
(研究会輪読) Weight Uncertainty in Neural Networks
(研究会輪読) Weight Uncertainty in Neural Networks(研究会輪読) Weight Uncertainty in Neural Networks
(研究会輪読) Weight Uncertainty in Neural Networks
 
Bayesian inversion of deterministic dynamic causal models
Bayesian inversion of deterministic dynamic causal modelsBayesian inversion of deterministic dynamic causal models
Bayesian inversion of deterministic dynamic causal models
 
Machine Learning 1
Machine Learning 1Machine Learning 1
Machine Learning 1
 
Basic Learning Algorithms of ANN
Basic Learning Algorithms of ANNBasic Learning Algorithms of ANN
Basic Learning Algorithms of ANN
 
PRML Chapter 3
PRML Chapter 3PRML Chapter 3
PRML Chapter 3
 
Machine learning and Neural Networks
Machine learning and Neural NetworksMachine learning and Neural Networks
Machine learning and Neural Networks
 
[ICLR2021 (spotlight)] Benefit of deep learning with non-convex noisy gradien...
[ICLR2021 (spotlight)] Benefit of deep learning with non-convex noisy gradien...[ICLR2021 (spotlight)] Benefit of deep learning with non-convex noisy gradien...
[ICLR2021 (spotlight)] Benefit of deep learning with non-convex noisy gradien...
 
Illustrative Introductory Neural Networks
Illustrative Introductory Neural NetworksIllustrative Introductory Neural Networks
Illustrative Introductory Neural Networks
 

Más de zukun

My lyn tutorial 2009
My lyn tutorial 2009My lyn tutorial 2009
My lyn tutorial 2009zukun
 
ETHZ CV2012: Tutorial openCV
ETHZ CV2012: Tutorial openCVETHZ CV2012: Tutorial openCV
ETHZ CV2012: Tutorial openCVzukun
 
ETHZ CV2012: Information
ETHZ CV2012: InformationETHZ CV2012: Information
ETHZ CV2012: Informationzukun
 
Siwei lyu: natural image statistics
Siwei lyu: natural image statisticsSiwei lyu: natural image statistics
Siwei lyu: natural image statisticszukun
 
Lecture9 camera calibration
Lecture9 camera calibrationLecture9 camera calibration
Lecture9 camera calibrationzukun
 
Brunelli 2008: template matching techniques in computer vision
Brunelli 2008: template matching techniques in computer visionBrunelli 2008: template matching techniques in computer vision
Brunelli 2008: template matching techniques in computer visionzukun
 
Modern features-part-4-evaluation
Modern features-part-4-evaluationModern features-part-4-evaluation
Modern features-part-4-evaluationzukun
 
Modern features-part-3-software
Modern features-part-3-softwareModern features-part-3-software
Modern features-part-3-softwarezukun
 
Modern features-part-2-descriptors
Modern features-part-2-descriptorsModern features-part-2-descriptors
Modern features-part-2-descriptorszukun
 
Modern features-part-1-detectors
Modern features-part-1-detectorsModern features-part-1-detectors
Modern features-part-1-detectorszukun
 
Modern features-part-0-intro
Modern features-part-0-introModern features-part-0-intro
Modern features-part-0-introzukun
 
Lecture 02 internet video search
Lecture 02 internet video searchLecture 02 internet video search
Lecture 02 internet video searchzukun
 
Lecture 01 internet video search
Lecture 01 internet video searchLecture 01 internet video search
Lecture 01 internet video searchzukun
 
Lecture 03 internet video search
Lecture 03 internet video searchLecture 03 internet video search
Lecture 03 internet video searchzukun
 
Icml2012 tutorial representation_learning
Icml2012 tutorial representation_learningIcml2012 tutorial representation_learning
Icml2012 tutorial representation_learningzukun
 
Advances in discrete energy minimisation for computer vision
Advances in discrete energy minimisation for computer visionAdvances in discrete energy minimisation for computer vision
Advances in discrete energy minimisation for computer visionzukun
 
Gephi tutorial: quick start
Gephi tutorial: quick startGephi tutorial: quick start
Gephi tutorial: quick startzukun
 
EM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysisEM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysiszukun
 
Object recognition with pictorial structures
Object recognition with pictorial structuresObject recognition with pictorial structures
Object recognition with pictorial structureszukun
 
Iccv2011 learning spatiotemporal graphs of human activities
Iccv2011 learning spatiotemporal graphs of human activities Iccv2011 learning spatiotemporal graphs of human activities
Iccv2011 learning spatiotemporal graphs of human activities zukun
 

Más de zukun (20)

My lyn tutorial 2009
My lyn tutorial 2009My lyn tutorial 2009
My lyn tutorial 2009
 
ETHZ CV2012: Tutorial openCV
ETHZ CV2012: Tutorial openCVETHZ CV2012: Tutorial openCV
ETHZ CV2012: Tutorial openCV
 
ETHZ CV2012: Information
ETHZ CV2012: InformationETHZ CV2012: Information
ETHZ CV2012: Information
 
Siwei lyu: natural image statistics
Siwei lyu: natural image statisticsSiwei lyu: natural image statistics
Siwei lyu: natural image statistics
 
Lecture9 camera calibration
Lecture9 camera calibrationLecture9 camera calibration
Lecture9 camera calibration
 
Brunelli 2008: template matching techniques in computer vision
Brunelli 2008: template matching techniques in computer visionBrunelli 2008: template matching techniques in computer vision
Brunelli 2008: template matching techniques in computer vision
 
Modern features-part-4-evaluation
Modern features-part-4-evaluationModern features-part-4-evaluation
Modern features-part-4-evaluation
 
Modern features-part-3-software
Modern features-part-3-softwareModern features-part-3-software
Modern features-part-3-software
 
Modern features-part-2-descriptors
Modern features-part-2-descriptorsModern features-part-2-descriptors
Modern features-part-2-descriptors
 
Modern features-part-1-detectors
Modern features-part-1-detectorsModern features-part-1-detectors
Modern features-part-1-detectors
 
Modern features-part-0-intro
Modern features-part-0-introModern features-part-0-intro
Modern features-part-0-intro
 
Lecture 02 internet video search
Lecture 02 internet video searchLecture 02 internet video search
Lecture 02 internet video search
 
Lecture 01 internet video search
Lecture 01 internet video searchLecture 01 internet video search
Lecture 01 internet video search
 
Lecture 03 internet video search
Lecture 03 internet video searchLecture 03 internet video search
Lecture 03 internet video search
 
Icml2012 tutorial representation_learning
Icml2012 tutorial representation_learningIcml2012 tutorial representation_learning
Icml2012 tutorial representation_learning
 
Advances in discrete energy minimisation for computer vision
Advances in discrete energy minimisation for computer visionAdvances in discrete energy minimisation for computer vision
Advances in discrete energy minimisation for computer vision
 
Gephi tutorial: quick start
Gephi tutorial: quick startGephi tutorial: quick start
Gephi tutorial: quick start
 
EM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysisEM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysis
 
Object recognition with pictorial structures
Object recognition with pictorial structuresObject recognition with pictorial structures
Object recognition with pictorial structures
 
Iccv2011 learning spatiotemporal graphs of human activities
Iccv2011 learning spatiotemporal graphs of human activities Iccv2011 learning spatiotemporal graphs of human activities
Iccv2011 learning spatiotemporal graphs of human activities
 

Último

20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-pyJamie (Taka) Wang
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDELiveplex
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXTarek Kalaji
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfDaniel Santiago Silva Capera
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaborationbruanjhuli
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfAijun Zhang
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...DianaGray10
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UbiTrack UK
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024D Cloud Solutions
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Adtran
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Websitedgelyza
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintMahmoud Rabie
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdfPedro Manuel
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemAsko Soukka
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1DianaGray10
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Commit University
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.YounusS2
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfJamie (Taka) Wang
 

Último (20)

20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-py
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBX
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdf
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Website
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership Blueprint
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdf
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystem
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
 

Lecun 20060816-ciar-01-energy based learning

  • 1.     A Tutorial on Energy­Based Learning  Yann LeCun, Sumit Chopra, Raia Hadsell,  Marc'Aurelio Ranzato, Fu­Jie Huang     The Courant Institute of Mathematical Sciences New York University http://yann.lecun.com http://www.cs.nyu.edu/~yann Yann LeCun
  • 2. Two Problems in Machine Learning 1. The “Deep Learning Problem” “Deep” architectures are necessary to solve the invariance problem in vision (and perception in general) How do we train deep architectures with lots of non-linear stages 2. The “Partition Function Problem” Give high probability (or low energy) to good answers Give low probability (or high energy) to bad answers There are too many bad answers! This tutorial discusses problem #2 The partition function problem arises with probabilistic approaches Non-probabilistic approaches may allow us to get around it. Energy­Based Learning provides a framework in which to describe  probabilistic and non­probabilistic approaches to learning Yann LeCun
  • 3. Energy­Based Model for Decision­Making  Model: Measures the compatibility  between an observed variable X and  a variable to be predicted Y through  an energy function E(Y,X).   Inference: Search for the Y that  minimizes the energy within a set  If the set has low cardinality, we can  use exhaustive search.  Yann LeCun
  • 4. Complex Tasks: Inference is non­trivial When the  cardinality or  dimension of Y  is large,  exhaustive  search is  impractical. We need to use a  “smart”  inference  procedure: min­ sum, Viterbi, ..... Yann LeCun
  • 5. What Questions Can a Model Answer? 1. Classification & Decision Making:  “which value of Y is most compatible with X?” Applications: Robot navigation,..... Training: give the lowest energy to the correct answer 2. Ranking: “Is Y1 or Y2 more compatible with X?” Applications: Data-mining.... Training: produce energies that rank the answers correctly 3. Detection: “Is this value of Y compatible with X”? Application: face detection.... Training: energies that increase as the image looks less like a face. 4. Conditional Density Estimation: “What is the conditional distribution P(Y|X)?” Application: feeding a decision-making system Training: differences of energies must be just so. Yann LeCun
  • 6. Decision­Making versus Probabilistic Modeling Energies are uncalibrated The energies of two separately-trained systems cannot be combined The energies are uncalibrated (measured in arbitrary untis) How do we calibrate energies? We turn them into probabilities (positive numbers that sum to 1). Simplest way: Gibbs distribution Other ways can be reduced to Gibbs by a suitable redefinition of the energy. Partition function  Inverse temperature Yann LeCun
  • 7. Architecture and Loss Function Family of energy functions Training set Loss functional / Loss function Measures the quality of an energy function Training Form of the loss functional  invariant under permutations and repetitions of the samples Energy surface Regularizer Per­sample Desired for a given Xi loss answer as Y varies Yann LeCun
  • 8. Designing a Loss Functional Correct answer has the lowest energy ­> LOW LOSS Lowest energy is not for the correct answer ­> HIGH LOSS Yann LeCun
  • 9. Designing a Loss Functional Push down on the energy of the correct answer Pull up on the energies of the incorrect answers, particularly if they  are smaller than the correct one Yann LeCun
  • 10. Architecture + Inference Algo + Loss Function = Model E(W,Y,X)  1. Design an architecture: a particular form for E(W,Y,X).   2. Pick an inference algorithm for Y: MAP or conditional  distribution, belief prop, min cut, variational methods,  W gradient descent, MCMC, HMC.....  3. Pick a loss function: in such a way that minimizing it  with respect to W over a training set will make the inference  algorithm find the correct Y for a given X. X Y  4. Pick an optimization method.  PROBLEM: What loss functions will make the machine approach  the desired behavior?   Yann LeCun
  • 11. Several Energy Surfaces can give the same answers Y X Both surfaces compute Y=X^2 MINy E(Y,X) = X^2 Minimum­energy inference gives us the same answer Yann LeCun
  • 12. Simple Architectures Regression Binary Classification Multi­class  Classification Yann LeCun
  • 13. Simple Architecture: Implicit Regression The Implicit Regression architecture  allows multiple answers to have low energy. Encodes a constraint between X and Y rather than an explicit functional relationship This is useful for many applications Example: sentence completion: “The cat ate the {mouse,bird,homework,...}” [Bengio et al. 2003] But, inference may be difficult. Yann LeCun
  • 14. Examples of Loss Functions: Energy Loss Energy Loss Simply pushes down on the energy of the correct answer !! S! K R O W !!! ES PS A LL O C Yann LeCun
  • 15. Examples of Loss Functions:Perceptron Loss Perceptron Loss [LeCun et al. 1998], [Collins 2002] Pushes down on the energy of the correct answer Pulls up on the energy of the machine's answer Always positive. Zero when answer is correct No “margin”: technically does not prevent the energy surface from being almost flat. Works pretty well in practice, particularly if the energy parameterization does not allow flat surfaces. Yann LeCun
  • 16. Perceptron Loss for Binary Classification Energy:  Inference:  Loss:  Learning Rule:  If Gw(X) is linear in W:  Yann LeCun
  • 17. Examples of Loss Functions: Generalized Margin Losses First, we need to define the Most Offending Incorrect Answer Most Offending Incorrect Answer: discrete case Most Offending Incorrect Answer: continuous case Yann LeCun
  • 18. Examples of Loss Functions: Generalized Margin Losses Generalized Margin Loss Qm increases with the energy of the correct answer Qm decreases with the energy of the most offending incorrect answer whenever it is less than the energy of the correct answer plus a margin m. Yann LeCun
  • 19. Examples of Generalized Margin Losses Hinge Loss  [Altun et al. 2003], [Taskar et al. 2003] With the linearly-parameterized binary classifier architecture, we get linear SVMs Log Loss  “soft hinge” loss With the linearly-parameterized binary classifier architecture, we get linear Logistic Regression Yann LeCun
  • 20. Examples of Margin Losses: Square­Square Loss Square­Square Loss  [LeCun-Huang 2005] Appropriate for positive energy functions Learning Y = X^2 !! E! PS A LL O  C O N Yann LeCun
  • 21. Other Margin­Like Losses LVQ2 Loss [Kohonen, Oja], Driancourt­Bottou 1991] Minimum Classification Error Loss [Juang, Chou, Lee 1997] Square­Exponential Loss [Osadchy, Miller, LeCun 2004] Yann LeCun
  • 22. Negative Log­Likelihood Loss Conditional probability of the samples (assuming independence) Gibbs distribution: We get the NLL loss by dividing by P and Beta: Reduces to the perceptron loss when Beta­>infinity Yann LeCun
  • 23. Negative Log­Likelihood Loss Pushes down on the energy of the correct answer Pulls up on the energies of all answers in proportion to their probability Yann LeCun
  • 24. Negative Log­Likelihood Loss: Binary Classification Binary Classifier Architecture: Linear Binary Classifier Architecture: Learning Rule: logistic regression Yann LeCun
  • 25. What Makes a “Good”  Loss Function Good loss functions make the  machine produce the correct  answer Avoid collapses and flat energy surfaces Yann LeCun
  • 26. What Make a “Good” Loss Function Good and bad loss functions Yann LeCun
  • 27. Advantages/Disadvantages of various losses Loss functions differ in how they pick the point(s) whose energy is  pulled up, and how much they pull them up Losses with a log partition function in the contrastive term pull up all  the bad answers simultaneously.  This may be good if the gradient of the contrastive term can be computed efficiently This may be bad if it cannot, in which case we might as well use a loss with a single point in the contrastive term Variational methods pull up many points, but not as many as with the  full log partition function. Efficiency of a loss/architecture: how many energies are pulled up for  a given amount of computation?  The theory for this is to be developed Yann LeCun
  • 28. Latent Variable Models The energy includes “hidden” variables Z whose value is never given to us Yann LeCun
  • 29. What can the latent variables represent? Variables that would make the task easier if they were known: Face recognition: the gender of the person, the orientation of the face. Object recognition: the pose parameters of the object (location, orientation, scale), the lighting conditions. Parts of Speech Tagging: the segmentation of the sentence into syntactic units, the parse tree. Speech Recognition: the segmentation of the sentence into phonemes or phones. Handwriting Recognition: the segmentation of the line into characters. In general, we will search for the value of the latent variable that  allows us to get an answer (Y) of smallest energy. Yann LeCun
  • 30. Probabilistic Latent Variable Models Marginalizing over latent variables instead of minimizing. Equivalent to traditional energy­based inference with a redefined  energy function: Reduces to traditional minimization when Beta­>infinity Yann LeCun
  • 31. Face Detection and Pose Estimation with a Convolutional EBM ●  Training: 52,850,  32x32  grey­level images of faces,  52,850 selected non­faces.  Each training image was used  5 times with random variation  E W , Z , X  ( energy) in scale, in­plane rotation,  brightness and contrast. ∥G W  X −F  Z ∥ 2  phase: half of the initial  nd negative set was replaced by  GW  X  FZ  false positives of the initial  analytical  convolutional  version of the detector .    network mapping onto  face manifold W(param) Small E*(W,X): face X  Z  Large E*(W,X): no face (image) (pose)  [Osadchy, Miller, LeCun, NIPS 2004] Yann LeCun
  • 32. Face Manifold Low dimensional space ||G(X)­min_z F(Z)||| G(X)   Face Manifold  F(Z) parameterized by pose  Apply Mapping: G Image X
  • 34. Energy­Based Contrastive Loss Function Attract the network output Gw(X) to the  ) + ) location of the desired pose F(Z) on the manifold ( ( = = ( ) Repel the network output Gw(X) away  from the face/pose manifold
  • 35. Convolutional Network Architecture [LeCun et al. 1988, 1989, 1998, 2005] Hierarchy of local filters (convolution kernels),  sigmoid pointwise non­linearities, and spatial subsampling All the filter coefficients are learned with gradient descent (back­prop)  Yann LeCun
  • 36. “Simple cells” Alternated Convolutions “Complex cells”  and Pooling/Subsampling  Local features are extracted  pooling  everywhere. Multiple  subsampling convolutions  pooling/subsampling layer builds  robustness to variations in feature  locations.  Long history in neuroscience and  computer vision: Hubel/Wiesel 1962,  Fukushima 1971­82,   LeCun 1988­06 Poggio, Riesenhuber, Serre 02­06 Ullman 2002­06 Triggs, Lowe,.... Yann LeCun
  • 37. Building a Detector/Recognizer: Replicated Conv. Nets output: 3x3 Window 32x32 input:40x40   Traditional Detectors/Classifiers must be applied to every  location on a large input image, at multiple scales.  Convolutional nets can replicated over large images very  cheaply.  The network is applied to multiple scales spaced by sqrt(2) Non­maximum suppression with exclusion window Yann LeCun
  • 38. Building a Detector/Recognizer:  Replicated Convolutional Nets  Computational cost for replicated convolutional net: 96x96 ­> 4.6 million multiply­accumulate operations 120x120 ­> 8.3 million multiply­accumulate operations 240x240 ­> 47.5 million multiply­accumulate operations  480x480 ­> 232 million multiply­accumulate operations  Computational cost for a non­convolutional detector of the  same size, applied every 12 pixels: 96x96 ­> 4.6 million multiply­accumulate operations 120x120 ­> 42.0 million multiply­accumulate operations 240x240 ­> 788.0 million multiply­accumulate operations  480x480 ­> 5,083 million multiply­accumulate operations 96x96 window 12 pixel shift 84x84 overlap
  • 39. Face Detection: Results Data Set-> TILTED PROFILE MIT+CMU 4.42 26.9 0.47 3.36 0.5 1.28 False positives per image-> Our Detector 90% 97% 67% 83% 83% 88% Jones & Viola (tilted) 90% 95% x x Jones & Viola (profile) x 70% 83% x Rowley et al 89% 96% x Schneiderman & Kanade 86% 93% x Yann LeCun
  • 42. Performance on standard dataset Detection Pose estimation Pose estimation is performed on faces located automatically by the system  when the faces are localized by hand we get: 89% of yaw and 100% of in­plane  rotations within 15 degrees. Yann LeCun
  • 43. Synergy Between Detection and Pose Estimation Pose Estimation Improves Detection improves  Detection pose estimation Yann LeCun
  • 44. EBM for Face Recognition X and Y are images E(W,X,Y) Y is a discrete variable with many  || Gw(X)­Gw(Y)|| possible values All the people in our gallery Example of architecture:  Gw(X) Gw(Y) A function G(X) maps input images into a low-dimensional space in which the Euclidean distance measures dissemblance. Inference: Find the Y in the gallery that minimizes the energy (find the Y that is most similar to X) Minimization through exhaustive search. Yann LeCun
  • 45. Learning an Invariant Dissimilarity Metric with EBM  [Chopra, Hadsell, LeCun CVPR 2005] E(W,X1,X2)  Training a parameterized, invariant dissimilarity metric  || Gw(X1)­Gw(X2)|| may be a solution to the many­category problem.  Find a mapping Gw(X) such that the Euclidean distance         Gw(X1) Gw(X2) ||Gw(X1)­ Gw(X2)|| reflects the “semantic” distance between  X1 and X2. X1 X2  Once trained, a trainable dissimilarity metric can be used to  classify new categories using a very small number of  training samples (used as prototypes). E(W,X1,X2)  This is an example where probabilistic models are too  || Gw(X1)­Gw(X2)|| constraining, because we would have to limit ourselves to  models that can be normalized over the space of input pairs.  Gw(X1) Gw(X2)  With EBMs, we can put what we want in the box (e.g. A  convolutional net).  Siamese Architecture X1 X2 Application: face verification/recognition
  • 46. Loss Function E(W,X1,X2) || Gw(X1)­Gw(X2)|| Gw(X1) Gw(X2) X1 X2  Siamese models: distance between the outputs of two identical copies of a model.   Energy function: E(W,X1,X2) =  ||Gw(X1)­Gw(X2)||  If X1 and X2 are from the same category (genuine pair), train the two copies of the model  to produce similar outputs (low energy)  If X1 and X2 are from different categories (impostor pair), train the two copies of the  model to produce different outputs (high energy)  Loss function: increasing function of genuine pair energy, decreasing function of  impostor pair energy.
  • 47. Loss Function Our Loss function for a single training pair (X1,X2):  L W , X 1, X 2 =1−Y L G E W  X 1, X 2 YL I  E W X 1, X 2  E W  X 1, X 2  2 2 −2.77 R =1−Y   E W X 1, X 2  Y 2R e R E W  X 1, X 2 =∥G W X 1 −GW  X 2 ∥L1 And R is the largest possible value of  E W  X 1, X 2  Y=0 for a genuine pair, and Y=1 for  an impostor pair.
  • 48. Face Verification datasets: AT&T, FERET, and AR/Purdue ●  The AT&T/ORL dataset   Total subjects: 40. Images per subject: 10.  Total images: 400. ●  Images had a moderate degree of variation in pose, lighting, expression and head position. ●  Images from 35 subjects were used for training. Images from 5 remaining subjects for testing. ●  Training set was taken from: 3500 genuine and 119000 impostor pairs. ●  Test set was taken from: 500 genuine and 2000 impostor pairs. ●  http://www.uk.research.att.com/facedatabase.html ● AT&T/ORL  Dataset
  • 49. Face Verification datasets: AT&T, FERET, and AR/Purdue ●  The FERET dataset. part of the dataset was used only for training.  Total subjects: 96. Images per subject: 6. Total images: 1122. ●  Images had high degree of variation in pose, lighting, expression and head position. ●  The images were used for training only. ●  http://www.itl.nist.gov/iad/humanid/feret/ ● FERET Dataset
  • 50. Face Verification datasets: AT&T, FERET, and AR/Purdue ●  The AR/Purdue dataset  Total subjects: 136. Images per subject: 26. Total images: 3536.  ●  Each subject has 2 sets of 13 images taken 14 days apart.  ● ●  Images had very high degree of variation in pose, lighting, expression and position. Within each set  of 13, there are 4 images with expression variation, 3 with lighting variation, 3 with dark sun glasses  and lighting variation, and 3 with face obscuring scarfs and lighting variation.  Images from 96 subjects were used for training. The remaining 40 subjects were used for testing. ●  Training set drawn from: 64896 genuine and 6165120 impostor pairs. ●  Test set drawn from: 27040 genuine and 1054560 impostor pairs. ●  http://rv11.ecn.purdue.edu/aleix/aleix_face_DB.html ●
  • 52. Preprocessing The 3 datasets each required a small amount of preprocessing.  FERET: Cropping, subsampling, and centering (see below) AR/PURDUE: Cropping and subsampling AT&T: Subsampling only  crop subsample center
  • 53. Centering with a Gaussian­blurred face template Coarse centering was done on the FERET database images 1. Construct a template by blurring a well­centered face. 2. Convolve the template with an uncentered image. 3. Choose the 'peak' of the convolution as the center of the image.    Convolve mask with    peak is center  image       of image
  • 54. Architecture for the Mapping Function Gw(X) Convolutional net Layer 6 Layer 3 Input  Layer 1 Layer 2 Layer 4 Layer 5 Fully connected image  15@50x40 45@20x15 15@25x20 45@5x5 250 2@56x46 50 units. Low­dimensional invariant representation 7x7 6x6 5x5 2x2 4x3 convolution convolution subsampling subsampling convolution (15 kernels) (198 kernels) (11250 kernels)
  • 56. Gaussian Face Model in the output space A gaussian model  constructed from 5  X 2  genuine images of the  above subject. X 2 impostor 
  • 57. Dataset for Verification Verification Results tested on AT&T and AR/Purdue The AT&T dataset The AR/Purdue dataset AT&T dataset False Accept False Reject False Accept False Reject Number of subjects:  5 10.00% 0.00% 10.00% 11.00% 7.50% 1.00% 7.50% 14.60% Images/subject:  10 5.00% 1.00% 5.00% 19.00% Images/Model:  5 Total test size:  5000 Number of Genuine:  500 Number of Impostors:  4500 Purdue/AR dataset Number of subjects:  40 Images/subject:  26 Images/Model:  13 Total test size:  5000 Number of Genuine:  500 Number of Impostors:  4500
  • 58. Classification Examples Example: Correctly classified genuine pairs energy: 0.3159 energy: 0.0043 energy: 0.0046 Example: Correctly classified impostor pairs energy: 32.7897 energy: 5.7186 energy: 20.1259 Example: Mis­classified  pairs energy: 10.3209 energy: 2.8243
  • 60. 1 1 A similar idea Lsimilar = D2 2 w Ldissimilar = {max 0, m −D W }2 2 for Learning Margin a Manifold m with Invariance Properties DW DW Loss function: ∥G W  x 1 −G w  x 2 ∥ ∥G W  x 1 −G w  x 2 ∥ Pay quadratically for making outputs GW  x 1  GW  x 2  GW  x 1  GW  x 2  of neighbors far apart x1 x2 x1 x2 Pay quadratically for making outputs of non-neighbors smaller than a margin m Yann LeCun
  • 61. A Manifold with Invariance to Shifts  Training set: 3000 “4” and  3000 “9” from MNIST.  Each digit is shifted  horizontally by ­6, ­3, 3,  and 6  pixels Neighborhood graph: 5  nearest neighbors in  Euclidean distance, and  shifted versions of self and  nearest neighbors Output Dimension: 2 Test set (shown) 1000 “4”  and 1000 “9” Yann LeCun
  • 62. Automatic Discovery of the Viewpoint Manifold  with Invariant to Illumination  Yann LeCun
  • 63. Efficient Inference: Energy­Based Factor Graphs Graphical models have brought us efficient inference algorithms, such  as belief propagation and its numerous variations. Traditionally, graphical models are viewed as probabilistic models At first glance, is seems difficult to dissociate graphical models from the  probabilistic view Energy­Based Factor Graphs are an extension of graphical models to  non­probabilistic settings. An EBFG is an energy function that can be written as a sum of “factor”  functions that take different subsets of variables as inputs. Yann LeCun
  • 64. Efficient Inference: Energy­Based Factor Graphs The energy is a sum of “factor” functions  Example:  Z1, Z2, Y1 are binary Z2 is ternary A naïve exhaustive inference would require 2x2x2x3 energy evaluations (= 96 factor evaluations) BUT: Ea only has 2 possible input configurations, Eb and Ec have 4, and Ed 6. Hence, we can precompute the 16 factor values, and put them on the arcs in a graph. A path in the graph is a config of variable The cost of the path is the energy of the config Yann LeCun
  • 65. Energy­Based Belief Prop The previous picture shows a chain graph of factors with 2  inputs. The extension of this procedure to trees, with factors that can  have more than 2 inputs the “min­sum” algorithm (a non­ probabilistic form of belief propagation) Basically, it is the sum­product algorithm with a different semi­ ring algebra (min instead of sum, sum instead of product), and  no normalization step. [Kschischang, Frey, Loeliger, 2001][McKay's book] Yann LeCun
  • 66. Feed­Forward, Causal, and Bi­directional Models  EBFG are all “undirected”, but the architecture determines the  complexity of the inference in certain directions Distance Complicated function X Y X Y X Y Feed­Forward “Causal” Bi­directional Predicting Y Predicting Y X->Y and Y->X are both from X is easy from X is hard hard if the two factors Predicting X Predicting X don't agree. from Y is hard from Y is easy They are both easy if the factors agree Yann LeCun
  • 67. Two types of “deep” architectures Factors are deep / graph is deep X Y X Y Yann LeCun
  • 68. Shallow Factors / Deep Graph Linearly Parameterized Factors with the NLL Loss : Lafferty's Conditional Random Field with Hinge Loss:  Taskar's Max Margin Markov Nets with Perceptron Loss Collins's sequence labeling model With Log Loss: Altun/Hofmann sequence labeling model Yann LeCun
  • 69. Deep Factors / Deep Graph: ASR with TDNN/DTW Trainable Automatic Speech Recognition system with convolutional  nets (TDNN) and dynamic time warping (DTW) Training the feature  extractor as part of the  whole process. with the LVQ2 Loss : Driancourt and Bottou's speech recognizer (1991) with NLL:  Bengio's speech recognizer (1992) Haffner's speech recognizer (1993) Yann LeCun
  • 70. Deep Factors / Deep Graph: ASR with TDNN/HMM Discriminative Automatic Speech Recognition system with HMM and  various acoustic models Training the acoustic model (feature extractor) and a (normalized) HMM in an integrated fashion. With Minimum Empirical Error loss Ljolje and Rabiner (1990) with NLL:  Bengio (1992) Haffner (1993) Bourlard (1994) With MCE Juang et al. (1997) Late normalization scheme (un­normalized HMM) Bottou pointed out the label bias problem (1991) Denker and Burges proposed a solution (1995) Yann LeCun
  • 71. Really Deep Factors /  Really Deep Graph Handwriting Recognition with  Graph Transformer Networks Un­normalized hierarchical  HMMs Trained with Perceptron loss [LeCun, Bottou, Bengio, Haffner 1998] Trained with NLL loss [Bengio, LeCun 1994], [LeCun, Bottou, Bengio, Haffner 1998] Answer = sequence of symbols Latent variable = segmentation Yann LeCun