SlideShare una empresa de Scribd logo
1 de 36
Descargar para leer sin conexión
Machine Learning
 Maximum Likelihood Estimation
and Bayesian Parameter Estimation
      (Parametric Learning)

               Phong VO
       vdphong@fit.hcmus.edu.vn

           September 11, 2010




            – Typeset by FoilTEX –
Introduction



• From previous lecture, designing classifier assumes knowledge of p(x|ωi)
  and P (ωi) for each class, i.e. For Gaussian densities, we need to know
  µi, Σi for i = 1, . . . , c

• Unfortunately, this information is not available directly.

• Given training samples with true class label for each of sample, we have
  a learning problem.

• If the form of the densities is known, i.e. the number of parameters and
  general knowledge about the problem, a parameter estimation problem
  results.

– Typeset by FoilTEX –                                                   1
Example 1. Assume that p(x|ωi) is a normal density with mean µi and
covariance matrix Σi, although we do not know exact values of these
quantities. This knowledge simplifies the problem from one of estimating
an unknow function p(x|ωx) to one of estimating the parameters µi and
Σi.




– Typeset by FoilTEX –                                                2
Approaches to Parameter Estimation



• In maximum likelihood estimation, we assume the parameters are fixed,
  but unknown. The MLE approach seeks the ”‘best”’ parameter estimate
  in the sense that ”‘best”’ means the set of parameters that maximize
  the probability of obtaining the training set.

• Bayesian estimation models the parameters to be estimated as random
  variables with some (assumed) known priori distribution. The training
  set are ”‘observation”’, which allow conversion of the a priori information
  into an a posteriori density. The Bayesian approach uses the training set
  to update the training set-conditioned density function of the unknown
  parameters.


– Typeset by FoilTEX –                                                      3
Maximum Likelihood Estimation



• MLE nearly always have good convergence properties as the number of
  training samples increases.

• It is often simpler than alternate methods, such as Bayesian techniques.




– Typeset by FoilTEX –                                                   4
Formulation



                         c
• Assume D = j=1 Dj , with the samples in Dj having been drawn
  independently according to the probability law p(x|ωj ).

• Assume that p(x|ωj ) has a known parametric form, and is determined
  uniquely by the value of a parameter vector θ j , i.e. p(x|ωi) ∼ N (µj , Σj )
  where θ j = {µj , Σj }.

• The dependence of p(x|ωj ) on θ j is expressed as p(x|ωj , θ j ).

• Our problem: use the training samples to obtain good estimates for the
  unknown parameter vectors θ 1, . . . , θ c.

– Typeset by FoilTEX –                                                        5
• Assume more that samples in Di give no information about θ j if i = j.
  In other word, parameters are functionally independent.


• The problem of classification is turned into the problems of parameter
  estimation: Use a set D of training samples draw independently from the
  probability density p(x|θ) to estimate the unknown parameter vector θ.


• Suppose that D = {x1, . . . , xn}.          Since the samples were drawn
  independently, we have

                                         n
                             p(D|θ) =         p(xk |θ)
                                        k=1


    p(D|θ) is called the likelihood of θ with respect to the set of samples.

– Typeset by FoilTEX –                                                         6
ˆ
• The maximum likelihood estimate of θ is the value θ that maximizes
  p(D|θ)




– Typeset by FoilTEX –                                             7
• Take the logarithm on both sides (just for analytical purpose), we define
  l(θ) as the log-likelihood function


                                      l(θ) = ln p(D|θ)


                                                       ˆ
• Since the logarithm is monotonically increasing, the θ that maximizes
  the log-likelihood also maximizes the likelihood,


                                                            n
                         ˆ
                         θ = arg max θ l(θ) = arg max   θ         ln p(xk |θ)
                                                            k=1



  ˆ
• θ can be found by taking derivatives of log-likelihood function

– Typeset by FoilTEX –                                                          8
n

                             θl   =            θ ln   p(xk |θ)
                                      k=1

    where

                                                     
                                                 ∂
                                                ∂θ1
                                      ≡         .
                                                 .    ,
                                                     
                                  θ
                                                ∂
                                               ∂θp


    and then solve the equation

                                          θl   = 0.



– Typeset by FoilTEX –                                           9
• A solution could be a global maximum, a local maximum or minimum.
  We have to check each of them individually.

• NOTE: A related class of estimators - maximum a posteriori or MAP
  estimators - find the value of θ that maximizes l(θ)p(θ). Thus a ML
  estimator is a MAP estimator for the uniform or ”‘flat”’ prior.




– Typeset by FoilTEX –                                            10
MLE: The Gaussian Case for Unknown µ



• In this case, only the mean is unknown. Under this condition, we consider
  a sample point xk and find

                             1             1
              ln p(xk |µ) = − ln (2π)d|Σ| − (xk − µ)tΣ−1(xk − µ)
                             2             2
    and


                           θ ln   p(xk |µ) = Σ−1(xk − µ).

• The maximum likelihood estimate for µ must satisfy

– Typeset by FoilTEX –                                                   11
n
                                Σ−1(xk − µ) = 0,
                                         ˆ
                          k=1


• Solve above equation, we obtain

                                       n
                                   1
                                ˆ
                                µ=           xk
                                   n
                                       k=1


• Interpretation: The maximum likelihood estimate for theu unknown
  population mean is just the arithmetic average of the training samples
  - the sample mean. Think of the n samples as a cloud of points, the
  sample mean is the centroid of the cloud.




– Typeset by FoilTEX –                                                12
MLE: The Gaussian Case for Unknown µ and Σ



• Consider the univariate normal case, θ = {θ1, θ2} = {µ, σ 2}.                 The
  log-likelihood of a single point is

                                        1           1
                         ln p(xk |θ) = − ln 2πθ2 −     (xk − θ1)2
                                        2          2θ2

    and its derivative is

                                                         1
                                                        θ2 (xk − θ1 )
                         θl   =   θ ln   p(xk |θ) =      1     (xk −θ1 )2   .
                                                      − 2θ2 + 2θ2
                                                                    2


– Typeset by FoilTEX –                                                           13
• Let         θl   = 0 and we obtain

                                        n
                                            1        ˆ
                                               (xk − θ1) = 0
                                            ˆ
                                            θ2
                                      k=1

    and

                                  n            n            ˆ
                                        1            (xk − θ1)2
                              −            +                    =0
                                        ˆ
                                        θ2               ˆ2
                                                         θ
                                  k=1          k=1          2
    .

                 ˆ          ˆ
• Substitute µ = θ1 and σ = θ2 we obtain

                                                     n
                                               1
                                            µ=
                                            ˆ              xk
                                               n
                                                     k=1

– Typeset by FoilTEX –                                               14
and

                                   n
                             2 1
                           σ =
                           ˆ             (xk − µ)2
                                               ˆ
                               n
                                   k=1
    .

Exercise 1. Estimate µ and Σ for the case of multivariate Gaussian.




– Typeset by FoilTEX –                                                15
Bayesian Parameter Estimation



• Bayes’ formula allows us to compute the posterior probabilities P (ωi|x)
  from the prior probabilities P (ωi) and the class-conditional densities
  p(x|ωi).

• How can we proceed those quantities?
    – Prior probabilities: from knowledge of the functional forms for unknown
      densities and ranges for the values of unknown parameters
    – class-conditional densities: from training samples.




– Typeset by FoilTEX –                                                     16
• Given training samples as D, Bayes’s formula then becomes

                                          p(x|ωi, D)P (ωi|D)
                         P (ωi|x, D) =   c
                                         j=1 p(x|ωj , D)P (ωj |D)


• Assume that the a priori probabilities are known, P (ωi) = P (ωi|D) and
  the samples in Di have no influence on p(x|ωj , D) if i = j,

                                           p(x|ωi, Di)P (ωi)
                         P (ωi|x, D) =   c
                                         j=1 p(x|ωj , Dj )P (ωj )


• We have c separate problems of the following form: use a set D of samples
  drawn independently according to the fixed but unknown probability
  distribution p(x) to determine p(x|D). Our supervised learning problem
  is turned into an unsupervised density estimation problem.

– Typeset by FoilTEX –                                                   17
The Parameter Distribution



• Although the desired probability density p(x) is unknown, we assume
  that it has a known parametric form.

• The unknown factor is the value of a parameter vector θ. As long as θ
  is known, the function p(x|θ) is known.

• Information that we have about θ prior to observing the samples is
  assumed to be contained in a known prior density p(θ).

• Observation of the samples converts this to a posterior density p(θ|D),
  which is expected to be sharply peaked about the true value of θ.

– Typeset by FoilTEX –                                                 18
• Our basic goal is to compute p(x|D), which is as close as we can come to
  obtaining the unknown p(x). By integrating the joint density p(x, θ|D),
  we have



                         p(x|D) =       p(x, θ|D)dθ                   (1)
                                    θ

                               =        p(x|θ, D)p(θ|D)dθ             (2)
                                    θ

                               =        p(x|θ)p(θ|D)dθ                (3)
                                    θ




– Typeset by FoilTEX –                                                  19
BPE: Gaussian Case


• Calculate p(θ|D) and p(x|D) for the case where p(x|µ) ∼ N (µ, Σ)

• Consider the univariate case where µ is unknown

                            p(x|µ) ∼ N (µ, σ 2)

• We assume that the prior density p(µ) has a known distribution

                                            2
                             p(µ) ∼ N (µ0, σ0 )
                                                                      2
    Interpretation: µ0 represents our best a priori guess for µ, and σ0
    measures our uncertainty about this guess.

– Typeset by FoilTEX –                                               20
• Once µ is ”‘guessed”’, it determines the density for x. Letting D =
  {x1, . . . , xn}, Bayes’ formula gives us



                                                     n
                                    p(D|µ)p(µ)
                         p(µ|D) =                ∝         p(xk |µ)p(µ)
                                    p(D|µ)p(µ)dµ
                                                     k=1




    where it is easy to see the affection of training samples to the estimation
    of the true µ.




• Since p(xk |µ) ∼ N (µ, σ 2) and p(µ) ∼ N (µ0, σ0 ), we have
                                                 2


– Typeset by FoilTEX –                                                      21
p(xk |µ)                            p(µ)
                         n                       2                                     2
                   1               1   xk − µ          1             1      x k − µ0
    p(µ|D) ∝      √ exp                                √       exp
                 σ 2π              2     σ           σ0 2π           2          σ0
             k=1
                                                                                       (4)
                             n               2                  2
                         1          µ − xk           µ − µ0
                 ∝ exp −                         +                                     (5)
                         2            σ                σ0
                             k=1
                                                           n
                         1    n    1                 1              µ0
                 ∝ exp −         + 2 µ2 − 2                     xk + 2       µ         (6)
                         2    σ 2 σ0                 σ2             σ0
                                                          k=1



• p(µ|D) is again a normal density and is said to be a reproducing density
  and p(µ) is conjugate prior.

– Typeset by FoilTEX –                                                                     22
2
• If we write p(µ|D) ∼ N (µn, σn), then

                                                                      2
                                       1      1              µ − µn
                             p(µ|D) = √ exp −
                                     σ 2π     2                σn

• Equating coefficients show us

                                         2
                                      nσ0            σ2
                               µn =   2       xn + 2        µ0
                                    nσ0 + σ 2     nσ0 + σ 2
                         1   n
    where xn =           n   k=1 xk   and


                                             2     σ0 σ 2
                                                     2
                                            σn =    2
                                                 nσ0 + σ 2

– Typeset by FoilTEX –                                                    23
Interpretation: these equations show how the prior information is
    combined with the empirical information in the samples to obtain the a
    posteriori density p(µ|D).




– Typeset by FoilTEX –                                                  24
Interpretation


• µn represents our best guess for µ after observing n samples

   2
• σn measures our uncertainty about this guess,

                                      2     σ2
                              limn→∞ σn   = ,
                                            n
    each additional observation decreases our uncertainty about the true
    value of µ.

• As n increases, p(µ|D) approaches a Dirac delta function.

• This behavior is known as Bayesian learning.

– Typeset by FoilTEX –                                                25
– Typeset by FoilTEX –   26
– Typeset by FoilTEX –   27
• µn is a positive combination of xn and µ0, xn ≤ µn ≤ µ0
                             
                             lim
                                   n→∞ µn   = xn   if σ = 0
                         µn = µ0                    if σ0 = 0
                             
                               xn                   if σ0    σ
                             


• The dogmatism:
                              prior knowledge σ 2
                                              ∼ 2
                               empirical data  σ0

• If the dogmatism in not infinite, after enough samples are taken the exact
                                2
  values assumed for µ0 and σ0 will be unimportant, µn will converge to
  the sample mean.




– Typeset by FoilTEX –                                                   28
Compute the class-conditional density


• Having obtained a posteriori density for the mean, p(µ|D), we now
  compute the ”‘class-contitional”’ density for p(x|D)



      p(x|D) =           p(x|µ)p(µ|D)dµ                                (7)

                           1      1       x−µ     1       1   x − µn
                  =       √ exp −                 √ exp −
                         σ 2π     2        σ    σn 2π     2     σn
                                                                       (8)
                      1         1 (x − µn)2
                  =       exp −             f (σ, σn),                 (9)
                    2πσσn       2 σ 2 + σn
                                         2


– Typeset by FoilTEX –                                                  29
where

                                                                    2
                                   1 σ 2 + σn 2
                                                   σn x + σ 2 µ n
                                                    2
               f (σ, σn) =   exp −       2σ 2
                                                µ−                      dµ
                                   2 σ n             σ 2 + σn2



• Hence p(x|D) is normally distributed with mean µn and variance σ 2 + σn
                                                                        2



                              p(x|D) ∼ N (µn, σ 2 + σn).
                                                     2



• The density p(x|D) is the desired class-conditional density p(x|ωj , Dj ).

Exercise 2. Use Bayesian estimation to calculate the a posteriori density
p(θ|D) and the desired probability density p(x|D) for the multivariate case
where p(x|µ) ∼ N (µ, Σ)


– Typeset by FoilTEX –                                                       30
BPE: General Theory


   The basic assumptions for the applicability of Bayesian estimation are
summarized as follows:

1. The form of the density p(x|θ) is assumed to be known, but the value
   of the parameter vector θ is not known exactly.

2. Our initial knowledge about θ is assumed to be contained in a known a
   priori density p(θ).

3. The rest of our knowledge about θ is contained in a set D of n samples
   x1, . . . , xn drawn independently according to the unknown probability
   density p(x).

– Typeset by FoilTEX –                                                  31
The basic problem is to compute the posterior density p(θ|D)


                         p(x|D) =    p(x|θ)p(θ|D)dθ.

     By Bayes’ formula we have

                                       p(D|θ)p(θ)
                          p(θ|D) =
                                       p(D|θ)p(θ)dθ

     and by the independence assumption

                                       n
                            p(D|θ) =         p(xk |θ).
                                       k=1


– Typeset by FoilTEX –                                              32
Frequentists Perspective



• Probability refers to limiting relative frequencies.   Probabilities are
  objective properties of the real world.

• Parameters are fixed, unknown constants. Because they are not
  fluctuating, no useful probability statements can be made about
  parameters.

• Statistical procedures should be designed to have well-defined long run
  frequency properties




– Typeset by FoilTEX –                                                  33
Bayesian Perspective



• Probability describes degrees of belief, not limiting frequency.

• We can make probability statements about parameters, even though they
  are fixed constants.

• We make inference about a parameters θ by producing a probability
  distribution for θ.




– Typeset by FoilTEX –                                               34
Frequentists VS. Bayesians



• Bayesian inference is a controversial approach because it inherently
  embrace a subjective notion of probability.

•




– Typeset by FoilTEX –                                              35

Más contenido relacionado

La actualidad más candente

11.solution of linear and nonlinear partial differential equations using mixt...
11.solution of linear and nonlinear partial differential equations using mixt...11.solution of linear and nonlinear partial differential equations using mixt...
11.solution of linear and nonlinear partial differential equations using mixt...Alexander Decker
 
EM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysisEM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysiszukun
 
Lesson 23: Antiderivatives (slides)
Lesson 23: Antiderivatives (slides)Lesson 23: Antiderivatives (slides)
Lesson 23: Antiderivatives (slides)Matthew Leingang
 
Lesson 18: Maximum and Minimum Values (slides)
Lesson 18: Maximum and Minimum Values (slides)Lesson 18: Maximum and Minimum Values (slides)
Lesson 18: Maximum and Minimum Values (slides)Matthew Leingang
 
Runtime Analysis of Population-based Evolutionary Algorithms
Runtime Analysis of Population-based Evolutionary AlgorithmsRuntime Analysis of Population-based Evolutionary Algorithms
Runtime Analysis of Population-based Evolutionary AlgorithmsPK Lehre
 
Bachelor_Defense
Bachelor_DefenseBachelor_Defense
Bachelor_DefenseTeja Turk
 
Rao-Blackwellisation schemes for accelerating Metropolis-Hastings algorithms
Rao-Blackwellisation schemes for accelerating Metropolis-Hastings algorithmsRao-Blackwellisation schemes for accelerating Metropolis-Hastings algorithms
Rao-Blackwellisation schemes for accelerating Metropolis-Hastings algorithmsChristian Robert
 
11.[104 111]analytical solution for telegraph equation by modified of sumudu ...
11.[104 111]analytical solution for telegraph equation by modified of sumudu ...11.[104 111]analytical solution for telegraph equation by modified of sumudu ...
11.[104 111]analytical solution for telegraph equation by modified of sumudu ...Alexander Decker
 
tensor-decomposition
tensor-decompositiontensor-decomposition
tensor-decompositionKenta Oono
 
Lesson 26: The Fundamental Theorem of Calculus (slides)
Lesson 26: The Fundamental Theorem of Calculus (slides)Lesson 26: The Fundamental Theorem of Calculus (slides)
Lesson 26: The Fundamental Theorem of Calculus (slides)Matthew Leingang
 
Lesson 15: Exponential Growth and Decay (slides)
Lesson 15: Exponential Growth and Decay (slides)Lesson 15: Exponential Growth and Decay (slides)
Lesson 15: Exponential Growth and Decay (slides)Matthew Leingang
 
Bachelor thesis of do dai chi
Bachelor thesis of do dai chiBachelor thesis of do dai chi
Bachelor thesis of do dai chidodaichi2005
 
A new approach to constants of the motion and the helmholtz conditions
A new approach to constants of the motion and the helmholtz conditionsA new approach to constants of the motion and the helmholtz conditions
A new approach to constants of the motion and the helmholtz conditionsAlexander Decker
 
NIPS2010: optimization algorithms in machine learning
NIPS2010: optimization algorithms in machine learningNIPS2010: optimization algorithms in machine learning
NIPS2010: optimization algorithms in machine learningzukun
 
A discussion on sampling graphs to approximate network classification functions
A discussion on sampling graphs to approximate network classification functionsA discussion on sampling graphs to approximate network classification functions
A discussion on sampling graphs to approximate network classification functionsLARCA UPC
 
Monte-Carlo method for Two-Stage SLP
Monte-Carlo method for Two-Stage SLPMonte-Carlo method for Two-Stage SLP
Monte-Carlo method for Two-Stage SLPSSA KPI
 

La actualidad más candente (18)

11.solution of linear and nonlinear partial differential equations using mixt...
11.solution of linear and nonlinear partial differential equations using mixt...11.solution of linear and nonlinear partial differential equations using mixt...
11.solution of linear and nonlinear partial differential equations using mixt...
 
EM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysisEM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysis
 
Lesson 23: Antiderivatives (slides)
Lesson 23: Antiderivatives (slides)Lesson 23: Antiderivatives (slides)
Lesson 23: Antiderivatives (slides)
 
Nested sampling
Nested samplingNested sampling
Nested sampling
 
Lesson 18: Maximum and Minimum Values (slides)
Lesson 18: Maximum and Minimum Values (slides)Lesson 18: Maximum and Minimum Values (slides)
Lesson 18: Maximum and Minimum Values (slides)
 
Runtime Analysis of Population-based Evolutionary Algorithms
Runtime Analysis of Population-based Evolutionary AlgorithmsRuntime Analysis of Population-based Evolutionary Algorithms
Runtime Analysis of Population-based Evolutionary Algorithms
 
ma112011id535
ma112011id535ma112011id535
ma112011id535
 
Bachelor_Defense
Bachelor_DefenseBachelor_Defense
Bachelor_Defense
 
Rao-Blackwellisation schemes for accelerating Metropolis-Hastings algorithms
Rao-Blackwellisation schemes for accelerating Metropolis-Hastings algorithmsRao-Blackwellisation schemes for accelerating Metropolis-Hastings algorithms
Rao-Blackwellisation schemes for accelerating Metropolis-Hastings algorithms
 
11.[104 111]analytical solution for telegraph equation by modified of sumudu ...
11.[104 111]analytical solution for telegraph equation by modified of sumudu ...11.[104 111]analytical solution for telegraph equation by modified of sumudu ...
11.[104 111]analytical solution for telegraph equation by modified of sumudu ...
 
tensor-decomposition
tensor-decompositiontensor-decomposition
tensor-decomposition
 
Lesson 26: The Fundamental Theorem of Calculus (slides)
Lesson 26: The Fundamental Theorem of Calculus (slides)Lesson 26: The Fundamental Theorem of Calculus (slides)
Lesson 26: The Fundamental Theorem of Calculus (slides)
 
Lesson 15: Exponential Growth and Decay (slides)
Lesson 15: Exponential Growth and Decay (slides)Lesson 15: Exponential Growth and Decay (slides)
Lesson 15: Exponential Growth and Decay (slides)
 
Bachelor thesis of do dai chi
Bachelor thesis of do dai chiBachelor thesis of do dai chi
Bachelor thesis of do dai chi
 
A new approach to constants of the motion and the helmholtz conditions
A new approach to constants of the motion and the helmholtz conditionsA new approach to constants of the motion and the helmholtz conditions
A new approach to constants of the motion and the helmholtz conditions
 
NIPS2010: optimization algorithms in machine learning
NIPS2010: optimization algorithms in machine learningNIPS2010: optimization algorithms in machine learning
NIPS2010: optimization algorithms in machine learning
 
A discussion on sampling graphs to approximate network classification functions
A discussion on sampling graphs to approximate network classification functionsA discussion on sampling graphs to approximate network classification functions
A discussion on sampling graphs to approximate network classification functions
 
Monte-Carlo method for Two-Stage SLP
Monte-Carlo method for Two-Stage SLPMonte-Carlo method for Two-Stage SLP
Monte-Carlo method for Two-Stage SLP
 

Similar a Ml mle_bayes

Testing for mixtures by seeking components
Testing for mixtures by seeking componentsTesting for mixtures by seeking components
Testing for mixtures by seeking componentsChristian Robert
 
CVPR2010: Advanced ITinCVPR in a Nutshell: part 6: Mixtures
CVPR2010: Advanced ITinCVPR in a Nutshell: part 6: MixturesCVPR2010: Advanced ITinCVPR in a Nutshell: part 6: Mixtures
CVPR2010: Advanced ITinCVPR in a Nutshell: part 6: Mixtureszukun
 
Olivier Cappé's talk at BigMC March 2011
Olivier Cappé's talk at BigMC March 2011Olivier Cappé's talk at BigMC March 2011
Olivier Cappé's talk at BigMC March 2011BigMC
 
IVR - Chapter 1 - Introduction
IVR - Chapter 1 - IntroductionIVR - Chapter 1 - Introduction
IVR - Chapter 1 - IntroductionCharles Deledalle
 
Deep Learning for Cyber Security
Deep Learning for Cyber SecurityDeep Learning for Cyber Security
Deep Learning for Cyber SecurityAltoros
 
Murphy: Machine learning A probabilistic perspective: Ch.9
Murphy: Machine learning A probabilistic perspective: Ch.9Murphy: Machine learning A probabilistic perspective: Ch.9
Murphy: Machine learning A probabilistic perspective: Ch.9Daisuke Yoneoka
 
Tensor train to solve stochastic PDEs
Tensor train to solve stochastic PDEsTensor train to solve stochastic PDEs
Tensor train to solve stochastic PDEsAlexander Litvinenko
 
Recent developments on unbiased MCMC
Recent developments on unbiased MCMCRecent developments on unbiased MCMC
Recent developments on unbiased MCMCPierre Jacob
 
Introduction to modern Variational Inference.
Introduction to modern Variational Inference.Introduction to modern Variational Inference.
Introduction to modern Variational Inference.Tomasz Kusmierczyk
 
Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...Pierre Jacob
 
Statistical Analysis of Neural Coding
Statistical Analysis of Neural CodingStatistical Analysis of Neural Coding
Statistical Analysis of Neural CodingYifei Shea, Ph.D.
 
Meta-learning and the ELBO
Meta-learning and the ELBOMeta-learning and the ELBO
Meta-learning and the ELBOYoonho Lee
 
IGARSS2011 FR3.T08.3 BenDavid.pdf
IGARSS2011 FR3.T08.3 BenDavid.pdfIGARSS2011 FR3.T08.3 BenDavid.pdf
IGARSS2011 FR3.T08.3 BenDavid.pdfgrssieee
 
The dual geometry of Shannon information
The dual geometry of Shannon informationThe dual geometry of Shannon information
The dual geometry of Shannon informationFrank Nielsen
 
Approximate Bayesian Computation with Quasi-Likelihoods
Approximate Bayesian Computation with Quasi-LikelihoodsApproximate Bayesian Computation with Quasi-Likelihoods
Approximate Bayesian Computation with Quasi-LikelihoodsStefano Cabras
 
Monte Carlo Statistical Methods
Monte Carlo Statistical MethodsMonte Carlo Statistical Methods
Monte Carlo Statistical MethodsChristian Robert
 
Unbiased Hamiltonian Monte Carlo
Unbiased Hamiltonian Monte CarloUnbiased Hamiltonian Monte Carlo
Unbiased Hamiltonian Monte CarloJeremyHeng10
 

Similar a Ml mle_bayes (20)

Testing for mixtures by seeking components
Testing for mixtures by seeking componentsTesting for mixtures by seeking components
Testing for mixtures by seeking components
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
CVPR2010: Advanced ITinCVPR in a Nutshell: part 6: Mixtures
CVPR2010: Advanced ITinCVPR in a Nutshell: part 6: MixturesCVPR2010: Advanced ITinCVPR in a Nutshell: part 6: Mixtures
CVPR2010: Advanced ITinCVPR in a Nutshell: part 6: Mixtures
 
Olivier Cappé's talk at BigMC March 2011
Olivier Cappé's talk at BigMC March 2011Olivier Cappé's talk at BigMC March 2011
Olivier Cappé's talk at BigMC March 2011
 
IVR - Chapter 1 - Introduction
IVR - Chapter 1 - IntroductionIVR - Chapter 1 - Introduction
IVR - Chapter 1 - Introduction
 
Deep Learning for Cyber Security
Deep Learning for Cyber SecurityDeep Learning for Cyber Security
Deep Learning for Cyber Security
 
Murphy: Machine learning A probabilistic perspective: Ch.9
Murphy: Machine learning A probabilistic perspective: Ch.9Murphy: Machine learning A probabilistic perspective: Ch.9
Murphy: Machine learning A probabilistic perspective: Ch.9
 
Tensor train to solve stochastic PDEs
Tensor train to solve stochastic PDEsTensor train to solve stochastic PDEs
Tensor train to solve stochastic PDEs
 
Recent developments on unbiased MCMC
Recent developments on unbiased MCMCRecent developments on unbiased MCMC
Recent developments on unbiased MCMC
 
Introduction to modern Variational Inference.
Introduction to modern Variational Inference.Introduction to modern Variational Inference.
Introduction to modern Variational Inference.
 
Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...
 
Statistical Analysis of Neural Coding
Statistical Analysis of Neural CodingStatistical Analysis of Neural Coding
Statistical Analysis of Neural Coding
 
Meta-learning and the ELBO
Meta-learning and the ELBOMeta-learning and the ELBO
Meta-learning and the ELBO
 
IGARSS2011 FR3.T08.3 BenDavid.pdf
IGARSS2011 FR3.T08.3 BenDavid.pdfIGARSS2011 FR3.T08.3 BenDavid.pdf
IGARSS2011 FR3.T08.3 BenDavid.pdf
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
The dual geometry of Shannon information
The dual geometry of Shannon informationThe dual geometry of Shannon information
The dual geometry of Shannon information
 
Approximate Bayesian Computation with Quasi-Likelihoods
Approximate Bayesian Computation with Quasi-LikelihoodsApproximate Bayesian Computation with Quasi-Likelihoods
Approximate Bayesian Computation with Quasi-Likelihoods
 
Monte Carlo Statistical Methods
Monte Carlo Statistical MethodsMonte Carlo Statistical Methods
Monte Carlo Statistical Methods
 
rinko2010
rinko2010rinko2010
rinko2010
 
Unbiased Hamiltonian Monte Carlo
Unbiased Hamiltonian Monte CarloUnbiased Hamiltonian Monte Carlo
Unbiased Hamiltonian Monte Carlo
 

Ml mle_bayes

  • 1. Machine Learning Maximum Likelihood Estimation and Bayesian Parameter Estimation (Parametric Learning) Phong VO vdphong@fit.hcmus.edu.vn September 11, 2010 – Typeset by FoilTEX –
  • 2. Introduction • From previous lecture, designing classifier assumes knowledge of p(x|ωi) and P (ωi) for each class, i.e. For Gaussian densities, we need to know µi, Σi for i = 1, . . . , c • Unfortunately, this information is not available directly. • Given training samples with true class label for each of sample, we have a learning problem. • If the form of the densities is known, i.e. the number of parameters and general knowledge about the problem, a parameter estimation problem results. – Typeset by FoilTEX – 1
  • 3. Example 1. Assume that p(x|ωi) is a normal density with mean µi and covariance matrix Σi, although we do not know exact values of these quantities. This knowledge simplifies the problem from one of estimating an unknow function p(x|ωx) to one of estimating the parameters µi and Σi. – Typeset by FoilTEX – 2
  • 4. Approaches to Parameter Estimation • In maximum likelihood estimation, we assume the parameters are fixed, but unknown. The MLE approach seeks the ”‘best”’ parameter estimate in the sense that ”‘best”’ means the set of parameters that maximize the probability of obtaining the training set. • Bayesian estimation models the parameters to be estimated as random variables with some (assumed) known priori distribution. The training set are ”‘observation”’, which allow conversion of the a priori information into an a posteriori density. The Bayesian approach uses the training set to update the training set-conditioned density function of the unknown parameters. – Typeset by FoilTEX – 3
  • 5. Maximum Likelihood Estimation • MLE nearly always have good convergence properties as the number of training samples increases. • It is often simpler than alternate methods, such as Bayesian techniques. – Typeset by FoilTEX – 4
  • 6. Formulation c • Assume D = j=1 Dj , with the samples in Dj having been drawn independently according to the probability law p(x|ωj ). • Assume that p(x|ωj ) has a known parametric form, and is determined uniquely by the value of a parameter vector θ j , i.e. p(x|ωi) ∼ N (µj , Σj ) where θ j = {µj , Σj }. • The dependence of p(x|ωj ) on θ j is expressed as p(x|ωj , θ j ). • Our problem: use the training samples to obtain good estimates for the unknown parameter vectors θ 1, . . . , θ c. – Typeset by FoilTEX – 5
  • 7. • Assume more that samples in Di give no information about θ j if i = j. In other word, parameters are functionally independent. • The problem of classification is turned into the problems of parameter estimation: Use a set D of training samples draw independently from the probability density p(x|θ) to estimate the unknown parameter vector θ. • Suppose that D = {x1, . . . , xn}. Since the samples were drawn independently, we have n p(D|θ) = p(xk |θ) k=1 p(D|θ) is called the likelihood of θ with respect to the set of samples. – Typeset by FoilTEX – 6
  • 8. ˆ • The maximum likelihood estimate of θ is the value θ that maximizes p(D|θ) – Typeset by FoilTEX – 7
  • 9. • Take the logarithm on both sides (just for analytical purpose), we define l(θ) as the log-likelihood function l(θ) = ln p(D|θ) ˆ • Since the logarithm is monotonically increasing, the θ that maximizes the log-likelihood also maximizes the likelihood, n ˆ θ = arg max θ l(θ) = arg max θ ln p(xk |θ) k=1 ˆ • θ can be found by taking derivatives of log-likelihood function – Typeset by FoilTEX – 8
  • 10. n θl = θ ln p(xk |θ) k=1 where   ∂ ∂θ1 ≡ . . ,   θ ∂ ∂θp and then solve the equation θl = 0. – Typeset by FoilTEX – 9
  • 11. • A solution could be a global maximum, a local maximum or minimum. We have to check each of them individually. • NOTE: A related class of estimators - maximum a posteriori or MAP estimators - find the value of θ that maximizes l(θ)p(θ). Thus a ML estimator is a MAP estimator for the uniform or ”‘flat”’ prior. – Typeset by FoilTEX – 10
  • 12. MLE: The Gaussian Case for Unknown µ • In this case, only the mean is unknown. Under this condition, we consider a sample point xk and find 1 1 ln p(xk |µ) = − ln (2π)d|Σ| − (xk − µ)tΣ−1(xk − µ) 2 2 and θ ln p(xk |µ) = Σ−1(xk − µ). • The maximum likelihood estimate for µ must satisfy – Typeset by FoilTEX – 11
  • 13. n Σ−1(xk − µ) = 0, ˆ k=1 • Solve above equation, we obtain n 1 ˆ µ= xk n k=1 • Interpretation: The maximum likelihood estimate for theu unknown population mean is just the arithmetic average of the training samples - the sample mean. Think of the n samples as a cloud of points, the sample mean is the centroid of the cloud. – Typeset by FoilTEX – 12
  • 14. MLE: The Gaussian Case for Unknown µ and Σ • Consider the univariate normal case, θ = {θ1, θ2} = {µ, σ 2}. The log-likelihood of a single point is 1 1 ln p(xk |θ) = − ln 2πθ2 − (xk − θ1)2 2 2θ2 and its derivative is 1 θ2 (xk − θ1 ) θl = θ ln p(xk |θ) = 1 (xk −θ1 )2 . − 2θ2 + 2θ2 2 – Typeset by FoilTEX – 13
  • 15. • Let θl = 0 and we obtain n 1 ˆ (xk − θ1) = 0 ˆ θ2 k=1 and n n ˆ 1 (xk − θ1)2 − + =0 ˆ θ2 ˆ2 θ k=1 k=1 2 . ˆ ˆ • Substitute µ = θ1 and σ = θ2 we obtain n 1 µ= ˆ xk n k=1 – Typeset by FoilTEX – 14
  • 16. and n 2 1 σ = ˆ (xk − µ)2 ˆ n k=1 . Exercise 1. Estimate µ and Σ for the case of multivariate Gaussian. – Typeset by FoilTEX – 15
  • 17. Bayesian Parameter Estimation • Bayes’ formula allows us to compute the posterior probabilities P (ωi|x) from the prior probabilities P (ωi) and the class-conditional densities p(x|ωi). • How can we proceed those quantities? – Prior probabilities: from knowledge of the functional forms for unknown densities and ranges for the values of unknown parameters – class-conditional densities: from training samples. – Typeset by FoilTEX – 16
  • 18. • Given training samples as D, Bayes’s formula then becomes p(x|ωi, D)P (ωi|D) P (ωi|x, D) = c j=1 p(x|ωj , D)P (ωj |D) • Assume that the a priori probabilities are known, P (ωi) = P (ωi|D) and the samples in Di have no influence on p(x|ωj , D) if i = j, p(x|ωi, Di)P (ωi) P (ωi|x, D) = c j=1 p(x|ωj , Dj )P (ωj ) • We have c separate problems of the following form: use a set D of samples drawn independently according to the fixed but unknown probability distribution p(x) to determine p(x|D). Our supervised learning problem is turned into an unsupervised density estimation problem. – Typeset by FoilTEX – 17
  • 19. The Parameter Distribution • Although the desired probability density p(x) is unknown, we assume that it has a known parametric form. • The unknown factor is the value of a parameter vector θ. As long as θ is known, the function p(x|θ) is known. • Information that we have about θ prior to observing the samples is assumed to be contained in a known prior density p(θ). • Observation of the samples converts this to a posterior density p(θ|D), which is expected to be sharply peaked about the true value of θ. – Typeset by FoilTEX – 18
  • 20. • Our basic goal is to compute p(x|D), which is as close as we can come to obtaining the unknown p(x). By integrating the joint density p(x, θ|D), we have p(x|D) = p(x, θ|D)dθ (1) θ = p(x|θ, D)p(θ|D)dθ (2) θ = p(x|θ)p(θ|D)dθ (3) θ – Typeset by FoilTEX – 19
  • 21. BPE: Gaussian Case • Calculate p(θ|D) and p(x|D) for the case where p(x|µ) ∼ N (µ, Σ) • Consider the univariate case where µ is unknown p(x|µ) ∼ N (µ, σ 2) • We assume that the prior density p(µ) has a known distribution 2 p(µ) ∼ N (µ0, σ0 ) 2 Interpretation: µ0 represents our best a priori guess for µ, and σ0 measures our uncertainty about this guess. – Typeset by FoilTEX – 20
  • 22. • Once µ is ”‘guessed”’, it determines the density for x. Letting D = {x1, . . . , xn}, Bayes’ formula gives us n p(D|µ)p(µ) p(µ|D) = ∝ p(xk |µ)p(µ) p(D|µ)p(µ)dµ k=1 where it is easy to see the affection of training samples to the estimation of the true µ. • Since p(xk |µ) ∼ N (µ, σ 2) and p(µ) ∼ N (µ0, σ0 ), we have 2 – Typeset by FoilTEX – 21
  • 23. p(xk |µ) p(µ) n 2 2 1 1 xk − µ 1 1 x k − µ0 p(µ|D) ∝ √ exp √ exp σ 2π 2 σ σ0 2π 2 σ0 k=1 (4) n 2 2 1 µ − xk µ − µ0 ∝ exp − + (5) 2 σ σ0 k=1 n 1 n 1 1 µ0 ∝ exp − + 2 µ2 − 2 xk + 2 µ (6) 2 σ 2 σ0 σ2 σ0 k=1 • p(µ|D) is again a normal density and is said to be a reproducing density and p(µ) is conjugate prior. – Typeset by FoilTEX – 22
  • 24. 2 • If we write p(µ|D) ∼ N (µn, σn), then 2 1 1 µ − µn p(µ|D) = √ exp − σ 2π 2 σn • Equating coefficients show us 2 nσ0 σ2 µn = 2 xn + 2 µ0 nσ0 + σ 2 nσ0 + σ 2 1 n where xn = n k=1 xk and 2 σ0 σ 2 2 σn = 2 nσ0 + σ 2 – Typeset by FoilTEX – 23
  • 25. Interpretation: these equations show how the prior information is combined with the empirical information in the samples to obtain the a posteriori density p(µ|D). – Typeset by FoilTEX – 24
  • 26. Interpretation • µn represents our best guess for µ after observing n samples 2 • σn measures our uncertainty about this guess, 2 σ2 limn→∞ σn = , n each additional observation decreases our uncertainty about the true value of µ. • As n increases, p(µ|D) approaches a Dirac delta function. • This behavior is known as Bayesian learning. – Typeset by FoilTEX – 25
  • 27. – Typeset by FoilTEX – 26
  • 28. – Typeset by FoilTEX – 27
  • 29. • µn is a positive combination of xn and µ0, xn ≤ µn ≤ µ0  lim  n→∞ µn = xn if σ = 0 µn = µ0 if σ0 = 0  xn if σ0 σ  • The dogmatism: prior knowledge σ 2 ∼ 2 empirical data σ0 • If the dogmatism in not infinite, after enough samples are taken the exact 2 values assumed for µ0 and σ0 will be unimportant, µn will converge to the sample mean. – Typeset by FoilTEX – 28
  • 30. Compute the class-conditional density • Having obtained a posteriori density for the mean, p(µ|D), we now compute the ”‘class-contitional”’ density for p(x|D) p(x|D) = p(x|µ)p(µ|D)dµ (7) 1 1 x−µ 1 1 x − µn = √ exp − √ exp − σ 2π 2 σ σn 2π 2 σn (8) 1 1 (x − µn)2 = exp − f (σ, σn), (9) 2πσσn 2 σ 2 + σn 2 – Typeset by FoilTEX – 29
  • 31. where 2 1 σ 2 + σn 2 σn x + σ 2 µ n 2 f (σ, σn) = exp − 2σ 2 µ− dµ 2 σ n σ 2 + σn2 • Hence p(x|D) is normally distributed with mean µn and variance σ 2 + σn 2 p(x|D) ∼ N (µn, σ 2 + σn). 2 • The density p(x|D) is the desired class-conditional density p(x|ωj , Dj ). Exercise 2. Use Bayesian estimation to calculate the a posteriori density p(θ|D) and the desired probability density p(x|D) for the multivariate case where p(x|µ) ∼ N (µ, Σ) – Typeset by FoilTEX – 30
  • 32. BPE: General Theory The basic assumptions for the applicability of Bayesian estimation are summarized as follows: 1. The form of the density p(x|θ) is assumed to be known, but the value of the parameter vector θ is not known exactly. 2. Our initial knowledge about θ is assumed to be contained in a known a priori density p(θ). 3. The rest of our knowledge about θ is contained in a set D of n samples x1, . . . , xn drawn independently according to the unknown probability density p(x). – Typeset by FoilTEX – 31
  • 33. The basic problem is to compute the posterior density p(θ|D) p(x|D) = p(x|θ)p(θ|D)dθ. By Bayes’ formula we have p(D|θ)p(θ) p(θ|D) = p(D|θ)p(θ)dθ and by the independence assumption n p(D|θ) = p(xk |θ). k=1 – Typeset by FoilTEX – 32
  • 34. Frequentists Perspective • Probability refers to limiting relative frequencies. Probabilities are objective properties of the real world. • Parameters are fixed, unknown constants. Because they are not fluctuating, no useful probability statements can be made about parameters. • Statistical procedures should be designed to have well-defined long run frequency properties – Typeset by FoilTEX – 33
  • 35. Bayesian Perspective • Probability describes degrees of belief, not limiting frequency. • We can make probability statements about parameters, even though they are fixed constants. • We make inference about a parameters θ by producing a probability distribution for θ. – Typeset by FoilTEX – 34
  • 36. Frequentists VS. Bayesians • Bayesian inference is a controversial approach because it inherently embrace a subjective notion of probability. • – Typeset by FoilTEX – 35