31. Outline
• Motivation
• Probability Definitions and Rules
• Probability Distributions
• MLE for Gaussian Parameter Estimation
• MLE and Least Squares
32. Material
• Pattern Recognition and Machine Learning - Christopher M. Bishop
• All of Statistics – Larry Wasserman
• Wolfram MathWorld
• Wikipedia
33. Motivation
• Uncertainty arises through:
• Noisy measurements
• Finite size of data sets
• Ambiguity: The word bank can mean (1) a financial institution, (2) the side of a river,
or (3) tilting an airplane. Which meaning was intended, based on the words that
appear nearby?
• Limited Model Complexity
• Probability theory provides a consistent framework for the quantification
and manipulation of uncertainty
• Allows us to make optimal predictions given all the information available to
us, even though that information may be incomplete or ambiguous
34. Sample Space
• The sample space Ω is the set of possible outcomes of an experiment.
Points ω in Ω are called sample outcomes, realizations, or elements.
Subsets of Ω are called Events.
• Example. If we toss a coin twice then Ω = {HH,HT, TH, TT}. The event
that the first toss is heads is A = {HH,HT}
• We say that events A1 and A2 are disjoint (mutually exclusive) if Ai ∩
Aj = {}
• Example: first flip being heads and first flip being tails
35. Probability
• We will assign a real number P(A) to every event A, called the
probability of A.
• To qualify as a probability, P must satisfy three axioms:
• Axiom 1: P(A) ≥ 0 for every A
• Axiom 2: P(Ω) = 1
• Axiom 3: If A1,A2, . . . are disjoint then
36. Joint and Conditional Probabilities
• Joint Probability
• P(X,Y)
• Probability of X and Y
• Conditional Probability
• P(X|Y)
• Probability of X given Y
37. Independent and Conditional Probabilities
• Assuming that P(B) > 0, the conditional probability of A given B:
• P(A|B)=P(AB)/P(B)
• P(AB) = P(A|B)P(B) = P(B|A)P(A)
• Product Rule
• Two events A and B are independent if
• P(AB) = P(A)P(B)
• Joint = Product of Marginals
• Two events A and B are conditionally independent given C if they are
independent after conditioning on C
• P(AB|C) = P(B|AC)P(A|C) = P(B|C)P(A|C)
38. Example
• 60% of ML students pass the final and 45% of ML students pass both the
final and the midterm *
• What percent of students who passed the final also passed the
midterm?
* These are made up values.
39. Example
• 60% of ML students pass the final and 45% of ML students pass both the
final and the midterm *
• What percent of students who passed the final also passed the
midterm?
• Reworded: What percent of students passed the midterm given they
passed the final?
• P(M|F) = P(M,F) / P(F)
• = .45 / .60
• = .75
* These are made up values.
40. Marginalization and Law of Total Probability
• Marginalization (Sum Rule)
• Law of Total Probability
43. Example
• Suppose you have tested positive for a disease; what is the
probability that you actually have the disease?
• It depends on the accuracy and sensitivity of the test, and on the
background (prior) probability of the disease.
• P(T=1|D=1) = .95 (true positive)
• P(T=1|D=0) = .10 (false positive)
• P(D=1) = .01 (prior)
• P(D=1|T=1) = ?
44. Example
• P(T=1|D=1) = .95 (true positive)
• P(T=1|D=0) = .10 (false positive)
• P(D=1) = .01 (prior)
Bayes’ Rule
• P(D|T) = P(T|D)P(D) / P(T)
= .95 * .01 / .1085
= .087
Law of Total Probability
• P(T) = Σ P(T|D)P(D)
= P(T|D=1)P(D=1) + P(T|D=0)P(D=0)
= .95*.01 + .1*.99
= .1085
The probability that you have the disease given you tested positive is 8.7%
45. Random Variable
• How do we link sample spaces and events to data?
• A random variable is a mapping that assigns a real number X(ω) to
each outcome ω
• Example: Flip a coin ten times. Let X(ω) be the number of heads in the
sequence ω. If ω = HHTHHTHHTT, then X(ω) = 6.
46. Discrete vs Continuous Random Variables
• Discrete: can only take a countable number of values
• Example: number of heads
• Distribution defined by probability mass function (pmf)
• Marginalization:
• Continuous: can take infinitely many values (real numbers)
• Example: time taken to accomplish task
• Distribution defined by probability density function (pdf)
• Marginalization:
47. Probability Distribution Statistics
• Mean: E[x] = μ = first moment =
• Variance: Var(X) =
• Nth moment =
Univariate continuous random variable
Univariate discrete random variable
=
48. Bernoulli Distribution
• Input: x ∈ {0, 1}
• Parameter: μ
• Example: Probability of flipping heads (x=1)
• Mean = E[x] = μ
• Variance = μ(1 − μ)
Discrete Distribution
49. Binomial Distribution
• Input: m = number of successes
• Parameters: N = number of trials
μ = probability of success
• Example: Probability of flipping heads m times out of N independent
flips with success probability μ
• Mean = E[x] = Nμ
• Variance = Nμ(1 − μ)
Discrete Distribution
50. Multinomial Distribution
• The multinomial distribution is a generalization of the binomial
distribution to k categories instead of just binary (success/fail)
• For n independent trials each of which leads to a success for exactly
one of k categories, the multinomial distribution gives the probability
of any particular combination of numbers of successes for the various
categories
• Example: Rolling a die N times
Discrete Distribution
51. Multinomial Distribution
• Input: m1 … mK (counts)
• Parameters: N = number of trials
μ = μ1 … μK probability of success for each category, Σμ=1
• Mean of mk: Nµk
• Variance of mk: Nµk(1-µk)
Discrete Distribution
52. Gaussian Distribution
• Aka the normal distribution
• Widely used model for the distribution of continuous variables
• In the case of a single variable x, the Gaussian distribution can be
written in the form
• where μ is the mean and σ2 is the variance
Continuous Distribution
54. Multivariate Gaussian Distribution
• For a D-dimensional vector x, the multivariate Gaussian distribution
takes the form
• where μ is a D-dimensional mean vector
• Σ is a D × D covariance matrix
• |Σ| denotes the determinant of Σ
56. CS771: Intro to ML
Functions and their optima
2
Many ML problems require us to optimize a function 𝑓 of some
variable(s) 𝑥
For simplicity, assume 𝑓 is a scalar-valued function of a scalar 𝑥(𝑓: ℝ
→ ℝ)
Any function has one/more optima (maxima, minima), and maybe
saddle points
𝑓(𝑥)
Global
maxima
A local
maxima
A local
maxima
A local
minima
A local
minima A local
minima
Global
minima
Will see what
these are
later
Usually interested in
global optima but often
want to find local
optima, too
𝑥
The objective function of the
ML problem we are solving
(e.g., squared loss for
regression)
Assume
unconstrained for
now, i.e., just a real-
valued
number/vector
For deep learning models, often
the local optima are what we can
find (and they usually suffice) –
more later
57. CS771: Intro to ML
Derivatives
3
Magnitude of derivative at a point is the rate of change of the func at
that point
Derivative becomes zero at stationary points (optima or saddle points)
The function becomes “flat” (∆𝑓 𝑥 = 0 if we change 𝑥 by a very little at
𝑑𝑓(𝑥)
𝑑𝑥
= lim∆𝑥→0
∆𝑓(𝑥)
∆𝑥 𝑓(𝑥)
𝑥
∆𝑥
∆𝑓(𝑥)
∆𝑥
∆𝑓(𝑥)
Sign is also important: Positive
derivative means 𝑓 is increasing at 𝑥 if
we increase the value of 𝑥 by a very
small amount; negative derivative
means it is decreasing
Understanding how 𝑓 changes its value
as we change 𝑥 is helpful to understand
optimization
(minimization/maximization) algorithms
Will sometimes use 𝑓′(𝑥)
to denote the derivative
58. CS771: Intro to ML
Rules of Derivatives
4
Some basic rules of taking derivatives
Sum Rule: 𝑓 𝑥 + 𝑔 𝑥
′
= 𝑓′
𝑥 + 𝑔′
𝑥
Scaling Rule: 𝑎 ⋅ 𝑓 𝑥
′
= 𝑎 ⋅ 𝑓′
𝑥 if 𝑎 is not a function of 𝑥
Product Rule: 𝑓 𝑥 ⋅ 𝑔 𝑥
′
= 𝑓′
𝑥 ⋅ 𝑔 𝑥 + 𝑔′
𝑥 ⋅ 𝑓 𝑥
Quotient Rule: 𝑓 𝑥 /𝑔 𝑥 ′ = 𝑓′
𝑥 ⋅ 𝑔 𝑥 − 𝑔′
𝑥 𝑓 𝑥 / 𝑔 𝑥
2
Chain Rule: 𝑓 𝑔 𝑥
′
≝ 𝑓 ∘ 𝑔 ′
𝑥 = 𝑓′
𝑔 𝑥 ⋅ 𝑔′
𝑥
We already used some of these (sum,
scaling and chain) when calculating the
derivative for the linear regression model
59. CS771: Intro to ML
Derivatives
5
How the derivative itself changes tells us about the function’s optima
The second derivative 𝑓’’(𝑥) can provide this information
𝑓’(𝑥)= 0 at 𝑥,
𝑓’(𝑥)>0 just
before 𝑥 𝑓’(𝑥)<0
just after 𝑥
𝑥 is a maxima
𝑓’(𝑥)= 0 at 𝑥
𝑓’(𝑥)< 0 just
before 𝑥 𝑓’(𝑥)>0
just after 𝑥
𝑥 is a minima
𝑓’(𝑥)= 0 at 𝑥
𝑓’(𝑥)= 0 just
before 𝑥 𝑓’(𝑥)= 0
just after 𝑥
𝑥 may be a saddle
𝑓’(𝑥)= 0 and 𝑓’’(𝑥)
< 0
𝑥 is a maxima
𝑓’(𝑥)= 0 and 𝑓’’ 𝑥 > 0
𝑥 is a minima
𝑓’(𝑥)= 0 and 𝑓’’ 𝑥 = 0
𝑥 may be a saddle. May
need higher derivatives
60. CS771: Intro to ML
Saddle Points
6
Points where derivative is zero but are neither minima nor maxima
Saddle points are very common for loss functions of deep learning
models
Need to be handled carefully during optimization
Second or higher derivative may help identify if a stationary point is a
Saddle is a point of
inflection where the
derivative is also zero
A saddle
point
61. CS771: Intro to ML
Multivariate Functions
7
Most functions that we see in ML are multivariate function
Example: Loss fn 𝐿(𝒘) in lin-reg was a multivar function of 𝐷-dim
vector 𝒘
Here is an illustration of a function of 2 variables (4 maxima and 5
minima)
𝐿 𝒘 : ℝ𝐷
→ ℝ
Two-dim contour
plot of the function
(i.e., what it looks
like from the
above)
Plot courtesy: http://benchmarkfcns.xyz/benchmarkfcns/griewankfcn.html
62. CS771: Intro to ML
Derivatives of Multivariate Functions
8
Can define derivative for a multivariate functions as well via the
gradient
Gradient of a function 𝑓(𝒙): ℝ𝐷
→ ℝ is a 𝐷 × 1 vector of partial
derivatives
Optima and saddle points defined similar to one-dim case
Required properties that we saw for one-dim case must be satisfied along all
the directions
The second derivative in this case is known as the Hessian
∇𝑓 𝒙 =
𝜕𝑓
𝜕𝑥1
,
𝜕𝑓
𝜕𝑥2
, … ,
𝜕𝑓
𝜕𝑥𝐷
Each element in this gradient vector tells
us how much 𝑓 will change if we move a
little along the corresponding (akin to
one-dim case)
63. CS771: Intro to ML
The Hessian
9
For a multivar scalar valued function 𝑓(𝒙): ℝ𝐷
→ ℝ, Hessian is a 𝐷
× 𝐷 matrix
The Hessian matrix can be used to assess the optima/saddle points
∇𝑓 𝒙 = 0 and 𝛻2
𝑓 𝒙 is a positive semi-definite (PSD) matrix then 𝒙 is a
minima
∇𝑓 𝒙 = 0, and 𝛻2
𝑓 𝒙 is a negative semi-definite (NSD) matrix then 𝒙 is a
𝛻2𝑓 𝒙 =
𝜕2𝑓
𝜕𝑥1
2
𝜕2𝑓
𝜕𝑥2𝑥1
𝜕2𝑓
𝜕𝑥1𝑥2
𝜕2𝑓
𝜕𝑥2
2
…
…
⋮ ⋮ ⋱
𝜕2𝑓
𝜕𝑥𝐷𝑥1
𝜕2𝑓
𝜕𝑥𝐷𝑥2
…
𝜕2𝑓
𝜕𝑥1𝑥𝐷
𝜕2𝑓
𝜕𝑥2𝑥𝐷
⋮
𝜕2𝑓
𝜕𝑥𝐷
2
Note: If the function itself is
vector valued, e.g., 𝑓(𝒙): ℝ𝐷
→ ℝ𝐾 then we will have 𝐾 such
𝐷 × 𝐷 Hessian matrices, one for
each output dimension of 𝑓
Gives information
about the
curvature of the
function at point
𝒙
A square, symmetric 𝐷
× 𝐷 matrix M is PSD if 𝒙⊤𝑀𝒙 ≥ 𝟎
∀ 𝒙 ∈ ℝ𝐷
Will be NSD if 𝒙⊤𝑀𝒙 ≤ 𝟎 ∀ 𝒙
∈ ℝ𝐷
PSD if all
eigenvalues
are non-
negative
64. CS771: Intro to ML
A function being optimized can be either convex or non-convex
Here are a couple of examples of convex functions
Here are a couple of examples of non-convex functions
Convex and Non-Convex Functions
10
Convex functions are bowl-
shaped. They have a unique
optima (minima)
Negative of a convex function is
called a concave function, which
also has a unique optima
(maxima)
Non-convex functions have
multiple minima. Usually
harder to optimize as
compared to convex
functions
Loss functions of most
deep learning models
are non-convex
65. CS771: Intro to ML
Convex Sets
11
A set S of points is a convex set, if for any two points 𝑥, 𝑦 ∈ 𝑆, and 0 ≤
𝛼 ≤ 1
Above means that all points on the line-segment between 𝑥 and 𝑦 lie
within 𝑆
𝑧 = 𝛼𝑥 + 1 − 𝛼 𝑦 ∈ 𝑆
𝑧 is also called a
“convex combination”
of two points
Can also define convex
combination of 𝑁 points
𝑥1, 𝑥2, … , 𝑥𝑁 as 𝑧 = σ𝑖=1
𝑁
𝛼𝑖𝑥𝑖
66. CS771: Intro to ML
Convex Functions
12
Informally, 𝑓(𝑥) is convex if all of its chords lie above the function
everywhere
Formally, (assuming differentiable function), some tests for convexity:
First-order convexity (graph of 𝑓 must be above all the tangents)
Second derivative a.k.a. Hessian (if exists) must be positive semi-definite
Exercise: Show
that ridge
regression
objective is
convex
67. CS771: Intro to ML
Optimization Using First-Order Optimality
13
Very simple. Already used this approach for linear and ridge
regression
First order optimality: The gradient 𝒈 must be equal to zero at the
optima
Sometimes, setting 𝒈 = 𝟎 and solving for 𝒘 gives a closed form
solution
𝒈 = ∇𝒘 𝐿(𝒘) = 0
The approach works only for
very simple problems where the
objective is convex and there are
no constraints on the values 𝒘
can take
Called “first order” since only
gradient is used and gradient
provides the first order info about
the function being optimized
68. CS771: Intro to ML
Optimization via Gradient Descent
14
Initialize 𝒘 as 𝒘(0)
For iteration 𝑡 = 0,1,2, … (or until convergence)
Calculate the gradient 𝒈(𝑡)
using the current iterates
𝒘(𝑡)
Set the learning rate 𝜂𝑡
Move in the opposite direction of gradient
Gradient Descent
𝒘(𝑡+1)
= 𝒘(𝑡)
− 𝜂𝑡𝒈(𝑡)
Can I used this
approach to solve
maximization
problems?
Iterative since it requires
several steps/iterations to
find the optimal solution
For convex
functions, GD will
converge to the
global minima
Good
initialization
needed for
non-convex
functions
For max. problems we
can use gradient
ascent
𝒘(𝑡+1) = 𝒘(𝑡) + 𝜂𝑡𝒈(𝑡)
The learning rate
very imp. Should
be set carefully
(fixed or chosen
adaptively). Will
discuss some
strategies later
Will move in the
direction of the
gradient
Will see the
justification
shortly
Sometimes may
be tricky to to
assess
convergence? Will
see some methods
later
Fact: Gradient gives
the direction of
steepest change in
function’s value
69. CS771: Intro to ML
Gradient Descent: An Illustration
15
𝒘∗
𝒘(0) 𝒘(1) 𝒘(2) 𝒘(0)
𝒘(1)
𝒘(2) 𝒘∗
𝒘(3) 𝒘(3)
Stuck at a
local minima
Negative gradient here (
𝛿𝐿
𝛿𝑤
< 0). Let’s move in the
positive direction
Positive gradient
here. Let’s move
in the negative
direction
Learning rate is very important
Good initialization
is very important
𝐿(𝒘)
𝒘
70. CS771: Intro to ML
GD: An Example
16
Let’s apply GD for least squares linear regression
The gradient: 𝒈 = − σ𝑛=1
𝑁
2 𝑦𝑛 − 𝒘⊤
𝒙𝑛 𝒙𝑛
Each GD update will be of the form
Exercise: Assume 𝑁 = 1, and show that GD update improves
prediction on the training input (𝒙𝑛, 𝑦𝑛), i.e, 𝑦𝑛 is closer to 𝒘 𝑡+1 ⊤
𝒙𝑛
than to 𝒘 𝑡 ⊤
𝒙𝑛
This is sort of a proof that GD updates are “corrective” in nature (and it
𝒘𝑟𝑖𝑑𝑔𝑒= arg min𝒘 𝐿𝑟𝑒𝑔 𝒘 = arg min𝒘 σ𝑛=1
𝑁
(𝑦𝑛 − 𝒘⊤
𝒙𝑛)2
𝒘(𝑡+1)
= 𝒘(𝑡)
+ 𝜂𝑡 σ𝑛=1
𝑁
2 𝑦𝑛 − 𝒘(𝑡)⊤
𝒙𝑛 𝒙𝑛
Prediction error of current
model 𝒘(𝑡) on the
𝑛𝑡ℎ training example
Training
examples on
which the
current
model’s error
is large
contribute more
to the update