Machine learning mathematicals.pdf

Probability Theory for
Machine Learning
Chris Cremer
September 2015

Outline
• Motivation
• Probability Definitions and Rules
• Probability Distributions
• MLE for Gaussian Parameter Estimation
• MLE and Least Squares

Material
• Pattern Recognition and Machine Learning - Christopher M. Bishop
• All of Statistics – Larry Wasserman
• Wolfram MathWorld
• Wikipedia

Motivation
• Uncertainty arises through:
• Noisy measurements
• Finite size of data sets
• Ambiguity: The word bank can mean (1) a financial institution, (2) the side of a river,
or (3) tilting an airplane. Which meaning was intended, based on the words that
appear nearby?
• Limited Model Complexity
• Probability theory provides a consistent framework for the quantification
and manipulation of uncertainty
• Allows us to make optimal predictions given all the information available to
us, even though that information may be incomplete or ambiguous

Sample Space
• The sample space Ω is the set of possible outcomes of an experiment.
Points ω in Ω are called sample outcomes, realizations, or elements.
Subsets of Ω are called Events.
• Example. If we toss a coin twice then Ω = {HH,HT, TH, TT}. The event
that the first toss is heads is A = {HH,HT}
• We say that events A1 and A2 are disjoint (mutually exclusive) if Ai ∩
Aj = {}
• Example: first flip being heads and first flip being tails

Probability
• We will assign a real number P(A) to every event A, called the
probability of A.
• To qualify as a probability, P must satisfy three axioms:
• Axiom 1: P(A) ≥ 0 for every A
• Axiom 2: P(Ω) = 1
• Axiom 3: If A1,A2, . . . are disjoint then

Joint and Conditional Probabilities
• Joint Probability
• P(X,Y)
• Probability of X and Y
• Conditional Probability
• P(X|Y)
• Probability of X given Y

Independent and Conditional Probabilities
• Assuming that P(B) > 0, the conditional probability of A given B:
• P(A|B)=P(AB)/P(B)
• P(AB) = P(A|B)P(B) = P(B|A)P(A)
• Product Rule
• Two events A and B are independent if
• P(AB) = P(A)P(B)
• Joint = Product of Marginals
• Two events A and B are conditionally independent given C if they are
independent after conditioning on C
• P(AB|C) = P(B|AC)P(A|C) = P(B|C)P(A|C)

Example
• 60% of ML students pass the final and 45% of ML students pass both the
final and the midterm *
• What percent of students who passed the final also passed the
midterm?
* These are made up values.

Example
• 60% of ML students pass the final and 45% of ML students pass both the
final and the midterm *
• What percent of students who passed the final also passed the
midterm?
• Reworded: What percent of students passed the midterm given they
passed the final?
• P(M|F) = P(M,F) / P(F)
• = .45 / .60
• = .75
* These are made up values.

Marginalization and Law of Total Probability
• Marginalization (Sum Rule)
• Law of Total Probability

Example
• Suppose you have tested positive for a disease; what is the
probability that you actually have the disease?
• It depends on the accuracy and sensitivity of the test, and on the
background (prior) probability of the disease.
• P(T=1|D=1) = .95 (true positive)
• P(T=1|D=0) = .10 (false positive)
• P(D=1) = .01 (prior)
• P(D=1|T=1) = ?

Example
• P(T=1|D=1) = .95 (true positive)
• P(T=1|D=0) = .10 (false positive)
• P(D=1) = .01 (prior)
Bayes’ Rule
• P(D|T) = P(T|D)P(D) / P(T)
= .95 * .01 / .1085
= .087
Law of Total Probability
• P(T) = Σ P(T|D)P(D)
= P(T|D=1)P(D=1) + P(T|D=0)P(D=0)
= .95*.01 + .1*.99
= .1085
The probability that you have the disease given you tested positive is 8.7%

Random Variable
• How do we link sample spaces and events to data?
• A random variable is a mapping that assigns a real number X(ω) to
each outcome ω
• Example: Flip a coin ten times. Let X(ω) be the number of heads in the
sequence ω. If ω = HHTHHTHHTT, then X(ω) = 6.

Discrete vs Continuous Random Variables
• Discrete: can only take a countable number of values
• Example: number of heads
• Distribution defined by probability mass function (pmf)
• Marginalization:
• Continuous: can take infinitely many values (real numbers)
• Example: time taken to accomplish task
• Distribution defined by probability density function (pdf)
• Marginalization:

Probability Distribution Statistics
• Mean: E[x] = μ = first moment =
• Variance: Var(X) =
• Nth moment =
Univariate continuous random variable
Univariate discrete random variable
=

Bernoulli Distribution
• Input: x ∈ {0, 1}
• Parameter: μ
• Example: Probability of flipping heads (x=1)
• Mean = E[x] = μ
• Variance = μ(1 − μ)
Discrete Distribution

Binomial Distribution
• Input: m = number of successes
• Parameters: N = number of trials
μ = probability of success
• Example: Probability of flipping heads m times out of N independent
flips with success probability μ
• Mean = E[x] = Nμ
• Variance = Nμ(1 − μ)

Multinomial Distribution
• The multinomial distribution is a generalization of the binomial
distribution to k categories instead of just binary (success/fail)
• For n independent trials each of which leads to a success for exactly
one of k categories, the multinomial distribution gives the probability
of any particular combination of numbers of successes for the various
categories
• Example: Rolling a die N times

Multinomial Distribution
• Input: m1 … mK (counts)
• Parameters: N = number of trials
μ = μ1 … μK probability of success for each category, Σμ=1
• Mean of mk: Nµk
• Variance of mk: Nµk(1-µk)

Gaussian Distribution
• Aka the normal distribution
• Widely used model for the distribution of continuous variables
• In the case of a single variable x, the Gaussian distribution can be
written in the form
• where μ is the mean and σ2 is the variance
Continuous Distribution

Gaussian Distribution
• Gaussians with different means and variances

Multivariate Gaussian Distribution
• For a D-dimensional vector x, the multivariate Gaussian distribution
takes the form
• where μ is a D-dimensional mean vector
• Σ is a D × D covariance matrix
• |Σ| denotes the determinant of Σ

CS771: Intro to ML
Functions and their optima
2
 Many ML problems require us to optimize a function 𝑓 of some
variable(s) 𝑥
 For simplicity, assume 𝑓 is a scalar-valued function of a scalar 𝑥(𝑓: ℝ
→ ℝ)
 Any function has one/more optima (maxima, minima), and maybe
saddle points
𝑓(𝑥)
Global
maxima
A local
maxima
A local
maxima
A local
minima
A local
minima A local
minima
Global
minima
Will see what
these are
later
Usually interested in
global optima but often
want to find local
optima, too
𝑥
The objective function of the
ML problem we are solving
(e.g., squared loss for
regression)
Assume
unconstrained for
now, i.e., just a real-
valued
number/vector
For deep learning models, often
the local optima are what we can
find (and they usually suffice) –
more later

CS771: Intro to ML
Derivatives
3
 Magnitude of derivative at a point is the rate of change of the func at
that point
 Derivative becomes zero at stationary points (optima or saddle points)
 The function becomes “flat” (∆𝑓 𝑥 = 0 if we change 𝑥 by a very little at
𝑑𝑓(𝑥)
𝑑𝑥
= lim∆𝑥→0
∆𝑓(𝑥)
∆𝑥 𝑓(𝑥)
𝑥
∆𝑥
∆𝑓(𝑥)
∆𝑥
∆𝑓(𝑥)
Sign is also important: Positive
derivative means 𝑓 is increasing at 𝑥 if
we increase the value of 𝑥 by a very
small amount; negative derivative
means it is decreasing
Understanding how 𝑓 changes its value
as we change 𝑥 is helpful to understand
optimization
(minimization/maximization) algorithms
Will sometimes use 𝑓′(𝑥)
to denote the derivative

CS771: Intro to ML
Rules of Derivatives
4
Some basic rules of taking derivatives
 Sum Rule: 𝑓 𝑥 + 𝑔 𝑥
′
= 𝑓′
𝑥 + 𝑔′
𝑥
 Scaling Rule: 𝑎 ⋅ 𝑓 𝑥
′
= 𝑎 ⋅ 𝑓′
𝑥 if 𝑎 is not a function of 𝑥
 Product Rule: 𝑓 𝑥 ⋅ 𝑔 𝑥
′
= 𝑓′
𝑥 ⋅ 𝑔 𝑥 + 𝑔′
𝑥 ⋅ 𝑓 𝑥
 Quotient Rule: 𝑓 𝑥 /𝑔 𝑥 ′ = 𝑓′
𝑥 ⋅ 𝑔 𝑥 − 𝑔′
𝑥 𝑓 𝑥 / 𝑔 𝑥
2
 Chain Rule: 𝑓 𝑔 𝑥
′
≝ 𝑓 ∘ 𝑔 ′
𝑥 = 𝑓′
𝑔 𝑥 ⋅ 𝑔′
𝑥
We already used some of these (sum,
scaling and chain) when calculating the
derivative for the linear regression model

CS771: Intro to ML
Derivatives
5
 How the derivative itself changes tells us about the function’s optima
 The second derivative 𝑓’’(𝑥) can provide this information
𝑓’(𝑥)= 0 at 𝑥,
𝑓’(𝑥)>0 just
before 𝑥 𝑓’(𝑥)<0
just after 𝑥
𝑥 is a maxima
𝑓’(𝑥)= 0 at 𝑥
𝑓’(𝑥)< 0 just
before 𝑥 𝑓’(𝑥)>0
just after 𝑥
𝑥 is a minima
𝑓’(𝑥)= 0 at 𝑥
𝑓’(𝑥)= 0 just
before 𝑥 𝑓’(𝑥)= 0
just after 𝑥
𝑥 may be a saddle
𝑓’(𝑥)= 0 and 𝑓’’(𝑥)
< 0
𝑥 is a maxima
𝑓’(𝑥)= 0 and 𝑓’’ 𝑥 > 0
𝑥 is a minima
𝑓’(𝑥)= 0 and 𝑓’’ 𝑥 = 0
𝑥 may be a saddle. May
need higher derivatives

CS771: Intro to ML
Saddle Points
6
 Points where derivative is zero but are neither minima nor maxima
 Saddle points are very common for loss functions of deep learning
models
 Need to be handled carefully during optimization
 Second or higher derivative may help identify if a stationary point is a
Saddle is a point of
inflection where the
derivative is also zero
A saddle
point

CS771: Intro to ML
Multivariate Functions
7
 Most functions that we see in ML are multivariate function
 Example: Loss fn 𝐿(𝒘) in lin-reg was a multivar function of 𝐷-dim
vector 𝒘
 Here is an illustration of a function of 2 variables (4 maxima and 5
minima)
𝐿 𝒘 : ℝ𝐷
→ ℝ
Two-dim contour
plot of the function
(i.e., what it looks
like from the
above)
Plot courtesy: http://benchmarkfcns.xyz/benchmarkfcns/griewankfcn.html

CS771: Intro to ML
Derivatives of Multivariate Functions
8
 Can define derivative for a multivariate functions as well via the
gradient
 Gradient of a function 𝑓(𝒙): ℝ𝐷
→ ℝ is a 𝐷 × 1 vector of partial
derivatives
 Optima and saddle points defined similar to one-dim case
 Required properties that we saw for one-dim case must be satisfied along all
the directions
 The second derivative in this case is known as the Hessian
∇𝑓 𝒙 =
𝜕𝑓
𝜕𝑥1
,
𝜕𝑓
𝜕𝑥2
, … ,
𝜕𝑓
𝜕𝑥𝐷
Each element in this gradient vector tells
us how much 𝑓 will change if we move a
little along the corresponding (akin to
one-dim case)

CS771: Intro to ML
The Hessian
9
 For a multivar scalar valued function 𝑓(𝒙): ℝ𝐷
→ ℝ, Hessian is a 𝐷
× 𝐷 matrix
 The Hessian matrix can be used to assess the optima/saddle points
 ∇𝑓 𝒙 = 0 and 𝛻2
𝑓 𝒙 is a positive semi-definite (PSD) matrix then 𝒙 is a
minima
 ∇𝑓 𝒙 = 0, and 𝛻2
𝑓 𝒙 is a negative semi-definite (NSD) matrix then 𝒙 is a
𝛻2𝑓 𝒙 =
𝜕2𝑓
𝜕𝑥1
2
𝜕2𝑓
𝜕𝑥2𝑥1
𝜕2𝑓
𝜕𝑥1𝑥2
𝜕2𝑓
𝜕𝑥2
2
…
…
⋮ ⋮ ⋱
𝜕2𝑓
𝜕𝑥𝐷𝑥1
𝜕2𝑓
𝜕𝑥𝐷𝑥2
…
𝜕2𝑓
𝜕𝑥1𝑥𝐷
𝜕2𝑓
𝜕𝑥2𝑥𝐷
⋮
𝜕2𝑓
𝜕𝑥𝐷
2
Note: If the function itself is
vector valued, e.g., 𝑓(𝒙): ℝ𝐷
→ ℝ𝐾 then we will have 𝐾 such
𝐷 × 𝐷 Hessian matrices, one for
each output dimension of 𝑓
Gives information
about the
curvature of the
function at point
𝒙
A square, symmetric 𝐷
× 𝐷 matrix M is PSD if 𝒙⊤𝑀𝒙 ≥ 𝟎
∀ 𝒙 ∈ ℝ𝐷
Will be NSD if 𝒙⊤𝑀𝒙 ≤ 𝟎 ∀ 𝒙
∈ ℝ𝐷
PSD if all
eigenvalues
are non-
negative

CS771: Intro to ML
 A function being optimized can be either convex or non-convex
 Here are a couple of examples of convex functions
 Here are a couple of examples of non-convex functions
Convex and Non-Convex Functions
10
Convex functions are bowl-
shaped. They have a unique
optima (minima)
Negative of a convex function is
called a concave function, which
also has a unique optima
(maxima)
Non-convex functions have
multiple minima. Usually
harder to optimize as
compared to convex
functions
Loss functions of most
deep learning models
are non-convex

CS771: Intro to ML
Convex Sets
11
 A set S of points is a convex set, if for any two points 𝑥, 𝑦 ∈ 𝑆, and 0 ≤
𝛼 ≤ 1
 Above means that all points on the line-segment between 𝑥 and 𝑦 lie
within 𝑆
𝑧 = 𝛼𝑥 + 1 − 𝛼 𝑦 ∈ 𝑆
𝑧 is also called a
“convex combination”
of two points
Can also define convex
combination of 𝑁 points
𝑥1, 𝑥2, … , 𝑥𝑁 as 𝑧 = σ𝑖=1
𝑁
𝛼𝑖𝑥𝑖

CS771: Intro to ML
Convex Functions
12
 Informally, 𝑓(𝑥) is convex if all of its chords lie above the function
everywhere
 Formally, (assuming differentiable function), some tests for convexity:
 First-order convexity (graph of 𝑓 must be above all the tangents)
 Second derivative a.k.a. Hessian (if exists) must be positive semi-definite
Exercise: Show
that ridge
regression
objective is
convex

CS771: Intro to ML
Optimization Using First-Order Optimality
13
 Very simple. Already used this approach for linear and ridge
regression
 First order optimality: The gradient 𝒈 must be equal to zero at the
optima
 Sometimes, setting 𝒈 = 𝟎 and solving for 𝒘 gives a closed form
solution
𝒈 = ∇𝒘 𝐿(𝒘) = 0
The approach works only for
very simple problems where the
objective is convex and there are
no constraints on the values 𝒘
can take
Called “first order” since only
gradient is used and gradient
provides the first order info about
the function being optimized

CS771: Intro to ML
Optimization via Gradient Descent
14
 Initialize 𝒘 as 𝒘(0)
 For iteration 𝑡 = 0,1,2, … (or until convergence)
 Calculate the gradient 𝒈(𝑡)
using the current iterates
𝒘(𝑡)
 Set the learning rate 𝜂𝑡
 Move in the opposite direction of gradient
Gradient Descent
𝒘(𝑡+1)
= 𝒘(𝑡)
− 𝜂𝑡𝒈(𝑡)
Can I used this
approach to solve
maximization
problems?
Iterative since it requires
several steps/iterations to
find the optimal solution
For convex
functions, GD will
converge to the
global minima
Good
initialization
needed for
non-convex
functions
For max. problems we
can use gradient
ascent
𝒘(𝑡+1) = 𝒘(𝑡) + 𝜂𝑡𝒈(𝑡)
The learning rate
very imp. Should
be set carefully
(fixed or chosen
adaptively). Will
discuss some
strategies later
Will move in the
direction of the
gradient
Will see the
justification
shortly
Sometimes may
be tricky to to
assess
convergence? Will
see some methods
later
Fact: Gradient gives
the direction of
steepest change in
function’s value

CS771: Intro to ML
Gradient Descent: An Illustration
15
𝒘∗
𝒘(0) 𝒘(1) 𝒘(2) 𝒘(0)
𝒘(1)
𝒘(2) 𝒘∗
𝒘(3) 𝒘(3)
Stuck at a
local minima
Negative gradient here (
𝛿𝐿
𝛿𝑤
< 0). Let’s move in the
positive direction
Positive gradient
here. Let’s move
in the negative
direction
Learning rate is very important
Good initialization
is very important
𝐿(𝒘)
𝒘

CS771: Intro to ML
GD: An Example
16
 Let’s apply GD for least squares linear regression
 The gradient: 𝒈 = − σ𝑛=1
𝑁
2 𝑦𝑛 − 𝒘⊤
𝒙𝑛 𝒙𝑛
 Each GD update will be of the form
 Exercise: Assume 𝑁 = 1, and show that GD update improves
prediction on the training input (𝒙𝑛, 𝑦𝑛), i.e, 𝑦𝑛 is closer to 𝒘 𝑡+1 ⊤
𝒙𝑛
than to 𝒘 𝑡 ⊤
𝒙𝑛
 This is sort of a proof that GD updates are “corrective” in nature (and it
𝒘𝑟𝑖𝑑𝑔𝑒= arg min𝒘 𝐿𝑟𝑒𝑔 𝒘 = arg min𝒘 σ𝑛=1
𝑁
(𝑦𝑛 − 𝒘⊤
𝒙𝑛)2
𝒘(𝑡+1)
= 𝒘(𝑡)
+ 𝜂𝑡 σ𝑛=1
𝑁
2 𝑦𝑛 − 𝒘(𝑡)⊤
𝒙𝑛 𝒙𝑛
Prediction error of current
model 𝒘(𝑡) on the
𝑛𝑡ℎ training example
Training
examples on
which the
current
model’s error
is large
contribute more
to the update

Machine learning mathematicals.pdf

Recomendados

Recomendados

Más contenido relacionado

Similar a Machine learning mathematicals.pdf

Similar a Machine learning mathematicals.pdf (20)

Último

Último (20)

Machine learning mathematicals.pdf