SlideShare una empresa de Scribd logo
1 de 70
Probability Theory for
Machine Learning
Chris Cremer
September 2015
Outline
• Motivation
• Probability Definitions and Rules
• Probability Distributions
• MLE for Gaussian Parameter Estimation
• MLE and Least Squares
Material
• Pattern Recognition and Machine Learning - Christopher M. Bishop
• All of Statistics – Larry Wasserman
• Wolfram MathWorld
• Wikipedia
Motivation
• Uncertainty arises through:
• Noisy measurements
• Finite size of data sets
• Ambiguity: The word bank can mean (1) a financial institution, (2) the side of a river,
or (3) tilting an airplane. Which meaning was intended, based on the words that
appear nearby?
• Limited Model Complexity
• Probability theory provides a consistent framework for the quantification
and manipulation of uncertainty
• Allows us to make optimal predictions given all the information available to
us, even though that information may be incomplete or ambiguous
Sample Space
• The sample space Ω is the set of possible outcomes of an experiment.
Points ω in Ω are called sample outcomes, realizations, or elements.
Subsets of Ω are called Events.
• Example. If we toss a coin twice then Ω = {HH,HT, TH, TT}. The event
that the first toss is heads is A = {HH,HT}
• We say that events A1 and A2 are disjoint (mutually exclusive) if Ai ∩
Aj = {}
• Example: first flip being heads and first flip being tails
Probability
• We will assign a real number P(A) to every event A, called the
probability of A.
• To qualify as a probability, P must satisfy three axioms:
• Axiom 1: P(A) ≥ 0 for every A
• Axiom 2: P(Ω) = 1
• Axiom 3: If A1,A2, . . . are disjoint then
Joint and Conditional Probabilities
• Joint Probability
• P(X,Y)
• Probability of X and Y
• Conditional Probability
• P(X|Y)
• Probability of X given Y
Independent and Conditional Probabilities
• Assuming that P(B) > 0, the conditional probability of A given B:
• P(A|B)=P(AB)/P(B)
• P(AB) = P(A|B)P(B) = P(B|A)P(A)
• Product Rule
• Two events A and B are independent if
• P(AB) = P(A)P(B)
• Joint = Product of Marginals
• Two events A and B are conditionally independent given C if they are
independent after conditioning on C
• P(AB|C) = P(B|AC)P(A|C) = P(B|C)P(A|C)
Example
• 60% of ML students pass the final and 45% of ML students pass both the
final and the midterm *
• What percent of students who passed the final also passed the
midterm?
* These are made up values.
Example
• 60% of ML students pass the final and 45% of ML students pass both the
final and the midterm *
• What percent of students who passed the final also passed the
midterm?
• Reworded: What percent of students passed the midterm given they
passed the final?
• P(M|F) = P(M,F) / P(F)
• = .45 / .60
• = .75
* These are made up values.
Marginalization and Law of Total Probability
• Marginalization (Sum Rule)
• Law of Total Probability
Bayes’ Rule
P(A|B) = P(AB) /P(B) (Conditional Probability)
P(A|B) = P(B|A)P(A) /P(B) (Product Rule)
P(A|B) = P(B|A)P(A) / Σ P(B|A)P(A) (Law of Total Probability)
Bayes’ Rule
Example
• Suppose you have tested positive for a disease; what is the
probability that you actually have the disease?
• It depends on the accuracy and sensitivity of the test, and on the
background (prior) probability of the disease.
• P(T=1|D=1) = .95 (true positive)
• P(T=1|D=0) = .10 (false positive)
• P(D=1) = .01 (prior)
• P(D=1|T=1) = ?
Example
• P(T=1|D=1) = .95 (true positive)
• P(T=1|D=0) = .10 (false positive)
• P(D=1) = .01 (prior)
Bayes’ Rule
• P(D|T) = P(T|D)P(D) / P(T)
= .95 * .01 / .1085
= .087
Law of Total Probability
• P(T) = Σ P(T|D)P(D)
= P(T|D=1)P(D=1) + P(T|D=0)P(D=0)
= .95*.01 + .1*.99
= .1085
The probability that you have the disease given you tested positive is 8.7%
Random Variable
• How do we link sample spaces and events to data?
• A random variable is a mapping that assigns a real number X(ω) to
each outcome ω
• Example: Flip a coin ten times. Let X(ω) be the number of heads in the
sequence ω. If ω = HHTHHTHHTT, then X(ω) = 6.
Discrete vs Continuous Random Variables
• Discrete: can only take a countable number of values
• Example: number of heads
• Distribution defined by probability mass function (pmf)
• Marginalization:
• Continuous: can take infinitely many values (real numbers)
• Example: time taken to accomplish task
• Distribution defined by probability density function (pdf)
• Marginalization:
Probability Distribution Statistics
• Mean: E[x] = μ = first moment =
• Variance: Var(X) =
• Nth moment =
Univariate continuous random variable
Univariate discrete random variable
=
Bernoulli Distribution
• Input: x ∈ {0, 1}
• Parameter: μ
• Example: Probability of flipping heads (x=1)
• Mean = E[x] = μ
• Variance = μ(1 − μ)
Discrete Distribution
Binomial Distribution
• Input: m = number of successes
• Parameters: N = number of trials
μ = probability of success
• Example: Probability of flipping heads m times out of N independent
flips with success probability μ
• Mean = E[x] = Nμ
• Variance = Nμ(1 − μ)
Discrete Distribution
Multinomial Distribution
• The multinomial distribution is a generalization of the binomial
distribution to k categories instead of just binary (success/fail)
• For n independent trials each of which leads to a success for exactly
one of k categories, the multinomial distribution gives the probability
of any particular combination of numbers of successes for the various
categories
• Example: Rolling a die N times
Discrete Distribution
Multinomial Distribution
• Input: m1 … mK (counts)
• Parameters: N = number of trials
μ = μ1 … μK probability of success for each category, Σμ=1
• Mean of mk: Nµk
• Variance of mk: Nµk(1-µk)
Discrete Distribution
Gaussian Distribution
• Aka the normal distribution
• Widely used model for the distribution of continuous variables
• In the case of a single variable x, the Gaussian distribution can be
written in the form
• where μ is the mean and σ2 is the variance
Continuous Distribution
Gaussian Distribution
• Gaussians with different means and variances
Multivariate Gaussian Distribution
• For a D-dimensional vector x, the multivariate Gaussian distribution
takes the form
• where μ is a D-dimensional mean vector
• Σ is a D × D covariance matrix
• |Σ| denotes the determinant of Σ
Optimization for ML
CS771: Intro to ML
Functions and their optima
2
 Many ML problems require us to optimize a function 𝑓 of some
variable(s) 𝑥
 For simplicity, assume 𝑓 is a scalar-valued function of a scalar 𝑥(𝑓: ℝ
→ ℝ)
 Any function has one/more optima (maxima, minima), and maybe
saddle points
𝑓(𝑥)
Global
maxima
A local
maxima
A local
maxima
A local
minima
A local
minima A local
minima
Global
minima
Will see what
these are
later
Usually interested in
global optima but often
want to find local
optima, too
𝑥
The objective function of the
ML problem we are solving
(e.g., squared loss for
regression)
Assume
unconstrained for
now, i.e., just a real-
valued
number/vector
For deep learning models, often
the local optima are what we can
find (and they usually suffice) –
more later
CS771: Intro to ML
Derivatives
3
 Magnitude of derivative at a point is the rate of change of the func at
that point
 Derivative becomes zero at stationary points (optima or saddle points)
 The function becomes “flat” (∆𝑓 𝑥 = 0 if we change 𝑥 by a very little at
𝑑𝑓(𝑥)
𝑑𝑥
= lim∆𝑥→0
∆𝑓(𝑥)
∆𝑥 𝑓(𝑥)
𝑥
∆𝑥
∆𝑓(𝑥)
∆𝑥
∆𝑓(𝑥)
Sign is also important: Positive
derivative means 𝑓 is increasing at 𝑥 if
we increase the value of 𝑥 by a very
small amount; negative derivative
means it is decreasing
Understanding how 𝑓 changes its value
as we change 𝑥 is helpful to understand
optimization
(minimization/maximization) algorithms
Will sometimes use 𝑓′(𝑥)
to denote the derivative
CS771: Intro to ML
Rules of Derivatives
4
Some basic rules of taking derivatives
 Sum Rule: 𝑓 𝑥 + 𝑔 𝑥
′
= 𝑓′
𝑥 + 𝑔′
𝑥
 Scaling Rule: 𝑎 ⋅ 𝑓 𝑥
′
= 𝑎 ⋅ 𝑓′
𝑥 if 𝑎 is not a function of 𝑥
 Product Rule: 𝑓 𝑥 ⋅ 𝑔 𝑥
′
= 𝑓′
𝑥 ⋅ 𝑔 𝑥 + 𝑔′
𝑥 ⋅ 𝑓 𝑥
 Quotient Rule: 𝑓 𝑥 /𝑔 𝑥 ′ = 𝑓′
𝑥 ⋅ 𝑔 𝑥 − 𝑔′
𝑥 𝑓 𝑥 / 𝑔 𝑥
2
 Chain Rule: 𝑓 𝑔 𝑥
′
≝ 𝑓 ∘ 𝑔 ′
𝑥 = 𝑓′
𝑔 𝑥 ⋅ 𝑔′
𝑥
We already used some of these (sum,
scaling and chain) when calculating the
derivative for the linear regression model
CS771: Intro to ML
Derivatives
5
 How the derivative itself changes tells us about the function’s optima
 The second derivative 𝑓’’(𝑥) can provide this information
𝑓’(𝑥)= 0 at 𝑥,
𝑓’(𝑥)>0 just
before 𝑥 𝑓’(𝑥)<0
just after 𝑥
𝑥 is a maxima
𝑓’(𝑥)= 0 at 𝑥
𝑓’(𝑥)< 0 just
before 𝑥 𝑓’(𝑥)>0
just after 𝑥
𝑥 is a minima
𝑓’(𝑥)= 0 at 𝑥
𝑓’(𝑥)= 0 just
before 𝑥 𝑓’(𝑥)= 0
just after 𝑥
𝑥 may be a saddle
𝑓’(𝑥)= 0 and 𝑓’’(𝑥)
< 0
𝑥 is a maxima
𝑓’(𝑥)= 0 and 𝑓’’ 𝑥 > 0
𝑥 is a minima
𝑓’(𝑥)= 0 and 𝑓’’ 𝑥 = 0
𝑥 may be a saddle. May
need higher derivatives
CS771: Intro to ML
Saddle Points
6
 Points where derivative is zero but are neither minima nor maxima
 Saddle points are very common for loss functions of deep learning
models
 Need to be handled carefully during optimization
 Second or higher derivative may help identify if a stationary point is a
Saddle is a point of
inflection where the
derivative is also zero
A saddle
point
CS771: Intro to ML
Multivariate Functions
7
 Most functions that we see in ML are multivariate function
 Example: Loss fn 𝐿(𝒘) in lin-reg was a multivar function of 𝐷-dim
vector 𝒘
 Here is an illustration of a function of 2 variables (4 maxima and 5
minima)
𝐿 𝒘 : ℝ𝐷
→ ℝ
Two-dim contour
plot of the function
(i.e., what it looks
like from the
above)
Plot courtesy: http://benchmarkfcns.xyz/benchmarkfcns/griewankfcn.html
CS771: Intro to ML
Derivatives of Multivariate Functions
8
 Can define derivative for a multivariate functions as well via the
gradient
 Gradient of a function 𝑓(𝒙): ℝ𝐷
→ ℝ is a 𝐷 × 1 vector of partial
derivatives
 Optima and saddle points defined similar to one-dim case
 Required properties that we saw for one-dim case must be satisfied along all
the directions
 The second derivative in this case is known as the Hessian
∇𝑓 𝒙 =
𝜕𝑓
𝜕𝑥1
,
𝜕𝑓
𝜕𝑥2
, … ,
𝜕𝑓
𝜕𝑥𝐷
Each element in this gradient vector tells
us how much 𝑓 will change if we move a
little along the corresponding (akin to
one-dim case)
CS771: Intro to ML
The Hessian
9
 For a multivar scalar valued function 𝑓(𝒙): ℝ𝐷
→ ℝ, Hessian is a 𝐷
× 𝐷 matrix
 The Hessian matrix can be used to assess the optima/saddle points
 ∇𝑓 𝒙 = 0 and 𝛻2
𝑓 𝒙 is a positive semi-definite (PSD) matrix then 𝒙 is a
minima
 ∇𝑓 𝒙 = 0, and 𝛻2
𝑓 𝒙 is a negative semi-definite (NSD) matrix then 𝒙 is a
𝛻2𝑓 𝒙 =
𝜕2𝑓
𝜕𝑥1
2
𝜕2𝑓
𝜕𝑥2𝑥1
𝜕2𝑓
𝜕𝑥1𝑥2
𝜕2𝑓
𝜕𝑥2
2
…
…
⋮ ⋮ ⋱
𝜕2𝑓
𝜕𝑥𝐷𝑥1
𝜕2𝑓
𝜕𝑥𝐷𝑥2
…
𝜕2𝑓
𝜕𝑥1𝑥𝐷
𝜕2𝑓
𝜕𝑥2𝑥𝐷
⋮
𝜕2𝑓
𝜕𝑥𝐷
2
Note: If the function itself is
vector valued, e.g., 𝑓(𝒙): ℝ𝐷
→ ℝ𝐾 then we will have 𝐾 such
𝐷 × 𝐷 Hessian matrices, one for
each output dimension of 𝑓
Gives information
about the
curvature of the
function at point
𝒙
A square, symmetric 𝐷
× 𝐷 matrix M is PSD if 𝒙⊤𝑀𝒙 ≥ 𝟎
∀ 𝒙 ∈ ℝ𝐷
Will be NSD if 𝒙⊤𝑀𝒙 ≤ 𝟎 ∀ 𝒙
∈ ℝ𝐷
PSD if all
eigenvalues
are non-
negative
CS771: Intro to ML
 A function being optimized can be either convex or non-convex
 Here are a couple of examples of convex functions
 Here are a couple of examples of non-convex functions
Convex and Non-Convex Functions
10
Convex functions are bowl-
shaped. They have a unique
optima (minima)
Negative of a convex function is
called a concave function, which
also has a unique optima
(maxima)
Non-convex functions have
multiple minima. Usually
harder to optimize as
compared to convex
functions
Loss functions of most
deep learning models
are non-convex
CS771: Intro to ML
Convex Sets
11
 A set S of points is a convex set, if for any two points 𝑥, 𝑦 ∈ 𝑆, and 0 ≤
𝛼 ≤ 1
 Above means that all points on the line-segment between 𝑥 and 𝑦 lie
within 𝑆
𝑧 = 𝛼𝑥 + 1 − 𝛼 𝑦 ∈ 𝑆
𝑧 is also called a
“convex combination”
of two points
Can also define convex
combination of 𝑁 points
𝑥1, 𝑥2, … , 𝑥𝑁 as 𝑧 = σ𝑖=1
𝑁
𝛼𝑖𝑥𝑖
CS771: Intro to ML
Convex Functions
12
 Informally, 𝑓(𝑥) is convex if all of its chords lie above the function
everywhere
 Formally, (assuming differentiable function), some tests for convexity:
 First-order convexity (graph of 𝑓 must be above all the tangents)
 Second derivative a.k.a. Hessian (if exists) must be positive semi-definite
Exercise: Show
that ridge
regression
objective is
convex
CS771: Intro to ML
Optimization Using First-Order Optimality
13
 Very simple. Already used this approach for linear and ridge
regression
 First order optimality: The gradient 𝒈 must be equal to zero at the
optima
 Sometimes, setting 𝒈 = 𝟎 and solving for 𝒘 gives a closed form
solution
𝒈 = ∇𝒘 𝐿(𝒘) = 0
The approach works only for
very simple problems where the
objective is convex and there are
no constraints on the values 𝒘
can take
Called “first order” since only
gradient is used and gradient
provides the first order info about
the function being optimized
CS771: Intro to ML
Optimization via Gradient Descent
14
 Initialize 𝒘 as 𝒘(0)
 For iteration 𝑡 = 0,1,2, … (or until convergence)
 Calculate the gradient 𝒈(𝑡)
using the current iterates
𝒘(𝑡)
 Set the learning rate 𝜂𝑡
 Move in the opposite direction of gradient
Gradient Descent
𝒘(𝑡+1)
= 𝒘(𝑡)
− 𝜂𝑡𝒈(𝑡)
Can I used this
approach to solve
maximization
problems?
Iterative since it requires
several steps/iterations to
find the optimal solution
For convex
functions, GD will
converge to the
global minima
Good
initialization
needed for
non-convex
functions
For max. problems we
can use gradient
ascent
𝒘(𝑡+1) = 𝒘(𝑡) + 𝜂𝑡𝒈(𝑡)
The learning rate
very imp. Should
be set carefully
(fixed or chosen
adaptively). Will
discuss some
strategies later
Will move in the
direction of the
gradient
Will see the
justification
shortly
Sometimes may
be tricky to to
assess
convergence? Will
see some methods
later
Fact: Gradient gives
the direction of
steepest change in
function’s value
CS771: Intro to ML
Gradient Descent: An Illustration
15
𝒘∗
𝒘(0) 𝒘(1) 𝒘(2) 𝒘(0)
𝒘(1)
𝒘(2) 𝒘∗
𝒘(3) 𝒘(3)
Stuck at a
local minima
Negative gradient here (
𝛿𝐿
𝛿𝑤
< 0). Let’s move in the
positive direction
Positive gradient
here. Let’s move
in the negative
direction
Learning rate is very important
Good initialization
is very important
𝐿(𝒘)
𝒘
CS771: Intro to ML
GD: An Example
16
 Let’s apply GD for least squares linear regression
 The gradient: 𝒈 = − σ𝑛=1
𝑁
2 𝑦𝑛 − 𝒘⊤
𝒙𝑛 𝒙𝑛
 Each GD update will be of the form
 Exercise: Assume 𝑁 = 1, and show that GD update improves
prediction on the training input (𝒙𝑛, 𝑦𝑛), i.e, 𝑦𝑛 is closer to 𝒘 𝑡+1 ⊤
𝒙𝑛
than to 𝒘 𝑡 ⊤
𝒙𝑛
 This is sort of a proof that GD updates are “corrective” in nature (and it
𝒘𝑟𝑖𝑑𝑔𝑒= arg min𝒘 𝐿𝑟𝑒𝑔 𝒘 = arg min𝒘 σ𝑛=1
𝑁
(𝑦𝑛 − 𝒘⊤
𝒙𝑛)2
𝒘(𝑡+1)
= 𝒘(𝑡)
+ 𝜂𝑡 σ𝑛=1
𝑁
2 𝑦𝑛 − 𝒘(𝑡)⊤
𝒙𝑛 𝒙𝑛
Prediction error of current
model 𝒘(𝑡) on the
𝑛𝑡ℎ training example
Training
examples on
which the
current
model’s error
is large
contribute more
to the update

Más contenido relacionado

Similar a Machine learning mathematicals.pdf

Model Selection and Validation
Model Selection and ValidationModel Selection and Validation
Model Selection and Validationgmorishita
 
Final Exam ReviewChapter 10Know the three ideas of s.docx
Final Exam ReviewChapter 10Know the three ideas of s.docxFinal Exam ReviewChapter 10Know the three ideas of s.docx
Final Exam ReviewChapter 10Know the three ideas of s.docxlmelaine
 
C2 st lecture 13 revision for test b handout
C2 st lecture 13   revision for test b handoutC2 st lecture 13   revision for test b handout
C2 st lecture 13 revision for test b handoutfatima d
 
Lec12-Probability.ppt
Lec12-Probability.pptLec12-Probability.ppt
Lec12-Probability.pptakashok1v
 
Lec12-Probability.ppt
Lec12-Probability.pptLec12-Probability.ppt
Lec12-Probability.pptssuserc7c104
 
AP Advantage: AP Calculus
AP Advantage: AP CalculusAP Advantage: AP Calculus
AP Advantage: AP CalculusShashank Patil
 
Module-2_Notes-with-Example for data science
Module-2_Notes-with-Example for data scienceModule-2_Notes-with-Example for data science
Module-2_Notes-with-Example for data sciencepujashri1975
 
Basic statistics 1
Basic statistics  1Basic statistics  1
Basic statistics 1Kumar P
 
Probability Distributions
Probability Distributions Probability Distributions
Probability Distributions Anthony J. Evans
 
chap4_Parametric_Methods.ppt
chap4_Parametric_Methods.pptchap4_Parametric_Methods.ppt
chap4_Parametric_Methods.pptShayanChowdary
 
1.1 course notes inferential statistics
1.1 course notes inferential statistics1.1 course notes inferential statistics
1.1 course notes inferential statisticsDjamel Bob
 

Similar a Machine learning mathematicals.pdf (20)

Model Selection and Validation
Model Selection and ValidationModel Selection and Validation
Model Selection and Validation
 
BIIntro.ppt
BIIntro.pptBIIntro.ppt
BIIntro.ppt
 
Final Exam ReviewChapter 10Know the three ideas of s.docx
Final Exam ReviewChapter 10Know the three ideas of s.docxFinal Exam ReviewChapter 10Know the three ideas of s.docx
Final Exam ReviewChapter 10Know the three ideas of s.docx
 
C2 st lecture 13 revision for test b handout
C2 st lecture 13   revision for test b handoutC2 st lecture 13   revision for test b handout
C2 st lecture 13 revision for test b handout
 
Statistics 1 revision notes
Statistics 1 revision notesStatistics 1 revision notes
Statistics 1 revision notes
 
5. RV and Distributions.pptx
5. RV and Distributions.pptx5. RV and Distributions.pptx
5. RV and Distributions.pptx
 
Lec13_Bayes.pptx
Lec13_Bayes.pptxLec13_Bayes.pptx
Lec13_Bayes.pptx
 
Lec12-Probability (1).ppt
Lec12-Probability (1).pptLec12-Probability (1).ppt
Lec12-Probability (1).ppt
 
Lec12-Probability.ppt
Lec12-Probability.pptLec12-Probability.ppt
Lec12-Probability.ppt
 
Lec12-Probability.ppt
Lec12-Probability.pptLec12-Probability.ppt
Lec12-Probability.ppt
 
Lec12-Probability.ppt
Lec12-Probability.pptLec12-Probability.ppt
Lec12-Probability.ppt
 
Crv
CrvCrv
Crv
 
AP Advantage: AP Calculus
AP Advantage: AP CalculusAP Advantage: AP Calculus
AP Advantage: AP Calculus
 
Module-2_Notes-with-Example for data science
Module-2_Notes-with-Example for data scienceModule-2_Notes-with-Example for data science
Module-2_Notes-with-Example for data science
 
2주차
2주차2주차
2주차
 
Basic statistics 1
Basic statistics  1Basic statistics  1
Basic statistics 1
 
Probability Distributions
Probability Distributions Probability Distributions
Probability Distributions
 
chap4_Parametric_Methods.ppt
chap4_Parametric_Methods.pptchap4_Parametric_Methods.ppt
chap4_Parametric_Methods.ppt
 
Inorganic CHEMISTRY
Inorganic CHEMISTRYInorganic CHEMISTRY
Inorganic CHEMISTRY
 
1.1 course notes inferential statistics
1.1 course notes inferential statistics1.1 course notes inferential statistics
1.1 course notes inferential statistics
 

Último

Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxVishalSingh1417
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Shubhangi Sonawane
 
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-IIFood Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-IIShubhangi Sonawane
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.MaryamAhmad92
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin ClassesCeline George
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...Poonam Aher Patil
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxnegromaestrong
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxDenish Jangid
 
Role Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptxRole Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptxNikitaBankoti2
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docxPoojaSen20
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxRamakrishna Reddy Bijjam
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 

Último (20)

Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
 
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-IIFood Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Asian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptxAsian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptx
 
Role Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptxRole Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptx
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 

Machine learning mathematicals.pdf

  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
  • 30. Probability Theory for Machine Learning Chris Cremer September 2015
  • 31. Outline • Motivation • Probability Definitions and Rules • Probability Distributions • MLE for Gaussian Parameter Estimation • MLE and Least Squares
  • 32. Material • Pattern Recognition and Machine Learning - Christopher M. Bishop • All of Statistics – Larry Wasserman • Wolfram MathWorld • Wikipedia
  • 33. Motivation • Uncertainty arises through: • Noisy measurements • Finite size of data sets • Ambiguity: The word bank can mean (1) a financial institution, (2) the side of a river, or (3) tilting an airplane. Which meaning was intended, based on the words that appear nearby? • Limited Model Complexity • Probability theory provides a consistent framework for the quantification and manipulation of uncertainty • Allows us to make optimal predictions given all the information available to us, even though that information may be incomplete or ambiguous
  • 34. Sample Space • The sample space Ω is the set of possible outcomes of an experiment. Points ω in Ω are called sample outcomes, realizations, or elements. Subsets of Ω are called Events. • Example. If we toss a coin twice then Ω = {HH,HT, TH, TT}. The event that the first toss is heads is A = {HH,HT} • We say that events A1 and A2 are disjoint (mutually exclusive) if Ai ∩ Aj = {} • Example: first flip being heads and first flip being tails
  • 35. Probability • We will assign a real number P(A) to every event A, called the probability of A. • To qualify as a probability, P must satisfy three axioms: • Axiom 1: P(A) ≥ 0 for every A • Axiom 2: P(Ω) = 1 • Axiom 3: If A1,A2, . . . are disjoint then
  • 36. Joint and Conditional Probabilities • Joint Probability • P(X,Y) • Probability of X and Y • Conditional Probability • P(X|Y) • Probability of X given Y
  • 37. Independent and Conditional Probabilities • Assuming that P(B) > 0, the conditional probability of A given B: • P(A|B)=P(AB)/P(B) • P(AB) = P(A|B)P(B) = P(B|A)P(A) • Product Rule • Two events A and B are independent if • P(AB) = P(A)P(B) • Joint = Product of Marginals • Two events A and B are conditionally independent given C if they are independent after conditioning on C • P(AB|C) = P(B|AC)P(A|C) = P(B|C)P(A|C)
  • 38. Example • 60% of ML students pass the final and 45% of ML students pass both the final and the midterm * • What percent of students who passed the final also passed the midterm? * These are made up values.
  • 39. Example • 60% of ML students pass the final and 45% of ML students pass both the final and the midterm * • What percent of students who passed the final also passed the midterm? • Reworded: What percent of students passed the midterm given they passed the final? • P(M|F) = P(M,F) / P(F) • = .45 / .60 • = .75 * These are made up values.
  • 40. Marginalization and Law of Total Probability • Marginalization (Sum Rule) • Law of Total Probability
  • 41. Bayes’ Rule P(A|B) = P(AB) /P(B) (Conditional Probability) P(A|B) = P(B|A)P(A) /P(B) (Product Rule) P(A|B) = P(B|A)P(A) / Σ P(B|A)P(A) (Law of Total Probability)
  • 43. Example • Suppose you have tested positive for a disease; what is the probability that you actually have the disease? • It depends on the accuracy and sensitivity of the test, and on the background (prior) probability of the disease. • P(T=1|D=1) = .95 (true positive) • P(T=1|D=0) = .10 (false positive) • P(D=1) = .01 (prior) • P(D=1|T=1) = ?
  • 44. Example • P(T=1|D=1) = .95 (true positive) • P(T=1|D=0) = .10 (false positive) • P(D=1) = .01 (prior) Bayes’ Rule • P(D|T) = P(T|D)P(D) / P(T) = .95 * .01 / .1085 = .087 Law of Total Probability • P(T) = Σ P(T|D)P(D) = P(T|D=1)P(D=1) + P(T|D=0)P(D=0) = .95*.01 + .1*.99 = .1085 The probability that you have the disease given you tested positive is 8.7%
  • 45. Random Variable • How do we link sample spaces and events to data? • A random variable is a mapping that assigns a real number X(ω) to each outcome ω • Example: Flip a coin ten times. Let X(ω) be the number of heads in the sequence ω. If ω = HHTHHTHHTT, then X(ω) = 6.
  • 46. Discrete vs Continuous Random Variables • Discrete: can only take a countable number of values • Example: number of heads • Distribution defined by probability mass function (pmf) • Marginalization: • Continuous: can take infinitely many values (real numbers) • Example: time taken to accomplish task • Distribution defined by probability density function (pdf) • Marginalization:
  • 47. Probability Distribution Statistics • Mean: E[x] = μ = first moment = • Variance: Var(X) = • Nth moment = Univariate continuous random variable Univariate discrete random variable =
  • 48. Bernoulli Distribution • Input: x ∈ {0, 1} • Parameter: μ • Example: Probability of flipping heads (x=1) • Mean = E[x] = μ • Variance = μ(1 − μ) Discrete Distribution
  • 49. Binomial Distribution • Input: m = number of successes • Parameters: N = number of trials μ = probability of success • Example: Probability of flipping heads m times out of N independent flips with success probability μ • Mean = E[x] = Nμ • Variance = Nμ(1 − μ) Discrete Distribution
  • 50. Multinomial Distribution • The multinomial distribution is a generalization of the binomial distribution to k categories instead of just binary (success/fail) • For n independent trials each of which leads to a success for exactly one of k categories, the multinomial distribution gives the probability of any particular combination of numbers of successes for the various categories • Example: Rolling a die N times Discrete Distribution
  • 51. Multinomial Distribution • Input: m1 … mK (counts) • Parameters: N = number of trials μ = μ1 … μK probability of success for each category, Σμ=1 • Mean of mk: Nµk • Variance of mk: Nµk(1-µk) Discrete Distribution
  • 52. Gaussian Distribution • Aka the normal distribution • Widely used model for the distribution of continuous variables • In the case of a single variable x, the Gaussian distribution can be written in the form • where μ is the mean and σ2 is the variance Continuous Distribution
  • 53. Gaussian Distribution • Gaussians with different means and variances
  • 54. Multivariate Gaussian Distribution • For a D-dimensional vector x, the multivariate Gaussian distribution takes the form • where μ is a D-dimensional mean vector • Σ is a D × D covariance matrix • |Σ| denotes the determinant of Σ
  • 56. CS771: Intro to ML Functions and their optima 2  Many ML problems require us to optimize a function 𝑓 of some variable(s) 𝑥  For simplicity, assume 𝑓 is a scalar-valued function of a scalar 𝑥(𝑓: ℝ → ℝ)  Any function has one/more optima (maxima, minima), and maybe saddle points 𝑓(𝑥) Global maxima A local maxima A local maxima A local minima A local minima A local minima Global minima Will see what these are later Usually interested in global optima but often want to find local optima, too 𝑥 The objective function of the ML problem we are solving (e.g., squared loss for regression) Assume unconstrained for now, i.e., just a real- valued number/vector For deep learning models, often the local optima are what we can find (and they usually suffice) – more later
  • 57. CS771: Intro to ML Derivatives 3  Magnitude of derivative at a point is the rate of change of the func at that point  Derivative becomes zero at stationary points (optima or saddle points)  The function becomes “flat” (∆𝑓 𝑥 = 0 if we change 𝑥 by a very little at 𝑑𝑓(𝑥) 𝑑𝑥 = lim∆𝑥→0 ∆𝑓(𝑥) ∆𝑥 𝑓(𝑥) 𝑥 ∆𝑥 ∆𝑓(𝑥) ∆𝑥 ∆𝑓(𝑥) Sign is also important: Positive derivative means 𝑓 is increasing at 𝑥 if we increase the value of 𝑥 by a very small amount; negative derivative means it is decreasing Understanding how 𝑓 changes its value as we change 𝑥 is helpful to understand optimization (minimization/maximization) algorithms Will sometimes use 𝑓′(𝑥) to denote the derivative
  • 58. CS771: Intro to ML Rules of Derivatives 4 Some basic rules of taking derivatives  Sum Rule: 𝑓 𝑥 + 𝑔 𝑥 ′ = 𝑓′ 𝑥 + 𝑔′ 𝑥  Scaling Rule: 𝑎 ⋅ 𝑓 𝑥 ′ = 𝑎 ⋅ 𝑓′ 𝑥 if 𝑎 is not a function of 𝑥  Product Rule: 𝑓 𝑥 ⋅ 𝑔 𝑥 ′ = 𝑓′ 𝑥 ⋅ 𝑔 𝑥 + 𝑔′ 𝑥 ⋅ 𝑓 𝑥  Quotient Rule: 𝑓 𝑥 /𝑔 𝑥 ′ = 𝑓′ 𝑥 ⋅ 𝑔 𝑥 − 𝑔′ 𝑥 𝑓 𝑥 / 𝑔 𝑥 2  Chain Rule: 𝑓 𝑔 𝑥 ′ ≝ 𝑓 ∘ 𝑔 ′ 𝑥 = 𝑓′ 𝑔 𝑥 ⋅ 𝑔′ 𝑥 We already used some of these (sum, scaling and chain) when calculating the derivative for the linear regression model
  • 59. CS771: Intro to ML Derivatives 5  How the derivative itself changes tells us about the function’s optima  The second derivative 𝑓’’(𝑥) can provide this information 𝑓’(𝑥)= 0 at 𝑥, 𝑓’(𝑥)>0 just before 𝑥 𝑓’(𝑥)<0 just after 𝑥 𝑥 is a maxima 𝑓’(𝑥)= 0 at 𝑥 𝑓’(𝑥)< 0 just before 𝑥 𝑓’(𝑥)>0 just after 𝑥 𝑥 is a minima 𝑓’(𝑥)= 0 at 𝑥 𝑓’(𝑥)= 0 just before 𝑥 𝑓’(𝑥)= 0 just after 𝑥 𝑥 may be a saddle 𝑓’(𝑥)= 0 and 𝑓’’(𝑥) < 0 𝑥 is a maxima 𝑓’(𝑥)= 0 and 𝑓’’ 𝑥 > 0 𝑥 is a minima 𝑓’(𝑥)= 0 and 𝑓’’ 𝑥 = 0 𝑥 may be a saddle. May need higher derivatives
  • 60. CS771: Intro to ML Saddle Points 6  Points where derivative is zero but are neither minima nor maxima  Saddle points are very common for loss functions of deep learning models  Need to be handled carefully during optimization  Second or higher derivative may help identify if a stationary point is a Saddle is a point of inflection where the derivative is also zero A saddle point
  • 61. CS771: Intro to ML Multivariate Functions 7  Most functions that we see in ML are multivariate function  Example: Loss fn 𝐿(𝒘) in lin-reg was a multivar function of 𝐷-dim vector 𝒘  Here is an illustration of a function of 2 variables (4 maxima and 5 minima) 𝐿 𝒘 : ℝ𝐷 → ℝ Two-dim contour plot of the function (i.e., what it looks like from the above) Plot courtesy: http://benchmarkfcns.xyz/benchmarkfcns/griewankfcn.html
  • 62. CS771: Intro to ML Derivatives of Multivariate Functions 8  Can define derivative for a multivariate functions as well via the gradient  Gradient of a function 𝑓(𝒙): ℝ𝐷 → ℝ is a 𝐷 × 1 vector of partial derivatives  Optima and saddle points defined similar to one-dim case  Required properties that we saw for one-dim case must be satisfied along all the directions  The second derivative in this case is known as the Hessian ∇𝑓 𝒙 = 𝜕𝑓 𝜕𝑥1 , 𝜕𝑓 𝜕𝑥2 , … , 𝜕𝑓 𝜕𝑥𝐷 Each element in this gradient vector tells us how much 𝑓 will change if we move a little along the corresponding (akin to one-dim case)
  • 63. CS771: Intro to ML The Hessian 9  For a multivar scalar valued function 𝑓(𝒙): ℝ𝐷 → ℝ, Hessian is a 𝐷 × 𝐷 matrix  The Hessian matrix can be used to assess the optima/saddle points  ∇𝑓 𝒙 = 0 and 𝛻2 𝑓 𝒙 is a positive semi-definite (PSD) matrix then 𝒙 is a minima  ∇𝑓 𝒙 = 0, and 𝛻2 𝑓 𝒙 is a negative semi-definite (NSD) matrix then 𝒙 is a 𝛻2𝑓 𝒙 = 𝜕2𝑓 𝜕𝑥1 2 𝜕2𝑓 𝜕𝑥2𝑥1 𝜕2𝑓 𝜕𝑥1𝑥2 𝜕2𝑓 𝜕𝑥2 2 … … ⋮ ⋮ ⋱ 𝜕2𝑓 𝜕𝑥𝐷𝑥1 𝜕2𝑓 𝜕𝑥𝐷𝑥2 … 𝜕2𝑓 𝜕𝑥1𝑥𝐷 𝜕2𝑓 𝜕𝑥2𝑥𝐷 ⋮ 𝜕2𝑓 𝜕𝑥𝐷 2 Note: If the function itself is vector valued, e.g., 𝑓(𝒙): ℝ𝐷 → ℝ𝐾 then we will have 𝐾 such 𝐷 × 𝐷 Hessian matrices, one for each output dimension of 𝑓 Gives information about the curvature of the function at point 𝒙 A square, symmetric 𝐷 × 𝐷 matrix M is PSD if 𝒙⊤𝑀𝒙 ≥ 𝟎 ∀ 𝒙 ∈ ℝ𝐷 Will be NSD if 𝒙⊤𝑀𝒙 ≤ 𝟎 ∀ 𝒙 ∈ ℝ𝐷 PSD if all eigenvalues are non- negative
  • 64. CS771: Intro to ML  A function being optimized can be either convex or non-convex  Here are a couple of examples of convex functions  Here are a couple of examples of non-convex functions Convex and Non-Convex Functions 10 Convex functions are bowl- shaped. They have a unique optima (minima) Negative of a convex function is called a concave function, which also has a unique optima (maxima) Non-convex functions have multiple minima. Usually harder to optimize as compared to convex functions Loss functions of most deep learning models are non-convex
  • 65. CS771: Intro to ML Convex Sets 11  A set S of points is a convex set, if for any two points 𝑥, 𝑦 ∈ 𝑆, and 0 ≤ 𝛼 ≤ 1  Above means that all points on the line-segment between 𝑥 and 𝑦 lie within 𝑆 𝑧 = 𝛼𝑥 + 1 − 𝛼 𝑦 ∈ 𝑆 𝑧 is also called a “convex combination” of two points Can also define convex combination of 𝑁 points 𝑥1, 𝑥2, … , 𝑥𝑁 as 𝑧 = σ𝑖=1 𝑁 𝛼𝑖𝑥𝑖
  • 66. CS771: Intro to ML Convex Functions 12  Informally, 𝑓(𝑥) is convex if all of its chords lie above the function everywhere  Formally, (assuming differentiable function), some tests for convexity:  First-order convexity (graph of 𝑓 must be above all the tangents)  Second derivative a.k.a. Hessian (if exists) must be positive semi-definite Exercise: Show that ridge regression objective is convex
  • 67. CS771: Intro to ML Optimization Using First-Order Optimality 13  Very simple. Already used this approach for linear and ridge regression  First order optimality: The gradient 𝒈 must be equal to zero at the optima  Sometimes, setting 𝒈 = 𝟎 and solving for 𝒘 gives a closed form solution 𝒈 = ∇𝒘 𝐿(𝒘) = 0 The approach works only for very simple problems where the objective is convex and there are no constraints on the values 𝒘 can take Called “first order” since only gradient is used and gradient provides the first order info about the function being optimized
  • 68. CS771: Intro to ML Optimization via Gradient Descent 14  Initialize 𝒘 as 𝒘(0)  For iteration 𝑡 = 0,1,2, … (or until convergence)  Calculate the gradient 𝒈(𝑡) using the current iterates 𝒘(𝑡)  Set the learning rate 𝜂𝑡  Move in the opposite direction of gradient Gradient Descent 𝒘(𝑡+1) = 𝒘(𝑡) − 𝜂𝑡𝒈(𝑡) Can I used this approach to solve maximization problems? Iterative since it requires several steps/iterations to find the optimal solution For convex functions, GD will converge to the global minima Good initialization needed for non-convex functions For max. problems we can use gradient ascent 𝒘(𝑡+1) = 𝒘(𝑡) + 𝜂𝑡𝒈(𝑡) The learning rate very imp. Should be set carefully (fixed or chosen adaptively). Will discuss some strategies later Will move in the direction of the gradient Will see the justification shortly Sometimes may be tricky to to assess convergence? Will see some methods later Fact: Gradient gives the direction of steepest change in function’s value
  • 69. CS771: Intro to ML Gradient Descent: An Illustration 15 𝒘∗ 𝒘(0) 𝒘(1) 𝒘(2) 𝒘(0) 𝒘(1) 𝒘(2) 𝒘∗ 𝒘(3) 𝒘(3) Stuck at a local minima Negative gradient here ( 𝛿𝐿 𝛿𝑤 < 0). Let’s move in the positive direction Positive gradient here. Let’s move in the negative direction Learning rate is very important Good initialization is very important 𝐿(𝒘) 𝒘
  • 70. CS771: Intro to ML GD: An Example 16  Let’s apply GD for least squares linear regression  The gradient: 𝒈 = − σ𝑛=1 𝑁 2 𝑦𝑛 − 𝒘⊤ 𝒙𝑛 𝒙𝑛  Each GD update will be of the form  Exercise: Assume 𝑁 = 1, and show that GD update improves prediction on the training input (𝒙𝑛, 𝑦𝑛), i.e, 𝑦𝑛 is closer to 𝒘 𝑡+1 ⊤ 𝒙𝑛 than to 𝒘 𝑡 ⊤ 𝒙𝑛  This is sort of a proof that GD updates are “corrective” in nature (and it 𝒘𝑟𝑖𝑑𝑔𝑒= arg min𝒘 𝐿𝑟𝑒𝑔 𝒘 = arg min𝒘 σ𝑛=1 𝑁 (𝑦𝑛 − 𝒘⊤ 𝒙𝑛)2 𝒘(𝑡+1) = 𝒘(𝑡) + 𝜂𝑡 σ𝑛=1 𝑁 2 𝑦𝑛 − 𝒘(𝑡)⊤ 𝒙𝑛 𝒙𝑛 Prediction error of current model 𝒘(𝑡) on the 𝑛𝑡ℎ training example Training examples on which the current model’s error is large contribute more to the update