2. Table of content
What is Machine Learning
Probability & Random variable
Probability Distributions & Maximum Likelihood Estimation
2 / 23
3. Machine learning
Machine learning is the study of computer algorithms that allow
computer programs to automatically improve through experience
(Tom Mitchell)
▶ T: a task with clearly defined input and output
▶ P: a performance measure assessing how good an algorithm is
on the task
▶ E: a set of experience (i.e. data) provided to the algorithm
3 / 23
4. Example
(Single) Face detection
▶ T: input = 224x224 RGB image, output = (x1,y1,x2,y2)
top-left & bottom-right corner of the face in the input
▶ P: IoU
▶ E: a set of (million) (image, (x1, y1, x2, y2)) pairs
Exercises: Specify T, P, E for
▶ Predicting tomorrow’s weather given geographic information,
satellite images, and a trailing window of past weather.
▶ Answering question, expressed in free-form text.
▶ Identifying all of people depicted in an image and draws
outlines around each.
▶ Recommending users with products that they are likely to
enjoy while browsing.
4 / 23
5. Types of Machine Learning
▶ Supervised learning: learn input-output relationship
(E = {(xi , yi )} where xi ’s are inputs and yi ’s are desired
targets)
▶ Unsupervised / Self-supervised learning: learn data features,
clusters or distribution (E = {xi }, inputs only, no targets)
▶ Reinforcement learning: learn good action policy for an agent
in an environment (E = {(s, a) → (s′, r)} where s, s′ are
states, a is action, r is reward)
5 / 23
6. Key phases in Machine Learning
Phase Programming aspect
Data preparation storing, retrieving, transforming data
Data modelling model libraries, machine learning algorithms
Training model optimization, fine tuning, validation
Inference deploying, logging, testing, mobile, web, api
6 / 23
7. Prerequisite for Machine learning
Math
▶ Linear Algebra
▶ Calculus
▶ Probability and Statistics
▶ Optimization
Programming
▶ Data structure and
algorithms
▶ Python/C++
▶ Libaries: numpy, pandas,
scikit-learn, pytorch
▶ Framework: jupyter, django,
fastapi, Android, IOS
7 / 23
8. Probability
Definitions:
▶ Sample space: Ω is the set of all possible outcomes or results
(of a random experiment).
▶ Event space: The set F ⊂ 2Ω is a σ-algebra of the sets of Ω.
Each element in F is an event (subset of Ω).
▶ A σ-algebra must satisfy: (i) F ̸= ∅, (ii) A ∈ F ⇒ Ω A ∈ F,
(iii) Ai ∈ F, ∀i ⇒
S∞
i=1 Ai ∈ F
▶ Probability measure: a function P : F → R+ satisfies the
following properties:
▶ P(Ω) = 1, P(∅) = 0
▶ Ai ∈ F, Ai ∩ Aj = ∅, ∀i ̸= j ⇒ P(
S∞
i=1 Ai ) =
P∞
i=1 P(Ai )
As a result, the probability of a random event is specified by a
probability triple (Ω, F, P).
8 / 23
9. Probability
Example
Consider a random experiment: A closed box contains 100
marbles, of which 40 are red and 60 are blue. Take out one marble
randomly.
▶ Sample space: Ω is the set of 100 marbles in the box.
▶ Event space: F = {∅, Ω, red marble, blue marble}, i.e F
includes 4 sets of Ω. Notice that F is a σ-algebra of Ω.
▶ Probability measure: If the chances of taking every marble are
all equal, then
▶ P(∅) = 0, (Ω) = 1, P(red) = 0.4, P(blue) = 0.6
▶ Event ∅: no marble are taken (happen with probability 0).
▶ Event Ω: A red or blue marble is taken (happen with
probability 1).
▶ Event red marble: the marble taken is red (probability 0.4).
▶ Event blue marble: the marble taken is blue (probability 0.6).
9 / 23
10. Probability
Bayes’ theorem
Consider two events A, B, with P(A) ̸= 0, then
P(B|A) =
P(A ∩ B)
P(A)
=
P(A|B)P(B)
P(A)
where
▶ P(B|A): the probability of event B occurring given that A is
true (a-posterior).
▶ P(A|B): the likelihood of A given a fixed B
▶ P(B): marginal or prior probability.
Independence
Two events A and B are independent iff P(A ∩ B) = P(A)P(B)
10 / 23
11. Probability
Example: COVID-19
▶ Test results are accurate on a sicked person with 90% (True
positive rate).
▶ Test results are accurate on a healthy person with 99% (True
negative rate).
▶ 3% of the population have COVID-19.
Question: what is the probability that a random person who tests
positive is really a sicked person?
▶ Event A: positive test result.
▶ Event B: has disease.
P(A|B) × P(B) = 0.9 × 0.03 = 0.027
P(A) = P(A|B) × P(B) + P(A| − B) × P(−B)
= 0.9 × 0.03 + 0.01 × 0.97 = 0.0367
⇒ P(B|A) = 73.569%
11 / 23
12. Random variable
A random variable X is a measurable function X on the sample
space
X : Ω → R
Example:
▶ Randomly take 10 marbles (with replacement). The number
of blue marble in the 10 taken marbles is a random variable.
▶ Pick randomly 1 person in 100 people, the height of that
person is a random variable.
12 / 23
13. Types of random variables
▶ Discrete
X ∈ {1, 2, . . . C}
with parameters: θc = P(X = c), c = 1, 2, . . . C
▶ Continuous
X ∈ R
▶ Cumulative density function (CDF): F(x) = P(X ≤ x)
▶ Probability density function (PDF): p(x) = F′
(x)
▶ Bayes’s formula for PDF:
p(x, y) = p(y|x)p(x) = p(x|y)p(y)
13 / 23
14. Properties of random distribution
▶ Expectation
E[X] =
X
c
cP(X = c) =
Z
R
xp(x)dx
E[f (X)] =
Z
R
f (x)p(x)dx
▶ Variance
V[X] = E[(X − E[X])2
]
14 / 23
15. Properties of expectation
E[aX + bY + c] = aE[X] + bE[Y ] + c
V[aX] = a2
V[X]
V[X] = E[X2
] − (E[X])2
V[X] = V[E[X|Y ]] + E[V[X|Y ]]
If X, Y are independent
E[X · Y ] = E[X] · E[Y ]
V[X + Y ] = V[X] + V[Y ]
15 / 23
16. Properties of expectation
EY EX [X|Y ] =
Z
R
Z
R
xp(x|y)dx
p(y)dy =
Z
R
xp(x)dx = E[X]
Z
R
Z
R
xp(x|y)dx
p(y)dy =
Z
R
Z
R
xp(x|y)p(y) dxdy
=
Z
R
Z
R
xp(x, y) dxdy
=
Z
R
Z
R
xp(y|x)p(x) dxdy
=
Z
R
Z
R
p(y|x) dy
| {z }
1
xp(x) dx
16 / 23
17. Bernoulli distribution
X ∈ {0, 1} with probability P(X = 1) = θ, written as X ∼ Ber(θ).
We also have P(X = 0) = 1 − θ.
▶ A biased coin: θ = probability of head
▶ Binary classification: P(y = 1|x) = Ber(θ(x))
→ probability of class 1 is a function of input
17 / 23
18. Parameter estimation
Toss a coin (sampling) N times, the number of times heads come
up (number 1) is s times, what is the parameter θ of the coin
(Bernoulli distribution)?
An intuitive guess: θ = s
N , why does this number make sense?
Let xi ∈ {0, 1} is the values from the ith toss.
The probability of the data D = {x1, x2, . . . , xN} under the model
X ∼ Ber(θ) is
L(θ) = P(D) = P(x1, x2, . . . , xN) =
N
Y
i=1
P(xi )
=
N
Y
i=1
θxi
(1 − θ)1−xi
18 / 23
19. Maximum likelihood Estimation - MLE
L(θ) is the likelihood of θ with respect to the dataset D
MLE: Find θ for which L(θ) is maximized.
ℓ(θ) = log L(θ) =
n
X
i=1
xi log θ + (1 − xi ) log(1 − θ)
ℓ′
(θ) =
n
X
i=1
xi /θ − (1 − xi )/(1 − θ) = 0
1
θ
n
X
i=1
xi
| {z }
s
=
1
1 − θ
n
X
i=1
(1 − xi )
| {z }
N−s
s(1 − θ) = (N − s)θ
θMLE
=
s
N
19 / 23
20. How good is the MLE?
▶ Unbiased: E[θMLE ] = θ
▶ Variance goes to 0: V[θMLE ] = θ(1 − θ)/N
▶ Consistent: P{|θMLE − θ| ≥ ϵ}
n→∞
−→ ∞
▶ Normality:
√
N(θMLE − θ)
d
−→ N(0, 1)
20 / 23
21. Binomial distribution
The probability of getting exactly s heads in N independent
Bernoulli trials of tossing a coin is a Binomial distribution.
X ∼ Bin(s|N, θ) ⇒ P(X = s) = Cs
Nθs
(1 − θ)N−s
If, after taking the experiments n times, we get the data
D = {s1, s2, . . . , sn}, then what is the sensible value of θ? (Hint:
using MLE)
L(θ) = P(D) = P(s1, x2, . . . , sn) =
n
Y
i=1
P(si )
=
n
Y
i=1
Csi
N θsi
(1 − θ)N−si
21 / 23
22. Binomial distribution (cont)
ℓ(θ) = log L(θ) = const +
n
X
i=1
si log θ + (N − si ) log(1 − θ)
ℓ′
(θ) =
n
X
i=1
si /θ − (N − si )/(1 − θ) = 0
1
θ
n
X
i=1
si =
1
1 − θ
n
X
i=1
(N − si )
θ =
1
n
N
X
i=1
si
N
22 / 23
23. Gaussian distribution
The distribution X ∼ N(x|µ, σ2) is a Gaussian distribution with
the density function
p(X = x) =
1
√
2πσ2
e−
(x−µ)2
2σ2
▶ Regression: p(y|x) = N(y|µ(x), σ2) or y = µ(x) + ϵ with
ϵ ∼ N(ϵ|0, σ2)
Exercise: Given the data D = {x1, x2, . . . xn}, what are reasonable
values of the parameters µ, σ2?
23 / 23