The presentation material for the reading club of Element of Statistical Learning by Hastie et al.
The contents of the sections cover
- Properties of logistic regression compared to least square s fitting
- Difference between logistic regression vs. linear discriminant analysis
- Rosenblatt's perceptron algorithm
- Derivation of optimal hyperplane, which offers the basis for SVM
-------------------------------------------------------------------------
研究室での『統計学習の基礎』(Hastieら著)の輪講用発表資料(ぜんぶ英語)です。
担当範囲は
・最小二乗法との類推で見るロジスティック回帰の特徴
・ロジスティック回帰と線形判別分析の比較
・ローゼンブラットのパーセプトロンアルゴリズム
・SVMの基礎となる最適分離超平面の導出
ESL 4.4.3-4.5: Logistic Reression (contd.) and Separating Hyperplane
1. ESL 4.4.3-4.5
Logistic Regression (contd.)
& Separating Hyperplane
June 8, 2015
Talk by Shinichi TAMURA
Mathematical Informatics Lab @ NAIST
2. Today's topics
¨ Logistic regression (contd.)"
¨ On the analogy with Least Squares Fitting"
¨ Logistic regression vs. LDA"
¨ Separating Hyperplane"
¨ Rosenblatt's Perceptron"
¨ Optimal Hyperplane"
3. Today's topics
¨ Logistic regression (contd.)"
¨ On the analogy with Least Squares Fitting"
¨ Logistic regression vs. LDA"
¨ Separating Hyperplane"
¨ Rosenblatt's Perceptron"
¨ Optimal Hyperplane"
4. Today's topics
¨ Logistic regression (contd.)"
¨ On the analogy with Least Squares Fitting"
¨ Logistic regression vs. LDA"
¨ Separating Hyperplane"
¨ Rosenblatt's Perceptron"
¨ Optimal Hyperplane"
5. Today's topics
¨ Logistic regression (contd.)"
¨ On the analogy with Least Squares Fitting"
¨ Logistic regression vs. LDA"
¨ Separating Hyperplane"
¨ Rosenblatt's Perceptron"
¨ Optimal Hyperplane"
6. On the analogy with Least Squares Fitting
[Review] Fitting LR Model
Parameters are fitted by ML estimation, using
Newton-Raphson algorithm:"
"
"
"
βnew
← rg min
β
(z − Xβ) W(z − Xβ)
βnew
=(X WX)−1
X Wz.
7. On the analogy with Least Squares Fitting
[Review] Fitting LR Model
Parameters are fitted by ML estimation, using
Newton-Raphson algorithm:"
"
"
"
It looks like least squares fitting:"
βnew
← rg min
β
(z − Xβ) W(z − Xβ)
βnew
=(X WX)−1
X Wz.
β ← rg min
β
(y − Xβ) (y − Xβ)
β =(X X)−1
X y
8. On the analogy with Least Squares Fitting
Self-consistency
β depends on W and z, while W and z depend on β."
"
"
"
"
"
βnew
← rg min
β
(z − Xβ) W(z − Xβ)
βnew
=(X WX)−1
X Wz.
9. On the analogy with Least Squares Fitting
Self-consistency
β depends on W and z, while W and z depend on β."
"
"
"
"
"
βnew
← rg min
β
(z − Xβ) W(z − Xβ)
βnew
=(X WX)−1
X Wz.
z =
ˆβ +
y − ˆp
ˆp(1 − ˆp)
= ˆp(1 − ˆp).
10. On the analogy with Least Squares Fitting
Self-consistency
β depends on W and z, while W and z depend on β."
"
"
"
"
→ “self-consistent” equation, needs iterative method
to solve"
"
βnew
← rg min
β
(z − Xβ) W(z − Xβ)
βnew
=(X WX)−1
X Wz.
11. On the analogy with Least Squares Fitting
Meaning of Weighted RSS (1)
RSS is used to check the goodness of fit in least
squares fitting."
"
"
N
=1
(y − ˆp)2
12. On the analogy with Least Squares Fitting
Meaning of Weighted RSS (1)
RSS is used to check the goodness of fit in least
squares fitting."
"
"
"
How about weighted RSS in logistic regression?"
N
=1
(y − ˆp)2
N
=1
(y − ˆp)2
ˆp(1 − ˆp)
13. On the analogy with Least Squares Fitting
Meaning of Weighted RSS (2)
Weighted RSS is interpreted as..."
Peason's χ-squared statistics"
χ2
=
N
=1
(y − ˆp)2
ˆp
+
(y − ˆp)2
1 − ˆp
=
N
=1
(1 − ˆp + ˆp)(y − ˆp)2
ˆp(1 − ˆp)
=
N
=1
(y − ˆp)2
ˆp(1 − ˆp)
.
14. On the analogy with Least Squares Fitting
Meaning of Weighted RSS (2)
Weighted RSS is interpreted as..."
Peason's χ-squared statistics"
χ2
=
N
=1
(y − ˆp)2
ˆp
+
(y − ˆp)2
1 − ˆp
=
N
=1
(1 − ˆp + ˆp)(y − ˆp)2
ˆp(1 − ˆp)
=
N
=1
(y − ˆp)2
ˆp(1 − ˆp)
.
15. On the analogy with Least Squares Fitting
Meaning of Weighted RSS (2)
Weighted RSS is interpreted as..."
Peason's χ-squared statistics"
χ2
=
N
=1
(y − ˆp)2
ˆp
+
(y − ˆp)2
1 − ˆp
=
N
=1
(1 − ˆp + ˆp)(y − ˆp)2
ˆp(1 − ˆp)
=
N
=1
(y − ˆp)2
ˆp(1 − ˆp)
.
16. On the analogy with Least Squares Fitting
Meaning of Weighted RSS (3)
or... as quadratic approximation of Deviance"
D = −2
N
=1
[y log ˆp + (1 − y) log(1 − ˆp)] −
N
=1
[y log y + (1 − y) log(1 − y)]
= 2
N
=1
y log
y
ˆp
+ (1 − y) log
1 − y
1 − ˆp
≈ 2
N
=1
(y − ˆp) +
(y − ˆp)2
2 ˆp
+ {(1 − y) − (1 − ˆp)} +
{(1 − y) − (1 − ˆp)}2
2(1 − ˆp)
=
N
=1
(y − ˆp)2
ˆp
+
(y − ˆp)2
1 − ˆp
=
N
=1
(y − ˆp)2
ˆp(1 − ˆp)
.
17. On the analogy with Least Squares Fitting
Meaning of Weighted RSS (3)
or... as quadratic approximation of Deviance"
D = −2
N
=1
[y log ˆp + (1 − y) log(1 − ˆp)] −
N
=1
[y log y + (1 − y) log(1 − y)]
= 2
N
=1
y log
y
ˆp
+ (1 − y) log
1 − y
1 − ˆp
≈ 2
N
=1
(y − ˆp) +
(y − ˆp)2
2 ˆp
+ {(1 − y) − (1 − ˆp)} +
{(1 − y) − (1 − ˆp)}2
2(1 − ˆp)
=
N
=1
(y − ˆp)2
ˆp
+
(y − ˆp)2
1 − ˆp
=
N
=1
(y − ˆp)2
ˆp(1 − ˆp)
.
Maximum
likelihood
of
the
model
18. On the analogy with Least Squares Fitting
Meaning of Weighted RSS (3)
or... as quadratic approximation of Deviance"
D = −2
N
=1
[y log ˆp + (1 − y) log(1 − ˆp)] −
N
=1
[y log y + (1 − y) log(1 − y)]
= 2
N
=1
y log
y
ˆp
+ (1 − y) log
1 − y
1 − ˆp
≈ 2
N
=1
(y − ˆp) +
(y − ˆp)2
2 ˆp
+ {(1 − y) − (1 − ˆp)} +
{(1 − y) − (1 − ˆp)}2
2(1 − ˆp)
=
N
=1
(y − ˆp)2
ˆp
+
(y − ˆp)2
1 − ˆp
=
N
=1
(y − ˆp)2
ˆp(1 − ˆp)
.
Maximum
likelihood
of
the
model
Likelihood
of
the
full
model
which
achieve
perfect
fitting
19. On the analogy with Least Squares Fitting
Meaning of Weighted RSS (3)
or... as quadratic approximation of Deviance"
D = −2
N
=1
[y log ˆp + (1 − y) log(1 − ˆp)] −
N
=1
[y log y + (1 − y) log(1 − y)]
= 2
N
=1
y log
y
ˆp
+ (1 − y) log
1 − y
1 − ˆp
≈ 2
N
=1
(y − ˆp) +
(y − ˆp)2
2 ˆp
+ {(1 − y) − (1 − ˆp)} +
{(1 − y) − (1 − ˆp)}2
2(1 − ˆp)
=
N
=1
(y − ˆp)2
ˆp
+
(y − ˆp)2
1 − ˆp
=
N
=1
(y − ˆp)2
ˆp(1 − ˆp)
.
00
20. On the analogy with Least Squares Fitting
Meaning of Weighted RSS (3)
or... as quadratic approximation of Deviance"
D = −2
N
=1
[y log ˆp + (1 − y) log(1 − ˆp)] −
N
=1
[y log y + (1 − y) log(1 − y)]
= 2
N
=1
y log
y
ˆp
+ (1 − y) log
1 − y
1 − ˆp
≈ 2
N
=1
(y − ˆp) +
(y − ˆp)2
2 ˆp
+ {(1 − y) − (1 − ˆp)} +
{(1 − y) − (1 − ˆp)}2
2(1 − ˆp)
=
N
=1
(y − ˆp)2
ˆp
+
(y − ˆp)2
1 − ˆp
=
N
=1
(y − ˆp)2
ˆp(1 − ˆp)
.
00
log
= ( − )
+
( − )2
2
−
( − )3
62
+ · · ·
21. On the analogy with Least Squares Fitting
Meaning of Weighted RSS (3)
or... as quadratic approximation of Deviance"
D = −2
N
=1
[y log ˆp + (1 − y) log(1 − ˆp)] −
N
=1
[y log y + (1 − y) log(1 − y)]
= 2
N
=1
y log
y
ˆp
+ (1 − y) log
1 − y
1 − ˆp
≈ 2
N
=1
(y − ˆp) +
(y − ˆp)2
2 ˆp
+ {(1 − y) − (1 − ˆp)} +
{(1 − y) − (1 − ˆp)}2
2(1 − ˆp)
=
N
=1
(y − ˆp)2
ˆp
+
(y − ˆp)2
1 − ˆp
=
N
=1
(y − ˆp)2
ˆp(1 − ˆp)
.
00
22. On the analogy with Least Squares Fitting
Meaning of Weighted RSS (3)
or... as quadratic approximation of Deviance"
D = −2
N
=1
[y log ˆp + (1 − y) log(1 − ˆp)] −
N
=1
[y log y + (1 − y) log(1 − y)]
= 2
N
=1
y log
y
ˆp
+ (1 − y) log
1 − y
1 − ˆp
≈ 2
N
=1
(y − ˆp) +
(y − ˆp)2
2 ˆp
+ {(1 − y) − (1 − ˆp)} +
{(1 − y) − (1 − ˆp)}2
2(1 − ˆp)
=
N
=1
(y − ˆp)2
ˆp
+
(y − ˆp)2
1 − ˆp
=
N
=1
(y − ˆp)2
ˆp(1 − ˆp)
.
23. On the analogy with Least Squares Fitting
Asymp. distribution of
The distribution of converges to "N β, (X WX)−1ˆβ
ˆβ
24. On the analogy with Least Squares Fitting
Asymp. distribution of
The distribution of converges to "
(See hand-out for the details)"
N β, (X WX)−1
y
i.i.d.
∼ Bern(Pr(; β)).
∴ E[y] = p, vr[y] = W.
∴ E ˆβ = E (X WX)−1
X Wz
= (X WX)−1
X WE Xβ + W−1
(y − p)
= (X WX)−1
X WXβ
= β,
vr ˆβ = (X WX)−1
X Wvr Xβ + W−1
(y − p) W X(X WX)−
= (X WX)−1
X W(W−1
WW−
)W X(X WX)−
= (X WX)−1
.
ˆβ
ˆβ
25. On the analogy with Least Squares Fitting
Test of models for LR
Once a model is obtained, Wald test or Rao's score test
can be used to decide which term to drop/add. It need
no recalculation of IRLS."
Figure from: "Statistics 111: Introduction to Theoretical Statistics" lecture note
by Kevin Andrew Rader, on Harvard College GSAS
http://isites.harvard.edu/icb/icb.do?keyword=k101665&pageid=icb.page651024
26. On the analogy with Least Squares Fitting
Test of models for LR
Once a model is obtained, Wald test or Rao's score test
can be used to decide which term to drop/add. It need
no recalculation of IRLS."
Figure from: "Statistics 111: Introduction to Theoretical Statistics" lecture note
by Kevin Andrew Rader, on Harvard College GSAS
http://isites.harvard.edu/icb/icb.do?keyword=k101665&pageid=icb.page651024
Test
by
the
gradient
of
log-‐likelihood
27. On the analogy with Least Squares Fitting
Test of models for LR
Once a model is obtained, Wald test or Rao's score test
can be used to decide which term to drop/add. It need
no recalculation of IRLS."
Figure from: "Statistics 111: Introduction to Theoretical Statistics" lecture note
by Kevin Andrew Rader, on Harvard College GSAS
http://isites.harvard.edu/icb/icb.do?keyword=k101665&pageid=icb.page651024
Test
by
the
difference
of
paremeter
28. On the analogy with Least Squares Fitting
L1-regularlized LR (1)
Just like lasso, L1-regularlizer is effective for LR."
29. On the analogy with Least Squares Fitting
L1-regularlized LR (1)
Just like lasso, L1-regularlizer is effective for LR."
Here the objective function will be:"
"
"
"
"
mx
β0,β
N
=1
log Pr(; β0, β) − λ β 1
= mx
β0,β
N
=1
y(β0 + β ) − log 1 + eβ0+β − λ
p
j=1
|βj| .
30. On the analogy with Least Squares Fitting
L1-regularlized LR (1)
Just like lasso, L1-regularlizer is effective for LR."
Here the objective function will be:"
"
"
"
"
The resulting algorithm can be called “iterative
reweighted lasso” algorithm."
mx
β0,β
N
=1
log Pr(; β0, β) − λ β 1
= mx
β0,β
N
=1
y(β0 + β ) − log 1 + eβ0+β − λ
p
j=1
|βj| .
31. On the analogy with Least Squares Fitting
L1-regularlized LR (2)
By putting the gradient to 0, we get same score
equation as lasso algorithm:"
∂
∂βj
N
=1
y(β0 + β ) − log 1 + eβ0+β − λ
p
j=1
|βj| = 0
∴
N
=1
yj −
eβ0+β
1 + eβ0+β
− λ · sign(βj) = 0
∴ xj
(y − p) = λ · sign(βj) (where βj = 0)
00
32. On the analogy with Least Squares Fitting
L1-regularlized LR (2)
By putting the gradient to 0, we get same score
equation as lasso algorithm:"
∂
∂βj
N
=1
y(β0 + β ) − log 1 + eβ0+β − λ
p
j=1
|βj| = 0
∴
N
=1
yj −
eβ0+β
1 + eβ0+β
− λ · sign(βj) = 0
∴ xj
(y − p) = λ · sign(βj) (where βj = 0)
00
33. On the analogy with Least Squares Fitting
L1-regularlized LR (2)
By putting the gradient to 0, we get same score
equation as lasso algorithm:"
∂
∂βj
N
=1
y(β0 + β ) − log 1 + eβ0+β − λ
p
j=1
|βj| = 0
∴
N
=1
yj −
eβ0+β
1 + eβ0+β
− λ · sign(βj) = 0
∴ xj
(y − p) = λ · sign(βj) (where βj = 0)
34. On the analogy with Least Squares Fitting
L1-regularlized LR (2)
By putting the gradient to 0, we get same score
equation as lasso algorithm:"
∂
∂βj
N
=1
y(β0 + β ) − log 1 + eβ0+β − λ
p
j=1
|βj| = 0
∴
N
=1
yj −
eβ0+β
1 + eβ0+β
− λ · sign(βj) = 0
∴ xj
(y − p) = λ · sign(βj) (where βj = 0)
xj
(y − Xβ) = λ · sign(βj)
Score
equation
of
lasso
is
35. On the analogy with Least Squares Fitting
L1-regularlized LR (3)
Since the objective function is concave, the solution
can be obtained using optimization techniques."
"
36. On the analogy with Least Squares Fitting
L1-regularlized LR (3)
Since the objective function is concave, the solution
can be obtained using optimization techniques."
"
However, the profiles of coefficients are not piece-
wise linear, and it is difficult to get the path."
Predictor-Corrector method for convex optimization or
coordinate descent algorithm will work in some situations."
37. On the analogy with Least Squares Fitting
Summary
LR is analogous to least squares fitting"
"
and..."
• LR requires iterative algorithm because of the self-consistency"
• Weighted RSS can be seen as χ-squared or deviance"
• The dist. of converges to "
• Rao's score test or Wald test is useful for model selection"
• L1-regularlized is analogous to lasso except for non-linearity"
βnew
= (X WX)−1
X Wz ↔ β = (X X)−1
X y
N β, (X WX)−1ˆβ
38. Today's topics
¨ Logistic regression (contd.)"
¨ On the analogy with Least Squares Fitting"
¨ Logistic regression vs. LDA"
¨ Separating Hyperplane"
¨ Rosenblatt's Perceptron"
¨ Optimal Hyperplane"
39. Today's topics
¨ Logistic regression (contd.)"
þ On the analogy with Least Squares Fitting"
¨ Logistic regression vs. LDA"
¨ Separating Hyperplane"
¨ Rosenblatt's Perceptron"
¨ Optimal Hyperplane"
40. Logistic regression vs. LDA
What is the different
LDA and logistic regression are very similar
methods."
Let us study the characteristics of these methods
through the difference of formal aspects."
51. Logistic regression vs. LDA
Form of the Pr(X)
LDA"
"
"
"
Logistic regression"
Pr(X) =
K
k=1
πkϕ(X; μk, ).
Arbitrary
Pr(X)
52. Logistic regression vs. LDA
Form of the Pr(X)
LDA"
"
"
"
Logistic regression"
Pr(X) =
K
k=1
πkϕ(X; μk, ).
Arbitrary
Pr(X)
Involves
parameters
53. Logistic regression vs. LDA
Effects of the difference (1)
How these formal difference affect on the
character of the algorithm?"
54. Logistic regression vs. LDA
Effects of the difference (2)
The assumption of Gaussian and homoscedastic
can be strong constraint, which lead low variance."
55. Logistic regression vs. LDA
Effects of the difference (2)
The assumption of Gaussian and homoscedastic
can be strong constraint, which lead low variance."
In addition, LDA has the advantage that it can
make use of unlabelled observations; i.e. semi-
supervised is available."
56. Logistic regression vs. LDA
Effects of the difference (2)
The assumption of Gaussian and homoscedastic
can be strong constraint, which lead low variance."
In addition, LDA has the advantage that it can
make use of unlabelled observations; i.e. semi-
supervised is available."
"
On the other hand, LDA could be affected by
outliers."
58. Logistic regression vs. LDA
Effects of the difference (3)
With linear separable data,"
• The coefficients of LDA is defined well; but training error
may occur."
59. Logistic regression vs. LDA
Effects of the difference (3)
With linear separable data,"
• The coefficients of LDA is defined well; but training error
may occur."
• The coefficients of LR can be infinite; but true separating
hyperplane can be found"
60. Logistic regression vs. LDA
Effects of the difference (3)
With linear separable data,"
• The coefficients of LDA is defined well; but training error
may occur."
• The coefficients of LR can be infinite; but true separating
hyperplane can be found"
Do
not
think
too
much
on
training
error;
what
is
important
is
generalization
error
61. Logistic regression vs. LDA
Effects of the difference (4)
The assumptions for LDA rarely hold in practical."
62. Logistic regression vs. LDA
Effects of the difference (4)
The assumptions for LDA rarely hold in practical."
Nevertheless, it is known empirically that these
models give quite similar results, even when LDA is
used inappropriately, say with qualitative variables."
"
63. Logistic regression vs. LDA
Effects of the difference (4)
The assumptions for LDA rarely hold in practical."
Nevertheless, it is known empirically that these
models give quite similar results, even when LDA is
used inappropriately, say with qualitative variables."
"
After all, however, if Gaussian assumption looks
to hold, use LDA. Otherwise, use logistic
regression."
64. Today's topics
¨ Logistic regression (contd.)"
þ On the analogy with Least Squares Fitting"
¨ Logistic regression vs. LDA"
¨ Separating Hyperplane"
¨ Rosenblatt's Perceptron"
¨ Optimal Hyperplane"
65. Today's topics
¨ Logistic regression (contd.)"
þ On the analogy with Least Squares Fitting"
þ Logistic regression vs. LDA"
¨ Separating Hyperplane"
¨ Rosenblatt's Perceptron"
¨ Optimal Hyperplane"
66. Today's topics
þ Logistic regression (contd.)"
þ On the analogy with Least Squares Fitting"
þ Logistic regression vs. LDA"
¨ Separating Hyperplane"
¨ Rosenblatt's Perceptron"
¨ Optimal Hyperplane"
68. Separating Hyperplane: Overview
Another way of Classification
Both LDA and LR do classification through the
probabilities using regression models."
"
Classification can be done by more explicit way:
modelling the decision boundary directly."
69. Separating Hyperplane: Overview
Properties of vector algebra
Let L be the affine set defined by"
"
"
"
"
and the signed distance from x to L is "
β0 + β = 0
d± (, L) =
1
β
(β + β0)
β0 + β > 0 ⇔ is above L
β0 + β = 0 ⇔ is on L
β0 + β < 0 ⇔ is below L
70. Today's topics
þ Logistic regression (contd.)"
þ On the analogy with Least Squares Fitting"
þ Logistic regression vs. LDA"
¨ Separating Hyperplane"
¨ Rosenblatt's Perceptron"
¨ Optimal Hyperplane"
71. Today's topics
þ Logistic regression (contd.)"
þ On the analogy with Least Squares Fitting"
þ Logistic regression vs. LDA"
p Separating Hyperplane"
p Rosenblatt's Perceptron"
p Optimal Hyperplane"
72. Rosenblatt's Perceptron
Learning Criteria
The basic criteria of Rosenblatt's Perceptron learning
algorithm is to reduce (M is misclassified data)"
D(β, β0) =
∈M
β + β0
∝
∈M
d± (, L)
= −
∈M
y(
β + β0)
00
73. Rosenblatt's Perceptron
Learning Criteria
The basic criteria of Rosenblatt's Perceptron learning
algorithm is to reduce (M is misclassified data)"
D(β, β0) =
∈M
β + β0
∝
∈M
d± (, L)
= −
∈M
y(
β + β0)
00
74. Rosenblatt's Perceptron
Learning Criteria
The basic criteria of Rosenblatt's Perceptron learning
algorithm is to reduce (M is misclassified data)"
D(β, β0) =
∈M
β + β0
∝
∈M
d± (, L)
= −
∈M
y(
β + β0)
75. Rosenblatt's Perceptron
Learning Criteria
The basic criteria of Rosenblatt's Perceptron learning
algorithm is to reduce (M is misclassified data)"
D(β, β0) =
∈M
β + β0
∝
∈M
d± (, L)
= −
∈M
y(
β + β0)
If
misclassified
yi=1
as
-‐1,
the
latter
part
is
negative
76. Rosenblatt's Perceptron
Learning Criteria
The basic criteria of Rosenblatt's Perceptron learning
algorithm is to reduce (M is misclassified data)"
D(β, β0) =
∈M
β + β0
∝
∈M
d± (, L)
= −
∈M
y(
β + β0)
If
misclassified
yi=-‐1
as
1,
the
latter
part
is
positive
77. Rosenblatt's Perceptron
Learning Algorithm (1)
Instead of reducing D by batch learning,
“stochastic” gradient descent algorithm is adopted. "
The coefficients are updated for each misclassified
observations like online learning."
78. Rosenblatt's Perceptron
Learning Algorithm (1)
Instead of reducing D by batch learning,
“stochastic” gradient descent algorithm is adopted. "
The coefficients are updated for each misclassified
observations like online learning."
79. Rosenblatt's Perceptron
Learning Algorithm (1)
Instead of reducing D by batch learning,
“stochastic” gradient descent algorithm is adopted. "
The coefficients are updated for each misclassified
observations like online learning."
Observations
classified
correctly
do
not
affects
the
parameter,
so
it
is
robust
to
outliers.
80. Rosenblatt's Perceptron
Learning Algorithm (1)
Instead of reducing D by batch learning,
“stochastic” gradient descent algorithm is adopted. "
The coefficients are updated for each misclassified
observations like online learning."
"
Thus, coefficients will be updated based not on D
but on single"D(β, β0) = −y(
β + β0)
83. Rosenblatt's Perceptron
Learning Algorithm (2)
Proceedings of the algorithm is as follows:"
1. Take 1 observation xi and classify it"
2. If the classification was wrong, update coefficients"
∂D(β, β0)
∂β
= − y,
∂D(β, β0)
∂β0
= − y.
∴
β
β0
←
β
β0
+ρ
y
y00
84. Rosenblatt's Perceptron
Learning Algorithm (2)
Proceedings of the algorithm is as follows:"
1. Take 1 observation xi and classify it"
2. If the classification was wrong, update coefficients"
∂D(β, β0)
∂β
= − y,
∂D(β, β0)
∂β0
= − y.
∴
β
β0
←
β
β0
+ρ
y
y
85. Rosenblatt's Perceptron
Learning Algorithm (2)
Proceedings of the algorithm is as follows:"
1. Take 1 observation xi and classify it"
2. If the classification was wrong, update coefficients"
∂D(β, β0)
∂β
= − y,
∂D(β, β0)
∂β0
= − y.
∴
β
β0
←
β
β0
+ρ
y
y
Learning
rate
Can
be
set
to
1
without
loss
of
generality
86. Rosenblatt's Perceptron
Learning Algorithm (3)
Updating parameter may lead misclassifications of
other correctly-classified observations."
Therefore, although each update reduces each Di ,
it can increase total D."
87. Rosenblatt's Perceptron
Learning Algorithm (3)
Updating parameter may lead misclassifications of
other correctly-classified observations."
Therefore, although each update reduces each Di ,
it can increase total D."
90. Rosenblatt's Perceptron
Convergence Theorem
If data is linear separable learning of perceptron
terminates in finite steps. "
Otherwise, learning never terminates."
"
However, in practical, it is difficult to know if"
• the data is not linear separable and never converge"
• or the data is linear separable but time-consuming"
"
"
91. Rosenblatt's Perceptron
Convergence Theorem
If data is linear separable learning of perceptron
terminates in finite steps. "
Otherwise, learning never terminates."
"
However, in practical, it is difficult to know if"
• the data is not linear separable and never converge"
• or the data is linear separable but time-consuming"
"
In addition, the solution is not unique depending on
the initial value or data order."
"
92. Today's topics
þ Logistic regression (contd.)"
þ On the analogy with Least Squares Fitting"
þ Logistic regression vs. LDA"
¨ Separating Hyperplane"
¨ Rosenblatt's Perceptron"
¨ Optimal Hyperplane"
93. Today's topics
þ Logistic regression (contd.)"
þ On the analogy with Least Squares Fitting"
þ Logistic regression vs. LDA"
¨ Separating Hyperplane"
þ Rosenblatt's Perceptron"
¨ Optimal Hyperplane"
94. Optimal Hyperplane
Derivation of KKT cond. (1)
This section could be hard for some audience."
"
To make story bit clearer, let us study general on
optimization problem. The theme is:"
95. Optimal Hyperplane
Derivation of KKT cond. (1)
This section could be hard for some audience."
"
To make story bit clearer, let us study general on
optimization problem. The theme is:"
Duality and KKT condition for optimization problem"
96. Optimal Hyperplane
Derivation of KKT cond. (2)
Suppose we have an optimization problem:"
"
"
"
and let the feasible region be"
minimize ƒ()
subject to g() ≤ 0
C = {|g() ≤ 0}
97. Optimal Hyperplane
Derivation of KKT cond. (3)
On the region of optimization, relaxation is the
technique often used to make problem easier."
"
98. Optimal Hyperplane
Derivation of KKT cond. (3)
On the region of optimization, relaxation is the
technique often used to make problem easier."
"
Lagrange relaxation, as done below, is one of that:"
" minimize L(, y) = ƒ() +
yg()
subject to y ≥ 0.
99. Optimal Hyperplane
Derivation of KKT cond. (4)
Concerning to the L(x,y), following inequality holds:"
"
"
and it requires yi or gi(x) to be equal to zero for all i
(this condition is called “complementary slackness” )."
min
∈C
ƒ() = min
sp
y≥0
L(, y) ≥ mx
y≥0
inf
L(, y)
100. Optimal Hyperplane
Derivation of KKT cond. (4)
Concerning to the L(x,y), following inequality holds:"
"
"
and it requires yi or gi(x) to be equal to zero for all i
(this condition is called “complementary slackness” )."
min
∈C
ƒ() = min
sp
y≥0
L(, y) ≥ mx
y≥0
inf
L(, y)
101. Optimal Hyperplane
Derivation of KKT cond. (4)
Concerning to the L(x,y), following inequality holds:"
"
"
and it requires yi or gi(x) to be equal to zero for all i
(this condition is called “complementary slackness” )."
"
According to the inequality, maximizing infx L(x,y) tells
us the lower boundary for the original problem."
min
∈C
ƒ() = min
sp
y≥0
L(, y) ≥ mx
y≥0
inf
L(, y)
102. Optimal Hyperplane
Derivation of KKT cond. (5)
Therefore, we have the following maximizing problem:"
"
"
"
"
mximize L(, y)
subject to
∂
∂
L(, y) = 0
y ≥ 0
103. Optimal Hyperplane
Derivation of KKT cond. (5)
Therefore, we have the following maximizing problem:"
"
"
"
"
mximize L(, y)
subject to
∂
∂
L(, y) = 0
y ≥ 0 Condition
to
achieve
inf
L(x,y)
104. Optimal Hyperplane
Derivation of KKT cond. (5)
Therefore, we have the following maximizing problem:"
"
"
"
"
This is called “Wolfe dual problem”, and strong duality
theory says the solutions for the primal and dual
problem are equivalent."
mximize L(, y)
subject to
∂
∂
L(, y) = 0
y ≥ 0
105. Optimal Hyperplane
Derivation of KKT cond. (5)
Therefore, we have the following maximizing problem:"
"
"
"
"
This is called “Wolfe dual problem”, and strong duality
theory says the solutions for the primal and dual
problem are equivalent."
mximize L(, y)
subject to
∂
∂
L(, y) = 0
y ≥ 0
106. Optimal Hyperplane
Derivation of KKT cond. (6)
Thus, optimal solution must satisfy the conditions so
far. They are called the “KKT condition” altogether."
g() ≤ 0
∂
∂
L(, y) = 0
y ≥ 0
yg() = 0
107. Optimal Hyperplane
Derivation of KKT cond. (6)
Thus, optimal solution must satisfy the conditions so
far. They are called the “KKT condition” altogether."
g() ≤ 0
∂
∂
L(, y) = 0
y ≥ 0
yg() = 0
Primal
constraint
108. Optimal Hyperplane
Derivation of KKT cond. (6)
Thus, optimal solution must satisfy the conditions so
far. They are called the “KKT condition” altogether."
g() ≤ 0
∂
∂
L(, y) = 0
y ≥ 0
yg() = 0
Primal
constraint
Stationary
condition
109. Optimal Hyperplane
Derivation of KKT cond. (6)
Thus, optimal solution must satisfy the conditions so
far. They are called the “KKT condition” altogether."
g() ≤ 0
∂
∂
L(, y) = 0
y ≥ 0
yg() = 0
Primal
constraint
Stationary
condition
Dual
constraint
110. Optimal Hyperplane
Derivation of KKT cond. (6)
Thus, optimal solution must satisfy the conditions so
far. They are called the “KKT condition” altogether."
g() ≤ 0
∂
∂
L(, y) = 0
y ≥ 0
yg() = 0
Primal
constraint
Stationary
condition
Dual
constraint
Complementary
slackness
111. Optimal Hyperplane
Derivation of KKT cond. (6)
Thus, optimal solution must satisfy the conditions so
far. They are called the “KKT condition” altogether."
g() ≤ 0
∂
∂
L(, y) = 0
y ≥ 0
yg() = 0
Primal
constraint
Stationary
condition
Dual
constraint
Complementary
slackness
112. Optimal Hyperplane
KKT for Opt. Hyperplane (1)
We learned about the KKT conditions."
"
Then, get back to the original problem: finding
optimal hyperplane."
113. Optimal Hyperplane
KKT for Opt. Hyperplane (2)
The original fitting criteria of the optimal hyperplane
is what is generalized of perceptron:"
" mximize
β,β0
M
subject to β = 1
y(
β + β0) ≥ M ( = 1, . . . , N)
114. Optimal Hyperplane
KKT for Opt. Hyperplane (2)
The original fitting criteria of the optimal hyperplane
is what is generalized of perceptron:"
" mximize
β,β0
M
subject to β = 1
y(
β + β0) ≥ M ( = 1, . . . , N)
Criteria
of
maximizing
margin
is
theoretically
supported
using
distributions
with
no
assumption
115. Optimal Hyperplane
KKT for Opt. Hyperplane (3)
This is kind of mini-max problem which is difficult to
solve, so convert it into more easier problem:"
"
"
"
116. Optimal Hyperplane
KKT for Opt. Hyperplane (3)
This is kind of mini-max problem which is difficult to
solve, so convert it into more easier problem:"
"
"
"
(See hand-out for the detailed transformation)"
minimize
β,β0
1
2
β 2
subject to y(
β + β0) ≥ 1 ( = 1, . . . , N)
117. Optimal Hyperplane
KKT for Opt. Hyperplane (3)
This is kind of mini-max problem which is difficult to
solve, so convert it into more easier problem:"
"
"
"
(See hand-out for the detailed transformation)"
"
This is quadratic programming problem."
minimize
β,β0
1
2
β 2
subject to y(
β + β0) ≥ 1 ( = 1, . . . , N)
118. Optimal Hyperplane
KKT for Opt. Hyperplane (4)
To make use of KKT condition, let's make object
function into Lagrange function:"
Lp =
1
2
β 2
−
N
=1
α y(
β + β0) − 1
121. Optimal Hyperplane
Support points (1)
The KKT condition tell us"
"
"
"
α > 0 ⇔ y(
β + β0) = 1 ⇔
α = 0 ⇔ y(
β + β0) > 1 ⇔
is
on
edge
of
slab
is
off
edge
of
slab
122. Optimal Hyperplane
Support points (1)
The KKT condition tell us"
"
"
"
α > 0 ⇔ y(
β + β0) = 1 ⇔
α = 0 ⇔ y(
β + β0) > 1 ⇔
is
on
edge
of
slab
is
off
edge
of
slab
123. Optimal Hyperplane
Support points (1)
The KKT condition tell us"
"
"
"
α > 0 ⇔ y(
β + β0) = 1 ⇔
α = 0 ⇔ y(
β + β0) > 1 ⇔
is
on
edge
of
slab
is
off
edge
of
slab
124. Optimal Hyperplane
Support points (1)
The KKT condition tell us"
"
"
"
α > 0 ⇔ y(
β + β0) = 1 ⇔
α = 0 ⇔ y(
β + β0) > 1 ⇔
is
on
edge
of
slab
is
off
edge
of
slab
125. Optimal Hyperplane
Support points (1)
The KKT condition tell us"
"
"
"
Those points on the edge of the slab is called
“support points” (or “support vectors” )."
α > 0 ⇔ y(
β + β0) = 1 ⇔
α = 0 ⇔ y(
β + β0) > 1 ⇔
is
on
edge
of
slab
is
off
edge
of
slab
126. Optimal Hyperplane
Support points (2)
β can be written as the linear combination of the
support points:"
"
"
"
where S is the indices of the support points."
β =
N
=1
αy
=
∈S
αy,
127. Optimal Hyperplane
Support points (3)
β0 can be obtained after β is obtained. For i S"
y(
β + β0) = 1
∴ β0 = 1/y − β
= y −
j∈S
αjyjj
∴ β0 =
1
|S| ∈S
y −
j∈S
αjyjj
00
128. Optimal Hyperplane
Support points (3)
β0 can be obtained after β is obtained. For i S"
y(
β + β0) = 1
∴ β0 = 1/y − β
= y −
j∈S
αjyjj
∴ β0 =
1
|S| ∈S
y −
j∈S
αjyjj
Took
average
to
avoid
computation
error
129. Optimal Hyperplane
Support points (4)
All coefficients are defined only through support points:"
"
"
"
"
thus, this is robust to outliers."
"
β =
∈S
αy,
β0 =
1
|S| ∈S
y −
j∈S
αjyjj
130. Optimal Hyperplane
Support points (4)
All coefficients are defined only through support points:"
"
"
"
"
thus, this is robust to outliers."
"
However, do not forget that which will be support points
is defined using all data points."
β =
∈S
αy,
β0 =
1
|S| ∈S
y −
j∈S
αjyjj
131. Today's topics
þ Logistic regression (contd.)"
þ On the analogy with Least Squares Fitting"
þ Logistic regression vs. LDA"
¨ Separating Hyperplane"
þ Rosenblatt's Perceptron"
¨ Optimal Hyperplane"
132. Today's topics
þ Logistic regression (contd.)"
þ On the analogy with Least Squares Fitting"
þ Logistic regression vs. LDA"
¨ Separating Hyperplane"
þ Rosenblatt's Perceptron"
þ Optimal Hyperplane"
133. Today's topics
þ Logistic regression (contd.)"
þ On the analogy with Least Squares Fitting"
þ Logistic regression vs. LDA"
þ Separating Hyperplane"
þ Rosenblatt's Perceptron"
þ Optimal Hyperplane"
134. Summary
LDA
Logistic
Regression
Perceptron
Optimal
Hyperplane
With linear separable data
Training error
may occur
True separator
found, but coef.
may be infinite
True separator
found, but not
unique
Best separator
found
With non-linear separable data
Work well Work well Algorithm never
stop
Not feasible
With outliers
Not robust Robust Robust Robust