SlideShare una empresa de Scribd logo
1 de 134
Descargar para leer sin conexión
ESL 4.4.3-4.5
Logistic Regression (contd.)
& Separating Hyperplane
June 8, 2015
Talk by Shinichi TAMURA
Mathematical Informatics Lab @ NAIST
Today's topics
¨  Logistic regression (contd.)"
¨  On the analogy with Least Squares Fitting"
¨  Logistic regression vs. LDA"
¨  Separating Hyperplane"
¨  Rosenblatt's Perceptron"
¨  Optimal Hyperplane"
Today's topics
¨  Logistic regression (contd.)"
¨  On the analogy with Least Squares Fitting"
¨  Logistic regression vs. LDA"
¨  Separating Hyperplane"
¨  Rosenblatt's Perceptron"
¨  Optimal Hyperplane"
Today's topics
¨  Logistic regression (contd.)"
¨  On the analogy with Least Squares Fitting"
¨  Logistic regression vs. LDA"
¨  Separating Hyperplane"
¨  Rosenblatt's Perceptron"
¨  Optimal Hyperplane"
Today's topics
¨  Logistic regression (contd.)"
¨  On the analogy with Least Squares Fitting"
¨  Logistic regression vs. LDA"
¨  Separating Hyperplane"
¨  Rosenblatt's Perceptron"
¨  Optimal Hyperplane"
On the analogy with Least Squares Fitting
[Review] Fitting LR Model
Parameters are fitted by ML estimation, using
Newton-Raphson algorithm:"
"
"
"
βnew
← rg min
β
(z − Xβ) W(z − Xβ)
βnew
=(X WX)−1
X Wz.
On the analogy with Least Squares Fitting
[Review] Fitting LR Model
Parameters are fitted by ML estimation, using
Newton-Raphson algorithm:"
"
"
"
It looks like least squares fitting:"
βnew
← rg min
β
(z − Xβ) W(z − Xβ)
βnew
=(X WX)−1
X Wz.
β ← rg min
β
(y − Xβ) (y − Xβ)
β =(X X)−1
X y
On the analogy with Least Squares Fitting
Self-consistency
β depends on W and z, while W and z depend on β."
"
"
"
"
"
βnew
← rg min
β
(z − Xβ) W(z − Xβ)
βnew
=(X WX)−1
X Wz.
On the analogy with Least Squares Fitting
Self-consistency
β depends on W and z, while W and z depend on β."
"
"
"
"
"
βnew
← rg min
β
(z − Xβ) W(z − Xβ)
βnew
=(X WX)−1
X Wz.
z =
ˆβ +
y − ˆp
ˆp(1 − ˆp)
 = ˆp(1 − ˆp).
On the analogy with Least Squares Fitting
Self-consistency
β depends on W and z, while W and z depend on β."
"
"
"
"
→ “self-consistent” equation, needs iterative method
to solve"
"
βnew
← rg min
β
(z − Xβ) W(z − Xβ)
βnew
=(X WX)−1
X Wz.
On the analogy with Least Squares Fitting
Meaning of Weighted RSS (1)
RSS is used to check the goodness of fit in least
squares fitting."
"
"
N
=1
(y − ˆp)2
On the analogy with Least Squares Fitting
Meaning of Weighted RSS (1)
RSS is used to check the goodness of fit in least
squares fitting."
"
"
"
How about weighted RSS in logistic regression?"
N
=1
(y − ˆp)2
N
=1
(y − ˆp)2
ˆp(1 − ˆp)
On the analogy with Least Squares Fitting
Meaning of Weighted RSS (2)
Weighted RSS is interpreted as..."
Peason's χ-squared statistics"
χ2
=
N
=1
(y − ˆp)2
ˆp
+
(y − ˆp)2
1 − ˆp
=
N
=1
(1 − ˆp + ˆp)(y − ˆp)2
ˆp(1 − ˆp)
=
N
=1
(y − ˆp)2
ˆp(1 − ˆp)
.
On the analogy with Least Squares Fitting
Meaning of Weighted RSS (2)
Weighted RSS is interpreted as..."
Peason's χ-squared statistics"
χ2
=
N
=1
(y − ˆp)2
ˆp
+
(y − ˆp)2
1 − ˆp
=
N
=1
(1 − ˆp + ˆp)(y − ˆp)2
ˆp(1 − ˆp)
=
N
=1
(y − ˆp)2
ˆp(1 − ˆp)
.
On the analogy with Least Squares Fitting
Meaning of Weighted RSS (2)
Weighted RSS is interpreted as..."
Peason's χ-squared statistics"
χ2
=
N
=1
(y − ˆp)2
ˆp
+
(y − ˆp)2
1 − ˆp
=
N
=1
(1 − ˆp + ˆp)(y − ˆp)2
ˆp(1 − ˆp)
=
N
=1
(y − ˆp)2
ˆp(1 − ˆp)
.
On the analogy with Least Squares Fitting
Meaning of Weighted RSS (3)
or... as quadratic approximation of Deviance"
D = −2
N
=1
[y log ˆp + (1 − y) log(1 − ˆp)] −
N
=1
[y log y + (1 − y) log(1 − y)]
= 2
N
=1
y log
y
ˆp
+ (1 − y) log
1 − y
1 − ˆp
≈ 2
N
=1
(y − ˆp) +
(y − ˆp)2
2 ˆp
+ {(1 − y) − (1 − ˆp)} +
{(1 − y) − (1 − ˆp)}2
2(1 − ˆp)
=
N
=1
(y − ˆp)2
ˆp
+
(y − ˆp)2
1 − ˆp
=
N
=1
(y − ˆp)2
ˆp(1 − ˆp)
.
On the analogy with Least Squares Fitting
Meaning of Weighted RSS (3)
or... as quadratic approximation of Deviance"
D = −2
N
=1
[y log ˆp + (1 − y) log(1 − ˆp)] −
N
=1
[y log y + (1 − y) log(1 − y)]
= 2
N
=1
y log
y
ˆp
+ (1 − y) log
1 − y
1 − ˆp
≈ 2
N
=1
(y − ˆp) +
(y − ˆp)2
2 ˆp
+ {(1 − y) − (1 − ˆp)} +
{(1 − y) − (1 − ˆp)}2
2(1 − ˆp)
=
N
=1
(y − ˆp)2
ˆp
+
(y − ˆp)2
1 − ˆp
=
N
=1
(y − ˆp)2
ˆp(1 − ˆp)
.
Maximum	
  likelihood	
  
of	
  the	
  model	
  
On the analogy with Least Squares Fitting
Meaning of Weighted RSS (3)
or... as quadratic approximation of Deviance"
D = −2
N
=1
[y log ˆp + (1 − y) log(1 − ˆp)] −
N
=1
[y log y + (1 − y) log(1 − y)]
= 2
N
=1
y log
y
ˆp
+ (1 − y) log
1 − y
1 − ˆp
≈ 2
N
=1
(y − ˆp) +
(y − ˆp)2
2 ˆp
+ {(1 − y) − (1 − ˆp)} +
{(1 − y) − (1 − ˆp)}2
2(1 − ˆp)
=
N
=1
(y − ˆp)2
ˆp
+
(y − ˆp)2
1 − ˆp
=
N
=1
(y − ˆp)2
ˆp(1 − ˆp)
.
Maximum	
  likelihood	
  
of	
  the	
  model	
  
Likelihood	
  of	
  the	
  full	
  model	
  
which	
  achieve	
  perfect	
  fitting	
  
On the analogy with Least Squares Fitting
Meaning of Weighted RSS (3)
or... as quadratic approximation of Deviance"
D = −2
N
=1
[y log ˆp + (1 − y) log(1 − ˆp)] −
N
=1
[y log y + (1 − y) log(1 − y)]
= 2
N
=1
y log
y
ˆp
+ (1 − y) log
1 − y
1 − ˆp
≈ 2
N
=1
(y − ˆp) +
(y − ˆp)2
2 ˆp
+ {(1 − y) − (1 − ˆp)} +
{(1 − y) − (1 − ˆp)}2
2(1 − ˆp)
=
N
=1
(y − ˆp)2
ˆp
+
(y − ˆp)2
1 − ˆp
=
N
=1
(y − ˆp)2
ˆp(1 − ˆp)
.
00	
  
On the analogy with Least Squares Fitting
Meaning of Weighted RSS (3)
or... as quadratic approximation of Deviance"
D = −2
N
=1
[y log ˆp + (1 − y) log(1 − ˆp)] −
N
=1
[y log y + (1 − y) log(1 − y)]
= 2
N
=1
y log
y
ˆp
+ (1 − y) log
1 − y
1 − ˆp
≈ 2
N
=1
(y − ˆp) +
(y − ˆp)2
2 ˆp
+ {(1 − y) − (1 − ˆp)} +
{(1 − y) − (1 − ˆp)}2
2(1 − ˆp)
=
N
=1
(y − ˆp)2
ˆp
+
(y − ˆp)2
1 − ˆp
=
N
=1
(y − ˆp)2
ˆp(1 − ˆp)
.
00	
  
 log


= ( − )
+
( − )2
2
−
( − )3
62
+ · · ·
On the analogy with Least Squares Fitting
Meaning of Weighted RSS (3)
or... as quadratic approximation of Deviance"
D = −2
N
=1
[y log ˆp + (1 − y) log(1 − ˆp)] −
N
=1
[y log y + (1 − y) log(1 − y)]
= 2
N
=1
y log
y
ˆp
+ (1 − y) log
1 − y
1 − ˆp
≈ 2
N
=1
(y − ˆp) +
(y − ˆp)2
2 ˆp
+ {(1 − y) − (1 − ˆp)} +
{(1 − y) − (1 − ˆp)}2
2(1 − ˆp)
=
N
=1
(y − ˆp)2
ˆp
+
(y − ˆp)2
1 − ˆp
=
N
=1
(y − ˆp)2
ˆp(1 − ˆp)
.
00	
  
On the analogy with Least Squares Fitting
Meaning of Weighted RSS (3)
or... as quadratic approximation of Deviance"
D = −2
N
=1
[y log ˆp + (1 − y) log(1 − ˆp)] −
N
=1
[y log y + (1 − y) log(1 − y)]
= 2
N
=1
y log
y
ˆp
+ (1 − y) log
1 − y
1 − ˆp
≈ 2
N
=1
(y − ˆp) +
(y − ˆp)2
2 ˆp
+ {(1 − y) − (1 − ˆp)} +
{(1 − y) − (1 − ˆp)}2
2(1 − ˆp)
=
N
=1
(y − ˆp)2
ˆp
+
(y − ˆp)2
1 − ˆp
=
N
=1
(y − ˆp)2
ˆp(1 − ˆp)
.
On the analogy with Least Squares Fitting
Asymp. distribution of
The distribution of converges to "N β, (X WX)−1ˆβ
ˆβ
On the analogy with Least Squares Fitting
Asymp. distribution of
The distribution of converges to "
(See hand-out for the details)"
N β, (X WX)−1
y
i.i.d.
∼ Bern(Pr(; β)).
∴ E[y] = p, vr[y] = W.
∴ E ˆβ = E (X WX)−1
X Wz
= (X WX)−1
X WE Xβ + W−1
(y − p)
= (X WX)−1
X WXβ
= β,
vr ˆβ = (X WX)−1
X Wvr Xβ + W−1
(y − p) W X(X WX)−
= (X WX)−1
X W(W−1
WW−
)W X(X WX)−
= (X WX)−1
.
ˆβ
ˆβ
On the analogy with Least Squares Fitting
Test of models for LR
Once a model is obtained, Wald test or Rao's score test
can be used to decide which term to drop/add. It need
no recalculation of IRLS."
Figure from: "Statistics 111: Introduction to Theoretical Statistics" lecture note
by Kevin Andrew Rader, on Harvard College GSAS
http://isites.harvard.edu/icb/icb.do?keyword=k101665&pageid=icb.page651024
On the analogy with Least Squares Fitting
Test of models for LR
Once a model is obtained, Wald test or Rao's score test
can be used to decide which term to drop/add. It need
no recalculation of IRLS."
Figure from: "Statistics 111: Introduction to Theoretical Statistics" lecture note
by Kevin Andrew Rader, on Harvard College GSAS
http://isites.harvard.edu/icb/icb.do?keyword=k101665&pageid=icb.page651024
Test	
  by	
  the	
  gradient	
  
of	
  log-­‐likelihood	
  
On the analogy with Least Squares Fitting
Test of models for LR
Once a model is obtained, Wald test or Rao's score test
can be used to decide which term to drop/add. It need
no recalculation of IRLS."
Figure from: "Statistics 111: Introduction to Theoretical Statistics" lecture note
by Kevin Andrew Rader, on Harvard College GSAS
http://isites.harvard.edu/icb/icb.do?keyword=k101665&pageid=icb.page651024
Test	
  by	
  the	
  difference	
  
of	
  paremeter	
  
On the analogy with Least Squares Fitting
L1-regularlized LR (1)
Just like lasso, L1-regularlizer is effective for LR."
On the analogy with Least Squares Fitting
L1-regularlized LR (1)
Just like lasso, L1-regularlizer is effective for LR."
Here the objective function will be:"
"
"
"
"
mx
β0,β
N
=1
log Pr(; β0, β) − λ β 1
= mx
β0,β
N
=1
y(β0 + β ) − log 1 + eβ0+β  − λ
p
j=1
|βj| .
On the analogy with Least Squares Fitting
L1-regularlized LR (1)
Just like lasso, L1-regularlizer is effective for LR."
Here the objective function will be:"
"
"
"
"
The resulting algorithm can be called “iterative
reweighted lasso” algorithm."
mx
β0,β
N
=1
log Pr(; β0, β) − λ β 1
= mx
β0,β
N
=1
y(β0 + β ) − log 1 + eβ0+β  − λ
p
j=1
|βj| .
On the analogy with Least Squares Fitting
L1-regularlized LR (2)
By putting the gradient to 0, we get same score
equation as lasso algorithm:"
∂
∂βj
N
=1
y(β0 + β ) − log 1 + eβ0+β  − λ
p
j=1
|βj| = 0
∴
N
=1
 yj −
eβ0+β 
1 + eβ0+β 
− λ · sign(βj) = 0
∴ xj
(y − p) = λ · sign(βj) (where βj = 0)
00	
  
On the analogy with Least Squares Fitting
L1-regularlized LR (2)
By putting the gradient to 0, we get same score
equation as lasso algorithm:"
∂
∂βj
N
=1
y(β0 + β ) − log 1 + eβ0+β  − λ
p
j=1
|βj| = 0
∴
N
=1
 yj −
eβ0+β 
1 + eβ0+β 
− λ · sign(βj) = 0
∴ xj
(y − p) = λ · sign(βj) (where βj = 0)
00	
  
On the analogy with Least Squares Fitting
L1-regularlized LR (2)
By putting the gradient to 0, we get same score
equation as lasso algorithm:"
∂
∂βj
N
=1
y(β0 + β ) − log 1 + eβ0+β  − λ
p
j=1
|βj| = 0
∴
N
=1
 yj −
eβ0+β 
1 + eβ0+β 
− λ · sign(βj) = 0
∴ xj
(y − p) = λ · sign(βj) (where βj = 0)
On the analogy with Least Squares Fitting
L1-regularlized LR (2)
By putting the gradient to 0, we get same score
equation as lasso algorithm:"
∂
∂βj
N
=1
y(β0 + β ) − log 1 + eβ0+β  − λ
p
j=1
|βj| = 0
∴
N
=1
 yj −
eβ0+β 
1 + eβ0+β 
− λ · sign(βj) = 0
∴ xj
(y − p) = λ · sign(βj) (where βj = 0)
xj
(y − Xβ) = λ · sign(βj)
Score	
  equation	
  of	
  lasso	
  is	
  	
  
On the analogy with Least Squares Fitting
L1-regularlized LR (3)
Since the objective function is concave, the solution
can be obtained using optimization techniques."
"
On the analogy with Least Squares Fitting
L1-regularlized LR (3)
Since the objective function is concave, the solution
can be obtained using optimization techniques."
"
However, the profiles of coefficients are not piece-
wise linear, and it is difficult to get the path."
Predictor-Corrector method for convex optimization or
coordinate descent algorithm will work in some situations."
On the analogy with Least Squares Fitting
Summary
LR is analogous to least squares fitting"
"
and..."
•  LR requires iterative algorithm because of the self-consistency"
•  Weighted RSS can be seen as χ-squared or deviance"
•  The dist. of converges to "
•  Rao's score test or Wald test is useful for model selection"
•  L1-regularlized is analogous to lasso except for non-linearity"
βnew
= (X WX)−1
X Wz ↔ β = (X X)−1
X y
N β, (X WX)−1ˆβ
Today's topics
¨  Logistic regression (contd.)"
¨  On the analogy with Least Squares Fitting"
¨  Logistic regression vs. LDA"
¨  Separating Hyperplane"
¨  Rosenblatt's Perceptron"
¨  Optimal Hyperplane"
Today's topics
¨  Logistic regression (contd.)"
þ  On the analogy with Least Squares Fitting"
¨  Logistic regression vs. LDA"
¨  Separating Hyperplane"
¨  Rosenblatt's Perceptron"
¨  Optimal Hyperplane"
Logistic regression vs. LDA
What is the different
LDA and logistic regression are very similar
methods."
Let us study the characteristics of these methods
through the difference of formal aspects."
Logistic regression vs. LDA
Form of the log-odds
"
"
"
Logistic regression vs. LDA
Form of the log-odds
LDA"
"
"
"
log
Pr(G = k|X = )
Pr(G = K|X = )
= log
πk
πK
−
1
2
(μk + μK ) −1
(μk − μK )
+  −1
(μk − μK )
=αk0 + αk
,
Logistic regression vs. LDA
Form of the log-odds
LDA"
"
"
"
"
Logistic regression"
log
Pr(G = k|X = )
Pr(G = K|X = )
= log
πk
πK
−
1
2
(μk + μK ) −1
(μk − μK )
+  −1
(μk − μK )
=αk0 + αk
,
log
Pr(G = k|X = )
Pr(G = K|X = )
=βk0 + βk
.
Logistic regression vs. LDA
Form of the log-odds
LDA"
"
"
"
"
Logistic regression"
log
Pr(G = k|X = )
Pr(G = K|X = )
= log
πk
πK
−
1
2
(μk + μK ) −1
(μk − μK )
+  −1
(μk − μK )
=αk0 + αk
,
log
Pr(G = k|X = )
Pr(G = K|X = )
=βk0 + βk
.
Same	
  form	
  
Logistic regression vs. LDA
Criteria of estimations
"
"
Logistic regression vs. LDA
Criteria of estimations
LDA"
"
"
"
"
mx
N
=1
log Pr(G = g, X = )
= mx
N
=1
log Pr(G = g|X = ) log Pr(X = )
Logistic regression vs. LDA
Criteria of estimations
LDA"
"
"
"
"
Logistic regression"
mx
N
=1
log Pr(G = g, X = )
= mx
N
=1
log Pr(G = g|X = ) log Pr(X = )
mx
N
=1
log Pr(G = g|X = )
Logistic regression vs. LDA
Criteria of estimations
LDA"
"
"
"
"
Logistic regression"
mx
N
=1
log Pr(G = g, X = )
= mx
N
=1
log Pr(G = g|X = ) log Pr(X = )
mx
N
=1
log Pr(G = g|X = )
Marginal	
  likelihood	
  
Logistic regression vs. LDA
Form of the Pr(X)
"
"
Logistic regression vs. LDA
Form of the Pr(X)
LDA"
"
"
Pr(X) =
K
k=1
πkϕ(X; μk, ).
Logistic regression vs. LDA
Form of the Pr(X)
LDA"
"
"
"
Logistic regression"
Pr(X) =
K
k=1
πkϕ(X; μk, ).
Arbitrary	
  Pr(X)	
  
Logistic regression vs. LDA
Form of the Pr(X)
LDA"
"
"
"
Logistic regression"
Pr(X) =
K
k=1
πkϕ(X; μk, ).
Arbitrary	
  Pr(X)	
  
Involves	
  parameters	
  
Logistic regression vs. LDA
Effects of the difference (1)
How these formal difference affect on the
character of the algorithm?"
Logistic regression vs. LDA
Effects of the difference (2)
The assumption of Gaussian and homoscedastic
can be strong constraint, which lead low variance."
Logistic regression vs. LDA
Effects of the difference (2)
The assumption of Gaussian and homoscedastic
can be strong constraint, which lead low variance."
In addition, LDA has the advantage that it can
make use of unlabelled observations; i.e. semi-
supervised is available."
Logistic regression vs. LDA
Effects of the difference (2)
The assumption of Gaussian and homoscedastic
can be strong constraint, which lead low variance."
In addition, LDA has the advantage that it can
make use of unlabelled observations; i.e. semi-
supervised is available."
"
On the other hand, LDA could be affected by
outliers."
Logistic regression vs. LDA
Effects of the difference (3)
With linear separable data,"
Logistic regression vs. LDA
Effects of the difference (3)
With linear separable data,"
•  The coefficients of LDA is defined well; but training error
may occur."
Logistic regression vs. LDA
Effects of the difference (3)
With linear separable data,"
•  The coefficients of LDA is defined well; but training error
may occur."
•  The coefficients of LR can be infinite; but true separating
hyperplane can be found"
Logistic regression vs. LDA
Effects of the difference (3)
With linear separable data,"
•  The coefficients of LDA is defined well; but training error
may occur."
•  The coefficients of LR can be infinite; but true separating
hyperplane can be found"
Do	
  not	
  think	
  too	
  much	
  on	
  training	
  error;	
  
what	
  is	
  important	
  is	
  	
  generalization	
  error	
  	
  
Logistic regression vs. LDA
Effects of the difference (4)
The assumptions for LDA rarely hold in practical."
Logistic regression vs. LDA
Effects of the difference (4)
The assumptions for LDA rarely hold in practical."
Nevertheless, it is known empirically that these
models give quite similar results, even when LDA is
used inappropriately, say with qualitative variables."
"
Logistic regression vs. LDA
Effects of the difference (4)
The assumptions for LDA rarely hold in practical."
Nevertheless, it is known empirically that these
models give quite similar results, even when LDA is
used inappropriately, say with qualitative variables."
"
After all, however, if Gaussian assumption looks
to hold, use LDA. Otherwise, use logistic
regression."
Today's topics
¨  Logistic regression (contd.)"
þ  On the analogy with Least Squares Fitting"
¨  Logistic regression vs. LDA"
¨  Separating Hyperplane"
¨  Rosenblatt's Perceptron"
¨  Optimal Hyperplane"
Today's topics
¨  Logistic regression (contd.)"
þ  On the analogy with Least Squares Fitting"
þ  Logistic regression vs. LDA"
¨  Separating Hyperplane"
¨  Rosenblatt's Perceptron"
¨  Optimal Hyperplane"
Today's topics
þ  Logistic regression (contd.)"
þ  On the analogy with Least Squares Fitting"
þ  Logistic regression vs. LDA"
¨  Separating Hyperplane"
¨  Rosenblatt's Perceptron"
¨  Optimal Hyperplane"
Separating Hyperplane: Overview
Another way of Classification
Both LDA and LR do classification through the
probabilities using regression models."
"
Separating Hyperplane: Overview
Another way of Classification
Both LDA and LR do classification through the
probabilities using regression models."
"
Classification can be done by more explicit way:
modelling the decision boundary directly."
Separating Hyperplane: Overview
Properties of vector algebra
Let L be the affine set defined by"
"
"
"
"
and the signed distance from x to L is "
β0 + β  = 0
d± (, L) =
1
β
(β  + β0)
β0 + β  > 0 ⇔  is above L
β0 + β  = 0 ⇔  is on L
β0 + β  < 0 ⇔  is below L
Today's topics
þ  Logistic regression (contd.)"
þ  On the analogy with Least Squares Fitting"
þ  Logistic regression vs. LDA"
¨  Separating Hyperplane"
¨  Rosenblatt's Perceptron"
¨  Optimal Hyperplane"
Today's topics
þ  Logistic regression (contd.)"
þ  On the analogy with Least Squares Fitting"
þ  Logistic regression vs. LDA"
p  Separating Hyperplane"
p  Rosenblatt's Perceptron"
p  Optimal Hyperplane"
Rosenblatt's Perceptron
Learning Criteria
The basic criteria of Rosenblatt's Perceptron learning
algorithm is to reduce (M is misclassified data)"
D(β, β0) =
∈M

β + β0
∝
∈M
d± (, L)
= −
∈M
y(
β + β0)
00	
  
Rosenblatt's Perceptron
Learning Criteria
The basic criteria of Rosenblatt's Perceptron learning
algorithm is to reduce (M is misclassified data)"
D(β, β0) =
∈M

β + β0
∝
∈M
d± (, L)
= −
∈M
y(
β + β0)
00	
  
Rosenblatt's Perceptron
Learning Criteria
The basic criteria of Rosenblatt's Perceptron learning
algorithm is to reduce (M is misclassified data)"
D(β, β0) =
∈M

β + β0
∝
∈M
d± (, L)
= −
∈M
y(
β + β0)
Rosenblatt's Perceptron
Learning Criteria
The basic criteria of Rosenblatt's Perceptron learning
algorithm is to reduce (M is misclassified data)"
D(β, β0) =
∈M

β + β0
∝
∈M
d± (, L)
= −
∈M
y(
β + β0)
If	
  misclassified	
  yi=1	
  as	
  -­‐1,	
  
the	
  latter	
  part	
  is	
  negative	
  
Rosenblatt's Perceptron
Learning Criteria
The basic criteria of Rosenblatt's Perceptron learning
algorithm is to reduce (M is misclassified data)"
D(β, β0) =
∈M

β + β0
∝
∈M
d± (, L)
= −
∈M
y(
β + β0)
If	
  misclassified	
  yi=-­‐1	
  as	
  1,	
  
the	
  latter	
  part	
  is	
  positive	
  
Rosenblatt's Perceptron
Learning Algorithm (1)
Instead of reducing D by batch learning,
“stochastic” gradient descent algorithm is adopted. "
The coefficients are updated for each misclassified
observations like online learning."
Rosenblatt's Perceptron
Learning Algorithm (1)
Instead of reducing D by batch learning,
“stochastic” gradient descent algorithm is adopted. "
The coefficients are updated for each misclassified
observations like online learning."
Rosenblatt's Perceptron
Learning Algorithm (1)
Instead of reducing D by batch learning,
“stochastic” gradient descent algorithm is adopted. "
The coefficients are updated for each misclassified
observations like online learning."
Observations	
  classified	
  correctly	
  
do	
  not	
  affects	
  the	
  parameter,	
  so	
  
it	
  is	
  robust	
  to	
  outliers.	
  
Rosenblatt's Perceptron
Learning Algorithm (1)
Instead of reducing D by batch learning,
“stochastic” gradient descent algorithm is adopted. "
The coefficients are updated for each misclassified
observations like online learning."
"
Thus, coefficients will be updated based not on D
but on single"D(β, β0) = −y(
β + β0)
Rosenblatt's Perceptron
Learning Algorithm (2)
Proceedings of the algorithm is as follows:"
Rosenblatt's Perceptron
Learning Algorithm (2)
Proceedings of the algorithm is as follows:"
1.  Take 1 observation xi and classify it"
Rosenblatt's Perceptron
Learning Algorithm (2)
Proceedings of the algorithm is as follows:"
1.  Take 1 observation xi and classify it"
2.  If the classification was wrong, update coefficients"
∂D(β, β0)
∂β
= − y,
∂D(β, β0)
∂β0
= − y.
∴
β
β0
←
β
β0
+ρ
y
y00	
  
Rosenblatt's Perceptron
Learning Algorithm (2)
Proceedings of the algorithm is as follows:"
1.  Take 1 observation xi and classify it"
2.  If the classification was wrong, update coefficients"
∂D(β, β0)
∂β
= − y,
∂D(β, β0)
∂β0
= − y.
∴
β
β0
←
β
β0
+ρ
y
y
Rosenblatt's Perceptron
Learning Algorithm (2)
Proceedings of the algorithm is as follows:"
1.  Take 1 observation xi and classify it"
2.  If the classification was wrong, update coefficients"
∂D(β, β0)
∂β
= − y,
∂D(β, β0)
∂β0
= − y.
∴
β
β0
←
β
β0
+ρ
y
y
Learning	
  rate	
  
Can	
  be	
  set	
  to	
  1	
  without	
  
loss	
  of	
  generality	
  
Rosenblatt's Perceptron
Learning Algorithm (3)
Updating parameter may lead misclassifications of
other correctly-classified observations."
Therefore, although each update reduces each Di ,
it can increase total D."
Rosenblatt's Perceptron
Learning Algorithm (3)
Updating parameter may lead misclassifications of
other correctly-classified observations."
Therefore, although each update reduces each Di ,
it can increase total D."
Rosenblatt's Perceptron
Convergence Theorem
If data is linear separable learning of perceptron
terminates in finite steps. "
Otherwise, learning never terminates."
"
Rosenblatt's Perceptron
Convergence Theorem
If data is linear separable learning of perceptron
terminates in finite steps. "
Otherwise, learning never terminates."
Rosenblatt's Perceptron
Convergence Theorem
If data is linear separable learning of perceptron
terminates in finite steps. "
Otherwise, learning never terminates."
"
However, in practical, it is difficult to know if"
•  the data is not linear separable and never converge"
•  or the data is linear separable but time-consuming"
"
"
Rosenblatt's Perceptron
Convergence Theorem
If data is linear separable learning of perceptron
terminates in finite steps. "
Otherwise, learning never terminates."
"
However, in practical, it is difficult to know if"
•  the data is not linear separable and never converge"
•  or the data is linear separable but time-consuming"
"
In addition, the solution is not unique depending on
the initial value or data order."
"
Today's topics
þ  Logistic regression (contd.)"
þ  On the analogy with Least Squares Fitting"
þ  Logistic regression vs. LDA"
¨  Separating Hyperplane"
¨  Rosenblatt's Perceptron"
¨  Optimal Hyperplane"
Today's topics
þ  Logistic regression (contd.)"
þ  On the analogy with Least Squares Fitting"
þ  Logistic regression vs. LDA"
¨  Separating Hyperplane"
þ  Rosenblatt's Perceptron"
¨  Optimal Hyperplane"
Optimal Hyperplane
Derivation of KKT cond. (1)
This section could be hard for some audience."
"
To make story bit clearer, let us study general on
optimization problem. The theme is:"
Optimal Hyperplane
Derivation of KKT cond. (1)
This section could be hard for some audience."
"
To make story bit clearer, let us study general on
optimization problem. The theme is:"
Duality and KKT condition for optimization problem"
Optimal Hyperplane
Derivation of KKT cond. (2)
Suppose we have an optimization problem:"
"
"
"
and let the feasible region be"
minimize ƒ()
subject to g() ≤ 0
C = {|g() ≤ 0}
Optimal Hyperplane
Derivation of KKT cond. (3)
On the region of optimization, relaxation is the
technique often used to make problem easier."
"
Optimal Hyperplane
Derivation of KKT cond. (3)
On the region of optimization, relaxation is the
technique often used to make problem easier."
"
Lagrange relaxation, as done below, is one of that:"
" minimize L(, y) = ƒ() +

yg()
subject to y ≥ 0.
Optimal Hyperplane
Derivation of KKT cond. (4)
Concerning to the L(x,y), following inequality holds:"
"
"
and it requires yi or gi(x) to be equal to zero for all i
(this condition is called “complementary slackness” )."
min
∈C
ƒ() = min

sp
y≥0
L(, y) ≥ mx
y≥0
inf

L(, y)
Optimal Hyperplane
Derivation of KKT cond. (4)
Concerning to the L(x,y), following inequality holds:"
"
"
and it requires yi or gi(x) to be equal to zero for all i
(this condition is called “complementary slackness” )."
min
∈C
ƒ() = min

sp
y≥0
L(, y) ≥ mx
y≥0
inf

L(, y)
Optimal Hyperplane
Derivation of KKT cond. (4)
Concerning to the L(x,y), following inequality holds:"
"
"
and it requires yi or gi(x) to be equal to zero for all i
(this condition is called “complementary slackness” )."
"
According to the inequality, maximizing infx L(x,y) tells
us the lower boundary for the original problem."
min
∈C
ƒ() = min

sp
y≥0
L(, y) ≥ mx
y≥0
inf

L(, y)
Optimal Hyperplane
Derivation of KKT cond. (5)
Therefore, we have the following maximizing problem:"
"
"
"
"
mximize L(, y)
subject to
∂
∂
L(, y) = 0
y ≥ 0
Optimal Hyperplane
Derivation of KKT cond. (5)
Therefore, we have the following maximizing problem:"
"
"
"
"
mximize L(, y)
subject to
∂
∂
L(, y) = 0
y ≥ 0 Condition	
  to	
  
achieve	
  inf	
  L(x,y)	
  	
  
Optimal Hyperplane
Derivation of KKT cond. (5)
Therefore, we have the following maximizing problem:"
"
"
"
"
This is called “Wolfe dual problem”, and strong duality
theory says the solutions for the primal and dual
problem are equivalent."
mximize L(, y)
subject to
∂
∂
L(, y) = 0
y ≥ 0
Optimal Hyperplane
Derivation of KKT cond. (5)
Therefore, we have the following maximizing problem:"
"
"
"
"
This is called “Wolfe dual problem”, and strong duality
theory says the solutions for the primal and dual
problem are equivalent."
mximize L(, y)
subject to
∂
∂
L(, y) = 0
y ≥ 0
Optimal Hyperplane
Derivation of KKT cond. (6)
Thus, optimal solution must satisfy the conditions so
far. They are called the “KKT condition” altogether."



g() ≤ 0
∂
∂
L(, y) = 0
y ≥ 0
yg() = 0
Optimal Hyperplane
Derivation of KKT cond. (6)
Thus, optimal solution must satisfy the conditions so
far. They are called the “KKT condition” altogether."



g() ≤ 0
∂
∂
L(, y) = 0
y ≥ 0
yg() = 0
Primal	
  constraint	
  
Optimal Hyperplane
Derivation of KKT cond. (6)
Thus, optimal solution must satisfy the conditions so
far. They are called the “KKT condition” altogether."



g() ≤ 0
∂
∂
L(, y) = 0
y ≥ 0
yg() = 0
Primal	
  constraint	
  
Stationary	
  condition	
  
Optimal Hyperplane
Derivation of KKT cond. (6)
Thus, optimal solution must satisfy the conditions so
far. They are called the “KKT condition” altogether."



g() ≤ 0
∂
∂
L(, y) = 0
y ≥ 0
yg() = 0
Primal	
  constraint	
  
Stationary	
  condition	
  
Dual	
  constraint	
  
Optimal Hyperplane
Derivation of KKT cond. (6)
Thus, optimal solution must satisfy the conditions so
far. They are called the “KKT condition” altogether."



g() ≤ 0
∂
∂
L(, y) = 0
y ≥ 0
yg() = 0
Primal	
  constraint	
  
Stationary	
  condition	
  
Dual	
  constraint	
  
Complementary	
  slackness	
  
Optimal Hyperplane
Derivation of KKT cond. (6)
Thus, optimal solution must satisfy the conditions so
far. They are called the “KKT condition” altogether."



g() ≤ 0
∂
∂
L(, y) = 0
y ≥ 0
yg() = 0
Primal	
  constraint	
  
Stationary	
  condition	
  
Dual	
  constraint	
  
Complementary	
  slackness	
  
Optimal Hyperplane
KKT for Opt. Hyperplane (1)
We learned about the KKT conditions."
"
Then, get back to the original problem: finding
optimal hyperplane."
Optimal Hyperplane
KKT for Opt. Hyperplane (2)
The original fitting criteria of the optimal hyperplane
is what is generalized of perceptron:"
" mximize
β,β0
M
subject to β = 1
y(
β + β0) ≥ M ( = 1, . . . , N)
Optimal Hyperplane
KKT for Opt. Hyperplane (2)
The original fitting criteria of the optimal hyperplane
is what is generalized of perceptron:"
" mximize
β,β0
M
subject to β = 1
y(
β + β0) ≥ M ( = 1, . . . , N)
Criteria	
  of	
  maximizing	
  margin	
  is	
  
theoretically	
  supported	
  using	
  
distributions	
  with	
  no	
  assumption	
  
Optimal Hyperplane
KKT for Opt. Hyperplane (3)
This is kind of mini-max problem which is difficult to
solve, so convert it into more easier problem:"
"
"
"
Optimal Hyperplane
KKT for Opt. Hyperplane (3)
This is kind of mini-max problem which is difficult to
solve, so convert it into more easier problem:"
"
"
"
(See hand-out for the detailed transformation)"
minimize
β,β0
1
2
β 2
subject to y(
β + β0) ≥ 1 ( = 1, . . . , N)
Optimal Hyperplane
KKT for Opt. Hyperplane (3)
This is kind of mini-max problem which is difficult to
solve, so convert it into more easier problem:"
"
"
"
(See hand-out for the detailed transformation)"
"
This is quadratic programming problem."
minimize
β,β0
1
2
β 2
subject to y(
β + β0) ≥ 1 ( = 1, . . . , N)
Optimal Hyperplane
KKT for Opt. Hyperplane (4)
To make use of KKT condition, let's make object
function into Lagrange function:"
Lp =
1
2
β 2
−
N
=1
α y(
β + β0) − 1
Optimal Hyperplane
KKT for Opt. Hyperplane (5)
Thus, the KKT condition is:"



y(
β + β0) ≥ 1 ( = 1, . . . , N),
β =
N
=1
αy,
0 =
N
=1
αy,
α ≥ 0 ( = 1, . . . , N),
α y(
β + β0) − 1 = 0 ( = 1, . . . , N),
Optimal Hyperplane
KKT for Opt. Hyperplane (5)
Thus, the KKT condition is:"
"
"
"
"
"
"
"
"
Solution is obtained by solving this."



y(
β + β0) ≥ 1 ( = 1, . . . , N),
β =
N
=1
αy,
0 =
N
=1
αy,
α ≥ 0 ( = 1, . . . , N),
α y(
β + β0) − 1 = 0 ( = 1, . . . , N),
Optimal Hyperplane
Support points (1)
The KKT condition tell us"
"
"
"
α > 0 ⇔ y(
β + β0) = 1 ⇔ 
α = 0 ⇔ y(
β + β0) > 1 ⇔ 
is	
  on	
  edge	
  of	
  slab	
  
is	
  off	
  edge	
  of	
  slab	
  
Optimal Hyperplane
Support points (1)
The KKT condition tell us"
"
"
"
α > 0 ⇔ y(
β + β0) = 1 ⇔ 
α = 0 ⇔ y(
β + β0) > 1 ⇔ 
is	
  on	
  edge	
  of	
  slab	
  
is	
  off	
  edge	
  of	
  slab	
  
Optimal Hyperplane
Support points (1)
The KKT condition tell us"
"
"
"
α > 0 ⇔ y(
β + β0) = 1 ⇔ 
α = 0 ⇔ y(
β + β0) > 1 ⇔ 
is	
  on	
  edge	
  of	
  slab	
  
is	
  off	
  edge	
  of	
  slab	
  
Optimal Hyperplane
Support points (1)
The KKT condition tell us"
"
"
"
α > 0 ⇔ y(
β + β0) = 1 ⇔ 
α = 0 ⇔ y(
β + β0) > 1 ⇔ 
is	
  on	
  edge	
  of	
  slab	
  
is	
  off	
  edge	
  of	
  slab	
  
Optimal Hyperplane
Support points (1)
The KKT condition tell us"
"
"
"
Those points on the edge of the slab is called
“support points” (or “support vectors” )."
α > 0 ⇔ y(
β + β0) = 1 ⇔ 
α = 0 ⇔ y(
β + β0) > 1 ⇔ 
is	
  on	
  edge	
  of	
  slab	
  
is	
  off	
  edge	
  of	
  slab	
  
Optimal Hyperplane
Support points (2)
β can be written as the linear combination of the
support points:"
"
"
"
where S is the indices of the support points."
β =
N
=1
αy
=
∈S
αy,
Optimal Hyperplane
Support points (3)
β0 can be obtained after β is obtained. For i S"
y(
β + β0) = 1
∴ β0 = 1/y − β 
= y −
j∈S
αjyjj

∴ β0 =
1
|S| ∈S
y −
j∈S
αjyjj

00	
  
Optimal Hyperplane
Support points (3)
β0 can be obtained after β is obtained. For i S"
y(
β + β0) = 1
∴ β0 = 1/y − β 
= y −
j∈S
αjyjj

∴ β0 =
1
|S| ∈S
y −
j∈S
αjyjj

Took	
  average	
  to	
  avoid	
  
computation	
  error	
  
Optimal Hyperplane
Support points (4)
All coefficients are defined only through support points:"
"
"
"
"
thus, this is robust to outliers."
"
β =
∈S
αy,
β0 =
1
|S| ∈S
y −
j∈S
αjyjj

Optimal Hyperplane
Support points (4)
All coefficients are defined only through support points:"
"
"
"
"
thus, this is robust to outliers."
"
However, do not forget that which will be support points
is defined using all data points."
β =
∈S
αy,
β0 =
1
|S| ∈S
y −
j∈S
αjyjj

Today's topics
þ  Logistic regression (contd.)"
þ  On the analogy with Least Squares Fitting"
þ  Logistic regression vs. LDA"
¨  Separating Hyperplane"
þ  Rosenblatt's Perceptron"
¨  Optimal Hyperplane"
Today's topics
þ  Logistic regression (contd.)"
þ  On the analogy with Least Squares Fitting"
þ  Logistic regression vs. LDA"
¨  Separating Hyperplane"
þ  Rosenblatt's Perceptron"
þ  Optimal Hyperplane"
Today's topics
þ  Logistic regression (contd.)"
þ  On the analogy with Least Squares Fitting"
þ  Logistic regression vs. LDA"
þ  Separating Hyperplane"
þ  Rosenblatt's Perceptron"
þ  Optimal Hyperplane"
Summary
LDA
Logistic
Regression
Perceptron
Optimal
Hyperplane
With linear separable data
Training error
may occur
True separator
found, but coef.
may be infinite
True separator
found, but not
unique
Best separator
found
With non-linear separable data
Work well Work well Algorithm never
stop
Not feasible
With outliers
Not robust Robust Robust Robust

Más contenido relacionado

Destacado

1.5.1 measures basic concepts
1.5.1 measures basic concepts1.5.1 measures basic concepts
1.5.1 measures basic concepts
A M
 
(마더세이프 라운드) Logistic regression
(마더세이프 라운드) Logistic regression(마더세이프 라운드) Logistic regression
(마더세이프 라운드) Logistic regression
mothersafe
 

Destacado (20)

Everyday English Advanced
Everyday English AdvancedEveryday English Advanced
Everyday English Advanced
 
PRML 2.4-2.5: The Exponential Family & Nonparametric Methods
PRML 2.4-2.5: The Exponential Family & Nonparametric MethodsPRML 2.4-2.5: The Exponential Family & Nonparametric Methods
PRML 2.4-2.5: The Exponential Family & Nonparametric Methods
 
Everyday English: Cooking Instructions
Everyday English: Cooking InstructionsEveryday English: Cooking Instructions
Everyday English: Cooking Instructions
 
Opinions & Debats SNCF: a case study
Opinions & Debats SNCF: a case studyOpinions & Debats SNCF: a case study
Opinions & Debats SNCF: a case study
 
ESL 17.3.2-17.4: Graphical Lasso and Boltzmann Machines
ESL 17.3.2-17.4: Graphical Lasso and Boltzmann MachinesESL 17.3.2-17.4: Graphical Lasso and Boltzmann Machines
ESL 17.3.2-17.4: Graphical Lasso and Boltzmann Machines
 
MLaPP 2章 「確率」(前編)
MLaPP 2章 「確率」(前編)MLaPP 2章 「確率」(前編)
MLaPP 2章 「確率」(前編)
 
NIPS 2016 輪読: Supervised Word Movers Distance
NIPS 2016 輪読: Supervised Word Movers DistanceNIPS 2016 輪読: Supervised Word Movers Distance
NIPS 2016 輪読: Supervised Word Movers Distance
 
PRML 9.1-9.2: K-means Clustering & Mixtures of Gaussians
PRML 9.1-9.2: K-means Clustering & Mixtures of GaussiansPRML 9.1-9.2: K-means Clustering & Mixtures of Gaussians
PRML 9.1-9.2: K-means Clustering & Mixtures of Gaussians
 
Support vector machines
Support vector machinesSupport vector machines
Support vector machines
 
PRML 13.2.2: The Forward-Backward Algorithm
PRML 13.2.2: The Forward-Backward AlgorithmPRML 13.2.2: The Forward-Backward Algorithm
PRML 13.2.2: The Forward-Backward Algorithm
 
Everyday English: Driving Vocabulary Review
Everyday English: Driving Vocabulary ReviewEveryday English: Driving Vocabulary Review
Everyday English: Driving Vocabulary Review
 
English: Everyday Expressions
English: Everyday Expressions English: Everyday Expressions
English: Everyday Expressions
 
Boosted Tree-based Multinomial Logit Model for Aggregated Market Data
Boosted Tree-based Multinomial Logit Model for Aggregated Market DataBoosted Tree-based Multinomial Logit Model for Aggregated Market Data
Boosted Tree-based Multinomial Logit Model for Aggregated Market Data
 
Everyday English: Rooms
Everyday English: Rooms Everyday English: Rooms
Everyday English: Rooms
 
Ordinal Logistic Regression
Ordinal Logistic RegressionOrdinal Logistic Regression
Ordinal Logistic Regression
 
Logistic Regression/Markov Chain presentation
Logistic Regression/Markov Chain presentationLogistic Regression/Markov Chain presentation
Logistic Regression/Markov Chain presentation
 
Transparency7
Transparency7Transparency7
Transparency7
 
1.5.1 measures basic concepts
1.5.1 measures basic concepts1.5.1 measures basic concepts
1.5.1 measures basic concepts
 
Everyday English: Cell phones, slang and abbreviations
Everyday English: Cell phones, slang and abbreviationsEveryday English: Cell phones, slang and abbreviations
Everyday English: Cell phones, slang and abbreviations
 
(마더세이프 라운드) Logistic regression
(마더세이프 라운드) Logistic regression(마더세이프 라운드) Logistic regression
(마더세이프 라운드) Logistic regression
 

Similar a ESL 4.4.3-4.5: Logistic Reression (contd.) and Separating Hyperplane

SMB_2012_HR_VAN_ST-last version
SMB_2012_HR_VAN_ST-last versionSMB_2012_HR_VAN_ST-last version
SMB_2012_HR_VAN_ST-last version
Lilyana Vankova
 
DissertationSlides169
DissertationSlides169DissertationSlides169
DissertationSlides169
Ryan White
 

Similar a ESL 4.4.3-4.5: Logistic Reression (contd.) and Separating Hyperplane (20)

5 cramer-rao lower bound
5 cramer-rao lower bound5 cramer-rao lower bound
5 cramer-rao lower bound
 
Introducing Zap Q-Learning
Introducing Zap Q-Learning   Introducing Zap Q-Learning
Introducing Zap Q-Learning
 
A Proof of the Generalized Riemann Hypothesis
A Proof of the Generalized Riemann HypothesisA Proof of the Generalized Riemann Hypothesis
A Proof of the Generalized Riemann Hypothesis
 
A Proof of the Generalized Riemann Hypothesis
A Proof of the Generalized Riemann HypothesisA Proof of the Generalized Riemann Hypothesis
A Proof of the Generalized Riemann Hypothesis
 
Nonconvex Compressed Sensing with the Sum-of-Squares Method
Nonconvex Compressed Sensing with the Sum-of-Squares MethodNonconvex Compressed Sensing with the Sum-of-Squares Method
Nonconvex Compressed Sensing with the Sum-of-Squares Method
 
LPS talk notes
LPS talk notesLPS talk notes
LPS talk notes
 
A New Approach on Proportional Fuzzy Likelihood Ratio orderings of Triangular...
A New Approach on Proportional Fuzzy Likelihood Ratio orderings of Triangular...A New Approach on Proportional Fuzzy Likelihood Ratio orderings of Triangular...
A New Approach on Proportional Fuzzy Likelihood Ratio orderings of Triangular...
 
D. Vulcanov - On Cosmologies with non-Minimally Coupled Scalar Field and the ...
D. Vulcanov - On Cosmologies with non-Minimally Coupled Scalar Field and the ...D. Vulcanov - On Cosmologies with non-Minimally Coupled Scalar Field and the ...
D. Vulcanov - On Cosmologies with non-Minimally Coupled Scalar Field and the ...
 
Nonparametric approach to multiple regression
Nonparametric approach to multiple regressionNonparametric approach to multiple regression
Nonparametric approach to multiple regression
 
Matrix calculus
Matrix calculusMatrix calculus
Matrix calculus
 
S 7
S 7S 7
S 7
 
Some Examples of Scaling Sets
Some Examples of Scaling SetsSome Examples of Scaling Sets
Some Examples of Scaling Sets
 
SMB_2012_HR_VAN_ST-last version
SMB_2012_HR_VAN_ST-last versionSMB_2012_HR_VAN_ST-last version
SMB_2012_HR_VAN_ST-last version
 
DissertationSlides169
DissertationSlides169DissertationSlides169
DissertationSlides169
 
M. Dimitrijević, Noncommutative models of gauge and gravity theories
M. Dimitrijević, Noncommutative models of gauge and gravity theoriesM. Dimitrijević, Noncommutative models of gauge and gravity theories
M. Dimitrijević, Noncommutative models of gauge and gravity theories
 
SOCG: Linear-Size Approximations to the Vietoris-Rips Filtration
SOCG: Linear-Size Approximations to the Vietoris-Rips FiltrationSOCG: Linear-Size Approximations to the Vietoris-Rips Filtration
SOCG: Linear-Size Approximations to the Vietoris-Rips Filtration
 
3 recursive bayesian estimation
3  recursive bayesian estimation3  recursive bayesian estimation
3 recursive bayesian estimation
 
1110 ch 11 day 10
1110 ch 11 day 101110 ch 11 day 10
1110 ch 11 day 10
 
2 random variables notes 2p3
2 random variables notes 2p32 random variables notes 2p3
2 random variables notes 2p3
 
Otter 2016-11-28-01-ss
Otter 2016-11-28-01-ssOtter 2016-11-28-01-ss
Otter 2016-11-28-01-ss
 

Último

Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Bertram Ludäscher
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
HyderabadDolls
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
Health
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
wsppdmt
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
HyderabadDolls
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
 

Último (20)

Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about them
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 

ESL 4.4.3-4.5: Logistic Reression (contd.) and Separating Hyperplane

  • 1. ESL 4.4.3-4.5 Logistic Regression (contd.) & Separating Hyperplane June 8, 2015 Talk by Shinichi TAMURA Mathematical Informatics Lab @ NAIST
  • 2. Today's topics ¨  Logistic regression (contd.)" ¨  On the analogy with Least Squares Fitting" ¨  Logistic regression vs. LDA" ¨  Separating Hyperplane" ¨  Rosenblatt's Perceptron" ¨  Optimal Hyperplane"
  • 3. Today's topics ¨  Logistic regression (contd.)" ¨  On the analogy with Least Squares Fitting" ¨  Logistic regression vs. LDA" ¨  Separating Hyperplane" ¨  Rosenblatt's Perceptron" ¨  Optimal Hyperplane"
  • 4. Today's topics ¨  Logistic regression (contd.)" ¨  On the analogy with Least Squares Fitting" ¨  Logistic regression vs. LDA" ¨  Separating Hyperplane" ¨  Rosenblatt's Perceptron" ¨  Optimal Hyperplane"
  • 5. Today's topics ¨  Logistic regression (contd.)" ¨  On the analogy with Least Squares Fitting" ¨  Logistic regression vs. LDA" ¨  Separating Hyperplane" ¨  Rosenblatt's Perceptron" ¨  Optimal Hyperplane"
  • 6. On the analogy with Least Squares Fitting [Review] Fitting LR Model Parameters are fitted by ML estimation, using Newton-Raphson algorithm:" " " " βnew ← rg min β (z − Xβ) W(z − Xβ) βnew =(X WX)−1 X Wz.
  • 7. On the analogy with Least Squares Fitting [Review] Fitting LR Model Parameters are fitted by ML estimation, using Newton-Raphson algorithm:" " " " It looks like least squares fitting:" βnew ← rg min β (z − Xβ) W(z − Xβ) βnew =(X WX)−1 X Wz. β ← rg min β (y − Xβ) (y − Xβ) β =(X X)−1 X y
  • 8. On the analogy with Least Squares Fitting Self-consistency β depends on W and z, while W and z depend on β." " " " " " βnew ← rg min β (z − Xβ) W(z − Xβ) βnew =(X WX)−1 X Wz.
  • 9. On the analogy with Least Squares Fitting Self-consistency β depends on W and z, while W and z depend on β." " " " " " βnew ← rg min β (z − Xβ) W(z − Xβ) βnew =(X WX)−1 X Wz. z = ˆβ + y − ˆp ˆp(1 − ˆp)  = ˆp(1 − ˆp).
  • 10. On the analogy with Least Squares Fitting Self-consistency β depends on W and z, while W and z depend on β." " " " " → “self-consistent” equation, needs iterative method to solve" " βnew ← rg min β (z − Xβ) W(z − Xβ) βnew =(X WX)−1 X Wz.
  • 11. On the analogy with Least Squares Fitting Meaning of Weighted RSS (1) RSS is used to check the goodness of fit in least squares fitting." " " N =1 (y − ˆp)2
  • 12. On the analogy with Least Squares Fitting Meaning of Weighted RSS (1) RSS is used to check the goodness of fit in least squares fitting." " " " How about weighted RSS in logistic regression?" N =1 (y − ˆp)2 N =1 (y − ˆp)2 ˆp(1 − ˆp)
  • 13. On the analogy with Least Squares Fitting Meaning of Weighted RSS (2) Weighted RSS is interpreted as..." Peason's χ-squared statistics" χ2 = N =1 (y − ˆp)2 ˆp + (y − ˆp)2 1 − ˆp = N =1 (1 − ˆp + ˆp)(y − ˆp)2 ˆp(1 − ˆp) = N =1 (y − ˆp)2 ˆp(1 − ˆp) .
  • 14. On the analogy with Least Squares Fitting Meaning of Weighted RSS (2) Weighted RSS is interpreted as..." Peason's χ-squared statistics" χ2 = N =1 (y − ˆp)2 ˆp + (y − ˆp)2 1 − ˆp = N =1 (1 − ˆp + ˆp)(y − ˆp)2 ˆp(1 − ˆp) = N =1 (y − ˆp)2 ˆp(1 − ˆp) .
  • 15. On the analogy with Least Squares Fitting Meaning of Weighted RSS (2) Weighted RSS is interpreted as..." Peason's χ-squared statistics" χ2 = N =1 (y − ˆp)2 ˆp + (y − ˆp)2 1 − ˆp = N =1 (1 − ˆp + ˆp)(y − ˆp)2 ˆp(1 − ˆp) = N =1 (y − ˆp)2 ˆp(1 − ˆp) .
  • 16. On the analogy with Least Squares Fitting Meaning of Weighted RSS (3) or... as quadratic approximation of Deviance" D = −2 N =1 [y log ˆp + (1 − y) log(1 − ˆp)] − N =1 [y log y + (1 − y) log(1 − y)] = 2 N =1 y log y ˆp + (1 − y) log 1 − y 1 − ˆp ≈ 2 N =1 (y − ˆp) + (y − ˆp)2 2 ˆp + {(1 − y) − (1 − ˆp)} + {(1 − y) − (1 − ˆp)}2 2(1 − ˆp) = N =1 (y − ˆp)2 ˆp + (y − ˆp)2 1 − ˆp = N =1 (y − ˆp)2 ˆp(1 − ˆp) .
  • 17. On the analogy with Least Squares Fitting Meaning of Weighted RSS (3) or... as quadratic approximation of Deviance" D = −2 N =1 [y log ˆp + (1 − y) log(1 − ˆp)] − N =1 [y log y + (1 − y) log(1 − y)] = 2 N =1 y log y ˆp + (1 − y) log 1 − y 1 − ˆp ≈ 2 N =1 (y − ˆp) + (y − ˆp)2 2 ˆp + {(1 − y) − (1 − ˆp)} + {(1 − y) − (1 − ˆp)}2 2(1 − ˆp) = N =1 (y − ˆp)2 ˆp + (y − ˆp)2 1 − ˆp = N =1 (y − ˆp)2 ˆp(1 − ˆp) . Maximum  likelihood   of  the  model  
  • 18. On the analogy with Least Squares Fitting Meaning of Weighted RSS (3) or... as quadratic approximation of Deviance" D = −2 N =1 [y log ˆp + (1 − y) log(1 − ˆp)] − N =1 [y log y + (1 − y) log(1 − y)] = 2 N =1 y log y ˆp + (1 − y) log 1 − y 1 − ˆp ≈ 2 N =1 (y − ˆp) + (y − ˆp)2 2 ˆp + {(1 − y) − (1 − ˆp)} + {(1 − y) − (1 − ˆp)}2 2(1 − ˆp) = N =1 (y − ˆp)2 ˆp + (y − ˆp)2 1 − ˆp = N =1 (y − ˆp)2 ˆp(1 − ˆp) . Maximum  likelihood   of  the  model   Likelihood  of  the  full  model   which  achieve  perfect  fitting  
  • 19. On the analogy with Least Squares Fitting Meaning of Weighted RSS (3) or... as quadratic approximation of Deviance" D = −2 N =1 [y log ˆp + (1 − y) log(1 − ˆp)] − N =1 [y log y + (1 − y) log(1 − y)] = 2 N =1 y log y ˆp + (1 − y) log 1 − y 1 − ˆp ≈ 2 N =1 (y − ˆp) + (y − ˆp)2 2 ˆp + {(1 − y) − (1 − ˆp)} + {(1 − y) − (1 − ˆp)}2 2(1 − ˆp) = N =1 (y − ˆp)2 ˆp + (y − ˆp)2 1 − ˆp = N =1 (y − ˆp)2 ˆp(1 − ˆp) . 00  
  • 20. On the analogy with Least Squares Fitting Meaning of Weighted RSS (3) or... as quadratic approximation of Deviance" D = −2 N =1 [y log ˆp + (1 − y) log(1 − ˆp)] − N =1 [y log y + (1 − y) log(1 − y)] = 2 N =1 y log y ˆp + (1 − y) log 1 − y 1 − ˆp ≈ 2 N =1 (y − ˆp) + (y − ˆp)2 2 ˆp + {(1 − y) − (1 − ˆp)} + {(1 − y) − (1 − ˆp)}2 2(1 − ˆp) = N =1 (y − ˆp)2 ˆp + (y − ˆp)2 1 − ˆp = N =1 (y − ˆp)2 ˆp(1 − ˆp) . 00    log   = ( − ) + ( − )2 2 − ( − )3 62 + · · ·
  • 21. On the analogy with Least Squares Fitting Meaning of Weighted RSS (3) or... as quadratic approximation of Deviance" D = −2 N =1 [y log ˆp + (1 − y) log(1 − ˆp)] − N =1 [y log y + (1 − y) log(1 − y)] = 2 N =1 y log y ˆp + (1 − y) log 1 − y 1 − ˆp ≈ 2 N =1 (y − ˆp) + (y − ˆp)2 2 ˆp + {(1 − y) − (1 − ˆp)} + {(1 − y) − (1 − ˆp)}2 2(1 − ˆp) = N =1 (y − ˆp)2 ˆp + (y − ˆp)2 1 − ˆp = N =1 (y − ˆp)2 ˆp(1 − ˆp) . 00  
  • 22. On the analogy with Least Squares Fitting Meaning of Weighted RSS (3) or... as quadratic approximation of Deviance" D = −2 N =1 [y log ˆp + (1 − y) log(1 − ˆp)] − N =1 [y log y + (1 − y) log(1 − y)] = 2 N =1 y log y ˆp + (1 − y) log 1 − y 1 − ˆp ≈ 2 N =1 (y − ˆp) + (y − ˆp)2 2 ˆp + {(1 − y) − (1 − ˆp)} + {(1 − y) − (1 − ˆp)}2 2(1 − ˆp) = N =1 (y − ˆp)2 ˆp + (y − ˆp)2 1 − ˆp = N =1 (y − ˆp)2 ˆp(1 − ˆp) .
  • 23. On the analogy with Least Squares Fitting Asymp. distribution of The distribution of converges to "N β, (X WX)−1ˆβ ˆβ
  • 24. On the analogy with Least Squares Fitting Asymp. distribution of The distribution of converges to " (See hand-out for the details)" N β, (X WX)−1 y i.i.d. ∼ Bern(Pr(; β)). ∴ E[y] = p, vr[y] = W. ∴ E ˆβ = E (X WX)−1 X Wz = (X WX)−1 X WE Xβ + W−1 (y − p) = (X WX)−1 X WXβ = β, vr ˆβ = (X WX)−1 X Wvr Xβ + W−1 (y − p) W X(X WX)− = (X WX)−1 X W(W−1 WW− )W X(X WX)− = (X WX)−1 . ˆβ ˆβ
  • 25. On the analogy with Least Squares Fitting Test of models for LR Once a model is obtained, Wald test or Rao's score test can be used to decide which term to drop/add. It need no recalculation of IRLS." Figure from: "Statistics 111: Introduction to Theoretical Statistics" lecture note by Kevin Andrew Rader, on Harvard College GSAS http://isites.harvard.edu/icb/icb.do?keyword=k101665&pageid=icb.page651024
  • 26. On the analogy with Least Squares Fitting Test of models for LR Once a model is obtained, Wald test or Rao's score test can be used to decide which term to drop/add. It need no recalculation of IRLS." Figure from: "Statistics 111: Introduction to Theoretical Statistics" lecture note by Kevin Andrew Rader, on Harvard College GSAS http://isites.harvard.edu/icb/icb.do?keyword=k101665&pageid=icb.page651024 Test  by  the  gradient   of  log-­‐likelihood  
  • 27. On the analogy with Least Squares Fitting Test of models for LR Once a model is obtained, Wald test or Rao's score test can be used to decide which term to drop/add. It need no recalculation of IRLS." Figure from: "Statistics 111: Introduction to Theoretical Statistics" lecture note by Kevin Andrew Rader, on Harvard College GSAS http://isites.harvard.edu/icb/icb.do?keyword=k101665&pageid=icb.page651024 Test  by  the  difference   of  paremeter  
  • 28. On the analogy with Least Squares Fitting L1-regularlized LR (1) Just like lasso, L1-regularlizer is effective for LR."
  • 29. On the analogy with Least Squares Fitting L1-regularlized LR (1) Just like lasso, L1-regularlizer is effective for LR." Here the objective function will be:" " " " " mx β0,β N =1 log Pr(; β0, β) − λ β 1 = mx β0,β N =1 y(β0 + β ) − log 1 + eβ0+β  − λ p j=1 |βj| .
  • 30. On the analogy with Least Squares Fitting L1-regularlized LR (1) Just like lasso, L1-regularlizer is effective for LR." Here the objective function will be:" " " " " The resulting algorithm can be called “iterative reweighted lasso” algorithm." mx β0,β N =1 log Pr(; β0, β) − λ β 1 = mx β0,β N =1 y(β0 + β ) − log 1 + eβ0+β  − λ p j=1 |βj| .
  • 31. On the analogy with Least Squares Fitting L1-regularlized LR (2) By putting the gradient to 0, we get same score equation as lasso algorithm:" ∂ ∂βj N =1 y(β0 + β ) − log 1 + eβ0+β  − λ p j=1 |βj| = 0 ∴ N =1  yj − eβ0+β  1 + eβ0+β  − λ · sign(βj) = 0 ∴ xj (y − p) = λ · sign(βj) (where βj = 0) 00  
  • 32. On the analogy with Least Squares Fitting L1-regularlized LR (2) By putting the gradient to 0, we get same score equation as lasso algorithm:" ∂ ∂βj N =1 y(β0 + β ) − log 1 + eβ0+β  − λ p j=1 |βj| = 0 ∴ N =1  yj − eβ0+β  1 + eβ0+β  − λ · sign(βj) = 0 ∴ xj (y − p) = λ · sign(βj) (where βj = 0) 00  
  • 33. On the analogy with Least Squares Fitting L1-regularlized LR (2) By putting the gradient to 0, we get same score equation as lasso algorithm:" ∂ ∂βj N =1 y(β0 + β ) − log 1 + eβ0+β  − λ p j=1 |βj| = 0 ∴ N =1  yj − eβ0+β  1 + eβ0+β  − λ · sign(βj) = 0 ∴ xj (y − p) = λ · sign(βj) (where βj = 0)
  • 34. On the analogy with Least Squares Fitting L1-regularlized LR (2) By putting the gradient to 0, we get same score equation as lasso algorithm:" ∂ ∂βj N =1 y(β0 + β ) − log 1 + eβ0+β  − λ p j=1 |βj| = 0 ∴ N =1  yj − eβ0+β  1 + eβ0+β  − λ · sign(βj) = 0 ∴ xj (y − p) = λ · sign(βj) (where βj = 0) xj (y − Xβ) = λ · sign(βj) Score  equation  of  lasso  is    
  • 35. On the analogy with Least Squares Fitting L1-regularlized LR (3) Since the objective function is concave, the solution can be obtained using optimization techniques." "
  • 36. On the analogy with Least Squares Fitting L1-regularlized LR (3) Since the objective function is concave, the solution can be obtained using optimization techniques." " However, the profiles of coefficients are not piece- wise linear, and it is difficult to get the path." Predictor-Corrector method for convex optimization or coordinate descent algorithm will work in some situations."
  • 37. On the analogy with Least Squares Fitting Summary LR is analogous to least squares fitting" " and..." •  LR requires iterative algorithm because of the self-consistency" •  Weighted RSS can be seen as χ-squared or deviance" •  The dist. of converges to " •  Rao's score test or Wald test is useful for model selection" •  L1-regularlized is analogous to lasso except for non-linearity" βnew = (X WX)−1 X Wz ↔ β = (X X)−1 X y N β, (X WX)−1ˆβ
  • 38. Today's topics ¨  Logistic regression (contd.)" ¨  On the analogy with Least Squares Fitting" ¨  Logistic regression vs. LDA" ¨  Separating Hyperplane" ¨  Rosenblatt's Perceptron" ¨  Optimal Hyperplane"
  • 39. Today's topics ¨  Logistic regression (contd.)" þ  On the analogy with Least Squares Fitting" ¨  Logistic regression vs. LDA" ¨  Separating Hyperplane" ¨  Rosenblatt's Perceptron" ¨  Optimal Hyperplane"
  • 40. Logistic regression vs. LDA What is the different LDA and logistic regression are very similar methods." Let us study the characteristics of these methods through the difference of formal aspects."
  • 41. Logistic regression vs. LDA Form of the log-odds " " "
  • 42. Logistic regression vs. LDA Form of the log-odds LDA" " " " log Pr(G = k|X = ) Pr(G = K|X = ) = log πk πK − 1 2 (μk + μK ) −1 (μk − μK ) +  −1 (μk − μK ) =αk0 + αk ,
  • 43. Logistic regression vs. LDA Form of the log-odds LDA" " " " " Logistic regression" log Pr(G = k|X = ) Pr(G = K|X = ) = log πk πK − 1 2 (μk + μK ) −1 (μk − μK ) +  −1 (μk − μK ) =αk0 + αk , log Pr(G = k|X = ) Pr(G = K|X = ) =βk0 + βk .
  • 44. Logistic regression vs. LDA Form of the log-odds LDA" " " " " Logistic regression" log Pr(G = k|X = ) Pr(G = K|X = ) = log πk πK − 1 2 (μk + μK ) −1 (μk − μK ) +  −1 (μk − μK ) =αk0 + αk , log Pr(G = k|X = ) Pr(G = K|X = ) =βk0 + βk . Same  form  
  • 45. Logistic regression vs. LDA Criteria of estimations " "
  • 46. Logistic regression vs. LDA Criteria of estimations LDA" " " " " mx N =1 log Pr(G = g, X = ) = mx N =1 log Pr(G = g|X = ) log Pr(X = )
  • 47. Logistic regression vs. LDA Criteria of estimations LDA" " " " " Logistic regression" mx N =1 log Pr(G = g, X = ) = mx N =1 log Pr(G = g|X = ) log Pr(X = ) mx N =1 log Pr(G = g|X = )
  • 48. Logistic regression vs. LDA Criteria of estimations LDA" " " " " Logistic regression" mx N =1 log Pr(G = g, X = ) = mx N =1 log Pr(G = g|X = ) log Pr(X = ) mx N =1 log Pr(G = g|X = ) Marginal  likelihood  
  • 49. Logistic regression vs. LDA Form of the Pr(X) " "
  • 50. Logistic regression vs. LDA Form of the Pr(X) LDA" " " Pr(X) = K k=1 πkϕ(X; μk, ).
  • 51. Logistic regression vs. LDA Form of the Pr(X) LDA" " " " Logistic regression" Pr(X) = K k=1 πkϕ(X; μk, ). Arbitrary  Pr(X)  
  • 52. Logistic regression vs. LDA Form of the Pr(X) LDA" " " " Logistic regression" Pr(X) = K k=1 πkϕ(X; μk, ). Arbitrary  Pr(X)   Involves  parameters  
  • 53. Logistic regression vs. LDA Effects of the difference (1) How these formal difference affect on the character of the algorithm?"
  • 54. Logistic regression vs. LDA Effects of the difference (2) The assumption of Gaussian and homoscedastic can be strong constraint, which lead low variance."
  • 55. Logistic regression vs. LDA Effects of the difference (2) The assumption of Gaussian and homoscedastic can be strong constraint, which lead low variance." In addition, LDA has the advantage that it can make use of unlabelled observations; i.e. semi- supervised is available."
  • 56. Logistic regression vs. LDA Effects of the difference (2) The assumption of Gaussian and homoscedastic can be strong constraint, which lead low variance." In addition, LDA has the advantage that it can make use of unlabelled observations; i.e. semi- supervised is available." " On the other hand, LDA could be affected by outliers."
  • 57. Logistic regression vs. LDA Effects of the difference (3) With linear separable data,"
  • 58. Logistic regression vs. LDA Effects of the difference (3) With linear separable data," •  The coefficients of LDA is defined well; but training error may occur."
  • 59. Logistic regression vs. LDA Effects of the difference (3) With linear separable data," •  The coefficients of LDA is defined well; but training error may occur." •  The coefficients of LR can be infinite; but true separating hyperplane can be found"
  • 60. Logistic regression vs. LDA Effects of the difference (3) With linear separable data," •  The coefficients of LDA is defined well; but training error may occur." •  The coefficients of LR can be infinite; but true separating hyperplane can be found" Do  not  think  too  much  on  training  error;   what  is  important  is    generalization  error    
  • 61. Logistic regression vs. LDA Effects of the difference (4) The assumptions for LDA rarely hold in practical."
  • 62. Logistic regression vs. LDA Effects of the difference (4) The assumptions for LDA rarely hold in practical." Nevertheless, it is known empirically that these models give quite similar results, even when LDA is used inappropriately, say with qualitative variables." "
  • 63. Logistic regression vs. LDA Effects of the difference (4) The assumptions for LDA rarely hold in practical." Nevertheless, it is known empirically that these models give quite similar results, even when LDA is used inappropriately, say with qualitative variables." " After all, however, if Gaussian assumption looks to hold, use LDA. Otherwise, use logistic regression."
  • 64. Today's topics ¨  Logistic regression (contd.)" þ  On the analogy with Least Squares Fitting" ¨  Logistic regression vs. LDA" ¨  Separating Hyperplane" ¨  Rosenblatt's Perceptron" ¨  Optimal Hyperplane"
  • 65. Today's topics ¨  Logistic regression (contd.)" þ  On the analogy with Least Squares Fitting" þ  Logistic regression vs. LDA" ¨  Separating Hyperplane" ¨  Rosenblatt's Perceptron" ¨  Optimal Hyperplane"
  • 66. Today's topics þ  Logistic regression (contd.)" þ  On the analogy with Least Squares Fitting" þ  Logistic regression vs. LDA" ¨  Separating Hyperplane" ¨  Rosenblatt's Perceptron" ¨  Optimal Hyperplane"
  • 67. Separating Hyperplane: Overview Another way of Classification Both LDA and LR do classification through the probabilities using regression models." "
  • 68. Separating Hyperplane: Overview Another way of Classification Both LDA and LR do classification through the probabilities using regression models." " Classification can be done by more explicit way: modelling the decision boundary directly."
  • 69. Separating Hyperplane: Overview Properties of vector algebra Let L be the affine set defined by" " " " " and the signed distance from x to L is " β0 + β  = 0 d± (, L) = 1 β (β  + β0) β0 + β  > 0 ⇔  is above L β0 + β  = 0 ⇔  is on L β0 + β  < 0 ⇔  is below L
  • 70. Today's topics þ  Logistic regression (contd.)" þ  On the analogy with Least Squares Fitting" þ  Logistic regression vs. LDA" ¨  Separating Hyperplane" ¨  Rosenblatt's Perceptron" ¨  Optimal Hyperplane"
  • 71. Today's topics þ  Logistic regression (contd.)" þ  On the analogy with Least Squares Fitting" þ  Logistic regression vs. LDA" p  Separating Hyperplane" p  Rosenblatt's Perceptron" p  Optimal Hyperplane"
  • 72. Rosenblatt's Perceptron Learning Criteria The basic criteria of Rosenblatt's Perceptron learning algorithm is to reduce (M is misclassified data)" D(β, β0) = ∈M  β + β0 ∝ ∈M d± (, L) = − ∈M y( β + β0) 00  
  • 73. Rosenblatt's Perceptron Learning Criteria The basic criteria of Rosenblatt's Perceptron learning algorithm is to reduce (M is misclassified data)" D(β, β0) = ∈M  β + β0 ∝ ∈M d± (, L) = − ∈M y( β + β0) 00  
  • 74. Rosenblatt's Perceptron Learning Criteria The basic criteria of Rosenblatt's Perceptron learning algorithm is to reduce (M is misclassified data)" D(β, β0) = ∈M  β + β0 ∝ ∈M d± (, L) = − ∈M y( β + β0)
  • 75. Rosenblatt's Perceptron Learning Criteria The basic criteria of Rosenblatt's Perceptron learning algorithm is to reduce (M is misclassified data)" D(β, β0) = ∈M  β + β0 ∝ ∈M d± (, L) = − ∈M y( β + β0) If  misclassified  yi=1  as  -­‐1,   the  latter  part  is  negative  
  • 76. Rosenblatt's Perceptron Learning Criteria The basic criteria of Rosenblatt's Perceptron learning algorithm is to reduce (M is misclassified data)" D(β, β0) = ∈M  β + β0 ∝ ∈M d± (, L) = − ∈M y( β + β0) If  misclassified  yi=-­‐1  as  1,   the  latter  part  is  positive  
  • 77. Rosenblatt's Perceptron Learning Algorithm (1) Instead of reducing D by batch learning, “stochastic” gradient descent algorithm is adopted. " The coefficients are updated for each misclassified observations like online learning."
  • 78. Rosenblatt's Perceptron Learning Algorithm (1) Instead of reducing D by batch learning, “stochastic” gradient descent algorithm is adopted. " The coefficients are updated for each misclassified observations like online learning."
  • 79. Rosenblatt's Perceptron Learning Algorithm (1) Instead of reducing D by batch learning, “stochastic” gradient descent algorithm is adopted. " The coefficients are updated for each misclassified observations like online learning." Observations  classified  correctly   do  not  affects  the  parameter,  so   it  is  robust  to  outliers.  
  • 80. Rosenblatt's Perceptron Learning Algorithm (1) Instead of reducing D by batch learning, “stochastic” gradient descent algorithm is adopted. " The coefficients are updated for each misclassified observations like online learning." " Thus, coefficients will be updated based not on D but on single"D(β, β0) = −y( β + β0)
  • 81. Rosenblatt's Perceptron Learning Algorithm (2) Proceedings of the algorithm is as follows:"
  • 82. Rosenblatt's Perceptron Learning Algorithm (2) Proceedings of the algorithm is as follows:" 1.  Take 1 observation xi and classify it"
  • 83. Rosenblatt's Perceptron Learning Algorithm (2) Proceedings of the algorithm is as follows:" 1.  Take 1 observation xi and classify it" 2.  If the classification was wrong, update coefficients" ∂D(β, β0) ∂β = − y, ∂D(β, β0) ∂β0 = − y. ∴ β β0 ← β β0 +ρ y y00  
  • 84. Rosenblatt's Perceptron Learning Algorithm (2) Proceedings of the algorithm is as follows:" 1.  Take 1 observation xi and classify it" 2.  If the classification was wrong, update coefficients" ∂D(β, β0) ∂β = − y, ∂D(β, β0) ∂β0 = − y. ∴ β β0 ← β β0 +ρ y y
  • 85. Rosenblatt's Perceptron Learning Algorithm (2) Proceedings of the algorithm is as follows:" 1.  Take 1 observation xi and classify it" 2.  If the classification was wrong, update coefficients" ∂D(β, β0) ∂β = − y, ∂D(β, β0) ∂β0 = − y. ∴ β β0 ← β β0 +ρ y y Learning  rate   Can  be  set  to  1  without   loss  of  generality  
  • 86. Rosenblatt's Perceptron Learning Algorithm (3) Updating parameter may lead misclassifications of other correctly-classified observations." Therefore, although each update reduces each Di , it can increase total D."
  • 87. Rosenblatt's Perceptron Learning Algorithm (3) Updating parameter may lead misclassifications of other correctly-classified observations." Therefore, although each update reduces each Di , it can increase total D."
  • 88. Rosenblatt's Perceptron Convergence Theorem If data is linear separable learning of perceptron terminates in finite steps. " Otherwise, learning never terminates." "
  • 89. Rosenblatt's Perceptron Convergence Theorem If data is linear separable learning of perceptron terminates in finite steps. " Otherwise, learning never terminates."
  • 90. Rosenblatt's Perceptron Convergence Theorem If data is linear separable learning of perceptron terminates in finite steps. " Otherwise, learning never terminates." " However, in practical, it is difficult to know if" •  the data is not linear separable and never converge" •  or the data is linear separable but time-consuming" " "
  • 91. Rosenblatt's Perceptron Convergence Theorem If data is linear separable learning of perceptron terminates in finite steps. " Otherwise, learning never terminates." " However, in practical, it is difficult to know if" •  the data is not linear separable and never converge" •  or the data is linear separable but time-consuming" " In addition, the solution is not unique depending on the initial value or data order." "
  • 92. Today's topics þ  Logistic regression (contd.)" þ  On the analogy with Least Squares Fitting" þ  Logistic regression vs. LDA" ¨  Separating Hyperplane" ¨  Rosenblatt's Perceptron" ¨  Optimal Hyperplane"
  • 93. Today's topics þ  Logistic regression (contd.)" þ  On the analogy with Least Squares Fitting" þ  Logistic regression vs. LDA" ¨  Separating Hyperplane" þ  Rosenblatt's Perceptron" ¨  Optimal Hyperplane"
  • 94. Optimal Hyperplane Derivation of KKT cond. (1) This section could be hard for some audience." " To make story bit clearer, let us study general on optimization problem. The theme is:"
  • 95. Optimal Hyperplane Derivation of KKT cond. (1) This section could be hard for some audience." " To make story bit clearer, let us study general on optimization problem. The theme is:" Duality and KKT condition for optimization problem"
  • 96. Optimal Hyperplane Derivation of KKT cond. (2) Suppose we have an optimization problem:" " " " and let the feasible region be" minimize ƒ() subject to g() ≤ 0 C = {|g() ≤ 0}
  • 97. Optimal Hyperplane Derivation of KKT cond. (3) On the region of optimization, relaxation is the technique often used to make problem easier." "
  • 98. Optimal Hyperplane Derivation of KKT cond. (3) On the region of optimization, relaxation is the technique often used to make problem easier." " Lagrange relaxation, as done below, is one of that:" " minimize L(, y) = ƒ() +  yg() subject to y ≥ 0.
  • 99. Optimal Hyperplane Derivation of KKT cond. (4) Concerning to the L(x,y), following inequality holds:" " " and it requires yi or gi(x) to be equal to zero for all i (this condition is called “complementary slackness” )." min ∈C ƒ() = min  sp y≥0 L(, y) ≥ mx y≥0 inf  L(, y)
  • 100. Optimal Hyperplane Derivation of KKT cond. (4) Concerning to the L(x,y), following inequality holds:" " " and it requires yi or gi(x) to be equal to zero for all i (this condition is called “complementary slackness” )." min ∈C ƒ() = min  sp y≥0 L(, y) ≥ mx y≥0 inf  L(, y)
  • 101. Optimal Hyperplane Derivation of KKT cond. (4) Concerning to the L(x,y), following inequality holds:" " " and it requires yi or gi(x) to be equal to zero for all i (this condition is called “complementary slackness” )." " According to the inequality, maximizing infx L(x,y) tells us the lower boundary for the original problem." min ∈C ƒ() = min  sp y≥0 L(, y) ≥ mx y≥0 inf  L(, y)
  • 102. Optimal Hyperplane Derivation of KKT cond. (5) Therefore, we have the following maximizing problem:" " " " " mximize L(, y) subject to ∂ ∂ L(, y) = 0 y ≥ 0
  • 103. Optimal Hyperplane Derivation of KKT cond. (5) Therefore, we have the following maximizing problem:" " " " " mximize L(, y) subject to ∂ ∂ L(, y) = 0 y ≥ 0 Condition  to   achieve  inf  L(x,y)    
  • 104. Optimal Hyperplane Derivation of KKT cond. (5) Therefore, we have the following maximizing problem:" " " " " This is called “Wolfe dual problem”, and strong duality theory says the solutions for the primal and dual problem are equivalent." mximize L(, y) subject to ∂ ∂ L(, y) = 0 y ≥ 0
  • 105. Optimal Hyperplane Derivation of KKT cond. (5) Therefore, we have the following maximizing problem:" " " " " This is called “Wolfe dual problem”, and strong duality theory says the solutions for the primal and dual problem are equivalent." mximize L(, y) subject to ∂ ∂ L(, y) = 0 y ≥ 0
  • 106. Optimal Hyperplane Derivation of KKT cond. (6) Thus, optimal solution must satisfy the conditions so far. They are called the “KKT condition” altogether."    g() ≤ 0 ∂ ∂ L(, y) = 0 y ≥ 0 yg() = 0
  • 107. Optimal Hyperplane Derivation of KKT cond. (6) Thus, optimal solution must satisfy the conditions so far. They are called the “KKT condition” altogether."    g() ≤ 0 ∂ ∂ L(, y) = 0 y ≥ 0 yg() = 0 Primal  constraint  
  • 108. Optimal Hyperplane Derivation of KKT cond. (6) Thus, optimal solution must satisfy the conditions so far. They are called the “KKT condition” altogether."    g() ≤ 0 ∂ ∂ L(, y) = 0 y ≥ 0 yg() = 0 Primal  constraint   Stationary  condition  
  • 109. Optimal Hyperplane Derivation of KKT cond. (6) Thus, optimal solution must satisfy the conditions so far. They are called the “KKT condition” altogether."    g() ≤ 0 ∂ ∂ L(, y) = 0 y ≥ 0 yg() = 0 Primal  constraint   Stationary  condition   Dual  constraint  
  • 110. Optimal Hyperplane Derivation of KKT cond. (6) Thus, optimal solution must satisfy the conditions so far. They are called the “KKT condition” altogether."    g() ≤ 0 ∂ ∂ L(, y) = 0 y ≥ 0 yg() = 0 Primal  constraint   Stationary  condition   Dual  constraint   Complementary  slackness  
  • 111. Optimal Hyperplane Derivation of KKT cond. (6) Thus, optimal solution must satisfy the conditions so far. They are called the “KKT condition” altogether."    g() ≤ 0 ∂ ∂ L(, y) = 0 y ≥ 0 yg() = 0 Primal  constraint   Stationary  condition   Dual  constraint   Complementary  slackness  
  • 112. Optimal Hyperplane KKT for Opt. Hyperplane (1) We learned about the KKT conditions." " Then, get back to the original problem: finding optimal hyperplane."
  • 113. Optimal Hyperplane KKT for Opt. Hyperplane (2) The original fitting criteria of the optimal hyperplane is what is generalized of perceptron:" " mximize β,β0 M subject to β = 1 y( β + β0) ≥ M ( = 1, . . . , N)
  • 114. Optimal Hyperplane KKT for Opt. Hyperplane (2) The original fitting criteria of the optimal hyperplane is what is generalized of perceptron:" " mximize β,β0 M subject to β = 1 y( β + β0) ≥ M ( = 1, . . . , N) Criteria  of  maximizing  margin  is   theoretically  supported  using   distributions  with  no  assumption  
  • 115. Optimal Hyperplane KKT for Opt. Hyperplane (3) This is kind of mini-max problem which is difficult to solve, so convert it into more easier problem:" " " "
  • 116. Optimal Hyperplane KKT for Opt. Hyperplane (3) This is kind of mini-max problem which is difficult to solve, so convert it into more easier problem:" " " " (See hand-out for the detailed transformation)" minimize β,β0 1 2 β 2 subject to y( β + β0) ≥ 1 ( = 1, . . . , N)
  • 117. Optimal Hyperplane KKT for Opt. Hyperplane (3) This is kind of mini-max problem which is difficult to solve, so convert it into more easier problem:" " " " (See hand-out for the detailed transformation)" " This is quadratic programming problem." minimize β,β0 1 2 β 2 subject to y( β + β0) ≥ 1 ( = 1, . . . , N)
  • 118. Optimal Hyperplane KKT for Opt. Hyperplane (4) To make use of KKT condition, let's make object function into Lagrange function:" Lp = 1 2 β 2 − N =1 α y( β + β0) − 1
  • 119. Optimal Hyperplane KKT for Opt. Hyperplane (5) Thus, the KKT condition is:"    y( β + β0) ≥ 1 ( = 1, . . . , N), β = N =1 αy, 0 = N =1 αy, α ≥ 0 ( = 1, . . . , N), α y( β + β0) − 1 = 0 ( = 1, . . . , N),
  • 120. Optimal Hyperplane KKT for Opt. Hyperplane (5) Thus, the KKT condition is:" " " " " " " " " Solution is obtained by solving this."    y( β + β0) ≥ 1 ( = 1, . . . , N), β = N =1 αy, 0 = N =1 αy, α ≥ 0 ( = 1, . . . , N), α y( β + β0) − 1 = 0 ( = 1, . . . , N),
  • 121. Optimal Hyperplane Support points (1) The KKT condition tell us" " " " α > 0 ⇔ y( β + β0) = 1 ⇔  α = 0 ⇔ y( β + β0) > 1 ⇔  is  on  edge  of  slab   is  off  edge  of  slab  
  • 122. Optimal Hyperplane Support points (1) The KKT condition tell us" " " " α > 0 ⇔ y( β + β0) = 1 ⇔  α = 0 ⇔ y( β + β0) > 1 ⇔  is  on  edge  of  slab   is  off  edge  of  slab  
  • 123. Optimal Hyperplane Support points (1) The KKT condition tell us" " " " α > 0 ⇔ y( β + β0) = 1 ⇔  α = 0 ⇔ y( β + β0) > 1 ⇔  is  on  edge  of  slab   is  off  edge  of  slab  
  • 124. Optimal Hyperplane Support points (1) The KKT condition tell us" " " " α > 0 ⇔ y( β + β0) = 1 ⇔  α = 0 ⇔ y( β + β0) > 1 ⇔  is  on  edge  of  slab   is  off  edge  of  slab  
  • 125. Optimal Hyperplane Support points (1) The KKT condition tell us" " " " Those points on the edge of the slab is called “support points” (or “support vectors” )." α > 0 ⇔ y( β + β0) = 1 ⇔  α = 0 ⇔ y( β + β0) > 1 ⇔  is  on  edge  of  slab   is  off  edge  of  slab  
  • 126. Optimal Hyperplane Support points (2) β can be written as the linear combination of the support points:" " " " where S is the indices of the support points." β = N =1 αy = ∈S αy,
  • 127. Optimal Hyperplane Support points (3) β0 can be obtained after β is obtained. For i S" y( β + β0) = 1 ∴ β0 = 1/y − β  = y − j∈S αjyjj  ∴ β0 = 1 |S| ∈S y − j∈S αjyjj  00  
  • 128. Optimal Hyperplane Support points (3) β0 can be obtained after β is obtained. For i S" y( β + β0) = 1 ∴ β0 = 1/y − β  = y − j∈S αjyjj  ∴ β0 = 1 |S| ∈S y − j∈S αjyjj  Took  average  to  avoid   computation  error  
  • 129. Optimal Hyperplane Support points (4) All coefficients are defined only through support points:" " " " " thus, this is robust to outliers." " β = ∈S αy, β0 = 1 |S| ∈S y − j∈S αjyjj 
  • 130. Optimal Hyperplane Support points (4) All coefficients are defined only through support points:" " " " " thus, this is robust to outliers." " However, do not forget that which will be support points is defined using all data points." β = ∈S αy, β0 = 1 |S| ∈S y − j∈S αjyjj 
  • 131. Today's topics þ  Logistic regression (contd.)" þ  On the analogy with Least Squares Fitting" þ  Logistic regression vs. LDA" ¨  Separating Hyperplane" þ  Rosenblatt's Perceptron" ¨  Optimal Hyperplane"
  • 132. Today's topics þ  Logistic regression (contd.)" þ  On the analogy with Least Squares Fitting" þ  Logistic regression vs. LDA" ¨  Separating Hyperplane" þ  Rosenblatt's Perceptron" þ  Optimal Hyperplane"
  • 133. Today's topics þ  Logistic regression (contd.)" þ  On the analogy with Least Squares Fitting" þ  Logistic regression vs. LDA" þ  Separating Hyperplane" þ  Rosenblatt's Perceptron" þ  Optimal Hyperplane"
  • 134. Summary LDA Logistic Regression Perceptron Optimal Hyperplane With linear separable data Training error may occur True separator found, but coef. may be infinite True separator found, but not unique Best separator found With non-linear separable data Work well Work well Algorithm never stop Not feasible With outliers Not robust Robust Robust Robust