9. Background on SVM
§ Margin: Distance from closest points to the hyper-plane
§ Idea: Among the set of hyper-planes, choose the one that
maximizes the margin
6/6/13 9
ρ
SVs
10. Background on SVM
6/6/13 10
wTx + b = 0
ρ
wTx + b > ρ
wTx + b < -ρ
ρ
• Hyper-plane represented by:
• We want to choose the w and
b that will maximize the
margin ρ.
• Using some algebra and
some rescaling, we can show
that for the support vectors:
margin =
1
w
11. Background on SVM (cont.)
§ Thus the goal is solving the following optimization problem:
6/6/13 11
Subject to yi (wTxi + b) ≥ 1, i =1..n
(where yi = 1 or -1, depending on which class of xi)
Argmax
W,b
ρ =
1
w
!
"
##
$
%
&& = Argmin
W,b
w( )
12. Background on SVM (cont.)
§ To avoid square roots, can do the following transformation
§ Thus, the problem is solving a quadratic function minimization
subject to linear constraints (well studied)
6/6/13 12
Subject to yi (wTxi + b) ≥ 1, i =1..n
Subject to yi (wTxi + b) ≥ 1, i =1..n
)
2
1
(
2
,
min wArg
bW
W,b
Argmin( w )
13. Background on SVM (cont.)
§ What happens is the data is not linearly separable? (i.e., there
is no hyper-plane that will split the data exactly)
14. Background on SVM (cont.)
6/6/13 14
Subject to yi (wTxi + b) ≥ 1, i=1 .. n
)
2
1
(
2
,
min wArg
bW
• Slack variables ξi is added to the constraints.
• ξi is the distance from xi to its class boundary.
15. Background on SVM (cont.)
6/6/13 CORTES, Corinna, and Vladimir VAPNIK, 1995. Support-vector
networks. Machine Learning, 20(3), 273–297
15
Subject to yi (wTxi + b) ≥ 1, i=1 .. n
)
2
1
(
2
,
min wArg
bW
⇓
Subject to yi (wTxi + b) ≥ 1 – ξi,, ξi ≥ 0, i=1 .. n
W,b
Argmin(
1
2
w
2
+C ξi
i=1
n
∑ )
(add slack)
• Slack variables ξi is added to the constraints.
• ξi is the distance from xi to its class boundary.
• C is the regularization parameter which controls the bias-
variance trade-off (significance of outliers)
16. Background on SVM (cont.)
6/6/13 CORTES, Corinna, and Vladimir VAPNIK, 1995. Support-vector
networks. Machine Learning, 20(3), 273–297
16
Subject to yi (wTxi + b) ≥ 1 – ξi,, ξi ≥ 0, i=1 .. n
W,b
Argmin(
1
2
w
2
+C ξi
i=1
n
∑ )
Question: how to get rid of the constraints?
17. Background on SVM (cont.)
6/6/13 17
Subject to yi (wTxi + b) ≥ 1 – ξi,, ξi ≥ 0, i=1 .. n
W,b
Argmin(
1
2
w
2
+C ξi
i=1
n
∑ )
Answer: Fenchel Duality and Representer Theorems!
W,b
Argmin
λ
2
w
2
+ max 0,1− wT
xi − b( )
Hinge Loss
i=1
n
∑
#
$
%
%%
&
'
(
((
We’ve removed the constraint! SVM minimizes the
“L2 Regularized Hinge”
18. Background on SVM (cont.)
§ What happens to the multi-class situations?
There are different ways to handle multi-classification:
• One vs. all
• One vs. one
• Cost-sensitive Hinge (Crammer and Singer 2001)
19. Cost sensitive formulation of hinge loss
(Crammer and Singer 2001)
Where
This loss function is called “cost-sensitive hinge.”
And the prediction function is:
Background on SVM (cont.)
6/6/13 Crammer, K & Singer. Y. (2001). On the algorithmic implementation of
multiclass kernel-based vector machines. JMLR, 2, 262-292.
19
W,b
Argmin
λ
2
w
2
+ max 0,1+ f r
(xi )− f t
(xi )( )
multi-class Hinge
i=1
n
∑
#
$
%
%%
&
'
(
((
f r
(xi ) = Argmax(wi x + bi ),i ∈ Y,i ≠ t
f t
(xi ) = wt x + bt
f (x) = Argmax(wi x + bi ),i ∈ Y
20. SVM: Implementation
We now have our function that we need to optimize. But how do
we parallelize this for map-reduce framework?
6/6/13 20
21. SVM: Implementation
We now have our function that we need to optimize. But how do
we parallelize this for map-reduce framework?
6/6/13 21
Parallelized
Stochas1c
Gradient
Descent
By
Mar'n
Zinkevich,
Markus
Weimer,
Alexander
J.
Smola,
Lihong
Li
NIPS
2010
23. Parallelized Stochastic Gradient Descent - Theory
§ Conditions:
• SVM loss function has bounded gradient
• The solver is stochastic
§ Result:
• You can break the original sample into randomly distributed
subsamples and solve on each subsample.
• The convex combination of each sub-solution will be the same as the
solution for the original sample
6/6/13 23
24. Optimization
§ Conditions:
• SVM loss function has bounded gradient
• The solver is stochastic
§ Loss: Cost sensitive hinge
• Crammer, K & Singer. Y. (2001). On the algorithmic implementation of
multiclass kernel-based vector machines. JMLR, 2, 262-292.
§ Solver: Pegasos
• Shalev-Shwartz, S., Singer, Y., & Srebro, N. (2007). Pegasos: primal estimated
sub-gradient solver for svm. ICML, 807-814.
§ Use mapper to random distribute the samples, and use reducer to
iterate on the sub-sample.
6/6/13 24
26. SVM: Non-tolerable data
But what about non-tolerable data?
6/6/13 26
Idea: Transform the pattern space to a higher
dimensional space, called feature space, which is
linearly separable
27. SVM: Non-tolerable data
But what about non-tolerable data?
6/6/13 27
Idea: Transform the pattern space to a higher
dimensional space, called feature space, which is
linearly separable
28. SVM: Non-tolerable data
But what about non-tolerable data?
6/6/13 28
Idea: Transform the pattern space to a higher
dimensional space, called feature space, which is
linearly separable
29. SVM: Kernels
§ Two questions:
• What kind of function is a kernel?
• What kernel is appropriate for a specific problem?
§ The answers:
• Mercer’s Theorem: Every semi-positive definite symmetric
function is a kernel
• Depends on the problem.
6/6/13
http://www.ism.ac.jp/~fukumizu/
H20_kernel/Kernel_7_theory.pdf
29
30. SVM: Kernels
§ Examples of popular kernel functions:
• Gaussian kernel:
• Laplacian kernel:
• Polynomial kernel:
6/6/13 30
2
2
2
),( σ
ji
exxK ji
xx −
−
=
θ
θ ||||
sin
||||
),(
ji
ji
ji xxK
xx
xx
−
−
=
( )d
j
T
iji bxaxxxK +=),(
31. SVM: Kernels
§ Kernel (dual) feature space is defined by the inner products
between each
§ Kernel matrix is N × N, where N is the number of samples
§ As your sample size goes up, kernel matrix gets huge!
§ Yet, the problem is lack of ability to match with MapReduce!
6/6/13 31
⇒ Dual space is not feasible at scale
xi and xj
33. SVM: Implementation
§ Question: How having a non-linear SVM without paying the
price of duality?
§ Claim: For certain kernel functions we can find a function z
where
6/6/13 33
W,b
Argmin
λ
2
w
2
+ max 0,1− wT
z xi( )− b( )
Hinge Loss
i=1
n
∑
#
$
%
%%
&
'
(
((
z
34. SVM: Implementation
6/6/13 34
Random
Features
for
Large-‐Scale
Kernel
Machines
By
Ali
Rahimi
and
Ben
Recht
NIPS
2007
Can
approximate
shi1-‐invariant
kernels
Random
Feature
Maps
for
Dot
Product
By
PurushoHam
Kar
and
Karish
Karnick
AISTATS
2012
Can
approximate
dot-‐product
kernels
35. Approximating shift-invariant kernel
6/6/13 35
Random
Features
for
Large-‐Scale
Kernel
Machines
Given a positive definite shift-invariant kernel K x, y( )= f x − y( ),
we can create a randomized feature map Z : Rd
→ RD
such that
Z x( )#Z y( ) ≈ K x − y( )
Compute the Fourier tranform p of the kernel k: p(ω) =
1
2π
e− j "ω δ
k δ( )dΔ∫
Draw D iid samples ω1,...,ωD ∈ Rd
from p.
Draw D iid samples b1,...,bD ∈ R from the uniform distribution on 0,2π[ ].
Z : x →
2
D
cos "ω1x + b1( )cos "ωDx + bD( )#$ %&
"
37. Approximating dot-product kernel
6/6/13 37
Random
Feature
Maps
for
Dot
Product
Kernels
Obtain the Maclaurin expansion of f (x) = an xn
n=0
∞
∑ by setting an =
f
n( )
0( )
n!
Fix a value p >1. For i =1 to D :
Choose a non-negative integer with P N = n[ ]=
1
pn+1
Choose N vectors ω1,...,ωn ∈ −1,1{ }
d
selecting each coordinate using fair coin tosses.
Let feature map Zi : x → aN pN+1
ωj
T
j=1
N
∏ x
Z : x →
1
D
Z1 x( ),..., ZD x( )( )
Given a positive definite dot product kernel K x, y( )= f x, y( ),
we can create a randomized feature map Z : Rd
→ RD
such that
Z x( ),Z y( ) ≈ K x, y( )
38. SVM: Implementation Summary
Using these approximations, we can now treat this as a linear SVM
problem.
(1) Job 1 – compute stats for feature and class (mean, variance, class
cardinality, etc.)
(2) Job 2- Transform sample by the approximate kernel and compute
stats for new feature space.
(3) Job 3 – randomly distribute the new samples and train the model in
the reducer.
6/6/13 38
We can use map-reduce to solve non-linear
multi-classification SVM!
39. SVM: Implementation examples
§ SVM used by large entertainment company for customer
segmentation
• Web logs containing browsing information mined for customer attributes
like gender and age
• Raw Omniture logs stored in Hadoop
• Models built on ~10 billion rows and 1 million features
• Models used to improve inventory value of company’s web properties
for publishers