Svm map reduce_slides

Support Vector Machines in MapReduce
Presented by
Asghar Dehghani, Alpine Data Labs
Sara Asher, Alpine Data Labs

Overview
§  Theory of basic SVM (biclassification, linear)
§  Generalized SVM (multi-classification)
§  MapReducing SVM
§  Handling kernels (nonlinear SVM) in MapReduce
§  Demo

Background on SVM
§  Given a bunch of points…

Background on SVM
§  How do we classify a new point?

Background on SVM
§  Split the space using a hyper-plane

Background on SVM
§  Which plane do you use?

Background on SVM
§  Margin: Distance from closest points to the hyper-plane
§  Idea: Among the set of hyper-planes, choose the one that
maximizes the margin
6/6/13 9
ρ

SVs

Background on SVM
6/6/13 10
wTx + b = 0
ρ

wTx + b > ρ
wTx + b < -ρ
ρ

•  Hyper-plane represented by:
•  We want to choose the w and
b that will maximize the
margin ρ.
•  Using some algebra and
some rescaling, we can show
that for the support vectors:
margin =
1
w

Background on SVM (cont.)
§  Thus the goal is solving the following optimization problem:
6/6/13 11
Subject to yi (wTxi + b) ≥ 1, i =1..n
(where yi = 1 or -1, depending on which class of xi)
Argmax
W,b
ρ =
1
w
!
"
##
$
%
&& = Argmin
W,b
w( )

§  To avoid square roots, can do the following transformation
§  Thus, the problem is solving a quadratic function minimization
subject to linear constraints (well studied)
6/6/13 12


)
2
1
(
2
,
min wArg
bW
W,b
Argmin( w )

§  What happens is the data is not linearly separable? (i.e., there
is no hyper-plane that will split the data exactly)

6/6/13 14
Subject to yi (wTxi + b) ≥ 1, i=1 .. n
)
2
1
(
2
,
min wArg
bW
•  Slack variables ξi is added to the constraints.
•  ξi is the distance from xi to its class boundary.

6/6/13 CORTES, Corinna, and Vladimir VAPNIK, 1995. Support-vector
networks. Machine Learning, 20(3), 273–297
15
Subject to yi (wTxi + b) ≥ 1, i=1 .. n
)
2
1
(
2
,
min wArg
bW
⇓
Subject to yi (wTxi + b) ≥ 1 – ξi,, ξi ≥ 0, i=1 .. n
W,b
Argmin(
1
2
w
2
+C ξi
i=1
n
∑ )
(add slack)
•  Slack variables ξi is added to the constraints.
•  ξi is the distance from xi to its class boundary.
•  C is the regularization parameter which controls the bias-
variance trade-off (significance of outliers)

6/6/13 CORTES, Corinna, and Vladimir VAPNIK, 1995. Support-vector
networks. Machine Learning, 20(3), 273–297
16
W,b
Argmin(
1
2
w
2
+C ξi
i=1
n
∑ )
Question: how to get rid of the constraints?

6/6/13 17
W,b
Argmin(
1
2
w
2
+C ξi
i=1
n
∑ )
Answer: Fenchel Duality and Representer Theorems!
W,b
Argmin
λ
2
w
2
+ max 0,1− wT
xi − b( )
Hinge Loss
  i=1
n
∑
#
$
%
%%
&
'
(
((
We’ve removed the constraint! SVM minimizes the
“L2 Regularized Hinge”

§  What happens to the multi-class situations?
There are different ways to handle multi-classification:
•  One vs. all
•  One vs. one
•  Cost-sensitive Hinge (Crammer and Singer 2001)

Cost sensitive formulation of hinge loss
(Crammer and Singer 2001)
Where
This loss function is called “cost-sensitive hinge.”
And the prediction function is:
6/6/13 Crammer, K & Singer. Y. (2001). On the algorithmic implementation of
multiclass kernel-based vector machines. JMLR, 2, 262-292.
19
W,b
Argmin
λ
2
w
2
+ max 0,1+ f r
(xi )− f t
(xi )( )
multi-class Hinge
  i=1
n
∑
#
$
%
%%
&
'
(
((
f r
(xi ) = Argmax(wi x + bi ),i ∈ Y,i ≠ t
f t
(xi ) = wt x + bt
f (x) = Argmax(wi x + bi ),i ∈ Y

SVM: Implementation
We now have our function that we need to optimize. But how do
we parallelize this for map-reduce framework?
6/6/13 20

SVM: Implementation
We now have our function that we need to optimize. But how do
we parallelize this for map-reduce framework?
6/6/13 21
Parallelized
Stochas1c
Gradient
Descent

By
Mar'n
Zinkevich,
Markus
Weimer,
Alexander
J.

Smola,
Lihong
Li

NIPS
2010

Parallelized Stochastic Gradient Descent - Theory
6/6/13 22

Parallelized Stochastic Gradient Descent - Theory
§  Conditions:
•  SVM loss function has bounded gradient
•  The solver is stochastic
§  Result:
•  You can break the original sample into randomly distributed
subsamples and solve on each subsample.
•  The convex combination of each sub-solution will be the same as the
solution for the original sample
6/6/13 23

Optimization
§  Conditions:
•  SVM loss function has bounded gradient
•  The solver is stochastic
§  Loss: Cost sensitive hinge
•  Crammer, K & Singer. Y. (2001). On the algorithmic implementation of
multiclass kernel-based vector machines. JMLR, 2, 262-292.
§  Solver: Pegasos
•  Shalev-Shwartz, S., Singer, Y., & Srebro, N. (2007). Pegasos: primal estimated
sub-gradient solver for svm. ICML, 807-814.
§  Use mapper to random distribute the samples, and use reducer to
iterate on the sub-sample.
6/6/13 24

SVM: Non-tolerable data
But what about non-tolerable data?
6/6/13 25

6/6/13 26
Idea: Transform the pattern space to a higher
dimensional space, called feature space, which is
linearly separable

6/6/13 27
linearly separable

6/6/13 28
linearly separable

SVM: Kernels
§  Two questions:
•  What kind of function is a kernel?
•  What kernel is appropriate for a specific problem?
§  The answers:
•  Mercer’s Theorem: Every semi-positive definite symmetric
function is a kernel
•  Depends on the problem.
6/6/13
http://www.ism.ac.jp/~fukumizu/
H20_kernel/Kernel_7_theory.pdf
29

SVM: Kernels
§  Examples of popular kernel functions:
•  Gaussian kernel:
•  Laplacian kernel:
•  Polynomial kernel:
6/6/13 30
2
2
2
),( σ
ji
exxK ji
xx −
−
=
θ
θ ||||
sin
||||
),(
ji
ji
ji xxK
xx
xx
−
−
=
( )d
j
T
iji bxaxxxK +=),(

SVM: Kernels
§  Kernel (dual) feature space is defined by the inner products
between each
§  Kernel matrix is N × N, where N is the number of samples
§  As your sample size goes up, kernel matrix gets huge!
§  Yet, the problem is lack of ability to match with MapReduce!
6/6/13 31
⇒ Dual space is not feasible at scale
xi and xj

SVM: Implementation
§  Question: How having a non-linear SVM without paying the
price of duality?
6/6/13 32

SVM: Implementation
§  Question: How having a non-linear SVM without paying the
price of duality?
§  Claim: For certain kernel functions we can find a function z
where
6/6/13 33
W,b
Argmin
λ
2
w
2
+ max 0,1− wT
z xi( )− b( )
Hinge Loss
  i=1
n
∑
#
$
%
%%
&
'
(
((
z

SVM: Implementation
6/6/13 34
Random
Features
for
Large-‐Scale
Kernel

Machines

By
Ali
Rahimi
and
Ben
Recht

NIPS
2007

Can
approximate
shi1-‐invariant
kernels

Random
Feature
Maps
for
Dot
Product

By
PurushoHam
Kar
and
Karish
Karnick

AISTATS
2012

Can
approximate
dot-‐product
kernels

Approximating shift-invariant kernel
6/6/13 35
Random
Features
for
Large-‐Scale
Kernel

Machines

Given a positive definite shift-invariant kernel K x, y( )= f x − y( ),
we can create a randomized feature map Z : Rd
→ RD
such that
Z x( )#Z y( ) ≈ K x − y( )
Compute the Fourier tranform p of the kernel k: p(ω) =
1
2π
e− j "ω δ
k δ( )dΔ∫
Draw D iid samples ω1,...,ωD ∈ Rd
from p.
Draw D iid samples b1,...,bD ∈ R from the uniform distribution on 0,2π[ ].
Z : x →
2
D
cos "ω1x + b1( )cos "ωDx + bD( )#$ %&
"

Approximating dot-product kernel
6/6/13 37
Random
Feature
Maps
for
Dot
Product
Kernels

Obtain the Maclaurin expansion of f (x) = an xn
n=0
∞
∑ by setting an =
f
n( )
0( )
n!
Fix a value p >1. For i =1 to D :
Choose a non-negative integer with P N = n[ ]=
1
pn+1
Choose N vectors ω1,...,ωn ∈ −1,1{ }
d
selecting each coordinate using fair coin tosses.
Let feature map Zi : x → aN pN+1
ωj
T
j=1
N
∏ x
Z : x →
1
D
Z1 x( ),..., ZD x( )( )
Given a positive definite dot product kernel K x, y( )= f x, y( ),
we can create a randomized feature map Z : Rd
→ RD
such that
Z x( ),Z y( ) ≈ K x, y( )

SVM: Implementation Summary
Using these approximations, we can now treat this as a linear SVM
problem.
(1)  Job 1 – compute stats for feature and class (mean, variance, class
cardinality, etc.)
(2)  Job 2- Transform sample by the approximate kernel and compute
stats for new feature space.
(3)  Job 3 – randomly distribute the new samples and train the model in
the reducer.
6/6/13 38
We can use map-reduce to solve non-linear
multi-classification SVM!

SVM: Implementation examples
§  SVM used by large entertainment company for customer
segmentation
•  Web logs containing browsing information mined for customer attributes
like gender and age
•  Raw Omniture logs stored in Hadoop
•  Models built on ~10 billion rows and 1 million features
•  Models used to improve inventory value of company’s web properties
for publishers

Svm map reduce_slides

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Svm map reduce_slides

Similar a Svm map reduce_slides (20)

Último

Último (20)

Svm map reduce_slides