SlideShare una empresa de Scribd logo
1 de 53
Descargar para leer sin conexión
1
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
MACHINE LEARNING
Vasant Honavar
Artificial Intelligence Research Laboratory
Department of Computer Science
Bioinformatics and Computational Biology Program
Center for Computational Intelligence, Learning, & Discovery
Iowa State University
honavar@cs.iastate.edu
www.cs.iastate.edu/~honavar/
www.cild.iastate.edu/
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
Maximum margin classifiers
• Discriminative classifiers that maximize the margin of
separation
• Support Vector Machines
– Feature spaces
– Kernel machines
– VC Theory and generalization bounds
– Maximum margin classifiers
• Comparisons with conditionally trained discriminative
classifiers
2
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
Perceptrons revisited
Perceptrons
• Can only compute threshold functions
• Can only represent linear decision surfaces
• Can only learn to classify linearly separable training
data
How can we deal with non linearly separable data?
• Map data into a typically higher dimensional feature
space where the classes become separable
Two problems must be solved:
• Computational problem of working with high
dimensional feature space
• Overfitting problem in high dimensional feature spaces
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
Maximum margin model
We cannot outperform Bayes optimal classifier when
• The generative model assumed is correct
• The data set is large enough to ensure reliable
estimation of parameters of the models
But discriminative models may be better than generative
models when
• The correct generative model is seldom known
• The data set is often simply not large enough
Maximum margin classifiers are a kind of discriminative
classifiers designed to circumvent the overfiting
problem
3
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
Extending Linear Classifiers – Feature spaces
Map data into a feature space where they are linearly
separable
( )ϕ→x x
x ( )ϕ x
X Φ
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
Linear Separability in Feature spaces
The original input space can always be mapped to some
higher-dimensional feature space where the training data
become separable:
Φ: x → φ(x)
4
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
Learning in the Feature Spaces
High dimensional Feature spaces
where typically d > > n solve the problem of expressing
complex functions
But this introduces a
• computational problem (working with very large vectors)
• generalization problem (curse of dimensionality)
SVM offer an elegant solution to both problems
( ) ( ) ( ) ( ) ( )( )xxxxx dnxxx ϕϕϕ=ϕ→= ,....,...., 2121
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
The Perceptron Algorithm Revisited
• The perceptron works by adding misclassified positive
or subtracting misclassified negative examples to an
arbitrary weight vector, which (without loss of
generality) we assumed to be the zero vector.
• The final weight vector is a linear combination of
training points
where, since the sign of the coefficient of is given by
label yi, the are positive values, proportional to the
number of times, misclassification of has caused the
weight to be updated. It is called the embedding
strength of the pattern .
•
1
,
l
i i i
i
yα
=
= ∑w x
ix
iα
ix
ix
5
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
Dual Representation
The decision function can be rewritten as:
The update rule is:
WLOG, we can take
( ) ( )
⎟
⎟
⎠
⎞
⎜
⎜
⎝
⎛
+=
⎟
⎟
⎠
⎞
⎜
⎜
⎝
⎛
+⎟
⎟
⎠
⎞
⎜
⎜
⎝
⎛
=+=
∑
∑
=
=
by
bybh
j
l
j
jj
l
j
jjj
xx
xxxwx
,sgn
,sgn,sgn
1
1
α
α
1.η =
ηαα
byy
y
ii
ij
l
j
jji
ii
+←
≤⎟
⎟
⎠
⎞
⎜
⎜
⎝
⎛
+∑=
then
if
),(exampletrainingon
,0,
1
xx
x
α
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
Implications of Dual Representation
When Linear Learning Machines are represented in the
dual form
Data appear only inside dot products (in decision
function and in training algorithm)
The matrix which is the matrix of pair-
wise dot products between training samples is called
the Gram matrix
( ), 1
l
i j
i j
G
=
= ⋅x x
( ) ( ) ⎟
⎟
⎠
⎞
⎜
⎜
⎝
⎛
+=+= ∑=
bybh ij
l
j
jjii xxxwx ,sgn,sgn
1
α
6
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
Implicit Mapping to Feature Space
Kernel Machines
• Solve the computational problem of working with
many dimensions
• Can make it possible to use infinite dimensions
efficiently
• Offer other advantages, both practical and conceptual
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
Kernel-Induced Feature Spaces
Φ→ϕ X:
( ) ( )
( ) ( )( )ii
ii
fh
bf
xx
xwx
sgn
,
=
+= ϕ
where is a non-linear map from input
space to feature space
In the dual representation, the data points only appear
inside dot products
( ) ( )( ) ( ) ( ) ⎟
⎟
⎠
⎞
⎜
⎜
⎝
⎛
+=+= ∑=
bybh ij
l
j
jjii xxxwx ϕϕαϕ ,sgn,sgn
1
7
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
Kernels
• Kernel function returns the value of the dot product
between the images of the two arguments
• When using kernels, the dimensionality of the feature
space Φ is not necessarily important because of the
special properties of kernel functions
• Given a function K, it is possible to verify that it is a
kernel (we will return to this later)
1 2 1 2( , ) ( ), ( )K x x x xϕ ϕ=
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
Kernel Machines
We can use perceptron learning algorithm in the feature
space by taking its dual representation and replacing
dot products with kernels:
1 2 1 2 1 2, ( , ) ( ), ( )x x K x x x xϕ ϕ← =
8
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
The Kernel Matrix
K= K(2,l)…K(2,3)K(2,2)K(2,1)
K(1,l)…K(1,3)K(1,2)K(1,1)
K(l,l)…K(l,3)K(l,2)K(l,1)
……………
Kernel matrix is the Gram matrix in the feature space Φ
(the matrix of pair-wise dot products between feature vectors
corresponding to the training samples)
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
Properties of Kernel Matrices
It is easy to show that the Gram matrix (and hence the
kernel matrix) is
• A Square matrix
• Symmetric (K T = K)
• Positive semi-definite (all eigenvalues of K are non-
negative – Recall that eigenvalues of a square matrix
A are given by values of λ that satisfy | A-λI |=0)
Any symmetric positive semi definite matrix can be
regarded as a kernel matrix, that is, as an inner
product matrix in some feature space Φ
( ) ( ) ( )jiji xxxxK ϕϕ= ,,
9
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
Mercer’s Theorem: Characterization of Kernel Functions
A function K : X×X ℜ is said to be (finitely) positive
semi-definite if
• K is a symmetric function: K(x,y) = K(y,x)
• Matrices formed by restricting K to any finite subset of
the space X are positive semi-definite
Characterization of Kernel Functions
Every (finitely) positive semi definite, symmetric function
is a kernel: i.e. there exists a mapping ϕ such that it
is possible to write:
( ) ( ) ( )jijiK xxxx ϕϕ= ,,
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
Properties of Kernel Functions - Mercer’s Theorem
Eigenvalues of the Gram Matrix define an expansion of
Kernels:
That is, the eigenvalues act as features!
Recall that the eigenvalues of a
square matrix A correspond to
solutions of
⎥
⎥
⎥
⎥
⎦
⎤
⎢
⎢
⎢
⎢
⎣
⎡
=
=λ−
1000
0100
0010
0001
0
I
IA where
( ) ( ) ( ) ( ) ( )jli
l
lljijiK xxxxxx ϕϕλϕϕ ∑== ,,
10
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
Examples of Kernels
( )
⎟
⎟
⎠
⎞
⎜
⎜
⎝
⎛
σ
−
=
zx
zx
,
eK ,
Simple examples of kernels:
( ) d
K zxzx ,, =
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
Example: Polynomial Kernels
1 2
1 2
( , );
( , );
x x x
z z z
=
=
2 2
1 1 2 2
2 2 2 2
1 1 2 2 1 1 2 2
2 2 2 2
1 2 1 2 1 2 1 2
, ( )
2
( , , 2 ),( , , 2 )
( ), ( )
x z x z x z
x z x z x z x z
x x x x z z z z
x zϕ ϕ
= +
= + +
=
=
11
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
Making Kernels – Closure Properties
The set of kernels is closed under some operations. If
K1 , K2 are kernels over X×X, then the following are kernels:
We can make complex kernels from simple ones: modularity!
( ) ( ) ( )
( ) ( )
( ) ( ) ( )
( ) ( ) ( )
( ) ( ) ( )( )
( )
( )
( )n
T
NNN
nn
K
K
KK
fffK
KKK
aaKK
KKK
ℜ⊆×
=
ℜ×ℜℜ→
=
ℜ→=
=
>=
+=
X
X
X
andmatrixdefinitepositivesymmetricais
overkernelais
;
B
Bzxzx
zxzx
zxzx
zxzxzx
zxzx
zxzxzx
;,
;:
;,,
:;,
,,,
0,,
,,,
3
3
21
1
21
ϕ
ϕϕ
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
Kernels
• We can define Kernels over arbitrary instance spaces
including
• Finite dimensional vector spaces,
• Boolean spaces
• ∑* where ∑ is a finite alphabet
• Documents, graphs, etc.
• Kernels need not always be expressed by a closed form
formula.
• Many useful kernels can be computed by complex
algorithms (e.g., diffusion kernels over graphs).
12
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
Kernels on Sets and Multi-Sets
( ) kernel.ais,Then
,Let
domainfixedsomeforLet
21
2
2
21
21
AA
V
AAK
XAA
VX
I
=
∈
⊆
Exercise: Define a Kernel on the space of multi-
sets whose elements are drawn from a finite
domain V
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
String Kernel (p-spectrum Kernel)
• The p-spectrum of a string is the histogram – vector of
number of occurrences of all possible contiguous
substrings – of length p
• We can define a kernel function K(s, t) over ∑* × ∑* as
the inner product of the p-spectra of s and t.
2
3
=
=
=
=
),(
:substringsCommon
tsK
tat, ati
p
ncomputatiot
statisticss
13
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
Kernel Machines
Kernel Machines are Linear Learning Machines that :
• Use a dual representation
• Operate in a kernel induced feature space (that is a
linear function in the feature space implicitly defined
by the gram matrix corresponding to the data set K)
( ) ( )( ) ( ) ( ) ⎟
⎟
⎠
⎞
⎜
⎜
⎝
⎛
+=+= ∑=
bybh ij
l
j
jjii xxxwx ϕϕαϕ ,sgn,sgn
1
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
Kernels – the good, the bad, and the ugly
Bad kernel – A kernel whose Gram (kernel) matrix is
mostly diagonal -- all data points are orthogonal to
each other, and hence the machine is unable to detect
hidden structure in the data
1
0…010
1…000
……………
0…001
14
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
Kernels – the good, the bad, and the ugly
Good kernel – Corresponds to a Gram (kernel) matrix in
which subsets of data points belonging to the same
class are similar to each other, and hence the machine
can detect hidden structure in the data
33400
00032
42300
24300
00023
Class 1
Class 2
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
Kernels – the good, the bad, and the ugly
• In mapping in a space with too many irrelevant
features, kernel matrix becomes diagonal
• Need some prior knowledge of target so choose a
good kernel
15
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
Learning in the Feature Spaces
High dimensional Feature spaces
where typically d > > n solve the problem of expressing
complex functions
But this introduces a
• computational problem (working with very large vectors)
– solved by the kernel trick – implicit computation of dot
products in kernel induced feature spaces via dot
products in the input space
• generalization problem (curse of dimensionality)
( ) ( ) ( ) ( ) ( )( )xxxxx dnxxx ϕϕϕ=ϕ→= ,....,...., 2121
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
The Generalization Problem
• The curse of dimensionality
• It is easy to overfit in high dimensional spaces
• The Learning problem is ill posed
• There are infinitely many hyperplanes that
separate the training data
• Need a principled approach to choose an optimal
hyperplane
16
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
The Generalization Problem
“Capacity” of the machine – ability to learn any
training set without error – related to VC dimension
Excellent memory is not an asset when it comes to
learning from limited data
“A machine with too much capacity is like a botanist
with a photographic memory who, when presented
with a new tree, concludes that it is not a tree
because it has a different number of leaves from
anything she has seen before; a machine with too
little capacity is like the botanist’s lazy brother,
who declares that if it’s green, it’s a tree”
C. Burges
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
History of Key Developments leading to SVM
1958 Perceptron (Rosenblatt)
1963 Margin (Vapnik)
1964 Kernel Trick (Aizerman)
1965 Optimization formulation (Mangasarian)
1971 Kernels (Wahba)
1992-1994 SVM (Vapnik)
1996 – present Rapid growth, numerous applications
Extensions to other problems
17
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
A Little Learning Theory
Suppose:
• We are given l training examples
• Train and test points drawn randomly (i.i.d) from some
unknown probability distribution D(x,y)
• The machine learns the mapping and outputs a
hypothesis . A particular choice of
generates “trained machine”
• The expectation of the test error or expected risk is
( , ).i iyx
i iy→x
( ) ( ) ( )∫ −= ydDbhybR ,,,
2
1
, xαxα
( )bh ,,αx ( )b,α
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
A Bound on the Generalization Performance
The empirical risk is:
Choose some such that . With probability
the following bound – risk bound of h(x,α) in distribution D
holds (Vapnik,1995):
where is called VC dimension is a measure of
“capacity” of machine.
δ 1 δ−
0d ≥
10 << δ
( ) ( )
⎟⎟
⎟
⎟
⎟
⎠
⎞
⎜⎜
⎜
⎜
⎜
⎝
⎛
⎟
⎠
⎞
⎜
⎝
⎛
−⎟
⎠
⎞
⎜
⎝
⎛
+⎟
⎠
⎞
⎜
⎝
⎛
+≤
l
d
l
d
bRbR emp
4
log1
2
log
,,
δ
αα
( ) ( )∑=
−=
l
i
iemp bhy
l
bR
1
,,
2
1
, αxα
18
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
A Bound on the Generalization Performance
The second term in the right-hand side is called VC
confidence.
Three key points about the actual risk bound:
• It is independent of D(x,y)
• It is usually not possible to compute the left hand
side.
• If we know d, we can compute the right hand side.
The risk bound gives us a way to compare learning
machines!
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
The VC Dimension
Definition: the VC dimension of a set of functions
is d if and only if there exists a set of
points d data points such that each of the 2d possible
labelings (dichotomies) of the d data points, can be
realized using some member of H but that no set
exists with q > d satisfying this property.
( ){ }bhH ,,αx=
19
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
The VC Dimension
VC dimension of H is size of largest subset of X
shattered by H. VC dimension measures the capacity
of a set H of hypotheses (functions).
If for any number N, it is possible to find N points
that can be separated in all the 2N possible
ways, we will say that the VC-dimension of the set is
infinite
1,.., Nx x
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
The VC Dimension Example
• Suppose that the data live in space, and the set
consists of oriented straight lines, (linear discriminants).
• It is possible to find three points that can be shattered by
this set of functions
• It is not possible to find four.
• So the VC dimension of the set of linear discriminants in
is three.
2
R { ( , )}h αx
2
R
20
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
The VC Dimension of Hyperplanes
Theorem 1 Consider some set of m points in .
Choose any one of the points as origin. Then the m
points can be shattered by oriented hyperplanes if and
only if the position vectors of the remaining points are
linearly independent.
Corollary: The VC dimension of the set of oriented
hyperplanes in is n+1, since we can always choose
n+1 points, and then choose one of the points as
origin, such that the position vectors of the remaining
points are linearly independent, but can never choose
n+2 points
n
R
n
R
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
VC Dimension
VC dimension can be infinite even when the number of
parameters of the set of hypothesis
functions is low.
Example:
For any integer l with any labels
we can find l points and parameter α
such that those points can be shattered by
Those points are:
and parameter a is:
{ ( , )}h αx
( , ) sgn(sin( )), ,h α α α≡ ∈x x x R
1,..., , { 1,1}l iy y y ∈ −
1,..., lx x
( , )h αx
10 , 1,..., .i
ix i l−
= =
1
(1 )10
(1 )
2
il
i
i
y
α π
=
−
= + ∑
21
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
Vapnik-Chervonenkis (VC) Dimension
• Let H be a hypothesis class over an instance space X.
• Both H and X may be infinite.
• We need a way to describe the behavior of H on a finite
set of points S ⊆ X.
• For any concept class H over X, and any S ⊆ X,
• Equivalently, with a little abuse of notation, we can write
( ) { }HhShSH ∈∩=Π :
{ }mXXXS ...., 21=
( ) ( ) ( )( ){ }HhXhXhS mH ∈=Π :......1
ΠH(S) is the set of all dichotomies or behaviors on S that are
induced or realized by H
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
Vapnik-Chervonenkis (VC) Dimension
• If where , or equivalently,
• we say that S is shattered by H.
• A set S of instances is said to be shattered by a
hypothesis class H if and only if for every dichotomy of
S, there exists a hypothesis in H that is consistent with
the dichotomy.
( ) { }m
H S 10,=Π mS = ( ) m
H S 2=Π
22
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
VC Dimension of a hypothesis class
Definition: The VC-dimension V(H), of a hypothesis
class H defined over an instance space X is the
cardinality d of the largest subset of X that is
shattered by H. If arbitrarily large finite subsets of X
can be shattered by H, V(H)=∞
How can we show that V(H) is at least d?
• Find a set of cardinality at least d that is shattered by
H.
How can we show that V(H) = d?
• Show that V(H) is at least d and no set of cardinality
(d+1) can be shattered by H.
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
VC Dimension of a Hypothesis Class - Examples
• Example: Let the instance space X be the 2-dimensional
Euclidian space. Let the hypothesis space H be the set of
axis parallel rectangles in the plane.
• V(H)=4 (there exists a set of 4 points that can be
shattered by a set of axis parallel rectangles but no set of
5 points can be shattered by H).
23
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
Some Useful Properties of VC Dimension
Proof: Left as an exercise
( ) ( ) ( )
{ }( ) ( ) ( )
( ) ( ) ( ) ( )
( )
{ }
( )
( ) ( ) ( ) dmmOmΦdmmΦ
mΦm
mSSmdVH
llHVOHVH
lH
HVHVHVHHH
HVHVHhhXH
HHVH
HVHVHH
d
d
m
d
dH
HH
l
l
<=≥=
≤Π
=Π=Π=
=
++≤⇒∪=
=⇒∈−=
≤
≤⇒⊆
ifandif
where)(
,:)(max)(,If
)lg)((,fromconcepts
ofonintersectiorunionabyformedisIf
:
lg)(class,conceptfiniteaisIf
2
12121
2121
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
Risk Bound
What is VC dimension and empirical risk of the nearest
neighbor classifier?
Any number of points, labeled arbitrarily, can be shattered by
a nearest-neighbor classifier, thus and empirical
risk =0 .
So the bound provide no useful information in this case.
d = ∞
24
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
Minimizing the Bound by Minimizing d
• VC confidence (second term) depends on d/l
• One should choose that learning machine with minimal d
• For large values of d/l the bound is not tight
0.05δ =
( ) ( )
⎟⎟
⎟
⎟
⎟
⎠
⎞
⎜⎜
⎜
⎜
⎜
⎝
⎛
⎟
⎠
⎞
⎜
⎝
⎛
−⎟
⎠
⎞
⎜
⎝
⎛
+⎟
⎠
⎞
⎜
⎝
⎛
+≤
l
d
l
d
RR emp
4
log1
2
log
δ
αα
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
Bounds on Error of Classification
The classification error
• depends on the VC dimension of the hypothesis class
• is independent of the dimensionality of the feature space
Vapnik proved that the error ε of of classification function
h for separable data sets is
where d is the VC dimension of the hypothesis class and l is
the number of training examples
d
O
l
ε
⎛ ⎞
= ⎜ ⎟
⎝ ⎠
25
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
Structural Risk Minimization
Finding a learning machine with the minimum upper bound
on the actual risk
• leads us to a method of choosing an optimal machine for a
given task
• essential idea of the structural risk minimization (SRM).
Let be a sequence of nested subsets of
hypotheses whose VC dimensions satisfy d1 < d2 < d3 < …
• SRM consists of finding that subset of functions which
minimizes the upper bound on the actual risk
• SRM involves training a series of classifiers, and choosing a
classifier from the series for which the sum of empirical
risk and VC confidence is minimal.
1 2 3 ...H H H⊂ ⊂ ⊂
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
Margin Based Bounds on Error of Classification
⎟
⎟
⎠
⎞
⎜
⎜
⎝
⎛
⎟
⎟
⎠
⎞
⎜
⎜
⎝
⎛
γ
=ε
2
1 L
l
O
( )
min
|| ||
i i
i
y f
γ =
x
w
( ) ,f b= +x w x
p
p
L Xmax=
Margin based bound
Important insight
Error of the classifier trained on a separable data set is
inversely proportional to its margin, and is independent of
the dimensionality of the input space!
26
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
Maximal Margin Classifier
The bounds on error of classification suggest the
possibility of improving generalization by maximizing
the margin
Minimize the risk of overfitting by choosing the
maximal margin hyperplane in feature space
SVMs control capacity by increasing the margin, not
by reducing the number of features
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
Margin
1
1
1
1
1
1
1
1
Linear separation of the
input space
( )
( ) ( )( )xx
xwx
fsignh
bf
=
+= ,
w
w
b
27
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
Functional and Geometric Margin
• The functional margin of a linear discriminant (w,b) w.r.t. a
labeled pattern is defined as
• If the functional margin is negative, then the pattern is
incorrectly classified, if it is positive then the classifier
predicts the correct label.
• The larger the further away xi is from the discriminant
• This is made more precise in the notion of the geometric
margin
( , )i i iy bγ ≡ +w x
| |iγ
|| ||
i
i
γ
γ ≡
w
which measures the
Euclidean distance of a point
from the decision boundary.
( ) { }1,1, −×ℜ∈ d
ii yx
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
Geometric Margin
+∈SiX
iγjγ
The geometric margin of two points
−∈SjX
28
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
Geometric Margin
+∈ SXi
−∈ SX j
iγ
jγ
Example
( ) ( ) ( )( )
( )( )
2
1
11111
1.11
1
)11(
1
)11(
21
==
=−+=
−+=
=
=
−=
=
W
X
W
i
i
ii
i
i
γ
xxYγ
Y
b
γ
(1,1)
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
Margin of a training set
i
i
γγ min=
γ
The functional margin of a
training set
The geometric margin of
a training set
i
i
γγ min=
29
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
Maximum Margin Separating Hyperplane
is called the (functional) margin of (w,b)
w.r.t. the data set S={(xi,yi)}.
The margin of a training set S is the maximum geometric
margin over all hyperplanes. A hyperplane realizing this
maximum is a maximal margin hyperplane.
Maximal Margin Hyperplane
mini iγ γ≡
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
VC Dimension of - margin hyperplanes
• If data samples belong to a sphere of radius R, then
the set of - margin hyperplanes has VC dimension
bounded by
• WLOG, we can make R=1 by normalizing data points
of each class so that they lie within a sphere of radius
R.
• For large margin hyperplanes, VC-dimension is
controlled independent of dimensionality n of the
feature space
Δ
Δ
1),/min()( 22
+Δ≤ nRHVC
30
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
Maximal Margin Classifier
The bounds on error of classification suggest the
possibility of improving generalization by maximizing
the margin
Minimize the risk of overfitting by choosing the
maximal margin hyperplane in feature space
SVMs control capacity by increasing the margin, not by
reducing the number of features
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
Maximizing Margin Minimizing ||W||
Definition of hyperplane (w,b) does not change if we
rescale it to (σw, σb), for σ>0
Functional margin depends on scaling, but geometric
margin does not
If we fix (by rescaling) the functional margin to 1, the
geometric margin will be equal 1/||w||
We can maximize the margin by minimizing the norm
||w||
γ
31
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
Maximizing Margin Minimizing ||w||
Let x+ and x- be the nearest data points in the sample
space. Then we have, by fixing the distance from the
hyperplane to be 1:
o
( ) ( )
( ) w
xx,
w
w
xxw,
xw,
xw,
2
211
1
1
=−
=−−=−
−=+
=+
−+
−+
−
+
b
b
γ
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
Learning as optimization
Minimize
subject to:
ww,
( ) 1≥+ b,y ii xw
32
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
Constrained Optimization
• Primal optimization problem
• Given functions f, gi, i=1…k; hj j=1..m; defined on a domain
Ω ⊆ ℜn,
( )
( )
( )
{
{
{ sconstraintqualitye
sconstraintinequality
functionbjectiveo
...10
,...10subject to
minimize
mjh
kig
Ωf
j
i
==
=≤
∈
w
w
ww
Shorthand
( ) ( )
( ) ( ) ...mjhh
...kigg
j
i
10denotes0
10denotes0
=≤=
=≤≤
ww
ww
Feasible region ( ) ( ){ }00 =≤∈= www hgΩF ,:
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
Optimization problems
• Linear program – objective function as well as equality
and inequality constraints are linear
• Quadratic program – objective function is quadratic,
and the equality and inequality constraints are linear
• Inequality constraints gi(w) ≤ 0 can be active i.e. gi(w) =
0 or inactive i.e. gi(w) < 0.
• Inequality constraints are often transformed into
equality constraints using slack variables
gi(w) ≤ 0 ↔ gi(w) +ξi = 0 with ξi ≥ 0
• We will be interested primarily in convex optimization
problems
33
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
Convex optimization problem
If function f is convex, any local minimum of an
unconstrained optimization problem with objective function
f is also a global minimum, since for any
A set is called convex if , and for
any the point
A convex optimization problem is one in which the set ,
the objective function and all the constraints are convex
*
w
*
≠u w *
( ) ( )f f≤w u
Ω
n
Ω ⊂ R ,∀ ∈Ωw u
(0,1),θ ∈ ( (1 ) )θ θ+ − ∈Ωw u
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
Lagrangian Theory
Given an optimization problem with an objective function f (w)
and equality constraints hj (w) = 0 j = 1..m, we define a
Lagrangian as
( ) ( ) ( )∑=
+=
m
j
jj hβfL
1
wwβw,
where βj are called the Lagrange multipliers. The necessary
Conditions for w* to be minimum of f (w) subject to the
Constraints hj (w) = 0 j = 1..m is given by
( ) ( ) 00 =
∂
∂
=
∂
∂
β
βw
w
βw ****
,
,
, LL
The condition is sufficient if L(w,β*) is a convex function of w
34
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
Lagrangian Theory
Example
( )
( ) ( )
( )
( )
( )
5
4
,
5
2
5
4
5
2
4
5
04
,,
022
,,
021
,,
42,,
04
2,
22
22
22
−=−=
==±=
=−+=
∂
∂
=+=
∂
∂
=+=
∂
∂
−+++=
=−+
+=
yxf
yx
yx
yxL
y
y
yxL
x
x
yxL
yxyxyxL
yx
yxyxf
whenminimizedis
:haveweabove,theSolving
:toSubject
Minimize
mmλ
λ
λ
λ
λ
λ
λ
λλ
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
Lagrangian theory – example
Find the lengths u, v, w of sides of the box that has the largest
volume for a given surface area c
( ) ( )
6
00
000
2
2
c
wvuvuβuwβv
w)β(uwu
v
L
w);β(vvw
u
L
v);β(uuv
w
L
c
vwuvwuuvwL
c
vwuvwu
-uvw
===⇒−=−
++−==
∂
∂
++−==
∂
∂
++−==
∂
∂
⎟
⎠
⎞
⎜
⎝
⎛
−++β+−=
=++
and
;
tosubject
minimize
35
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
Lagrangian Optimization - Example
• The entropy of a probability distribution p=(p1...pn)
over a finite set {1, 2,...n } is defined as
• The maximum entropy distribution can be found by
minimizing –H(p) subject to the constraints
( ) i
n
i
i ppH 2
1
log∑=
−=p
0
1
1
≥∀
=∑=
i
n
i
i
pi
p
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
Generalized Lagrangian Theory
Given an optimization problem with domain Ω ⊆ ℜn
( ) ( ) ( ) ( )∑∑ ==
++=
m
j
jj
k
i
ii hβgαfL
11
,, wwwβαw
where f is convex, and gi and hj are affine, we can define the
generalized Lagrangian function as:
( )
( )
( )
{
{
{ sconstraintqualitye
sconstraintinequality
functionbjectiveo
...10
,...10subject to
minimize
mjh
kig
Ωf
j
i
==
=≤
∈
w
w
ww
An affine function is a linear function plus a translation
F(x) is affine if F(x)=G(x)+b where G(x) is a linear function of x
and b is a constant
36
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
Generalized Lagrangian Theory: KKT Conditions
Given an optimization problem with domain Ω ⊆ ℜn
where f is convex, and gi and hj are affine. The necessary and
Sufficient conditions for w* to be an optimum are the existence
of α* and β* such that
( ) ( )
( ) ( ) kigg
LL
iiii ...;;;
,
,
,
,
****
**,***,*
1000
00
=≥α≤=α
=
∂
∂
=
∂
∂
ww
β
βαw
w
βαw
( )
( )
( )
{
{
{ sconstraintqualitye
sconstraintinequality
functionbjectiveo
tosubject
minimize
mjh
kig
Ωf
j
i
...10
,...10
==
=≤
∈
w
w
ww
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
( ) ( ) ( )
( ) ( ) ( ) ( )
( )
( ) ( ) ( )
( ) ( )
( ) ( ) ( )
20
0,0
2;0,202
2
6
2
2
;
2
6
;
2
2
02;032;0120,0
0,0
2
3;1032;012;0,0
0;2,0,0
0;02032
012
231
0;2
31,
21
1
11
1
11
11121
21
21
11
2121
21
21
22
22
==
≠≠
===⇒=⎟
⎠
⎞
⎜
⎝
⎛
−
−
+
−−
=
−
=
=−+=+−=+−⇒=≠
≠=
≤+
==⇒=−=−⇒==
≤−≤+≥≥
=−=−+=−+−=
∂
∂
=++−=
∂
∂
−+−++−+−=
≤−≤+
−+−=
; yx
yxyx
yxyx
yx
yxyx
yxyx
yxyxy
y
L
x
x
L
yxyxyxL
yxyx
yxyxf
issolutiontheHence,
solutionfeasibleayieldnotdoes4Case
solutionfeasibleaiswhich
yieldswhich
:3Case
solutionfeasibleayieldnotdoes:2caseSimilarly,
violatesitbecausesolutionfeasibleanotisThis
1:Case
;
:toSubject
Minimize
λλ
λ
λλ
λ
λλ
λλλλλ
λλ
λλ
λλ
λλλλ
λλ
λλ
37
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
Dual optimization problem
A primal problem can be transformed into a dual by simply
setting to zero, the derivatives of the Lagrangian with
respect to the primal variables and substituting the results
back into the Lagrangian to remove dependence on the
primal variables, resulting in
( ) ( )βαwβα
w
,,inf, L
Ω∈
=θ
which contains only dual variables and needs to be
maximized under simpler constraints
The infimum of a set S of real numbers is denoted by inf(S)
and is defined to be the largest real number that is smaller
than or equal to every number in S. If no such number
exists then inf(S) = −∞.
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
Maximum Margin Hyperplane
The problem of finding the minimal margin hyperplane
is a constrained optimization problem
Use Lagrange theory (extended by Karush, Kuhn, and
Tucker – KKT)
Lagrangian:
( ) ( )[ ]
0
1,,
2
1
≥
−+−= ∑
i
ii
i
ip byL
α
α xwwww
38
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
From Primal to Dual
Minimize Lp(w) with respect to (w,b) requiring that
derivatives of Lp(w) with respect to all vanish, subject to the
constraints
Differentiating Lp(w) :
iα
0iα ≥
Substituting the equality
constraints back into
Lp(w) we obtain a dual
problem.
0
0
0
=
=
=
∂
∂
=
∂
∂
∑
∑
i
ii
i
iii
p
p
y
y
b
L
L
α
α xw
w
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
The Dual Problem
Maximize:
1 1max ( ) min ( )
2
i iy i y i
b
=− =⋅ + ⋅
= −
w x w x
( )
sconstraintprimalusingsolvedbetohas
andtoSubject
b
αyα
yy
ybyyyy
byy
yyL
ii
i
i
i
i
ij
jijiji
i
ii
i
i
ij
jijiji
ij
jijiji
i
j
jjji
i
i
j
jjj
i
iiiD
00
,
2
1
2
1
1,
2
1
≥=
+−=
+−−=
⎥
⎥
⎦
⎤
⎢
⎢
⎣
⎡
−
⎟
⎟
⎠
⎞
⎜
⎜
⎝
⎛
+⎟
⎟
⎠
⎞
⎜
⎜
⎝
⎛
−
⎟
⎟
⎠
⎞
⎜
⎜
⎝
⎛
⎟
⎠
⎞
⎜
⎝
⎛
=
∑
∑∑
∑∑∑∑
∑∑
∑∑
ααα
αααααα
αα
αα
xx
xxxx
xx
xxw
39
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
Karush-Kuhn-Tucker Conditions
Solving for the SVM is equivalent to finding a solution to the
KKT conditions
( ) ( )[ ]
( )
( )[ ]
0
01,
01,
00
0
1,,
2
1
≥
=−+
≥−+
=⇒=
∂
∂
=⇒=
∂
∂
−+−=
∑
∑
∑
i
iii
ii
i
ii
p
i
iii
p
ii
i
ip
by
by
y
b
L
y
L
byL
α
α
α
α
α
xw
xw
xw
w
xwwww
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
Karush-Kuhn-Tucker Conditions for SVM
The KKT conditions state that optimal solutions
must satisfy
Only the training samples xi for which the functional
margin = 1 can have nonzero αi. They are called
Support Vectors.
The optimal hyperplane can be expressed in the dual
representation in terms of this subset of training
samples – the support vectors
( , )i bα w
1 sv
( , , )
l
i i i i i i
i i
f b y b y bα α
= ∈
= ⋅ + = ⋅ +∑ ∑x α x x x x
( )[ ] 01, =−+ by iii xwα
40
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
⎟
⎟
⎠
⎞
⎜
⎜
⎝
⎛
== ∑∈SV
i
ixw
αγ
1
( )
∑∑∑∑
∑∑∑∑
∑∑
∑∑
∑∑
∈∈∉∈
∈∉∈∈
∈∈
∈∈
==
=+⎟
⎟
⎠
⎞
⎜
⎜
⎝
⎛
+=
+⎟
⎟
⎠
⎞
⎜
⎜
⎝
⎛
+=−=
−=⇒=⎟
⎠
⎞
⎜
⎝
⎛
+∈
=
=
SVi
i
SVj
j
SVj
jj
SVj
jj
SVj
j
SVj
j
SVj
jj
SVj
jj
SVi
jjiiij
SVi
jiiijj
ji
SVi
ii
SVj
jj
jijij
l
i
i
l
j
yyb
yybby,
byyybyySV
yy
yy,
αααα
ααα
αα
αα
αα
444 3444 21
Q
condnsKKTofbecause0
11
01
1,1,,
,
,
ww
xxxxx
xx
xxww
So we have
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
Support Vector Machines Yield Sparse Solutions
γ Support
vectors
41
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
Example: XOR
∑∑∑
∑
= ==
=
−=
−+−=
=Φ
N
i
N
j
j
T
ijiji
N
i
i
N
i
i
T
ii
T
T
yyQ
bybL
1 1
2
1
1
1
2
1
2
1
)(
]1)([),,(
)(
xx
xwwww
www
αααα
αα 0 11
1
X y
[-1,-1] -1
[-1,+1] +1
[+1,-1] +1
[+1,+1] -12
)1(),( j
T
ijiK xxxx +=
T
xxxxxx ]2,2,,2,,1[)( 21
2
221
2
1=xϕ
⎥
⎥
⎥
⎥
⎦
⎤
⎢
⎢
⎢
⎢
⎣
⎡
=
9
1
1
1
1
9
1
1
1
1
9
1
1
1
1
9
K
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
Example: XOR
)9292292229(
)(
4
243
2
34232
2
213121
2
12
1
4321
ααααααααααααααα
ααααα
+−+−+++−−−
+++=Q
19
19
19
19
4321
4321
4321
4321
=+−−
=−++−
=−++−
=+−−
αααα
αααα
αααα
αααα
4
1
8
1
4,3,2,1,
)( =
====
α
αααα
o
oooo
Q
Not surprisingly, each of the training
samples is a support vector
2
1
4
12
2
1
, == oo ww
T
N
i
iiio
-
yw
⎥
⎦
⎤
⎢
⎣
⎡
=
= ∑=
000
2
1
00
)(
1
xϕα
Setting derivative of Q wrt
the dual variables to 0
42
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
Example – Two Spirals
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
Summary of Maximal margin classifier
• Good generalization when the training data is noise free
and separable in the kernel-induced feature space
• Excessively large Lagrange multipliers often signal outliers
– data points that are most difficult to classify
– SVM can be used for identifying outliers
• Focuses on boundary points If the boundary points are
mislabeled (noisy training data)
– the result is a terrible classifier if the training set is
separable
– Noisy training data often makes the training set non
separable
– Use of highly nonlinear kernels to make the training set
separable increases the likelihood of over fitting –
despite maximizing the margin
43
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
Non-Separable Case
• In the case of non-separable data in feature space, the
objective function grows arbitrarily large.
• Solution: Relax the constraint that all training data be
correctly classified but only when necessary to do so
• Cortes and Vapnik, 1995 introduced slack variables
in the constraints:
, 1,..,i i lξ =
( )( )b,y iii +−= xwγξ ,0max
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
Non-Separable Case
Generalization error can be
shown to be
iξ
jξ
max(0, ( , )).i i iy x bξ γ= − +w
γ
2
2
1
⎟
⎟
⎟
⎟
⎠
⎞
⎜
⎜
⎜
⎜
⎝
⎛ +
≤
∑
γ
ξ
ε i
iR
l
44
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
Non-Separable Case
For error to occur, the corresponding must exceed
(which was chosen to be unity)
So is an upper bound on the number of training
errors.
So the objective function is changed from to
where C and k are parameters
(typically k=1 or 2 and C is a large positive constant
ii
ξ∑
γ
∑+
⎟
⎟
⎠
⎞
⎜
⎜
⎝
⎛
i
k
iC ξ
2
2
w
iξ
2
2
w
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
The Soft-Margin Classifier
Minimize:
Subject to:
∑+
i
iC ξww,
2
1
( ) ( )
0
1,
≥
−≥+
i
iii by
ξ
ξxw
( )( )
32144444 344444 2144 344 21
i
i
ξ
i
i
iiii
i
i
i
iP byCL
onsconstraint
negativitynoninvolvingsconstraintinequalityfunctionobjective −
∑∑∑ −+−+−+= ξμξαξ
ξ
1,
2
1 2
xww
45
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
The Soft-Margin Classifier: KKT Conditions
( ) 0
01,
0
>•
=±=+•
>
ii
iii
i
b
b
ξ
ξ
α
henceand,byiedmisclassifis
or
henceandi.e.,vectorsupportais
ifonly
wx
xwx
( )( )
0
01,
0,0,0
=
=+−+
≥≥≥
ii
iiii
iii
μ
by
μ
ξ
ξα
αξ
xw
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
From primal to dual
( )( )
jiji
ij
ji
i
iD
DP
iii
i
P
i
ii
P
i
i
ii
P
P
i
i
iiii
i
i
i
iP
yyL
LL
CC
L
y
b
L
y
L
L
byCL
xx
xw
w
xww
,
2
1
,
00
00
0
,0
1,
2
1 2
∑∑
∑
∑
∑∑∑
−=
≤≤⇒=+⇒=
∂
∂
=⇒=
∂
∂
=⇒=
∂
∂
−+−+−+=
ααα
αμα
ξ
α
α
ξμξαξ
maximizedbetodualthegetweintobackngSubstituti
tovariablesprimalwrtofsderivativetheEquating
46
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
Soft Margin-Dual Lagrangian
( )( )
( )
0
0
0
01,
00
,
2
1
=
<<
=
=−
=+−+
=≤≤
−=
∑
∑∑
i
i
i
ii
iiii
i
iii
jiji
ij
ji
i
iD
α
Cα
C
C
by
yC
yyL
samples,training)classified(correctlyotherallFor
vectors,supporttrueFor
only whensample)trainingfied(misclassi
zero-nonarevariablesSlack
:ConditionsKKT
and:toSubject
Maximize
α
αξ
ξα
αα
ααα
xw
xx
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
Implementation Techniques
Maximizing a quadratic function, subject to a linear
equality and inequality constraints
,
1
( ) ( , )
2
0
0
i i j i j i j
i i j
i
i i
i
W y y K x x
C
y
α α α α
α
α
= −
≤ ≤
=
∑ ∑
∑
( )
1 ( , )i j j i j
ji
W
y y K x xα
α
∂
= −
∂
∑
α
47
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
On-line algorithm for the 1-norm soft margin (bias b=0)
Given training set D
repeat
for i=1 to l
else
end for
until stopping criterion satisfied
return α
←α 0
(1 ( , ))i i i i j j i j
j
y y K x xα α η α← + − ∑
if 0 then 0i iα α< ←
if theni iC Cα α> ←
( )ii
i
K xx ,
1
=η
In the general setting, b ≠ 0
and we need to solve for b
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
Implementation Techniques
Use QP packages (MINOS, LOQO, quadprog from
MATLAB optimization toolbox). They are not online
and require that the data are held in memory in the
form of kernel matrix
Stochastic Gradient Ascent. Sequentially update 1 weight
at the time. So it is online. Gives excellent
approximation in most cases
1
ˆ 1 ( , )
( , )
i i i j j i j
ji i
y y K x x
K x x
α α α
⎛ ⎞
← + −⎜ ⎟
⎝ ⎠
∑
48
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
Chunking and Decomposition
Given training set D
Select an arbitrary working set
repeat
solve optimization problem on
select new working set from data not satisfying KKT
conditions
until stopping criterion satisfied
return
←α 0
ˆD D⊂
ˆD
α
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
Sequential Minimal Optimization S.M.O.
At each step SMO: optimize two weights
simultaneously in order not to violate the linear
constraint
Optimization of the two weights is performed
analytically
Realizes gradient dissent without leaving the linear
constraint (J.Platt)
Online versions exist (Li, Long; Gentile)
0i i
i
yα =∑
,i jα α
49
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
SVM Implementations
• Details of optimization and Quadratic Programming
• SVMLight – one of the first practical implementations of
SVM (Platt)
• Matlab toolboxes for SVM
• LIBSVM – one of the best SVM implementations is by Chih-
Chung Chang and Chih-Jen Lin of the National University
of Singapore http://www.csie.ntu.edu.tw/~cjlin/libsvm/
• WLSVM – LIBSVM integrated with WEKA machine learning
toolbox Yasser El-Manzalawy of the ISU AI Lab
http://www.cs.iastate.edu/~yasser/wlsvm/
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
Why does SVM Work?
• The feature space is often very high dimensional. Why
don’t we have the curse of dimensionality?
• A classifier in a high-dimensional space has many
parameters and is hard to estimate
• Vapnik argues that the fundamental problem is not the
number of parameters to be estimated. Rather, the problem
is the capacity of a classifier
• Typically, a classifier with many parameters is very flexible,
but there are also exceptions
– Let xi=10i where i ranges from 1 to n. The classifier
can classify all xi correctly for all possible
combination of class labels on xi
– This 1-parameter classifier is very flexible
50
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
Why does SVM work?
• Vapnik argues that the capacity of a classifier should not be
characterized by the number of parameters, but by the
capacity of a classifier
– This is formalized by the “VC-dimension” of a classifier
• The minimization of ||w||2 subject to the condition that the
geometric margin =1 has the effect of restricting the VC-
dimension of the classifier in the feature space
• The SVM performs structural risk minimization: the
empirical risk (training error), plus a term related to the
generalization ability of the classifier, is minimized
• SVM loss function is analogous to ridge regression. The
term ½||w||2 “shrinks” the parameters towards zero to avoid
overfitting
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
Choosing the Kernel Function
• Probably the most tricky part of using SVM
• The kernel function should maximize the similarity among
instances within a class while accentuating the differences
between classes
• A variety of kernels have been proposed (diffusion kernel,
Fisher kernel, string kernel, …) for different types of data -
• In practice, a low degree polynomial kernel or RBF kernel
with a reasonable width is a good initial try for data that live
in a fixed dimensional input space
• Low order Markov kernels and its relatives are good ones
to consider for structured data – strings, images, etc.
51
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
Other Aspects of SVM
• How to use SVM for multi-class classification?
– One can change the QP formulation to become multi-
class
– More often, multiple binary classifiers are combined
– One can train multiple one-versus-all classifiers, or
combine multiple pairwise classifiers “intelligently”
• How to interpret the SVM discriminant function value as
probability?
– By performing logistic regression on the SVM output of a
set of data that is not used for training
• Some SVM software (like libsvm) have these features built-
in
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
Recap: How to Use SVM
• Prepare the data set
• Select the kernel function to use
• Select the parameter of the kernel function and the
value of C
– You can use the values suggested by the SVM
software
– You can use a tuning set to tune C
• Execute the training algorithm and obtain αι and b
• Unseen data can be classified using the αi, the support
vectors, and b
52
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
Strengths and Weaknesses of SVM
• Strengths
– Training is relatively easy
• No local optima
– It scales relatively well to high dimensional data
– Tradeoff between classifier complexity and error can be
controlled explicitly
– Non-traditional data like strings and trees can be used
as input to SVM, instead of feature vectors
• Weaknesses
– Need to choose a “good” kernel function.
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
Recent developments
• Better understanding of the relation between SVM and
regularized discriminative classifiers (logistic regression)
• Knowledge-based SVM – incorporating prior knowledge as
constraints in SVM optimization (Shavlik et al)
• A zoo of kernel functions
• Extensions of SVM to multi-class and structured label
classification tasks
• One class SVM (anomaly detection)
53
Copyright Vasant Honavar, 2006.
Iowa State University Department of Computer Science
Artificial Intelligence Research Laboratory
Representative Applications of SVM
• Handwritten letter classification
• Text classification
• Image classification – e.g., face recognition
• Bioinformatics
– Gene expression based tissue classification
– Protein function classification
– Protein structure classification
– Protein sub-cellular localization
• ….

Más contenido relacionado

La actualidad más candente

Data-Driven Recommender Systems
Data-Driven Recommender SystemsData-Driven Recommender Systems
Data-Driven Recommender Systemsrecsysfr
 
MS CS - Selecting Machine Learning Algorithm
MS CS - Selecting Machine Learning AlgorithmMS CS - Selecting Machine Learning Algorithm
MS CS - Selecting Machine Learning AlgorithmKaniska Mandal
 
Review of Metaheuristics and Generalized Evolutionary Walk Algorithm
Review of Metaheuristics and Generalized Evolutionary Walk AlgorithmReview of Metaheuristics and Generalized Evolutionary Walk Algorithm
Review of Metaheuristics and Generalized Evolutionary Walk AlgorithmXin-She Yang
 
Machine learning by Dr. Vivek Vijay and Dr. Sandeep Yadav
Machine learning by Dr. Vivek Vijay and Dr. Sandeep YadavMachine learning by Dr. Vivek Vijay and Dr. Sandeep Yadav
Machine learning by Dr. Vivek Vijay and Dr. Sandeep YadavAgile Testing Alliance
 
MLIP - Chapter 2 - Preliminaries to deep learning
MLIP - Chapter 2 - Preliminaries to deep learningMLIP - Chapter 2 - Preliminaries to deep learning
MLIP - Chapter 2 - Preliminaries to deep learningCharles Deledalle
 
A Study on Intuitionistic Multi-Anti Fuzzy Subgroups
A Study on Intuitionistic Multi-Anti Fuzzy Subgroups A Study on Intuitionistic Multi-Anti Fuzzy Subgroups
A Study on Intuitionistic Multi-Anti Fuzzy Subgroups mathsjournal
 
Machine learning
Machine learningMachine learning
Machine learningShreyas G S
 
Cuckoo Search Algorithm: An Introduction
Cuckoo Search Algorithm: An IntroductionCuckoo Search Algorithm: An Introduction
Cuckoo Search Algorithm: An IntroductionXin-She Yang
 
ACTIVE CONTROLLER DESIGN FOR THE OUTPUT REGULATION OF THE WANG-CHEN-YUAN SYSTEM
ACTIVE CONTROLLER DESIGN FOR THE OUTPUT REGULATION OF THE WANG-CHEN-YUAN SYSTEMACTIVE CONTROLLER DESIGN FOR THE OUTPUT REGULATION OF THE WANG-CHEN-YUAN SYSTEM
ACTIVE CONTROLLER DESIGN FOR THE OUTPUT REGULATION OF THE WANG-CHEN-YUAN SYSTEMecij
 
Dictionary Learning for Massive Matrix Factorization
Dictionary Learning for Massive Matrix FactorizationDictionary Learning for Massive Matrix Factorization
Dictionary Learning for Massive Matrix Factorizationrecsysfr
 
Fcv hum mach_geman
Fcv hum mach_gemanFcv hum mach_geman
Fcv hum mach_gemanzukun
 
Efficient Analysis of high-dimensional data in tensor formats
Efficient Analysis of high-dimensional data in tensor formatsEfficient Analysis of high-dimensional data in tensor formats
Efficient Analysis of high-dimensional data in tensor formatsAlexander Litvinenko
 
Hierarchical selection
Hierarchical selectionHierarchical selection
Hierarchical selectionDai-Hai Nguyen
 
Beginners Guide to Non-Negative Matrix Factorization
Beginners Guide to Non-Negative Matrix FactorizationBeginners Guide to Non-Negative Matrix Factorization
Beginners Guide to Non-Negative Matrix FactorizationBenjamin Bengfort
 
Kernels and Support Vector Machines
Kernels and Support Vector  MachinesKernels and Support Vector  Machines
Kernels and Support Vector MachinesEdgar Marca
 

La actualidad más candente (20)

Data-Driven Recommender Systems
Data-Driven Recommender SystemsData-Driven Recommender Systems
Data-Driven Recommender Systems
 
Ada boost
Ada boostAda boost
Ada boost
 
Ch1 2 (2)
Ch1 2 (2)Ch1 2 (2)
Ch1 2 (2)
 
CSC446: Pattern Recognition (LN3)
CSC446: Pattern Recognition (LN3)CSC446: Pattern Recognition (LN3)
CSC446: Pattern Recognition (LN3)
 
MS CS - Selecting Machine Learning Algorithm
MS CS - Selecting Machine Learning AlgorithmMS CS - Selecting Machine Learning Algorithm
MS CS - Selecting Machine Learning Algorithm
 
Review of Metaheuristics and Generalized Evolutionary Walk Algorithm
Review of Metaheuristics and Generalized Evolutionary Walk AlgorithmReview of Metaheuristics and Generalized Evolutionary Walk Algorithm
Review of Metaheuristics and Generalized Evolutionary Walk Algorithm
 
Machine learning by Dr. Vivek Vijay and Dr. Sandeep Yadav
Machine learning by Dr. Vivek Vijay and Dr. Sandeep YadavMachine learning by Dr. Vivek Vijay and Dr. Sandeep Yadav
Machine learning by Dr. Vivek Vijay and Dr. Sandeep Yadav
 
Chapter 1 - Introduction
Chapter 1 - IntroductionChapter 1 - Introduction
Chapter 1 - Introduction
 
MLIP - Chapter 2 - Preliminaries to deep learning
MLIP - Chapter 2 - Preliminaries to deep learningMLIP - Chapter 2 - Preliminaries to deep learning
MLIP - Chapter 2 - Preliminaries to deep learning
 
A Study on Intuitionistic Multi-Anti Fuzzy Subgroups
A Study on Intuitionistic Multi-Anti Fuzzy Subgroups A Study on Intuitionistic Multi-Anti Fuzzy Subgroups
A Study on Intuitionistic Multi-Anti Fuzzy Subgroups
 
Machine learning
Machine learningMachine learning
Machine learning
 
Cuckoo Search Algorithm: An Introduction
Cuckoo Search Algorithm: An IntroductionCuckoo Search Algorithm: An Introduction
Cuckoo Search Algorithm: An Introduction
 
ACTIVE CONTROLLER DESIGN FOR THE OUTPUT REGULATION OF THE WANG-CHEN-YUAN SYSTEM
ACTIVE CONTROLLER DESIGN FOR THE OUTPUT REGULATION OF THE WANG-CHEN-YUAN SYSTEMACTIVE CONTROLLER DESIGN FOR THE OUTPUT REGULATION OF THE WANG-CHEN-YUAN SYSTEM
ACTIVE CONTROLLER DESIGN FOR THE OUTPUT REGULATION OF THE WANG-CHEN-YUAN SYSTEM
 
Dictionary Learning for Massive Matrix Factorization
Dictionary Learning for Massive Matrix FactorizationDictionary Learning for Massive Matrix Factorization
Dictionary Learning for Massive Matrix Factorization
 
Section6 stochastic
Section6 stochasticSection6 stochastic
Section6 stochastic
 
Fcv hum mach_geman
Fcv hum mach_gemanFcv hum mach_geman
Fcv hum mach_geman
 
Efficient Analysis of high-dimensional data in tensor formats
Efficient Analysis of high-dimensional data in tensor formatsEfficient Analysis of high-dimensional data in tensor formats
Efficient Analysis of high-dimensional data in tensor formats
 
Hierarchical selection
Hierarchical selectionHierarchical selection
Hierarchical selection
 
Beginners Guide to Non-Negative Matrix Factorization
Beginners Guide to Non-Negative Matrix FactorizationBeginners Guide to Non-Negative Matrix Factorization
Beginners Guide to Non-Negative Matrix Factorization
 
Kernels and Support Vector Machines
Kernels and Support Vector  MachinesKernels and Support Vector  Machines
Kernels and Support Vector Machines
 

Destacado

Week 4: Algorithmic Game Theory Notes
Week 4: Algorithmic Game Theory NotesWeek 4: Algorithmic Game Theory Notes
Week 4: Algorithmic Game Theory NotesDongseo University
 
2014-1 Computer Algorithm W14 Notes
2014-1 Computer Algorithm W14 Notes2014-1 Computer Algorithm W14 Notes
2014-1 Computer Algorithm W14 NotesDongseo University
 
Econometrics Lecture Notes : Chapter 2
Econometrics Lecture Notes : Chapter 2Econometrics Lecture Notes : Chapter 2
Econometrics Lecture Notes : Chapter 2Dongseo University
 
MFC 핵심개념
MFC 핵심개념MFC 핵심개념
MFC 핵심개념종훈 박
 
牛肉麵全收錄
牛肉麵全收錄牛肉麵全收錄
牛肉麵全收錄psjlew
 
Les presentation 24 10 2013
Les presentation 24 10 2013Les presentation 24 10 2013
Les presentation 24 10 2013Anneke Weber
 
美聯社2011年度新聞圖片精選
美聯社2011年度新聞圖片精選美聯社2011年度新聞圖片精選
美聯社2011年度新聞圖片精選psjlew
 
Malawi Mission Trip 2014
Malawi Mission Trip 2014Malawi Mission Trip 2014
Malawi Mission Trip 2014Greg Wiley
 
痛批李敖
痛批李敖痛批李敖
痛批李敖psjlew
 
Fw 台灣駐美大使官邸
Fw 台灣駐美大使官邸Fw 台灣駐美大使官邸
Fw 台灣駐美大使官邸psjlew
 
祕魯旅遊雜記
祕魯旅遊雜記祕魯旅遊雜記
祕魯旅遊雜記psjlew
 
Best husband 嫁不對人的後果
Best husband 嫁不對人的後果Best husband 嫁不對人的後果
Best husband 嫁不對人的後果psjlew
 
寧夏風景線 沙湖
 寧夏風景線    沙湖  寧夏風景線    沙湖
寧夏風景線 沙湖 psjlew
 
北京故宮台北故宮鎮館之寶
北京故宮台北故宮鎮館之寶北京故宮台北故宮鎮館之寶
北京故宮台北故宮鎮館之寶psjlew
 
从恋爱到结婚
从恋爱到结婚从恋爱到结婚
从恋爱到结婚worldhema
 
二代健保101-04-01
二代健保101-04-01二代健保101-04-01
二代健保101-04-01psjlew
 
2013-1 Machine Learning Lecture 03 - Andrew Moore - probabilistic and baye…
2013-1 Machine Learning Lecture 03 - Andrew Moore - probabilistic and baye…2013-1 Machine Learning Lecture 03 - Andrew Moore - probabilistic and baye…
2013-1 Machine Learning Lecture 03 - Andrew Moore - probabilistic and baye…Dongseo University
 
Guangxi
GuangxiGuangxi
Guangxipsjlew
 

Destacado (20)

2015-2 W16 kernel and smo
2015-2 W16   kernel and smo2015-2 W16   kernel and smo
2015-2 W16 kernel and smo
 
Week 4: Algorithmic Game Theory Notes
Week 4: Algorithmic Game Theory NotesWeek 4: Algorithmic Game Theory Notes
Week 4: Algorithmic Game Theory Notes
 
2014-1 Computer Algorithm W14 Notes
2014-1 Computer Algorithm W14 Notes2014-1 Computer Algorithm W14 Notes
2014-1 Computer Algorithm W14 Notes
 
Econometrics Lecture Notes : Chapter 2
Econometrics Lecture Notes : Chapter 2Econometrics Lecture Notes : Chapter 2
Econometrics Lecture Notes : Chapter 2
 
MFC 핵심개념
MFC 핵심개념MFC 핵심개념
MFC 핵심개념
 
牛肉麵全收錄
牛肉麵全收錄牛肉麵全收錄
牛肉麵全收錄
 
Les presentation 24 10 2013
Les presentation 24 10 2013Les presentation 24 10 2013
Les presentation 24 10 2013
 
美聯社2011年度新聞圖片精選
美聯社2011年度新聞圖片精選美聯社2011年度新聞圖片精選
美聯社2011年度新聞圖片精選
 
Malawi Mission Trip 2014
Malawi Mission Trip 2014Malawi Mission Trip 2014
Malawi Mission Trip 2014
 
痛批李敖
痛批李敖痛批李敖
痛批李敖
 
Agile team
Agile teamAgile team
Agile team
 
Fw 台灣駐美大使官邸
Fw 台灣駐美大使官邸Fw 台灣駐美大使官邸
Fw 台灣駐美大使官邸
 
祕魯旅遊雜記
祕魯旅遊雜記祕魯旅遊雜記
祕魯旅遊雜記
 
Best husband 嫁不對人的後果
Best husband 嫁不對人的後果Best husband 嫁不對人的後果
Best husband 嫁不對人的後果
 
寧夏風景線 沙湖
 寧夏風景線    沙湖  寧夏風景線    沙湖
寧夏風景線 沙湖
 
北京故宮台北故宮鎮館之寶
北京故宮台北故宮鎮館之寶北京故宮台北故宮鎮館之寶
北京故宮台北故宮鎮館之寶
 
从恋爱到结婚
从恋爱到结婚从恋爱到结婚
从恋爱到结婚
 
二代健保101-04-01
二代健保101-04-01二代健保101-04-01
二代健保101-04-01
 
2013-1 Machine Learning Lecture 03 - Andrew Moore - probabilistic and baye…
2013-1 Machine Learning Lecture 03 - Andrew Moore - probabilistic and baye…2013-1 Machine Learning Lecture 03 - Andrew Moore - probabilistic and baye…
2013-1 Machine Learning Lecture 03 - Andrew Moore - probabilistic and baye…
 
Guangxi
GuangxiGuangxi
Guangxi
 

Similar a 2013-1 Machine Learning Lecture 05 - Vasant Honavar - Support Vector Machi…

The world of loss function
The world of loss functionThe world of loss function
The world of loss function홍배 김
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machinesnextlib
 
lawper.ppt
lawper.pptlawper.ppt
lawper.pptHiuLXun4
 
Unit-1 Introduction and Mathematical Preliminaries.pptx
Unit-1 Introduction and Mathematical Preliminaries.pptxUnit-1 Introduction and Mathematical Preliminaries.pptx
Unit-1 Introduction and Mathematical Preliminaries.pptxavinashBajpayee1
 
Svm and kernel machines
Svm and kernel machinesSvm and kernel machines
Svm and kernel machinesNawal Sharma
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You NeedDaiki Tanaka
 
support vector machine algorithm in machine learning
support vector machine algorithm in machine learningsupport vector machine algorithm in machine learning
support vector machine algorithm in machine learningSamGuy7
 
Support vector machine
Support vector machineSupport vector machine
Support vector machineRishabh Gupta
 
Fast Object Recognition from 3D Depth Data with Extreme Learning Machine
Fast Object Recognition from 3D Depth Data with Extreme Learning MachineFast Object Recognition from 3D Depth Data with Extreme Learning Machine
Fast Object Recognition from 3D Depth Data with Extreme Learning MachineSoma Boubou
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysisbutest
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysisbutest
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysisbutest
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysisbutest
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysisbutest
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysisbutest
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysisbutest
 
Neural Networks: Support Vector machines
Neural Networks: Support Vector machinesNeural Networks: Support Vector machines
Neural Networks: Support Vector machinesMostafa G. M. Mostafa
 
Module II Partition and Generating Function (2).ppt
Module II Partition and Generating Function (2).pptModule II Partition and Generating Function (2).ppt
Module II Partition and Generating Function (2).pptssuser26e219
 

Similar a 2013-1 Machine Learning Lecture 05 - Vasant Honavar - Support Vector Machi… (20)

The world of loss function
The world of loss functionThe world of loss function
The world of loss function
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machines
 
lawper.ppt
lawper.pptlawper.ppt
lawper.ppt
 
Unit-1 Introduction and Mathematical Preliminaries.pptx
Unit-1 Introduction and Mathematical Preliminaries.pptxUnit-1 Introduction and Mathematical Preliminaries.pptx
Unit-1 Introduction and Mathematical Preliminaries.pptx
 
Svm and kernel machines
Svm and kernel machinesSvm and kernel machines
Svm and kernel machines
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need
 
Support Vector Machine.ppt
Support Vector Machine.pptSupport Vector Machine.ppt
Support Vector Machine.ppt
 
svm.ppt
svm.pptsvm.ppt
svm.ppt
 
support vector machine algorithm in machine learning
support vector machine algorithm in machine learningsupport vector machine algorithm in machine learning
support vector machine algorithm in machine learning
 
Support vector machine
Support vector machineSupport vector machine
Support vector machine
 
Fast Object Recognition from 3D Depth Data with Extreme Learning Machine
Fast Object Recognition from 3D Depth Data with Extreme Learning MachineFast Object Recognition from 3D Depth Data with Extreme Learning Machine
Fast Object Recognition from 3D Depth Data with Extreme Learning Machine
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
 
Neural Networks: Support Vector machines
Neural Networks: Support Vector machinesNeural Networks: Support Vector machines
Neural Networks: Support Vector machines
 
Module II Partition and Generating Function (2).ppt
Module II Partition and Generating Function (2).pptModule II Partition and Generating Function (2).ppt
Module II Partition and Generating Function (2).ppt
 

Más de Dongseo University

Lecture_NaturalPolicyGradientsTRPOPPO.pdf
Lecture_NaturalPolicyGradientsTRPOPPO.pdfLecture_NaturalPolicyGradientsTRPOPPO.pdf
Lecture_NaturalPolicyGradientsTRPOPPO.pdfDongseo University
 
Evolutionary Computation Lecture notes03
Evolutionary Computation Lecture notes03Evolutionary Computation Lecture notes03
Evolutionary Computation Lecture notes03Dongseo University
 
Evolutionary Computation Lecture notes02
Evolutionary Computation Lecture notes02Evolutionary Computation Lecture notes02
Evolutionary Computation Lecture notes02Dongseo University
 
Evolutionary Computation Lecture notes01
Evolutionary Computation Lecture notes01Evolutionary Computation Lecture notes01
Evolutionary Computation Lecture notes01Dongseo University
 
Average Linear Selection Algorithm
Average Linear Selection AlgorithmAverage Linear Selection Algorithm
Average Linear Selection AlgorithmDongseo University
 
Lower Bound of Comparison Sort
Lower Bound of Comparison SortLower Bound of Comparison Sort
Lower Bound of Comparison SortDongseo University
 
Running Time of Building Binary Heap using Array
Running Time of Building Binary Heap using ArrayRunning Time of Building Binary Heap using Array
Running Time of Building Binary Heap using ArrayDongseo University
 
Proof By Math Induction Example
Proof By Math Induction ExampleProof By Math Induction Example
Proof By Math Induction ExampleDongseo University
 
Estimating probability distributions
Estimating probability distributionsEstimating probability distributions
Estimating probability distributionsDongseo University
 
2018-2 Machine Learning (Wasserstein GAN and BEGAN)
2018-2 Machine Learning (Wasserstein GAN and BEGAN)2018-2 Machine Learning (Wasserstein GAN and BEGAN)
2018-2 Machine Learning (Wasserstein GAN and BEGAN)Dongseo University
 
2018-2 Machine Learning (Linear regression, Logistic regression)
2018-2 Machine Learning (Linear regression, Logistic regression)2018-2 Machine Learning (Linear regression, Logistic regression)
2018-2 Machine Learning (Linear regression, Logistic regression)Dongseo University
 
2017-2 ML W9 Reinforcement Learning #5
2017-2 ML W9 Reinforcement Learning #52017-2 ML W9 Reinforcement Learning #5
2017-2 ML W9 Reinforcement Learning #5Dongseo University
 

Más de Dongseo University (20)

Lecture_NaturalPolicyGradientsTRPOPPO.pdf
Lecture_NaturalPolicyGradientsTRPOPPO.pdfLecture_NaturalPolicyGradientsTRPOPPO.pdf
Lecture_NaturalPolicyGradientsTRPOPPO.pdf
 
Evolutionary Computation Lecture notes03
Evolutionary Computation Lecture notes03Evolutionary Computation Lecture notes03
Evolutionary Computation Lecture notes03
 
Evolutionary Computation Lecture notes02
Evolutionary Computation Lecture notes02Evolutionary Computation Lecture notes02
Evolutionary Computation Lecture notes02
 
Evolutionary Computation Lecture notes01
Evolutionary Computation Lecture notes01Evolutionary Computation Lecture notes01
Evolutionary Computation Lecture notes01
 
Markov Chain Monte Carlo
Markov Chain Monte CarloMarkov Chain Monte Carlo
Markov Chain Monte Carlo
 
Simplex Lecture Notes
Simplex Lecture NotesSimplex Lecture Notes
Simplex Lecture Notes
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Median of Medians
Median of MediansMedian of Medians
Median of Medians
 
Average Linear Selection Algorithm
Average Linear Selection AlgorithmAverage Linear Selection Algorithm
Average Linear Selection Algorithm
 
Lower Bound of Comparison Sort
Lower Bound of Comparison SortLower Bound of Comparison Sort
Lower Bound of Comparison Sort
 
Running Time of Building Binary Heap using Array
Running Time of Building Binary Heap using ArrayRunning Time of Building Binary Heap using Array
Running Time of Building Binary Heap using Array
 
Running Time of MergeSort
Running Time of MergeSortRunning Time of MergeSort
Running Time of MergeSort
 
Binary Trees
Binary TreesBinary Trees
Binary Trees
 
Proof By Math Induction Example
Proof By Math Induction ExampleProof By Math Induction Example
Proof By Math Induction Example
 
TRPO and PPO notes
TRPO and PPO notesTRPO and PPO notes
TRPO and PPO notes
 
Estimating probability distributions
Estimating probability distributionsEstimating probability distributions
Estimating probability distributions
 
2018-2 Machine Learning (Wasserstein GAN and BEGAN)
2018-2 Machine Learning (Wasserstein GAN and BEGAN)2018-2 Machine Learning (Wasserstein GAN and BEGAN)
2018-2 Machine Learning (Wasserstein GAN and BEGAN)
 
2018-2 Machine Learning (Linear regression, Logistic regression)
2018-2 Machine Learning (Linear regression, Logistic regression)2018-2 Machine Learning (Linear regression, Logistic regression)
2018-2 Machine Learning (Linear regression, Logistic regression)
 
2017-2 ML W11 GAN #1
2017-2 ML W11 GAN #12017-2 ML W11 GAN #1
2017-2 ML W11 GAN #1
 
2017-2 ML W9 Reinforcement Learning #5
2017-2 ML W9 Reinforcement Learning #52017-2 ML W9 Reinforcement Learning #5
2017-2 ML W9 Reinforcement Learning #5
 

Último

Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxRoyAbrique
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991RKavithamani
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppCeline George
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 

Último (20)

Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website App
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 

2013-1 Machine Learning Lecture 05 - Vasant Honavar - Support Vector Machi…

  • 1. 1 Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory MACHINE LEARNING Vasant Honavar Artificial Intelligence Research Laboratory Department of Computer Science Bioinformatics and Computational Biology Program Center for Computational Intelligence, Learning, & Discovery Iowa State University honavar@cs.iastate.edu www.cs.iastate.edu/~honavar/ www.cild.iastate.edu/ Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Maximum margin classifiers • Discriminative classifiers that maximize the margin of separation • Support Vector Machines – Feature spaces – Kernel machines – VC Theory and generalization bounds – Maximum margin classifiers • Comparisons with conditionally trained discriminative classifiers
  • 2. 2 Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Perceptrons revisited Perceptrons • Can only compute threshold functions • Can only represent linear decision surfaces • Can only learn to classify linearly separable training data How can we deal with non linearly separable data? • Map data into a typically higher dimensional feature space where the classes become separable Two problems must be solved: • Computational problem of working with high dimensional feature space • Overfitting problem in high dimensional feature spaces Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Maximum margin model We cannot outperform Bayes optimal classifier when • The generative model assumed is correct • The data set is large enough to ensure reliable estimation of parameters of the models But discriminative models may be better than generative models when • The correct generative model is seldom known • The data set is often simply not large enough Maximum margin classifiers are a kind of discriminative classifiers designed to circumvent the overfiting problem
  • 3. 3 Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Extending Linear Classifiers – Feature spaces Map data into a feature space where they are linearly separable ( )ϕ→x x x ( )ϕ x X Φ Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Linear Separability in Feature spaces The original input space can always be mapped to some higher-dimensional feature space where the training data become separable: Φ: x → φ(x)
  • 4. 4 Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Learning in the Feature Spaces High dimensional Feature spaces where typically d > > n solve the problem of expressing complex functions But this introduces a • computational problem (working with very large vectors) • generalization problem (curse of dimensionality) SVM offer an elegant solution to both problems ( ) ( ) ( ) ( ) ( )( )xxxxx dnxxx ϕϕϕ=ϕ→= ,....,...., 2121 Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory The Perceptron Algorithm Revisited • The perceptron works by adding misclassified positive or subtracting misclassified negative examples to an arbitrary weight vector, which (without loss of generality) we assumed to be the zero vector. • The final weight vector is a linear combination of training points where, since the sign of the coefficient of is given by label yi, the are positive values, proportional to the number of times, misclassification of has caused the weight to be updated. It is called the embedding strength of the pattern . • 1 , l i i i i yα = = ∑w x ix iα ix ix
  • 5. 5 Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Dual Representation The decision function can be rewritten as: The update rule is: WLOG, we can take ( ) ( ) ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ += ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ +⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ =+= ∑ ∑ = = by bybh j l j jj l j jjj xx xxxwx ,sgn ,sgn,sgn 1 1 α α 1.η = ηαα byy y ii ij l j jji ii +← ≤⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ +∑= then if ),(exampletrainingon ,0, 1 xx x α Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Implications of Dual Representation When Linear Learning Machines are represented in the dual form Data appear only inside dot products (in decision function and in training algorithm) The matrix which is the matrix of pair- wise dot products between training samples is called the Gram matrix ( ), 1 l i j i j G = = ⋅x x ( ) ( ) ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ +=+= ∑= bybh ij l j jjii xxxwx ,sgn,sgn 1 α
  • 6. 6 Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Implicit Mapping to Feature Space Kernel Machines • Solve the computational problem of working with many dimensions • Can make it possible to use infinite dimensions efficiently • Offer other advantages, both practical and conceptual Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Kernel-Induced Feature Spaces Φ→ϕ X: ( ) ( ) ( ) ( )( )ii ii fh bf xx xwx sgn , = += ϕ where is a non-linear map from input space to feature space In the dual representation, the data points only appear inside dot products ( ) ( )( ) ( ) ( ) ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ +=+= ∑= bybh ij l j jjii xxxwx ϕϕαϕ ,sgn,sgn 1
  • 7. 7 Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Kernels • Kernel function returns the value of the dot product between the images of the two arguments • When using kernels, the dimensionality of the feature space Φ is not necessarily important because of the special properties of kernel functions • Given a function K, it is possible to verify that it is a kernel (we will return to this later) 1 2 1 2( , ) ( ), ( )K x x x xϕ ϕ= Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Kernel Machines We can use perceptron learning algorithm in the feature space by taking its dual representation and replacing dot products with kernels: 1 2 1 2 1 2, ( , ) ( ), ( )x x K x x x xϕ ϕ← =
  • 8. 8 Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory The Kernel Matrix K= K(2,l)…K(2,3)K(2,2)K(2,1) K(1,l)…K(1,3)K(1,2)K(1,1) K(l,l)…K(l,3)K(l,2)K(l,1) …………… Kernel matrix is the Gram matrix in the feature space Φ (the matrix of pair-wise dot products between feature vectors corresponding to the training samples) Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Properties of Kernel Matrices It is easy to show that the Gram matrix (and hence the kernel matrix) is • A Square matrix • Symmetric (K T = K) • Positive semi-definite (all eigenvalues of K are non- negative – Recall that eigenvalues of a square matrix A are given by values of λ that satisfy | A-λI |=0) Any symmetric positive semi definite matrix can be regarded as a kernel matrix, that is, as an inner product matrix in some feature space Φ ( ) ( ) ( )jiji xxxxK ϕϕ= ,,
  • 9. 9 Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Mercer’s Theorem: Characterization of Kernel Functions A function K : X×X ℜ is said to be (finitely) positive semi-definite if • K is a symmetric function: K(x,y) = K(y,x) • Matrices formed by restricting K to any finite subset of the space X are positive semi-definite Characterization of Kernel Functions Every (finitely) positive semi definite, symmetric function is a kernel: i.e. there exists a mapping ϕ such that it is possible to write: ( ) ( ) ( )jijiK xxxx ϕϕ= ,, Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Properties of Kernel Functions - Mercer’s Theorem Eigenvalues of the Gram Matrix define an expansion of Kernels: That is, the eigenvalues act as features! Recall that the eigenvalues of a square matrix A correspond to solutions of ⎥ ⎥ ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎢ ⎢ ⎣ ⎡ = =λ− 1000 0100 0010 0001 0 I IA where ( ) ( ) ( ) ( ) ( )jli l lljijiK xxxxxx ϕϕλϕϕ ∑== ,,
  • 10. 10 Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Examples of Kernels ( ) ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ σ − = zx zx , eK , Simple examples of kernels: ( ) d K zxzx ,, = Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Example: Polynomial Kernels 1 2 1 2 ( , ); ( , ); x x x z z z = = 2 2 1 1 2 2 2 2 2 2 1 1 2 2 1 1 2 2 2 2 2 2 1 2 1 2 1 2 1 2 , ( ) 2 ( , , 2 ),( , , 2 ) ( ), ( ) x z x z x z x z x z x z x z x x x x z z z z x zϕ ϕ = + = + + = =
  • 11. 11 Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Making Kernels – Closure Properties The set of kernels is closed under some operations. If K1 , K2 are kernels over X×X, then the following are kernels: We can make complex kernels from simple ones: modularity! ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )( ) ( ) ( ) ( )n T NNN nn K K KK fffK KKK aaKK KKK ℜ⊆× = ℜ×ℜℜ→ = ℜ→= = >= += X X X andmatrixdefinitepositivesymmetricais overkernelais ; B Bzxzx zxzx zxzx zxzxzx zxzx zxzxzx ;, ;: ;,, :;, ,,, 0,, ,,, 3 3 21 1 21 ϕ ϕϕ Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Kernels • We can define Kernels over arbitrary instance spaces including • Finite dimensional vector spaces, • Boolean spaces • ∑* where ∑ is a finite alphabet • Documents, graphs, etc. • Kernels need not always be expressed by a closed form formula. • Many useful kernels can be computed by complex algorithms (e.g., diffusion kernels over graphs).
  • 12. 12 Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Kernels on Sets and Multi-Sets ( ) kernel.ais,Then ,Let domainfixedsomeforLet 21 2 2 21 21 AA V AAK XAA VX I = ∈ ⊆ Exercise: Define a Kernel on the space of multi- sets whose elements are drawn from a finite domain V Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory String Kernel (p-spectrum Kernel) • The p-spectrum of a string is the histogram – vector of number of occurrences of all possible contiguous substrings – of length p • We can define a kernel function K(s, t) over ∑* × ∑* as the inner product of the p-spectra of s and t. 2 3 = = = = ),( :substringsCommon tsK tat, ati p ncomputatiot statisticss
  • 13. 13 Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Kernel Machines Kernel Machines are Linear Learning Machines that : • Use a dual representation • Operate in a kernel induced feature space (that is a linear function in the feature space implicitly defined by the gram matrix corresponding to the data set K) ( ) ( )( ) ( ) ( ) ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ +=+= ∑= bybh ij l j jjii xxxwx ϕϕαϕ ,sgn,sgn 1 Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Kernels – the good, the bad, and the ugly Bad kernel – A kernel whose Gram (kernel) matrix is mostly diagonal -- all data points are orthogonal to each other, and hence the machine is unable to detect hidden structure in the data 1 0…010 1…000 …………… 0…001
  • 14. 14 Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Kernels – the good, the bad, and the ugly Good kernel – Corresponds to a Gram (kernel) matrix in which subsets of data points belonging to the same class are similar to each other, and hence the machine can detect hidden structure in the data 33400 00032 42300 24300 00023 Class 1 Class 2 Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Kernels – the good, the bad, and the ugly • In mapping in a space with too many irrelevant features, kernel matrix becomes diagonal • Need some prior knowledge of target so choose a good kernel
  • 15. 15 Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Learning in the Feature Spaces High dimensional Feature spaces where typically d > > n solve the problem of expressing complex functions But this introduces a • computational problem (working with very large vectors) – solved by the kernel trick – implicit computation of dot products in kernel induced feature spaces via dot products in the input space • generalization problem (curse of dimensionality) ( ) ( ) ( ) ( ) ( )( )xxxxx dnxxx ϕϕϕ=ϕ→= ,....,...., 2121 Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory The Generalization Problem • The curse of dimensionality • It is easy to overfit in high dimensional spaces • The Learning problem is ill posed • There are infinitely many hyperplanes that separate the training data • Need a principled approach to choose an optimal hyperplane
  • 16. 16 Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory The Generalization Problem “Capacity” of the machine – ability to learn any training set without error – related to VC dimension Excellent memory is not an asset when it comes to learning from limited data “A machine with too much capacity is like a botanist with a photographic memory who, when presented with a new tree, concludes that it is not a tree because it has a different number of leaves from anything she has seen before; a machine with too little capacity is like the botanist’s lazy brother, who declares that if it’s green, it’s a tree” C. Burges Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory History of Key Developments leading to SVM 1958 Perceptron (Rosenblatt) 1963 Margin (Vapnik) 1964 Kernel Trick (Aizerman) 1965 Optimization formulation (Mangasarian) 1971 Kernels (Wahba) 1992-1994 SVM (Vapnik) 1996 – present Rapid growth, numerous applications Extensions to other problems
  • 17. 17 Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory A Little Learning Theory Suppose: • We are given l training examples • Train and test points drawn randomly (i.i.d) from some unknown probability distribution D(x,y) • The machine learns the mapping and outputs a hypothesis . A particular choice of generates “trained machine” • The expectation of the test error or expected risk is ( , ).i iyx i iy→x ( ) ( ) ( )∫ −= ydDbhybR ,,, 2 1 , xαxα ( )bh ,,αx ( )b,α Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory A Bound on the Generalization Performance The empirical risk is: Choose some such that . With probability the following bound – risk bound of h(x,α) in distribution D holds (Vapnik,1995): where is called VC dimension is a measure of “capacity” of machine. δ 1 δ− 0d ≥ 10 << δ ( ) ( ) ⎟⎟ ⎟ ⎟ ⎟ ⎠ ⎞ ⎜⎜ ⎜ ⎜ ⎜ ⎝ ⎛ ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ −⎟ ⎠ ⎞ ⎜ ⎝ ⎛ +⎟ ⎠ ⎞ ⎜ ⎝ ⎛ +≤ l d l d bRbR emp 4 log1 2 log ,, δ αα ( ) ( )∑= −= l i iemp bhy l bR 1 ,, 2 1 , αxα
  • 18. 18 Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory A Bound on the Generalization Performance The second term in the right-hand side is called VC confidence. Three key points about the actual risk bound: • It is independent of D(x,y) • It is usually not possible to compute the left hand side. • If we know d, we can compute the right hand side. The risk bound gives us a way to compare learning machines! Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory The VC Dimension Definition: the VC dimension of a set of functions is d if and only if there exists a set of points d data points such that each of the 2d possible labelings (dichotomies) of the d data points, can be realized using some member of H but that no set exists with q > d satisfying this property. ( ){ }bhH ,,αx=
  • 19. 19 Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory The VC Dimension VC dimension of H is size of largest subset of X shattered by H. VC dimension measures the capacity of a set H of hypotheses (functions). If for any number N, it is possible to find N points that can be separated in all the 2N possible ways, we will say that the VC-dimension of the set is infinite 1,.., Nx x Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory The VC Dimension Example • Suppose that the data live in space, and the set consists of oriented straight lines, (linear discriminants). • It is possible to find three points that can be shattered by this set of functions • It is not possible to find four. • So the VC dimension of the set of linear discriminants in is three. 2 R { ( , )}h αx 2 R
  • 20. 20 Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory The VC Dimension of Hyperplanes Theorem 1 Consider some set of m points in . Choose any one of the points as origin. Then the m points can be shattered by oriented hyperplanes if and only if the position vectors of the remaining points are linearly independent. Corollary: The VC dimension of the set of oriented hyperplanes in is n+1, since we can always choose n+1 points, and then choose one of the points as origin, such that the position vectors of the remaining points are linearly independent, but can never choose n+2 points n R n R Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory VC Dimension VC dimension can be infinite even when the number of parameters of the set of hypothesis functions is low. Example: For any integer l with any labels we can find l points and parameter α such that those points can be shattered by Those points are: and parameter a is: { ( , )}h αx ( , ) sgn(sin( )), ,h α α α≡ ∈x x x R 1,..., , { 1,1}l iy y y ∈ − 1,..., lx x ( , )h αx 10 , 1,..., .i ix i l− = = 1 (1 )10 (1 ) 2 il i i y α π = − = + ∑
  • 21. 21 Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Vapnik-Chervonenkis (VC) Dimension • Let H be a hypothesis class over an instance space X. • Both H and X may be infinite. • We need a way to describe the behavior of H on a finite set of points S ⊆ X. • For any concept class H over X, and any S ⊆ X, • Equivalently, with a little abuse of notation, we can write ( ) { }HhShSH ∈∩=Π : { }mXXXS ...., 21= ( ) ( ) ( )( ){ }HhXhXhS mH ∈=Π :......1 ΠH(S) is the set of all dichotomies or behaviors on S that are induced or realized by H Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Vapnik-Chervonenkis (VC) Dimension • If where , or equivalently, • we say that S is shattered by H. • A set S of instances is said to be shattered by a hypothesis class H if and only if for every dichotomy of S, there exists a hypothesis in H that is consistent with the dichotomy. ( ) { }m H S 10,=Π mS = ( ) m H S 2=Π
  • 22. 22 Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory VC Dimension of a hypothesis class Definition: The VC-dimension V(H), of a hypothesis class H defined over an instance space X is the cardinality d of the largest subset of X that is shattered by H. If arbitrarily large finite subsets of X can be shattered by H, V(H)=∞ How can we show that V(H) is at least d? • Find a set of cardinality at least d that is shattered by H. How can we show that V(H) = d? • Show that V(H) is at least d and no set of cardinality (d+1) can be shattered by H. Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory VC Dimension of a Hypothesis Class - Examples • Example: Let the instance space X be the 2-dimensional Euclidian space. Let the hypothesis space H be the set of axis parallel rectangles in the plane. • V(H)=4 (there exists a set of 4 points that can be shattered by a set of axis parallel rectangles but no set of 5 points can be shattered by H).
  • 23. 23 Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Some Useful Properties of VC Dimension Proof: Left as an exercise ( ) ( ) ( ) { }( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) { } ( ) ( ) ( ) ( ) dmmOmΦdmmΦ mΦm mSSmdVH llHVOHVH lH HVHVHVHHH HVHVHhhXH HHVH HVHVHH d d m d dH HH l l <=≥= ≤Π =Π=Π= = ++≤⇒∪= =⇒∈−= ≤ ≤⇒⊆ ifandif where)( ,:)(max)(,If )lg)((,fromconcepts ofonintersectiorunionabyformedisIf : lg)(class,conceptfiniteaisIf 2 12121 2121 Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Risk Bound What is VC dimension and empirical risk of the nearest neighbor classifier? Any number of points, labeled arbitrarily, can be shattered by a nearest-neighbor classifier, thus and empirical risk =0 . So the bound provide no useful information in this case. d = ∞
  • 24. 24 Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Minimizing the Bound by Minimizing d • VC confidence (second term) depends on d/l • One should choose that learning machine with minimal d • For large values of d/l the bound is not tight 0.05δ = ( ) ( ) ⎟⎟ ⎟ ⎟ ⎟ ⎠ ⎞ ⎜⎜ ⎜ ⎜ ⎜ ⎝ ⎛ ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ −⎟ ⎠ ⎞ ⎜ ⎝ ⎛ +⎟ ⎠ ⎞ ⎜ ⎝ ⎛ +≤ l d l d RR emp 4 log1 2 log δ αα Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Bounds on Error of Classification The classification error • depends on the VC dimension of the hypothesis class • is independent of the dimensionality of the feature space Vapnik proved that the error ε of of classification function h for separable data sets is where d is the VC dimension of the hypothesis class and l is the number of training examples d O l ε ⎛ ⎞ = ⎜ ⎟ ⎝ ⎠
  • 25. 25 Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Structural Risk Minimization Finding a learning machine with the minimum upper bound on the actual risk • leads us to a method of choosing an optimal machine for a given task • essential idea of the structural risk minimization (SRM). Let be a sequence of nested subsets of hypotheses whose VC dimensions satisfy d1 < d2 < d3 < … • SRM consists of finding that subset of functions which minimizes the upper bound on the actual risk • SRM involves training a series of classifiers, and choosing a classifier from the series for which the sum of empirical risk and VC confidence is minimal. 1 2 3 ...H H H⊂ ⊂ ⊂ Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Margin Based Bounds on Error of Classification ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ γ =ε 2 1 L l O ( ) min || || i i i y f γ = x w ( ) ,f b= +x w x p p L Xmax= Margin based bound Important insight Error of the classifier trained on a separable data set is inversely proportional to its margin, and is independent of the dimensionality of the input space!
  • 26. 26 Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Maximal Margin Classifier The bounds on error of classification suggest the possibility of improving generalization by maximizing the margin Minimize the risk of overfitting by choosing the maximal margin hyperplane in feature space SVMs control capacity by increasing the margin, not by reducing the number of features Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Margin 1 1 1 1 1 1 1 1 Linear separation of the input space ( ) ( ) ( )( )xx xwx fsignh bf = += , w w b
  • 27. 27 Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Functional and Geometric Margin • The functional margin of a linear discriminant (w,b) w.r.t. a labeled pattern is defined as • If the functional margin is negative, then the pattern is incorrectly classified, if it is positive then the classifier predicts the correct label. • The larger the further away xi is from the discriminant • This is made more precise in the notion of the geometric margin ( , )i i iy bγ ≡ +w x | |iγ || || i i γ γ ≡ w which measures the Euclidean distance of a point from the decision boundary. ( ) { }1,1, −×ℜ∈ d ii yx Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Geometric Margin +∈SiX iγjγ The geometric margin of two points −∈SjX
  • 28. 28 Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Geometric Margin +∈ SXi −∈ SX j iγ jγ Example ( ) ( ) ( )( ) ( )( ) 2 1 11111 1.11 1 )11( 1 )11( 21 == =−+= −+= = = −= = W X W i i ii i i γ xxYγ Y b γ (1,1) Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Margin of a training set i i γγ min= γ The functional margin of a training set The geometric margin of a training set i i γγ min=
  • 29. 29 Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Maximum Margin Separating Hyperplane is called the (functional) margin of (w,b) w.r.t. the data set S={(xi,yi)}. The margin of a training set S is the maximum geometric margin over all hyperplanes. A hyperplane realizing this maximum is a maximal margin hyperplane. Maximal Margin Hyperplane mini iγ γ≡ Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory VC Dimension of - margin hyperplanes • If data samples belong to a sphere of radius R, then the set of - margin hyperplanes has VC dimension bounded by • WLOG, we can make R=1 by normalizing data points of each class so that they lie within a sphere of radius R. • For large margin hyperplanes, VC-dimension is controlled independent of dimensionality n of the feature space Δ Δ 1),/min()( 22 +Δ≤ nRHVC
  • 30. 30 Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Maximal Margin Classifier The bounds on error of classification suggest the possibility of improving generalization by maximizing the margin Minimize the risk of overfitting by choosing the maximal margin hyperplane in feature space SVMs control capacity by increasing the margin, not by reducing the number of features Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Maximizing Margin Minimizing ||W|| Definition of hyperplane (w,b) does not change if we rescale it to (σw, σb), for σ>0 Functional margin depends on scaling, but geometric margin does not If we fix (by rescaling) the functional margin to 1, the geometric margin will be equal 1/||w|| We can maximize the margin by minimizing the norm ||w|| γ
  • 31. 31 Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Maximizing Margin Minimizing ||w|| Let x+ and x- be the nearest data points in the sample space. Then we have, by fixing the distance from the hyperplane to be 1: o ( ) ( ) ( ) w xx, w w xxw, xw, xw, 2 211 1 1 =− =−−=− −=+ =+ −+ −+ − + b b γ Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Learning as optimization Minimize subject to: ww, ( ) 1≥+ b,y ii xw
  • 32. 32 Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Constrained Optimization • Primal optimization problem • Given functions f, gi, i=1…k; hj j=1..m; defined on a domain Ω ⊆ ℜn, ( ) ( ) ( ) { { { sconstraintqualitye sconstraintinequality functionbjectiveo ...10 ,...10subject to minimize mjh kig Ωf j i == =≤ ∈ w w ww Shorthand ( ) ( ) ( ) ( ) ...mjhh ...kigg j i 10denotes0 10denotes0 =≤= =≤≤ ww ww Feasible region ( ) ( ){ }00 =≤∈= www hgΩF ,: Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Optimization problems • Linear program – objective function as well as equality and inequality constraints are linear • Quadratic program – objective function is quadratic, and the equality and inequality constraints are linear • Inequality constraints gi(w) ≤ 0 can be active i.e. gi(w) = 0 or inactive i.e. gi(w) < 0. • Inequality constraints are often transformed into equality constraints using slack variables gi(w) ≤ 0 ↔ gi(w) +ξi = 0 with ξi ≥ 0 • We will be interested primarily in convex optimization problems
  • 33. 33 Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Convex optimization problem If function f is convex, any local minimum of an unconstrained optimization problem with objective function f is also a global minimum, since for any A set is called convex if , and for any the point A convex optimization problem is one in which the set , the objective function and all the constraints are convex * w * ≠u w * ( ) ( )f f≤w u Ω n Ω ⊂ R ,∀ ∈Ωw u (0,1),θ ∈ ( (1 ) )θ θ+ − ∈Ωw u Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Lagrangian Theory Given an optimization problem with an objective function f (w) and equality constraints hj (w) = 0 j = 1..m, we define a Lagrangian as ( ) ( ) ( )∑= += m j jj hβfL 1 wwβw, where βj are called the Lagrange multipliers. The necessary Conditions for w* to be minimum of f (w) subject to the Constraints hj (w) = 0 j = 1..m is given by ( ) ( ) 00 = ∂ ∂ = ∂ ∂ β βw w βw **** , , , LL The condition is sufficient if L(w,β*) is a convex function of w
  • 34. 34 Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Lagrangian Theory Example ( ) ( ) ( ) ( ) ( ) ( ) 5 4 , 5 2 5 4 5 2 4 5 04 ,, 022 ,, 021 ,, 42,, 04 2, 22 22 22 −=−= ==±= =−+= ∂ ∂ =+= ∂ ∂ =+= ∂ ∂ −+++= =−+ += yxf yx yx yxL y y yxL x x yxL yxyxyxL yx yxyxf whenminimizedis :haveweabove,theSolving :toSubject Minimize mmλ λ λ λ λ λ λ λλ Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Lagrangian theory – example Find the lengths u, v, w of sides of the box that has the largest volume for a given surface area c ( ) ( ) 6 00 000 2 2 c wvuvuβuwβv w)β(uwu v L w);β(vvw u L v);β(uuv w L c vwuvwuuvwL c vwuvwu -uvw ===⇒−=− ++−== ∂ ∂ ++−== ∂ ∂ ++−== ∂ ∂ ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ −++β+−= =++ and ; tosubject minimize
  • 35. 35 Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Lagrangian Optimization - Example • The entropy of a probability distribution p=(p1...pn) over a finite set {1, 2,...n } is defined as • The maximum entropy distribution can be found by minimizing –H(p) subject to the constraints ( ) i n i i ppH 2 1 log∑= −=p 0 1 1 ≥∀ =∑= i n i i pi p Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Generalized Lagrangian Theory Given an optimization problem with domain Ω ⊆ ℜn ( ) ( ) ( ) ( )∑∑ == ++= m j jj k i ii hβgαfL 11 ,, wwwβαw where f is convex, and gi and hj are affine, we can define the generalized Lagrangian function as: ( ) ( ) ( ) { { { sconstraintqualitye sconstraintinequality functionbjectiveo ...10 ,...10subject to minimize mjh kig Ωf j i == =≤ ∈ w w ww An affine function is a linear function plus a translation F(x) is affine if F(x)=G(x)+b where G(x) is a linear function of x and b is a constant
  • 36. 36 Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Generalized Lagrangian Theory: KKT Conditions Given an optimization problem with domain Ω ⊆ ℜn where f is convex, and gi and hj are affine. The necessary and Sufficient conditions for w* to be an optimum are the existence of α* and β* such that ( ) ( ) ( ) ( ) kigg LL iiii ...;;; , , , , **** **,***,* 1000 00 =≥α≤=α = ∂ ∂ = ∂ ∂ ww β βαw w βαw ( ) ( ) ( ) { { { sconstraintqualitye sconstraintinequality functionbjectiveo tosubject minimize mjh kig Ωf j i ...10 ,...10 == =≤ ∈ w w ww Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) 20 0,0 2;0,202 2 6 2 2 ; 2 6 ; 2 2 02;032;0120,0 0,0 2 3;1032;012;0,0 0;2,0,0 0;02032 012 231 0;2 31, 21 1 11 1 11 11121 21 21 11 2121 21 21 22 22 == ≠≠ ===⇒=⎟ ⎠ ⎞ ⎜ ⎝ ⎛ − − + −− = − = =−+=+−=+−⇒=≠ ≠= ≤+ ==⇒=−=−⇒== ≤−≤+≥≥ =−=−+=−+−= ∂ ∂ =++−= ∂ ∂ −+−++−+−= ≤−≤+ −+−= ; yx yxyx yxyx yx yxyx yxyx yxyxy y L x x L yxyxyxL yxyx yxyxf issolutiontheHence, solutionfeasibleayieldnotdoes4Case solutionfeasibleaiswhich yieldswhich :3Case solutionfeasibleayieldnotdoes:2caseSimilarly, violatesitbecausesolutionfeasibleanotisThis 1:Case ; :toSubject Minimize λλ λ λλ λ λλ λλλλλ λλ λλ λλ λλλλ λλ λλ
  • 37. 37 Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Dual optimization problem A primal problem can be transformed into a dual by simply setting to zero, the derivatives of the Lagrangian with respect to the primal variables and substituting the results back into the Lagrangian to remove dependence on the primal variables, resulting in ( ) ( )βαwβα w ,,inf, L Ω∈ =θ which contains only dual variables and needs to be maximized under simpler constraints The infimum of a set S of real numbers is denoted by inf(S) and is defined to be the largest real number that is smaller than or equal to every number in S. If no such number exists then inf(S) = −∞. Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Maximum Margin Hyperplane The problem of finding the minimal margin hyperplane is a constrained optimization problem Use Lagrange theory (extended by Karush, Kuhn, and Tucker – KKT) Lagrangian: ( ) ( )[ ] 0 1,, 2 1 ≥ −+−= ∑ i ii i ip byL α α xwwww
  • 38. 38 Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory From Primal to Dual Minimize Lp(w) with respect to (w,b) requiring that derivatives of Lp(w) with respect to all vanish, subject to the constraints Differentiating Lp(w) : iα 0iα ≥ Substituting the equality constraints back into Lp(w) we obtain a dual problem. 0 0 0 = = = ∂ ∂ = ∂ ∂ ∑ ∑ i ii i iii p p y y b L L α α xw w Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory The Dual Problem Maximize: 1 1max ( ) min ( ) 2 i iy i y i b =− =⋅ + ⋅ = − w x w x ( ) sconstraintprimalusingsolvedbetohas andtoSubject b αyα yy ybyyyy byy yyL ii i i i i ij jijiji i ii i i ij jijiji ij jijiji i j jjji i i j jjj i iiiD 00 , 2 1 2 1 1, 2 1 ≥= +−= +−−= ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎣ ⎡ − ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ +⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ − ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ = ∑ ∑∑ ∑∑∑∑ ∑∑ ∑∑ ααα αααααα αα αα xx xxxx xx xxw
  • 39. 39 Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Karush-Kuhn-Tucker Conditions Solving for the SVM is equivalent to finding a solution to the KKT conditions ( ) ( )[ ] ( ) ( )[ ] 0 01, 01, 00 0 1,, 2 1 ≥ =−+ ≥−+ =⇒= ∂ ∂ =⇒= ∂ ∂ −+−= ∑ ∑ ∑ i iii ii i ii p i iii p ii i ip by by y b L y L byL α α α α α xw xw xw w xwwww Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Karush-Kuhn-Tucker Conditions for SVM The KKT conditions state that optimal solutions must satisfy Only the training samples xi for which the functional margin = 1 can have nonzero αi. They are called Support Vectors. The optimal hyperplane can be expressed in the dual representation in terms of this subset of training samples – the support vectors ( , )i bα w 1 sv ( , , ) l i i i i i i i i f b y b y bα α = ∈ = ⋅ + = ⋅ +∑ ∑x α x x x x ( )[ ] 01, =−+ by iii xwα
  • 40. 40 Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ == ∑∈SV i ixw αγ 1 ( ) ∑∑∑∑ ∑∑∑∑ ∑∑ ∑∑ ∑∑ ∈∈∉∈ ∈∉∈∈ ∈∈ ∈∈ == =+⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ += +⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ +=−= −=⇒=⎟ ⎠ ⎞ ⎜ ⎝ ⎛ +∈ = = SVi i SVj j SVj jj SVj jj SVj j SVj j SVj jj SVj jj SVi jjiiij SVi jiiijj ji SVi ii SVj jj jijij l i i l j yyb yybby, byyybyySV yy yy, αααα ααα αα αα αα 444 3444 21 Q condnsKKTofbecause0 11 01 1,1,, , , ww xxxxx xx xxww So we have Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Support Vector Machines Yield Sparse Solutions γ Support vectors
  • 41. 41 Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Example: XOR ∑∑∑ ∑ = == = −= −+−= =Φ N i N j j T ijiji N i i N i i T ii T T yyQ bybL 1 1 2 1 1 1 2 1 2 1 )( ]1)([),,( )( xx xwwww www αααα αα 0 11 1 X y [-1,-1] -1 [-1,+1] +1 [+1,-1] +1 [+1,+1] -12 )1(),( j T ijiK xxxx += T xxxxxx ]2,2,,2,,1[)( 21 2 221 2 1=xϕ ⎥ ⎥ ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎢ ⎢ ⎣ ⎡ = 9 1 1 1 1 9 1 1 1 1 9 1 1 1 1 9 K Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Example: XOR )9292292229( )( 4 243 2 34232 2 213121 2 12 1 4321 ααααααααααααααα ααααα +−+−+++−−− +++=Q 19 19 19 19 4321 4321 4321 4321 =+−− =−++− =−++− =+−− αααα αααα αααα αααα 4 1 8 1 4,3,2,1, )( = ==== α αααα o oooo Q Not surprisingly, each of the training samples is a support vector 2 1 4 12 2 1 , == oo ww T N i iiio - yw ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ = = ∑= 000 2 1 00 )( 1 xϕα Setting derivative of Q wrt the dual variables to 0
  • 42. 42 Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Example – Two Spirals Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Summary of Maximal margin classifier • Good generalization when the training data is noise free and separable in the kernel-induced feature space • Excessively large Lagrange multipliers often signal outliers – data points that are most difficult to classify – SVM can be used for identifying outliers • Focuses on boundary points If the boundary points are mislabeled (noisy training data) – the result is a terrible classifier if the training set is separable – Noisy training data often makes the training set non separable – Use of highly nonlinear kernels to make the training set separable increases the likelihood of over fitting – despite maximizing the margin
  • 43. 43 Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Non-Separable Case • In the case of non-separable data in feature space, the objective function grows arbitrarily large. • Solution: Relax the constraint that all training data be correctly classified but only when necessary to do so • Cortes and Vapnik, 1995 introduced slack variables in the constraints: , 1,..,i i lξ = ( )( )b,y iii +−= xwγξ ,0max Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Non-Separable Case Generalization error can be shown to be iξ jξ max(0, ( , )).i i iy x bξ γ= − +w γ 2 2 1 ⎟ ⎟ ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎜ ⎜ ⎝ ⎛ + ≤ ∑ γ ξ ε i iR l
  • 44. 44 Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Non-Separable Case For error to occur, the corresponding must exceed (which was chosen to be unity) So is an upper bound on the number of training errors. So the objective function is changed from to where C and k are parameters (typically k=1 or 2 and C is a large positive constant ii ξ∑ γ ∑+ ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ i k iC ξ 2 2 w iξ 2 2 w Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory The Soft-Margin Classifier Minimize: Subject to: ∑+ i iC ξww, 2 1 ( ) ( ) 0 1, ≥ −≥+ i iii by ξ ξxw ( )( ) 32144444 344444 2144 344 21 i i ξ i i iiii i i i iP byCL onsconstraint negativitynoninvolvingsconstraintinequalityfunctionobjective − ∑∑∑ −+−+−+= ξμξαξ ξ 1, 2 1 2 xww
  • 45. 45 Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory The Soft-Margin Classifier: KKT Conditions ( ) 0 01, 0 >• =±=+• > ii iii i b b ξ ξ α henceand,byiedmisclassifis or henceandi.e.,vectorsupportais ifonly wx xwx ( )( ) 0 01, 0,0,0 = =+−+ ≥≥≥ ii iiii iii μ by μ ξ ξα αξ xw Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory From primal to dual ( )( ) jiji ij ji i iD DP iii i P i ii P i i ii P P i i iiii i i i iP yyL LL CC L y b L y L L byCL xx xw w xww , 2 1 , 00 00 0 ,0 1, 2 1 2 ∑∑ ∑ ∑ ∑∑∑ −= ≤≤⇒=+⇒= ∂ ∂ =⇒= ∂ ∂ =⇒= ∂ ∂ −+−+−+= ααα αμα ξ α α ξμξαξ maximizedbetodualthegetweintobackngSubstituti tovariablesprimalwrtofsderivativetheEquating
  • 46. 46 Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Soft Margin-Dual Lagrangian ( )( ) ( ) 0 0 0 01, 00 , 2 1 = << = =− =+−+ =≤≤ −= ∑ ∑∑ i i i ii iiii i iii jiji ij ji i iD α Cα C C by yC yyL samples,training)classified(correctlyotherallFor vectors,supporttrueFor only whensample)trainingfied(misclassi zero-nonarevariablesSlack :ConditionsKKT and:toSubject Maximize α αξ ξα αα ααα xw xx Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Implementation Techniques Maximizing a quadratic function, subject to a linear equality and inequality constraints , 1 ( ) ( , ) 2 0 0 i i j i j i j i i j i i i i W y y K x x C y α α α α α α = − ≤ ≤ = ∑ ∑ ∑ ( ) 1 ( , )i j j i j ji W y y K x xα α ∂ = − ∂ ∑ α
  • 47. 47 Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory On-line algorithm for the 1-norm soft margin (bias b=0) Given training set D repeat for i=1 to l else end for until stopping criterion satisfied return α ←α 0 (1 ( , ))i i i i j j i j j y y K x xα α η α← + − ∑ if 0 then 0i iα α< ← if theni iC Cα α> ← ( )ii i K xx , 1 =η In the general setting, b ≠ 0 and we need to solve for b Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Implementation Techniques Use QP packages (MINOS, LOQO, quadprog from MATLAB optimization toolbox). They are not online and require that the data are held in memory in the form of kernel matrix Stochastic Gradient Ascent. Sequentially update 1 weight at the time. So it is online. Gives excellent approximation in most cases 1 ˆ 1 ( , ) ( , ) i i i j j i j ji i y y K x x K x x α α α ⎛ ⎞ ← + −⎜ ⎟ ⎝ ⎠ ∑
  • 48. 48 Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Chunking and Decomposition Given training set D Select an arbitrary working set repeat solve optimization problem on select new working set from data not satisfying KKT conditions until stopping criterion satisfied return ←α 0 ˆD D⊂ ˆD α Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Sequential Minimal Optimization S.M.O. At each step SMO: optimize two weights simultaneously in order not to violate the linear constraint Optimization of the two weights is performed analytically Realizes gradient dissent without leaving the linear constraint (J.Platt) Online versions exist (Li, Long; Gentile) 0i i i yα =∑ ,i jα α
  • 49. 49 Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory SVM Implementations • Details of optimization and Quadratic Programming • SVMLight – one of the first practical implementations of SVM (Platt) • Matlab toolboxes for SVM • LIBSVM – one of the best SVM implementations is by Chih- Chung Chang and Chih-Jen Lin of the National University of Singapore http://www.csie.ntu.edu.tw/~cjlin/libsvm/ • WLSVM – LIBSVM integrated with WEKA machine learning toolbox Yasser El-Manzalawy of the ISU AI Lab http://www.cs.iastate.edu/~yasser/wlsvm/ Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Why does SVM Work? • The feature space is often very high dimensional. Why don’t we have the curse of dimensionality? • A classifier in a high-dimensional space has many parameters and is hard to estimate • Vapnik argues that the fundamental problem is not the number of parameters to be estimated. Rather, the problem is the capacity of a classifier • Typically, a classifier with many parameters is very flexible, but there are also exceptions – Let xi=10i where i ranges from 1 to n. The classifier can classify all xi correctly for all possible combination of class labels on xi – This 1-parameter classifier is very flexible
  • 50. 50 Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Why does SVM work? • Vapnik argues that the capacity of a classifier should not be characterized by the number of parameters, but by the capacity of a classifier – This is formalized by the “VC-dimension” of a classifier • The minimization of ||w||2 subject to the condition that the geometric margin =1 has the effect of restricting the VC- dimension of the classifier in the feature space • The SVM performs structural risk minimization: the empirical risk (training error), plus a term related to the generalization ability of the classifier, is minimized • SVM loss function is analogous to ridge regression. The term ½||w||2 “shrinks” the parameters towards zero to avoid overfitting Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Choosing the Kernel Function • Probably the most tricky part of using SVM • The kernel function should maximize the similarity among instances within a class while accentuating the differences between classes • A variety of kernels have been proposed (diffusion kernel, Fisher kernel, string kernel, …) for different types of data - • In practice, a low degree polynomial kernel or RBF kernel with a reasonable width is a good initial try for data that live in a fixed dimensional input space • Low order Markov kernels and its relatives are good ones to consider for structured data – strings, images, etc.
  • 51. 51 Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Other Aspects of SVM • How to use SVM for multi-class classification? – One can change the QP formulation to become multi- class – More often, multiple binary classifiers are combined – One can train multiple one-versus-all classifiers, or combine multiple pairwise classifiers “intelligently” • How to interpret the SVM discriminant function value as probability? – By performing logistic regression on the SVM output of a set of data that is not used for training • Some SVM software (like libsvm) have these features built- in Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Recap: How to Use SVM • Prepare the data set • Select the kernel function to use • Select the parameter of the kernel function and the value of C – You can use the values suggested by the SVM software – You can use a tuning set to tune C • Execute the training algorithm and obtain αι and b • Unseen data can be classified using the αi, the support vectors, and b
  • 52. 52 Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Strengths and Weaknesses of SVM • Strengths – Training is relatively easy • No local optima – It scales relatively well to high dimensional data – Tradeoff between classifier complexity and error can be controlled explicitly – Non-traditional data like strings and trees can be used as input to SVM, instead of feature vectors • Weaknesses – Need to choose a “good” kernel function. Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Recent developments • Better understanding of the relation between SVM and regularized discriminative classifiers (logistic regression) • Knowledge-based SVM – incorporating prior knowledge as constraints in SVM optimization (Shavlik et al) • A zoo of kernel functions • Extensions of SVM to multi-class and structured label classification tasks • One class SVM (anomaly detection)
  • 53. 53 Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Representative Applications of SVM • Handwritten letter classification • Text classification • Image classification – e.g., face recognition • Bioinformatics – Gene expression based tissue classification – Protein function classification – Protein structure classification – Protein sub-cellular localization • ….