Data Mining. Classification

Summer School
“Achievements and Applications of Contemporary Informatics,
Mathematics and Physics” (AACIMP 2011)
August 8-20, 2011, Kiev, Ukraine

Classification

Erik Kropat

University of the Bundeswehr Munich
Institute for Theoretical Computer Science,
Mathematics and Operations Research
Neubiberg, Germany

Examples

Clinical trials
In a clinical trial 20 laboratory values of 10.000 patients are collected together
with the diagnosis ( ill / not ill ).

We measure the values of a new patient.
Is he / she ill or not?

Credit ratings
An online shop collects data from its customers together with some information
about the credit rating ( good customer / bad customer ).

We get the data of a new customer.
Is he / she a good customer or not?

Machine-Learning /
Classification
New Example

Labeled Machine
Classification
training learning
rule
examples algorithm

Predicted
classification

k Nearest Neighbor Classification
̶ kNN ̶

Idea: Classify a new object with regard to a set of training examples.
Compare the new object with the k “nearest” objects (“nearest neighbors”)

̶ ̶ ̶ + ̶
̶ ̶ ̶ Objects in class 1
̶ + +
+ + ̶ ̶
+
̶ + Objects in class 2
̶ + ̶ ̶ ̶
̶ ̶ ̶ New object
̶ ̶ +
̶ ̶ + ̶
̶ +
̶ ̶ + ̶

4-nearest neighbor


New object
• Required
̶ ̶ ̶ − Training set, i.e. objects and their class labels
̶
̶ ̶ +
̶ ̶ + + − Distance measure
+ + ̶ ̶ +
̶ + ̶ − The number k of nearest neighbors
̶ ̶ ̶
̶ ̶ ̶
̶ ̶ +
̶ ̶ + ̶ • Classification of a new object
̶ +
̶ ̶ + ̶ − Calculate the distance between the objects
of the training set.
5-nearest neighbor − Identify the k nearest neighbors.
− Use the class label of the k nearest neighbors
to determine the class of the new object
(e.g. by majority vote).

1-nearest neighbor 2-nearest neighbor 3-nearest neighbor

̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶
̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶
̶ ̶ ̶ ̶ ̶ ̶
̶ + ̶ + ̶ +
̶ ̶ ̶ ̶ ̶ ̶
̶ ̶ ̶
+ ̶ + ̶ + ̶
̶ ̶ ̶
̶ ̶ ̶ ̶ ̶ ̶
̶ ̶ + ̶ ̶ + ̶ ̶ +
̶ + ̶ ̶ + ̶ ̶ + ̶

Classification
̶ ? +
Class label:
Decision by distance
̶

1-nearest neighbor ⇒ Voronoi diagram

kNN ̶ k Nearest Neighbor Classification

Distance

• The distance between the new object and the objects in the set of training samples
is usually measured by the Euclidean metric or the squared Euclidean metric.

• In text mining the Hamming-distance is often used.


Class label of the new object

• The class label of the new object is determined by the list of the k nearest neighbors.

This could be achieved by

− Majority vote with regard to the class labels of the k nearest neighbors.
− Distance of the k nearest neighbors.


• The value of k has a strong influence on the classification result.

− k too small: Noise can have a strong influence.
− k too large: Neighborhood can contain objects from different classes
(ambiguity / false classification)

+ ̶ ̶ ̶ + ̶
̶ ̶ ̶ ̶ ̶ +
̶ ̶
̶ ̶ ̶
̶ ̶ + ̶
̶ ̶ + + ̶
+ + ̶ ̶ ̶
̶ ̶ ̶
̶ ̶
̶ ̶ ̶ ̶ ̶
̶ + ̶ ̶
̶ ̶ + ̶ ̶

Support Vector Machines

A set of training samples with objects in Rn is divided in two categories:

positive objects and negative objects


Goal: “Learn” a decision rule from the training samples.
Assign a new example into the “positive” or the “negative” category.
Idea: Determine a separating hyperplane.

New objects are classified as
positive, if they are in the half space
of positive examples
negative, if they are in the half space
of negative examples.


INPUT: Sample of training data Data from patients
with confirmed
T = { (x1, y1),...,(xk, yk) | xi ∈ Rn , yi ∈ { -1, +1 } }, diagnosis

with xi ∈ Rn data Laboratory values
and yi ∈ {-1, +1} class label Disease: Yes / No

Decision rule: INPUT:
Laboratory values
f : Rn → {-1, +1} of a new patient

Decision:
Disease: Yes / No

Separating Hyperplane

A separating hyperplane is determined by
− a normal vector w and
H
− a parameter b scalar product w

H = { x ∈ Rn | 〈 w, x 〉 ̶ b = 0 }

Offset of the hyperplane from the origin along w:
b
____
‖w ‖

Idea: Choose w and b, such that the hyperplane separates the set of training samples
in an optimal way.

What is a good separating hyperplane?

There exist many separating hyperplanes

Will this new object be in the “red” class?

Question: What is the best separating hyperplane?
Answer: Choose the separating hyperplane so that the distance from it
to the nearest data point on each side is maximized.

support vector
maximum-margin
hyperplane H

margin

support vector

Scaling of Hyperplanes

• A hyperplane can be defined in many ways:

For c ≠ 0: { x ∈ Rn | 〈 w, x 〉 + b = 0 } = { x ∈ Rn | 〈 cw, x 〉 + cb = 0 }

• Use trainings samples to choose (w, b), such that

Min | 〈 w, xi 〉 + b | = 1
xi canonical hyperplane

Definition
A training sample T = {(x1, y1),...,(xk, yk) | xi ∈ Rn , yi ∈ {-1, +1} } is separable
by the hyperplane

H = { x ∈ Rn | 〈 w, x 〉 + b = 0 },
H

if there exists a vector w ∈ Rn
and a parameter b ∈ R, such that w
〈 w, x 〉 + b = 1
〈 w, xi 〉 + b ≥ +1 , falls yi = +1
〈 w, xi 〉 + b ≤ ̶ 1 , falls yi = ̶ 1

for all i ∈ {1,...,k}.
〈 w, x 〉 + b = -1

Maximal Margin
H

• The above conditions can be rewritten:
w
〈 w, x 〉 + b = 1
yi · ( 〈 w, xi 〉 + b ) ≥ 1 for all i ∈ {1,...,k}

• Distance between the two margin hyperplanes:

〈 w, x 〉 + b = -1
2
____
‖w ‖

⇒ In order to maximize the margin we must minimize ‖ w ‖

Optimization problem
Find a normal vector w and a parameter b, such that the distance between
the training samples and the hyperplane defined by w and b is maximized.

Minimize __ ‖ w ‖ 2
1 H
2

s.t. yi · ( 〈 w, xi 〉 + b ) ≥ 1 for all i ∈ {1,...,k} w

⇒ quadratic programming problem

Dual Form

Find parameters α1,...,αk, such that
k k
Kernel function
Max Σ αi ̶
1 Σ αα y y
̶ i j i j 〈 xi, xj 〉
i=1 2 i, j = 1 k ( xi, xj ) := 〈 xi, xj 〉

with αi ≥ 0 for all i = 1,...,k

k
Σ αi yi = 0
i=1

The maximal margin hyperplane (= the classification problem)
⇒
is only a function of the support vectors.

Dual Form

• When the optimal Parameters α*,...,α* are known, the normal vector w*
1 k

of the separating hyperplane is given by

k
w* = Σ α* yi xi
i training data
i=1

• The parameter b* is given by

1 max { 〈 w*, xi 〉 | yi = ̶ 1 }
b* = _ _ + min { 〈 w*, xi 〉 | yi = +1 }
2

Classifier

• A decision function f maps a new object x ∈ Rn to a category f(x) ∈ {-1, +1} :

+1 , if 〈 w*, x 〉 + b* ≥ +1
f (x) =
̶ 1 , if 〈 w*, x 〉 + b* ≤ ̶ 1

H
+1

w

-1

̶ Soft Margins ̶

Soft Margin Support Vector Machines

• Until now: Hard margin SVMs
The set of training samples can be separated by a hyperplane.

• Problem: Some elements of the trainings samples can have a false label
The set of training samples can not be separated by a hyperplane
and SVM is not applicable.


• Idea: Soft margin SVMs
Modified maximum margin method for mislabeled examples.

• Choose a hyperplane that splits the training set as cleanly as possible,
while still maximizing the distance to the nearest cleanly split examples.

• Introduce slack variables ξ1,…, ξ n which
measure the degree of misclassification.


• Interpretation
The slack variables measure the degree of misclassification of the training examples
with regard to a given hyperplane H.

H

ξi
ξj


• Replace the constraints

yi · ( 〈 w, xi 〉 + b ) ≥ 1 for all i ∈ {1,...,n}

by

yi · ( 〈 w, xi 〉 + b ) ≥ 1 ̶ ξ i for all i ∈ {1,...,n}

H

ξi

• Idea
If the slack variables ξ i are small, then:

ξi = 0 ⇔ xi is correctly classified H

0 < ξi < 1 ⇔ xi is between the margins.
ξi

ξi ≥ 1 ⇔ xi is misclassified
[ yi · ( 〈 w, xi 〉 + b ) < 0 ]

Constraint: yi · ( 〈 w, xi 〉 + b ) ≥ 1 ̶ ξ i for all i ∈ {1,...,n}


• The sum of all slack variables is an upper bound for the total training error:

n
Σ ξi
i=1
H

ξi
ξj


Find a hyperplane with maximal margin and minimal training error.

regularisation

* n
__ ‖ w
1
‖ Σ ξi
2
C
+
Minimize
2
i=1
* *
s.t. yi · ( 〈 w, xi 〉 + b ) ≥ 1 ̶ ξ i for all i ∈ {1,...,n }
2
ξi ≥0 for all i ∈ {1,...,kn

̶ Nonlinear Classifiers ̶

Support Vector Machines ̶ Nonlinear Separation

Question: Is it possible to create nonlinear classifiers?

Support Vector Machines ̶ Nonlinear Separation

Idea: Map data points into a higher dimensional feature space
where a linear separation is possible.

Ф

Rn Rm

Nonlinear Transformation

Ф

original feature space high dimensional feature space
Rn Rm

Kernel Functions

Assume: For a given set X of training examples we know a function Ф,
such that a linear separation in the high-dimensional space is possible.

Decision: When we have solved the corresponding optimization problem,
we only need to evaluate a scalar product
to decide about the class label of a new data object.

n
f(xnew) = sign ( Σ α*i yi 〈 Ф (xi), Ф(xneu) 〉 + b* ) ∈ {-1, +1}
i=1

Kernel functions

Introduce a kernel function

K(xi, xj) = 〈 Ф (xi), Ф(xj) 〉

The kernel function defines a similarity measure between the objects xi and xj.

It is not necessary to know the function Ф or the dimension of H !!!

Kernel Trick

Example: Transformation into a higher dimensional feature space
___
Ф (x1,x2) = ( √
2 2
Ф:R →R,
2 3
x1 , 2 x1 x2, x2 )

Input: An element of the training sample x,
^
a new object x
___ ___
^ ^2
), ( x1 ,√ 2 ^ 1 x2, ^ 2 ) 〉
x ^ x
2 2 2
〈 Ф ( x ), Ф( x ) 〉 = 〈 ( x1 , √ 2 x1 x2, x2
x1 ^ 1 + 2 x1 ^ 1 x2 ^ 2 + x2 x2
^
2 2 2
= x x x 2
2
= ( x1 ^1 + x2 ^2)
x x
= K ( x ,^)
2
= 〈 x,^ 〉
x x

The scalar product in the higher dimensional space (here: R 3 )
can be evaluated in the low dimensional original space (here: R 2 ).

Kernel Trick

It is not necessary to apply the nonlinear function Ф to transform
the set of training examples into a higher dimensional feature space.

Use a kernel function

K(xi, xj) = 〈 Ф (xi), Ф(xj) 〉

instead of the scalar product in the original optimization problem and the decision problem.

Kernel Functions

Linear kernel K(xi, xj) = 〈 xi, xj 〉
2
̶ ‖ xi ̶ xj ‖ 2
Radial basis function kernel K(xi, xj) = exp ___________ ; σ0 = mean ‖ xi ̶ xj ‖ 2
2
2 σ0
Polynomial kernel K(xi, xj) = (s 〈 xi, xj 〉 + c) d

Sigmoid kernel K(xi, xj) = tanh (s 〈 xi, xj 〉 + c)

Convex combinations of kernels K(xi, xj) = c1K1(xi, xj) + c2K2(xi, xj)

,
K (xi, xj)
Normalization kernel K(xi, xj) = ___________________
, ,
√ K (xi, xi) K (xj, xj)

Summary

• Support vector machines can be used for binary classification.

• We can handle misclassified data if we introduce slack variables.

• If the sets to discriminate are not linearly separable we can use kernel functions.

• Applications → binary decisions

− Spam filter (spam / no spam)
− Face recognition ( access / no access)
− Credit rating ( good customer / bad costumer)

Literature
• N. Christianini, J.Shawe-Taylor
An Introduction to Support Vector Machines and Other
Kernel-based Learning Methods.
Cambridge University Press, Cambridge, 2004.

• T. Hastie, R. Tibshirani, J. Friedman
The Elements of Statistical Learning: Data Mining, Inference,
and Prediction.
Springer, New York, 2011.

Data Mining. Classification

Recomendados

Recomendados

Más contenido relacionado

Destacado

Destacado (8)

Más de SSA KPI

Más de SSA KPI (20)

Último

Último (20)

Data Mining. Classification