Introduction to Machine Learning Fundamentals

Introduction
to
Machine Learning
Isabelle Guyon
isabelle@clopinet.com

What is Machine Learning?
Learning
algorithm
TRAINING
DATA Answer
Trained
machine
Query

What for?
• Classification
• Time series prediction
• Regression
• Clustering

Some Learning Machines
• Linear models
• Kernel methods
• Neural networks
• Decision trees

Applications
inputs
training
examples
10
102
103
104
105
Bioinformatics
Ecology
OCR
HWR
Market
Analysis
Text
Categorization
Machine
VisionSystemdiagnosis
10 102
103
104
105

Banking / Telecom / Retail
• Identify:
– Prospective customers
– Dissatisfied customers
– Good customers
– Bad payers
• Obtain:
– More effective advertising
– Less credit risk
– Fewer fraud
– Decreased churn rate

Biomedical / Biometrics
• Medicine:
– Screening
– Diagnosis and prognosis
– Drug discovery
• Security:
– Face recognition
– Signature / fingerprint / iris
verification
– DNA fingerprinting 6

Computer / Internet
• Computer interfaces:
– Troubleshooting wizards
– Handwriting and speech
– Brain waves
• Internet
– Hit ranking
– Spam filtering
– Text categorization
– Text translation
– Recommendation 7

Challenges
inputs
training
examples
10
102
103
104
105
Arcene,
Dorothea, Hiva
Sylva
Gisette
Gina
Ada
Dexter, Nova
Madelon
10 102
103
104
105
NIPS 2003 &
WCCI 2006

Ten Classification Tasks
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
0
50
100
150
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
0
50
100
150
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
0
50
100
150
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
0
50
100
150
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
0
50
100
150
ADA
GINA
HIVA
NOVA
SYLVA
0 5 10 15 20 25 30 35 40 45 50
0
20
40
ARCENE
0 5 10 15 20 25 30 35 40 45 50
0
20
40
DEXTER
0 5 10 15 20 25 30 35 40 45 50
0
20
40
DOROTHEA
0 5 10 15 20 25 30 35 40 45 50
0
20
40
GISETTE
0 5 10 15 20 25 30 35 40 45 50
0
20
40
MADELON
Test BER (%)

Challenge Winning Methods
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
Linear
/Kernel
Neural
Nets
Trees
/RF
Naïve
Bayes
Gisette (HWR)
Gina (HWR)
Dexter (Text)
Nova (Text)
Madelon (Artificial)
Arcene (Spectral)
Dorothea (Pharma)
Hiva (Pharma)
Ada (Marketing)
Sylva (Ecology)
BER/<BER>

Conventions
X={xij}
n
m
xi
y ={yj}
α
w

Learning problem
Colon cancer, Alon et al 1999
Unsupervised learning
Is there structure in data?
Supervised learning
Predict an outcome y.
Data matrix: X
m lines = patterns (data points,
examples): samples, patients,
documents, images, …
n columns = features: (attributes,
input variables): genes, proteins,
words, pixels, …

Linear Models
• f(x) = w • x +b = Σj=1:n wj xj +b
Linearity in the parameters, NOT in the input
components.
• f(x) = w • Φ(x) +b = Σj wj φj(x) +b (Perceptron)
• f(x) = Σi=1:m αi k(xi,x) +b (Kernel method)

Artificial Neurons
x1
x2
xn
1
Σ f(x)
w1
w2
wn
b
f(x) = w • x + b
Axon
Synapses
Activation
of other
neurons Dendrites
Cell potential
Activation function
McCulloch and Pitts, 1943

Linear Decision Boundary
-0.5
0
0.5
-0.
0
0.5
-0.5
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
X1X2
X3
x1
x2
x3
hyperplane
x1
x2

Perceptron
Rosenblatt, 1957
f(x)
f(x) = w • Φ(x) + b
φ1(x)
1
x1
x2
xn
φ2(x)
φN(x)
Σ
w1
w2
wN
b

NL Decision Boundary
x1
x2
-0.5
0
0.5
-0.5
0
0.5
-0.5
0
0.5
Hs.128749Hs.234680
Hs.7780
x1
x2
x3

Kernel Method
Potential functions, Aizerman et al 1964
f(x) = Σi αi k(xi,x) + b
k(x1,x)
1
x1
x2
xn
Σ
α1
α2
αm
b
k(x2,x)
k(xm,x)
k(. ,. ) is a similarity measure or “kernel”.

Hebb’s Rule
wj ← wj + yi xij
Axon
Σ
y
xj wj
Synapse
Activation
of another
neuron
Dendrite
Link to “Naïve Bayes”

Kernel “Trick” (for Hebb’s rule)
• Hebb’s rule for the Perceptron:
w = Σi yi Φ(xi)
f(x) = w • Φ(x) = Σi yi Φ(xi) • Φ(x)
• Define a dot product:
k(xi,x) = Φ(xi) • Φ(x)
f(x) = Σi yi k(xi,x)

Kernel “Trick” (general)
• f(x) = Σi αi k(xi, x)
• k(xi, x) = Φ(xi) • Φ(x)
• f(x) = w • Φ(x)
• w = Σi αi Φ(xi)
Dual forms

A kernel is:
• a similarity measure
• a dot product in some feature space: k(s, t) = Φ(s) • Φ(t)
But we do not need to know the Φ representation.
Examples:
• k(s, t) = exp(-||s-t||2
/σ2
) Gaussian kernel
• k(s, t) = (s • t)q
Polynomial kernel
What is a Kernel?

Multi-Layer Perceptron
Back-propagation, Rumelhart et al, 1986
Σ
xj
Σ
Σ
“hidden units”
internal “latent” variables

Tree Classifiers
CART (Breiman, 1984) or C4.5 (Quinlan, 1993)
At each step,
choose the
feature that
“reduces
entropy” most.
Work towards
“node purity”.
All the
data
f1
f2
Choose f2
Choose f1

Iris Data (Fisher, 1936)
Linear discriminant Tree classifier
Gaussian mixture Kernel method (SVM)
setosa
virginica
versicolor
Figure from Norbert Jankowski and Krzysztof
Grabczewski

x1
x2
Fit / Robustness Tradeoff
x1
x2
15

x1
x2
Performance evaluation
x1
x2
f(x)=0
f(x) > 0
f(x) < 0
f(x) = 0
f(x) > 0
f(x) < 0

x1
x2
x1
x2
f(x)=-1
f(x) > -1
f(x) < -1
f(x) = -1
f(x) > -1
f(x) < -1

x1
x2
x1
x2
f(x)=1
f(x) > 1
f(x) < 1
f(x) = 1
f(x) > 1
f(x) < 1

ROC Curve
100%
100%
For a given
threshold
on f(x),
you get a
point on the
ROC curve. Actual ROC
0
Positive class
success rate
(hit rate,
sensitivity)
1 - negative class success rate
(false alarm rate, 1-specificity)
Random
ROC
Ideal ROC curve

ROC Curve
Ideal ROC curve (AUC=1)
100%
100%
0 ≤ AUC ≤ 1
Actual ROC
Random
ROC
(AUC=0.5)
0
For a given
threshold
on f(x),
you get a
point on the
ROC curve.
Positive class
success rate
(hit rate,
sensitivity)
1 - negative class success rate
(false alarm rate, 1-specificity)

Lift Curve
O
MGini =
O M
Fraction of customers selected
Hitrategoodcustomersselect.
Random
lift
Ideal Lift
100%
100%Customers
ranked
according
to f(x);
selection
of the top
ranking
customers.
Gini=2 AUC-1
0 ≤ Gini ≤ 1
Actual Lift
0

Predictions: F(x)
Class -1 Class +1
Truth:
y
Class -1 tn fp
Class +1 fn tp
Cost matrix
Predictions: F(x)
Class -1 Class +1
Truth:
y
Class -1 tn fp
Class +1 fn tp
neg=tn+fp
Total
pos=fn+tp
sel=fp+tprej=tn+fnTotal m=tn+fp
+fn+tp
False alarm = fp/neg
Class +1 / Total
Hit rate = tp/pos
Frac. selected = sel/m
Cost matrix
Class+1
/Total
Precision
= tp/sel
False alarm rate = type I errate = 1-specificity
Hit rate = 1-type II errate = sensitivity =
recall = test power
Performance Assessment
Compare F(x) = sign(f(x)) to the target y, and report:
• Error rate = (fn + fp)/m
• {Hit rate , False alarm rate} or {Hit rate , Precision} or {Hit rate , Frac.selected}
• Balanced error rate (BER) = (fn/pos + fp/neg)/2 = 1 – (sensitivity+specificity)/2
• F measure = 2 precision.recall/(precision+recall)
Vary the decision threshold θ in F(x) = sign(f(x)+θ), and plot:
• ROC curve: Hit rate vs. False alarm rate
• Lift curve: Hit rate vs. Fraction selected
• Precision/recall curve: Hit rate vs. Precision
Predictions: F(x)
Class -1 Class +1
Truth:
y
Class -1 tn fp
Class +1 fn tp
neg=tn+fp
Total
pos=fn+tp
+fn+tp
False alarm = fp/neg
Class +1 / Total
Hit rate = tp/pos
Frac. selected = sel/m
Cost matrix
Predictions: F(x)
Class -1 Class +1
Truth:
y
Class -1 tn fp
Class +1 fn tp
neg=tn+fp
Total
pos=fn+tp
+fn+tp
Cost matrix

What is a Risk Functional?
A function of the parameters of the
learning machine, assessing how much
it is expected to fail on a given task.
Examples:
• Classification:
– Error rate: (1/m) Σi=1:m 1(F(xi)≠yi)
– 1- AUC (Gini Index = 2 AUC-1)
• Regression:
– Mean square error: (1/m) Σi=1:m(f(xi)-yi)2

How to train?
• Define a risk functional R[f(x,w)]
• Optimize it w.r.t. w (gradient descent,
mathematical programming, simulated
annealing, genetic algorithms, etc.)
Parameter space (w)
R[f(x,w)]
w*
(… to be continued in the next lecture)

How to Train?
• Define a risk functional R[f(x,w)]
• Find a method to optimize it, typically
“gradient descent”
wj ← wj - η ∂R/∂wj
or any optimization method (mathematical
programming, simulated annealing,
genetic algorithms, etc.)
(… to be continued in the next lecture)

Summary
• With linear threshold units (“neurons”) we can build:
– Linear discriminant (including Naïve Bayes)
– Kernel methods
– Neural networks
– Decision trees
• The architectural hyper-parameters may include:
– The choice of basis functions φ (features)
– The kernel
– The number of units
• Learning means fitting:
– Parameters (weights)
– Hyper-parameters
– Be aware of the fit vs. robustness tradeoff

Want to Learn More?
• Pattern Classification, R. Duda, P. Hart, and D. Stork. Standard
pattern recognition textbook. Limited to classification problems. Matlab
code. http://rii.ricoh.com/~stork/DHS.html
• The Elements of statistical Learning: Data Mining, Inference, and
Prediction. T. Hastie, R. Tibshirani, J. Friedman, Standard statistics
textbook. Includes all the standard machine learning methods for
classification, regression, clustering. R code.
http://www-stat-class.stanford.edu/~tibs/ElemStatLearn/
• Linear Discriminants and Support Vector Machines, I. Guyon and
D. Stork, In Smola et al Eds. Advances in Large Margin Classiers.
Pages 147--169, MIT Press, 2000. http://clopinet.com/isabelle
/Papers/guyon_stork_nips98.ps.gz
• Feature Extraction: Foundations and Applications. I. Guyon et al,
Eds. Book for practitioners with datasets of NIPS 2003 challenge,
tutorials, best performing methods, Matlab code, teaching material.
http://clopinet.com/fextract-book

Introduction to Machine Learning Fundamentals

Recomendados

Recomendados

Más contenido relacionado

Similar a Introduction to Machine Learning Fundamentals

Similar a Introduction to Machine Learning Fundamentals (20)

Más de butest

Más de butest (20)

Introduction to Machine Learning Fundamentals

Notas del editor