SlideShare una empresa de Scribd logo
1 de 40
Introduction
to
Machine Learning
Isabelle Guyon
isabelle@clopinet.com
What is Machine Learning?
Learning
algorithm
TRAINING
DATA Answer
Trained
machine
Query
What for?
• Classification
• Time series prediction
• Regression
• Clustering
Some Learning Machines
• Linear models
• Kernel methods
• Neural networks
• Decision trees
Applications
inputs
training
examples
10
102
103
104
105
Bioinformatics
Ecology
OCR
HWR
Market
Analysis
Text
Categorization
Machine
VisionSystemdiagnosis
10 102
103
104
105
Banking / Telecom / Retail
• Identify:
– Prospective customers
– Dissatisfied customers
– Good customers
– Bad payers
• Obtain:
– More effective advertising
– Less credit risk
– Fewer fraud
– Decreased churn rate
Biomedical / Biometrics
• Medicine:
– Screening
– Diagnosis and prognosis
– Drug discovery
• Security:
– Face recognition
– Signature / fingerprint / iris
verification
– DNA fingerprinting 6
Computer / Internet
• Computer interfaces:
– Troubleshooting wizards
– Handwriting and speech
– Brain waves
• Internet
– Hit ranking
– Spam filtering
– Text categorization
– Text translation
– Recommendation 7
Challenges
inputs
training
examples
10
102
103
104
105
Arcene,
Dorothea, Hiva
Sylva
Gisette
Gina
Ada
Dexter, Nova
Madelon
10 102
103
104
105
NIPS 2003 &
WCCI 2006
Ten Classification Tasks
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
0
50
100
150
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
0
50
100
150
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
0
50
100
150
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
0
50
100
150
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
0
50
100
150
ADA
GINA
HIVA
NOVA
SYLVA
0 5 10 15 20 25 30 35 40 45 50
0
20
40
ARCENE
0 5 10 15 20 25 30 35 40 45 50
0
20
40
DEXTER
0 5 10 15 20 25 30 35 40 45 50
0
20
40
DOROTHEA
0 5 10 15 20 25 30 35 40 45 50
0
20
40
GISETTE
0 5 10 15 20 25 30 35 40 45 50
0
20
40
MADELON
Test BER (%)
Challenge Winning Methods
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
Linear
/Kernel
Neural
Nets
Trees
/RF
Naïve
Bayes
Gisette (HWR)
Gina (HWR)
Dexter (Text)
Nova (Text)
Madelon (Artificial)
Arcene (Spectral)
Dorothea (Pharma)
Hiva (Pharma)
Ada (Marketing)
Sylva (Ecology)
BER/<BER>
Conventions
X={xij}
n
m
xi
y ={yj}
α
w
Learning problem
Colon cancer, Alon et al 1999
Unsupervised learning
Is there structure in data?
Supervised learning
Predict an outcome y.
Data matrix: X
m lines = patterns (data points,
examples): samples, patients,
documents, images, …
n columns = features: (attributes,
input variables): genes, proteins,
words, pixels, …
Linear Models
• f(x) = w • x +b = Σj=1:n wj xj +b
Linearity in the parameters, NOT in the input
components.
• f(x) = w • Φ(x) +b = Σj wj φj(x) +b (Perceptron)
• f(x) = Σi=1:m αi k(xi,x) +b (Kernel method)
Artificial Neurons
x1
x2
xn
1
Σ f(x)
w1
w2
wn
b
f(x) = w • x + b
Axon
Synapses
Activation
of other
neurons Dendrites
Cell potential
Activation function
McCulloch and Pitts, 1943
Linear Decision Boundary
-0.5
0
0.5
-0.
0
0.5
-0.5
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
X1X2
X3
x1
x2
x3
hyperplane
x1
x2
Perceptron
Rosenblatt, 1957
f(x)
f(x) = w • Φ(x) + b
φ1(x)
1
x1
x2
xn
φ2(x)
φN(x)
Σ
w1
w2
wN
b
NL Decision Boundary
x1
x2
-0.5
0
0.5
-0.5
0
0.5
-0.5
0
0.5
Hs.128749Hs.234680
Hs.7780
x1
x2
x3
Kernel Method
Potential functions, Aizerman et al 1964
f(x) = Σi αi k(xi,x) + b
k(x1,x)
1
x1
x2
xn
Σ
α1
α2
αm
b
k(x2,x)
k(xm,x)
k(. ,. ) is a similarity measure or “kernel”.
Hebb’s Rule
wj ← wj + yi xij
Axon
Σ
y
xj wj
Synapse
Activation
of another
neuron
Dendrite
Link to “Naïve Bayes”
Kernel “Trick” (for Hebb’s rule)
• Hebb’s rule for the Perceptron:
w = Σi yi Φ(xi)
f(x) = w • Φ(x) = Σi yi Φ(xi) • Φ(x)
• Define a dot product:
k(xi,x) = Φ(xi) • Φ(x)
f(x) = Σi yi k(xi,x)
Kernel “Trick” (general)
• f(x) = Σi αi k(xi, x)
• k(xi, x) = Φ(xi) • Φ(x)
• f(x) = w • Φ(x)
• w = Σi αi Φ(xi)
Dual forms
A kernel is:
• a similarity measure
• a dot product in some feature space: k(s, t) = Φ(s) • Φ(t)
But we do not need to know the Φ representation.
Examples:
• k(s, t) = exp(-||s-t||2
/σ2
) Gaussian kernel
• k(s, t) = (s • t)q
Polynomial kernel
What is a Kernel?
Multi-Layer Perceptron
Back-propagation, Rumelhart et al, 1986
Σ
xj
Σ
Σ
“hidden units”
internal “latent” variables
Chessboard Problem
Tree Classifiers
CART (Breiman, 1984) or C4.5 (Quinlan, 1993)
At each step,
choose the
feature that
“reduces
entropy” most.
Work towards
“node purity”.
All the
data
f1
f2
Choose f2
Choose f1
Iris Data (Fisher, 1936)
Linear discriminant Tree classifier
Gaussian mixture Kernel method (SVM)
setosa
virginica
versicolor
Figure from Norbert Jankowski and Krzysztof
Grabczewski
x1
x2
Fit / Robustness Tradeoff
x1
x2
15
x1
x2
Performance evaluation
x1
x2
f(x)=0
f(x) > 0
f(x) < 0
f(x) = 0
f(x) > 0
f(x) < 0
x1
x2
x1
x2
f(x)=-1
f(x) > -1
f(x) < -1
f(x) = -1
f(x) > -1
f(x) < -1
Performance evaluation
x1
x2
x1
x2
f(x)=1
f(x) > 1
f(x) < 1
f(x) = 1
f(x) > 1
f(x) < 1
Performance evaluation
ROC Curve
100%
100%
For a given
threshold
on f(x),
you get a
point on the
ROC curve. Actual ROC
0
Positive class
success rate
(hit rate,
sensitivity)
1 - negative class success rate
(false alarm rate, 1-specificity)
Random
ROC
Ideal ROC curve
ROC Curve
Ideal ROC curve (AUC=1)
100%
100%
0 ≤ AUC ≤ 1
Actual ROC
Random
ROC
(AUC=0.5)
0
For a given
threshold
on f(x),
you get a
point on the
ROC curve.
Positive class
success rate
(hit rate,
sensitivity)
1 - negative class success rate
(false alarm rate, 1-specificity)
Lift Curve
O
MGini =
O M
Fraction of customers selected
Hitrategoodcustomersselect.
Random
lift
Ideal Lift
100%
100%Customers
ranked
according
to f(x);
selection
of the top
ranking
customers.
Gini=2 AUC-1
0 ≤ Gini ≤ 1
Actual Lift
0
Predictions: F(x)
Class -1 Class +1
Truth:
y
Class -1 tn fp
Class +1 fn tp
Cost matrix
Predictions: F(x)
Class -1 Class +1
Truth:
y
Class -1 tn fp
Class +1 fn tp
neg=tn+fp
Total
pos=fn+tp
sel=fp+tprej=tn+fnTotal m=tn+fp
+fn+tp
False alarm = fp/neg
Class +1 / Total
Hit rate = tp/pos
Frac. selected = sel/m
Cost matrix
Class+1
/Total
Precision
= tp/sel
False alarm rate = type I errate = 1-specificity
Hit rate = 1-type II errate = sensitivity =
recall = test power
Performance Assessment
Compare F(x) = sign(f(x)) to the target y, and report:
• Error rate = (fn + fp)/m
• {Hit rate , False alarm rate} or {Hit rate , Precision} or {Hit rate , Frac.selected}
• Balanced error rate (BER) = (fn/pos + fp/neg)/2 = 1 – (sensitivity+specificity)/2
• F measure = 2 precision.recall/(precision+recall)
Vary the decision threshold θ in F(x) = sign(f(x)+θ), and plot:
• ROC curve: Hit rate vs. False alarm rate
• Lift curve: Hit rate vs. Fraction selected
• Precision/recall curve: Hit rate vs. Precision
Predictions: F(x)
Class -1 Class +1
Truth:
y
Class -1 tn fp
Class +1 fn tp
neg=tn+fp
Total
pos=fn+tp
sel=fp+tprej=tn+fnTotal m=tn+fp
+fn+tp
False alarm = fp/neg
Class +1 / Total
Hit rate = tp/pos
Frac. selected = sel/m
Cost matrix
Predictions: F(x)
Class -1 Class +1
Truth:
y
Class -1 tn fp
Class +1 fn tp
neg=tn+fp
Total
pos=fn+tp
sel=fp+tprej=tn+fnTotal m=tn+fp
+fn+tp
Cost matrix
What is a Risk Functional?
A function of the parameters of the
learning machine, assessing how much
it is expected to fail on a given task.
Examples:
• Classification:
– Error rate: (1/m) Σi=1:m 1(F(xi)≠yi)
– 1- AUC (Gini Index = 2 AUC-1)
• Regression:
– Mean square error: (1/m) Σi=1:m(f(xi)-yi)2
How to train?
• Define a risk functional R[f(x,w)]
• Optimize it w.r.t. w (gradient descent,
mathematical programming, simulated
annealing, genetic algorithms, etc.)
Parameter space (w)
R[f(x,w)]
w*
(… to be continued in the next lecture)
How to Train?
• Define a risk functional R[f(x,w)]
• Find a method to optimize it, typically
“gradient descent”
wj ← wj - η ∂R/∂wj
or any optimization method (mathematical
programming, simulated annealing,
genetic algorithms, etc.)
(… to be continued in the next lecture)
Summary
• With linear threshold units (“neurons”) we can build:
– Linear discriminant (including Naïve Bayes)
– Kernel methods
– Neural networks
– Decision trees
• The architectural hyper-parameters may include:
– The choice of basis functions φ (features)
– The kernel
– The number of units
• Learning means fitting:
– Parameters (weights)
– Hyper-parameters
– Be aware of the fit vs. robustness tradeoff
Want to Learn More?
• Pattern Classification, R. Duda, P. Hart, and D. Stork. Standard
pattern recognition textbook. Limited to classification problems. Matlab
code. http://rii.ricoh.com/~stork/DHS.html
• The Elements of statistical Learning: Data Mining, Inference, and
Prediction. T. Hastie, R. Tibshirani, J. Friedman, Standard statistics
textbook. Includes all the standard machine learning methods for
classification, regression, clustering. R code.
http://www-stat-class.stanford.edu/~tibs/ElemStatLearn/
• Linear Discriminants and Support Vector Machines, I. Guyon and
D. Stork, In Smola et al Eds. Advances in Large Margin Classiers.
Pages 147--169, MIT Press, 2000. http://clopinet.com/isabelle
/Papers/guyon_stork_nips98.ps.gz
• Feature Extraction: Foundations and Applications. I. Guyon et al,
Eds. Book for practitioners with datasets of NIPS 2003 challenge,
tutorials, best performing methods, Matlab code, teaching material.
http://clopinet.com/fextract-book

Más contenido relacionado

Similar a Introduction to Machine Learning Fundamentals

Introduction
IntroductionIntroduction
Introductionbutest
 
isabelle_webinar_jan..
isabelle_webinar_jan..isabelle_webinar_jan..
isabelle_webinar_jan..butest
 
Optimization Techniques.pdf
Optimization Techniques.pdfOptimization Techniques.pdf
Optimization Techniques.pdfanandsimple
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningAI Summary
 
MLHEP 2015: Introductory Lecture #1
MLHEP 2015: Introductory Lecture #1MLHEP 2015: Introductory Lecture #1
MLHEP 2015: Introductory Lecture #1arogozhnikov
 
机器学习Adaboost
机器学习Adaboost机器学习Adaboost
机器学习AdaboostShocky1
 
Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017
Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017
Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017MLconf
 
Battista Biggio @ ECML PKDD 2013 - Evasion attacks against machine learning a...
Battista Biggio @ ECML PKDD 2013 - Evasion attacks against machine learning a...Battista Biggio @ ECML PKDD 2013 - Evasion attacks against machine learning a...
Battista Biggio @ ECML PKDD 2013 - Evasion attacks against machine learning a...Pluribus One
 
Free Ebooks Download ! Edhole.com
Free Ebooks Download ! Edhole.comFree Ebooks Download ! Edhole.com
Free Ebooks Download ! Edhole.comEdhole.com
 
06-01 Machine Learning and Linear Regression.pptx
06-01 Machine Learning and Linear Regression.pptx06-01 Machine Learning and Linear Regression.pptx
06-01 Machine Learning and Linear Regression.pptxSaharA84
 
Machine learning for_finance
Machine learning for_financeMachine learning for_finance
Machine learning for_financeStefan Duprey
 
Machine Learning.pdf
Machine Learning.pdfMachine Learning.pdf
Machine Learning.pdfnikola_tesla1
 
Deep Learning for AI (2)
Deep Learning for AI (2)Deep Learning for AI (2)
Deep Learning for AI (2)Dongheon Lee
 
L1 intro2 supervised_learning
L1 intro2 supervised_learningL1 intro2 supervised_learning
L1 intro2 supervised_learningYogendra Singh
 
Software Testing:
 A Research Travelogue 
(2000–2014)
Software Testing:
 A Research Travelogue 
(2000–2014)Software Testing:
 A Research Travelogue 
(2000–2014)
Software Testing:
 A Research Travelogue 
(2000–2014)Alex Orso
 
The world of loss function
The world of loss functionThe world of loss function
The world of loss function홍배 김
 
05 history of cv a machine learning (theory) perspective on computer vision
05  history of cv a machine learning (theory) perspective on computer vision05  history of cv a machine learning (theory) perspective on computer vision
05 history of cv a machine learning (theory) perspective on computer visionzukun
 

Similar a Introduction to Machine Learning Fundamentals (20)

Introduction
IntroductionIntroduction
Introduction
 
isabelle_webinar_jan..
isabelle_webinar_jan..isabelle_webinar_jan..
isabelle_webinar_jan..
 
Optimization Techniques.pdf
Optimization Techniques.pdfOptimization Techniques.pdf
Optimization Techniques.pdf
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
MLHEP 2015: Introductory Lecture #1
MLHEP 2015: Introductory Lecture #1MLHEP 2015: Introductory Lecture #1
MLHEP 2015: Introductory Lecture #1
 
机器学习Adaboost
机器学习Adaboost机器学习Adaboost
机器学习Adaboost
 
Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017
Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017
Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017
 
Battista Biggio @ ECML PKDD 2013 - Evasion attacks against machine learning a...
Battista Biggio @ ECML PKDD 2013 - Evasion attacks against machine learning a...Battista Biggio @ ECML PKDD 2013 - Evasion attacks against machine learning a...
Battista Biggio @ ECML PKDD 2013 - Evasion attacks against machine learning a...
 
Free Ebooks Download ! Edhole.com
Free Ebooks Download ! Edhole.comFree Ebooks Download ! Edhole.com
Free Ebooks Download ! Edhole.com
 
06-01 Machine Learning and Linear Regression.pptx
06-01 Machine Learning and Linear Regression.pptx06-01 Machine Learning and Linear Regression.pptx
06-01 Machine Learning and Linear Regression.pptx
 
Machine learning for_finance
Machine learning for_financeMachine learning for_finance
Machine learning for_finance
 
Machine Learning.pdf
Machine Learning.pdfMachine Learning.pdf
Machine Learning.pdf
 
Optimization
OptimizationOptimization
Optimization
 
"Let us talk about output features! by Florence d’Alché-Buc, LTCI & Full Prof...
"Let us talk about output features! by Florence d’Alché-Buc, LTCI & Full Prof..."Let us talk about output features! by Florence d’Alché-Buc, LTCI & Full Prof...
"Let us talk about output features! by Florence d’Alché-Buc, LTCI & Full Prof...
 
Deep Learning for AI (2)
Deep Learning for AI (2)Deep Learning for AI (2)
Deep Learning for AI (2)
 
L1 intro2 supervised_learning
L1 intro2 supervised_learningL1 intro2 supervised_learning
L1 intro2 supervised_learning
 
Software Testing:
 A Research Travelogue 
(2000–2014)
Software Testing:
 A Research Travelogue 
(2000–2014)Software Testing:
 A Research Travelogue 
(2000–2014)
Software Testing:
 A Research Travelogue 
(2000–2014)
 
The world of loss function
The world of loss functionThe world of loss function
The world of loss function
 
tutorial.ppt
tutorial.ppttutorial.ppt
tutorial.ppt
 
05 history of cv a machine learning (theory) perspective on computer vision
05  history of cv a machine learning (theory) perspective on computer vision05  history of cv a machine learning (theory) perspective on computer vision
05 history of cv a machine learning (theory) perspective on computer vision
 

Más de butest

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEbutest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jacksonbutest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer IIbutest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazzbutest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.docbutest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1butest
 
Facebook
Facebook Facebook
Facebook butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTbutest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docbutest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docbutest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.docbutest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!butest
 

Más de butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 

Introduction to Machine Learning Fundamentals

  • 2. What is Machine Learning? Learning algorithm TRAINING DATA Answer Trained machine Query
  • 3. What for? • Classification • Time series prediction • Regression • Clustering
  • 4. Some Learning Machines • Linear models • Kernel methods • Neural networks • Decision trees
  • 6. Banking / Telecom / Retail • Identify: – Prospective customers – Dissatisfied customers – Good customers – Bad payers • Obtain: – More effective advertising – Less credit risk – Fewer fraud – Decreased churn rate
  • 7. Biomedical / Biometrics • Medicine: – Screening – Diagnosis and prognosis – Drug discovery • Security: – Face recognition – Signature / fingerprint / iris verification – DNA fingerprinting 6
  • 8. Computer / Internet • Computer interfaces: – Troubleshooting wizards – Handwriting and speech – Brain waves • Internet – Hit ranking – Spam filtering – Text categorization – Text translation – Recommendation 7
  • 10. Ten Classification Tasks 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 50 100 150 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 50 100 150 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 50 100 150 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 50 100 150 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 50 100 150 ADA GINA HIVA NOVA SYLVA 0 5 10 15 20 25 30 35 40 45 50 0 20 40 ARCENE 0 5 10 15 20 25 30 35 40 45 50 0 20 40 DEXTER 0 5 10 15 20 25 30 35 40 45 50 0 20 40 DOROTHEA 0 5 10 15 20 25 30 35 40 45 50 0 20 40 GISETTE 0 5 10 15 20 25 30 35 40 45 50 0 20 40 MADELON Test BER (%)
  • 11. Challenge Winning Methods 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 Linear /Kernel Neural Nets Trees /RF Naïve Bayes Gisette (HWR) Gina (HWR) Dexter (Text) Nova (Text) Madelon (Artificial) Arcene (Spectral) Dorothea (Pharma) Hiva (Pharma) Ada (Marketing) Sylva (Ecology) BER/<BER>
  • 13. Learning problem Colon cancer, Alon et al 1999 Unsupervised learning Is there structure in data? Supervised learning Predict an outcome y. Data matrix: X m lines = patterns (data points, examples): samples, patients, documents, images, … n columns = features: (attributes, input variables): genes, proteins, words, pixels, …
  • 14. Linear Models • f(x) = w • x +b = Σj=1:n wj xj +b Linearity in the parameters, NOT in the input components. • f(x) = w • Φ(x) +b = Σj wj φj(x) +b (Perceptron) • f(x) = Σi=1:m αi k(xi,x) +b (Kernel method)
  • 15. Artificial Neurons x1 x2 xn 1 Σ f(x) w1 w2 wn b f(x) = w • x + b Axon Synapses Activation of other neurons Dendrites Cell potential Activation function McCulloch and Pitts, 1943
  • 17. Perceptron Rosenblatt, 1957 f(x) f(x) = w • Φ(x) + b φ1(x) 1 x1 x2 xn φ2(x) φN(x) Σ w1 w2 wN b
  • 19. Kernel Method Potential functions, Aizerman et al 1964 f(x) = Σi αi k(xi,x) + b k(x1,x) 1 x1 x2 xn Σ α1 α2 αm b k(x2,x) k(xm,x) k(. ,. ) is a similarity measure or “kernel”.
  • 20. Hebb’s Rule wj ← wj + yi xij Axon Σ y xj wj Synapse Activation of another neuron Dendrite Link to “Naïve Bayes”
  • 21. Kernel “Trick” (for Hebb’s rule) • Hebb’s rule for the Perceptron: w = Σi yi Φ(xi) f(x) = w • Φ(x) = Σi yi Φ(xi) • Φ(x) • Define a dot product: k(xi,x) = Φ(xi) • Φ(x) f(x) = Σi yi k(xi,x)
  • 22. Kernel “Trick” (general) • f(x) = Σi αi k(xi, x) • k(xi, x) = Φ(xi) • Φ(x) • f(x) = w • Φ(x) • w = Σi αi Φ(xi) Dual forms
  • 23. A kernel is: • a similarity measure • a dot product in some feature space: k(s, t) = Φ(s) • Φ(t) But we do not need to know the Φ representation. Examples: • k(s, t) = exp(-||s-t||2 /σ2 ) Gaussian kernel • k(s, t) = (s • t)q Polynomial kernel What is a Kernel?
  • 24. Multi-Layer Perceptron Back-propagation, Rumelhart et al, 1986 Σ xj Σ Σ “hidden units” internal “latent” variables
  • 26. Tree Classifiers CART (Breiman, 1984) or C4.5 (Quinlan, 1993) At each step, choose the feature that “reduces entropy” most. Work towards “node purity”. All the data f1 f2 Choose f2 Choose f1
  • 27. Iris Data (Fisher, 1936) Linear discriminant Tree classifier Gaussian mixture Kernel method (SVM) setosa virginica versicolor Figure from Norbert Jankowski and Krzysztof Grabczewski
  • 28. x1 x2 Fit / Robustness Tradeoff x1 x2 15
  • 29. x1 x2 Performance evaluation x1 x2 f(x)=0 f(x) > 0 f(x) < 0 f(x) = 0 f(x) > 0 f(x) < 0
  • 30. x1 x2 x1 x2 f(x)=-1 f(x) > -1 f(x) < -1 f(x) = -1 f(x) > -1 f(x) < -1 Performance evaluation
  • 31. x1 x2 x1 x2 f(x)=1 f(x) > 1 f(x) < 1 f(x) = 1 f(x) > 1 f(x) < 1 Performance evaluation
  • 32. ROC Curve 100% 100% For a given threshold on f(x), you get a point on the ROC curve. Actual ROC 0 Positive class success rate (hit rate, sensitivity) 1 - negative class success rate (false alarm rate, 1-specificity) Random ROC Ideal ROC curve
  • 33. ROC Curve Ideal ROC curve (AUC=1) 100% 100% 0 ≤ AUC ≤ 1 Actual ROC Random ROC (AUC=0.5) 0 For a given threshold on f(x), you get a point on the ROC curve. Positive class success rate (hit rate, sensitivity) 1 - negative class success rate (false alarm rate, 1-specificity)
  • 34. Lift Curve O MGini = O M Fraction of customers selected Hitrategoodcustomersselect. Random lift Ideal Lift 100% 100%Customers ranked according to f(x); selection of the top ranking customers. Gini=2 AUC-1 0 ≤ Gini ≤ 1 Actual Lift 0
  • 35. Predictions: F(x) Class -1 Class +1 Truth: y Class -1 tn fp Class +1 fn tp Cost matrix Predictions: F(x) Class -1 Class +1 Truth: y Class -1 tn fp Class +1 fn tp neg=tn+fp Total pos=fn+tp sel=fp+tprej=tn+fnTotal m=tn+fp +fn+tp False alarm = fp/neg Class +1 / Total Hit rate = tp/pos Frac. selected = sel/m Cost matrix Class+1 /Total Precision = tp/sel False alarm rate = type I errate = 1-specificity Hit rate = 1-type II errate = sensitivity = recall = test power Performance Assessment Compare F(x) = sign(f(x)) to the target y, and report: • Error rate = (fn + fp)/m • {Hit rate , False alarm rate} or {Hit rate , Precision} or {Hit rate , Frac.selected} • Balanced error rate (BER) = (fn/pos + fp/neg)/2 = 1 – (sensitivity+specificity)/2 • F measure = 2 precision.recall/(precision+recall) Vary the decision threshold θ in F(x) = sign(f(x)+θ), and plot: • ROC curve: Hit rate vs. False alarm rate • Lift curve: Hit rate vs. Fraction selected • Precision/recall curve: Hit rate vs. Precision Predictions: F(x) Class -1 Class +1 Truth: y Class -1 tn fp Class +1 fn tp neg=tn+fp Total pos=fn+tp sel=fp+tprej=tn+fnTotal m=tn+fp +fn+tp False alarm = fp/neg Class +1 / Total Hit rate = tp/pos Frac. selected = sel/m Cost matrix Predictions: F(x) Class -1 Class +1 Truth: y Class -1 tn fp Class +1 fn tp neg=tn+fp Total pos=fn+tp sel=fp+tprej=tn+fnTotal m=tn+fp +fn+tp Cost matrix
  • 36. What is a Risk Functional? A function of the parameters of the learning machine, assessing how much it is expected to fail on a given task. Examples: • Classification: – Error rate: (1/m) Σi=1:m 1(F(xi)≠yi) – 1- AUC (Gini Index = 2 AUC-1) • Regression: – Mean square error: (1/m) Σi=1:m(f(xi)-yi)2
  • 37. How to train? • Define a risk functional R[f(x,w)] • Optimize it w.r.t. w (gradient descent, mathematical programming, simulated annealing, genetic algorithms, etc.) Parameter space (w) R[f(x,w)] w* (… to be continued in the next lecture)
  • 38. How to Train? • Define a risk functional R[f(x,w)] • Find a method to optimize it, typically “gradient descent” wj ← wj - η ∂R/∂wj or any optimization method (mathematical programming, simulated annealing, genetic algorithms, etc.) (… to be continued in the next lecture)
  • 39. Summary • With linear threshold units (“neurons”) we can build: – Linear discriminant (including Naïve Bayes) – Kernel methods – Neural networks – Decision trees • The architectural hyper-parameters may include: – The choice of basis functions φ (features) – The kernel – The number of units • Learning means fitting: – Parameters (weights) – Hyper-parameters – Be aware of the fit vs. robustness tradeoff
  • 40. Want to Learn More? • Pattern Classification, R. Duda, P. Hart, and D. Stork. Standard pattern recognition textbook. Limited to classification problems. Matlab code. http://rii.ricoh.com/~stork/DHS.html • The Elements of statistical Learning: Data Mining, Inference, and Prediction. T. Hastie, R. Tibshirani, J. Friedman, Standard statistics textbook. Includes all the standard machine learning methods for classification, regression, clustering. R code. http://www-stat-class.stanford.edu/~tibs/ElemStatLearn/ • Linear Discriminants and Support Vector Machines, I. Guyon and D. Stork, In Smola et al Eds. Advances in Large Margin Classiers. Pages 147--169, MIT Press, 2000. http://clopinet.com/isabelle /Papers/guyon_stork_nips98.ps.gz • Feature Extraction: Foundations and Applications. I. Guyon et al, Eds. Book for practitioners with datasets of NIPS 2003 challenge, tutorials, best performing methods, Matlab code, teaching material. http://clopinet.com/fextract-book

Notas del editor

  1. Pb : the background image shows in webex. I’ve reduced the image : hope it helps Not mention KXEN
  2. Include here recommandation systems
  3. Best performance for each type of method, normalized by the average of these perf
  4. Which one is best : linear or non-linear The decision comes when we see new data Very often the simplest model is better This principle is implemented in Learning Theory
  5. Which one is best : linear or non-linear The decision comes when we see new data Very often the simplest model is better This principle is implemented in Learning Theory
  6. Which one is best : linear or non-linear The decision comes when we see new data Very often the simplest model is better This principle is implemented in Learning Theory
  7. Which one is best : linear or non-linear The decision comes when we see new data Very often the simplest model is better This principle is implemented in Learning Theory
  8. Explain that this is a global estimator
  9. Explain that this is a global estimator
  10. Explain that this is a global estimator Proof of Gini = 2 AUC –1 L = lift Hitrate = tp/pos Farate = fp/neg Selected = sel/tot = (tp+fp)/tot = pos/tot.tp/pos + neg/tot.fp/neg = pos/tot Hitrate + neg/tot Farate AUC = sum Hitrate d(Farate) L = sum Hitrate d(Selected) = sum Hitrate d(pos/tot Hitrate + neg/tot Farate) = pos/tot sum Hitrate d Hitrate + neg/tot sum Hitrate d Farate = ½ pos/tot + neg/tot AUC 2L-1 = -(1-pos/tot) + 2(1-pos/tot) AUC = (1-pos/tot) (2AUC-1) Gini = (L-1/2)/(1-pos/tot)/2 = (2L-1)/(1-pos/tot) = 2AUC-1