ML unit-1.pptx

Big Data
3
 Widespread useof personal computersand
wireless communicationleads to “bigdata”
 We are both producers and consumersof data
 Data isnot random, it hasstructure,e.g., customer
behavior
 We need “bigtheory” to extract that structurefrom
data for
(a) Understanding theprocess
(b) Making predictions for thefuture

Why “Learn” ?
4
 Machinelearning isprogramming computersto optimize
a performance criterion usingexample data or past
experience.
 Thereisnoneed to “learn” to calculate payroll
 Learning isusedwhen:
Humanexpertise doesnot exist (navigating on Mars),
Humansare unable to explain their expertise (speech
recognition)
Solutionchangesin time (routing ona computernetwork)
Solutionneedsto be adapted to particular cases(user
biometrics)

What We T
alk About When We T
alk
About “Learning”
5
 Learning general modelsfrom a data ofparticular
examples
 Data ischeap and abundant (data warehouses,
data marts); knowledge isexpensiveandscarce.
 Example in retail: Customertransactions toconsumer
behavior:
Peoplewhobought “Blink” also bought “Outliers”
(www.amazon.com)
 Build a model that isa good anduseful
approximation to thedata.

Data Mining
6
 Retail: Market basket analysis,Customer
relationship management(CRM)
 Finance:Credit scoring,frauddetection
 Manufacturing: Control, robotics,troubleshooting
 Medicine: Medicaldiagnosis
 Telecommunications:Spamfilters, intrusiondetection
 Bioinformatics: Motifs, alignment
 Web mining: Searchengines
 ...

What isMachine Learning?
7
 Optimize a performance criterionusingexample
data or pastexperience.
 Roleof Statistics: Inference from a sample
 Roleof Computer science:Efficient algorithmsto
Solvethe optimization problem
Representing and evaluating the model for inference

Applications
8
 Association
 SupervisedLearning
Classification
Regression
 UnsupervisedLearning
 ReinforcementLearning

LearningAssociations
9
 Basketanalysis:
P(Y| X) probability that somebodywhobuysX
also buysYwhere Xand Yareproducts/services.
Example: P( chips| beer ) = 0.7

Classification
10
 Example:Credit
scoring
 Differentiating
between low-riskand
high-risk customers
from their incomeand
savings
Discriminant: IFincome> θ1 ANDsavings> θ2
THENlow-risk E
L
S
Ehigh-risk

Classification:Applications
11
 Aka Patternrecognition
 Facerecognition: Pose,lighting, occlusion(glasses,
beard), make-up,hairstyle
 Characterrecognition:Different handwriting styles.
 Speechrecognition: Temporaldependency.
 Medical diagnosis: Fromsymptomsto illnesses
 Biometrics:Recognition/authentication usingphysical
and/or behavioral characteristics: Face,iris,
signature,etc
 Outlier/noveltydetection:

FaceRecognition
12
Training examples of a person
Test images
ORL dataset,
AT&T Laboratories, Cambridge UK

Regression
 Example: Price of a
usedcar
 x : car attributes
y : price
y = g (x |  )
g ( ) model,
 parameters
y = wx+w0

RegressionApplications
14
 Navigating a car:Angle of thesteering
 Kinematicsof a robotarm
α1= g1(x,y)
α2= g2(x,y)
(x,y)
α2
α1
 Responsesurface design

Supervised Learning:Uses
15
 Prediction of future cases:Usethe rule to predict
the output for future inputs
 Knowledge extraction: Therule iseasy to
understand
 Compression:Therule issimpler than the data it
explains
 Outlier detection: Exceptionsthat arenot covered
by the rule, e.g., fraud

UnsupervisedLearning
16
 Learning “what normallyhappens”
 No output
 Clustering: Grouping similarinstances
 Exampleapplications
Customersegmentation in CRM
Imagecompression:Color quantization
Bioinformatics: Learning motifs

Reinforcement Learning
17
 Learning a policy: Asequenceof outputs
 No supervised output but delayedreward
 Credit assignmentproblem
 Game playing
 Robotin a maze
 Multipleagents,partial observability,...

Learning a Classfrom Examples
3
 ClassCof a “familycar”
Prediction: Iscar x a family car?
Knowledge extraction: What do people expect from a
family car?
 Output:
Positive(+) and negative (–) examples
 Inputrepresentation:
x1: price, x2 : engine power

Training set X
t t N
t1
X  {x ,r }
0 if x isnegative
 1if x ispositive
r  
 2 
x 
x 
x1 

ClassC
5
p1  price  p2 AND e1  engine power  e2 

HypothesisclassH
h(x) 
0 if h says x isnegative
 1if h says x ispositive


N
t t

t1
1 h x r
E(h|X ) 
6
Error of hon H

S,G,and the Version Space
7
mostspecific hypothesis,S
mostgeneral hypothesis,G
h  H, between Sand G is
consistent and make up the
versionspace
(Mitchell,1997)

Margin
8
 Choosehwith largestmargin

VC Dimension
9
 N points canbe labeled in 2N waysas +/–
 H shattersN if there
existsh H consistent
for any of these:
VC(H ) = N
Anaxis-aligned rectangle shatters4 points only!

Probably Approximately Correct (PAC)
Learning
 Howmanytraining examples N should wehave,
suchthat withprobability
• at least 1 ‒δ, hhaserror at mostε ?
• (Blumeret al., 1989)
 Eachstripisat mostε/4
 Prthat wemissa strip 1‒ε/4
 Prthat N instancesmissa strip(1 ‒
ε/4)N
 Prthat N instancesmiss4 strips4(1 ‒
ε/4)N
 4(1 ‒
ε/4)N ≤ δ and(1 ‒
x)≤exp( ‒
x)
 4exp(‒εN/4) ≤ δ andN ≥ (4/ε)log(4/δ)
10

Noiseand Model Complexity
11
Usethe simpler one because
 Simpler to use
(lower computational
complexity)
 Easierto train (lower
spacecomplexity)
 Easierto explain
(moreinterpretable)
 Generalizes better (lower
variance - Occam’s razor)

Multiple Classes, Ci i=1,...,K
t t N
t1
X  {x ,r }

 
j
t
i
rt
i
C , j i
0ifx 
1ifxt
C
  i
j
t
t

hix  
0 if x C , j
1if xt
C
i
Trainhypotheses
hi(x), i=1,...,K:

R
egression
0
1
gx w xw
2
2 1 0
gx w x w xw
t
N t1
r gxt
2
Eg|X
1

N
t
t
N t1
r  w x w 2
1 0
 X 
1

N

E w1 ,w0 |
 
t

t
r  f x
r t

N
t t
t1
X  x
,r 

Model Selection & Generalization
14
 Learning isan ill-posed problem; data isnot
sufficient to find a uniquesolution
 Theneed for inductive bias, assumptionsaboutH
 Generalization: Howwell a model performs onnew
data
 Overfitting: H morecomplex thanCor f
 Underfitting: H lesscomplex than Cor f

TripleTrade-Off
15
 Thereisa trade-off between threefactors
(Dietterich,2003):
1. Complexity of H, c(H),
2. Training setsize, N,
3. Generalization error, E,onnew data
 AsNE
 Asc(H) first Eand thenE

Cross-V
alidation
16
 T
oestimate generalization error, weneed data
unseenduring training. We split the data as
Training set (50%)
Validation set (25%)
T
est(publication) set(25%)
 Resamplingwhenthere isfew data

Dimensionsof a SupervisedLearner
1. Model:
2. Lossfunction:
gx|
t
E| X  Lrt
,gxt
|

3. Optimization procedure:
*  argminE|X 

Probability and
Inference
3
 Result of tossing a coin is 
{Heads,Tails}
 Random var X {1,0}
Bernoulli: P {X=1} = po
X (1 ‒ po)(1
‒ X)
 Sample: X ={xt }N
t =1
Estimation: po = # {Heads}/#{Tosses} = ∑t xt /
N
 Prediction of nexttoss:
Heads if po > ½, Tailsotherwise

Classificati
on
 Credit scoring: Inputs are income and
savings. Output is low-risk vshigh-risk
 Input: x = [x1,x2]T ,Output: C Î{0,1}
 Prediction:
choose
C 1if P(C 1| x1,x2 )  0.5

C  0otherwise

C  0otherwise
or
choose
C 1if P(C 1| x1,x2 )  P(C  0| x1,x2 )

Losses and
Risks
i i k k
K
choose if R |x min R |x
 Actions: αi
 Loss of αi when the state is Ck :
λik
 Expected risk (Duda and Hart,
1973)
R |x  PC |x
i  ik k
k1

Losses and Risks: 0/1
Loss


1if i k
0 if i  k
ik
 
i
K
1 PC|x
R |x  PC |x
i  ik k
k1
 PCk |x
ki
8
For minimum risk, choose the most probable class

Losses and Risks:
Reject
0  1
1


  
0 if i  k
if i  K 1 ,
otherwise
ik
K
R |x PC |x 
K1  k
k1
R |x PC |x 1PC |x
i  k i
ki
if PC | x PC | x k  i andPC | x1 
i k i
otherwise
chooseCi
reject

Different Losses and
Reject
10
Equal losses
Unequal losses
With reject

Discriminant
Functions
gi x,i1,,K
chooseCi if gi x maxkgk x
K decision regions R1,...,RK
R  x|gx max g x
i i k k

 

i i
i
i
gi  
px | C PC
x  P C |x
R |x

K=2
Classes
 Log odds:
choose
C2 otherwise
 Dichotomizer (K=2) vs Polychotomizer (K>2)
 g(x) = g1(x) – g2(x)
C1 if gx 0
2
log
PC |x
PC1 |x

Association
Rules
 Association rule: X Y
 People who buy/click/visit/enjoy X are also
likely to buy/click/visit/enjoy Y.
 A rule implies association, not necessarily
causation.

Association
measures
 Support (X Y):
• PX,Y
#customerswho bought X andY
• #customers
 Confidence (X  Y):
• PY | X
PX,Y
• P(X)
• 
#customerswho boughtX
andY
15
#customerswho bought X
 Lift (X  Y):

PX,Y 
P(Y|X)
P(X)P(Y) P(Y)

Apriori algorithm (Agrawal et
al., 1996)
17
 For (X,Y,Z), a 3-item set, to be frequent
(have enough support), (X,Y), (X,Z), and
(Y,Z) should be frequent.
 If (X,Y) is not frequent, none of its supersets
can be frequent.
 Once we find the frequent k-item sets, we
convert them to rules:X, Y  Z, ...
and X  Y, Z, ...

Parametric
Estimation
 X= { xt }t where xt ~ p (x)
 Parametric estimation:
Assume a form for p (x | ) and estimate  , its
sufficient statistics, using X
e.g., N ( μ, σ2) where  = { μ, σ2}

Examples:
Bernoulli/Multinomial
5
 Bernoulli: Two states, failure/success, x in
{0,1}
P (x) = po
x (1 – po )(1 – x)
t
o o o
L (p |X) = log ∏ p xt
(1 – p ) (1 –
xt)
MLE: po = ∑t xt /N
 Multinomial: K>2 states, xi in
{0,1}
P (x1,x2,...,xK) = ∏i pi
xi
1 2
K
t i i
L(p ,p ,...,p |X) = log ∏ ∏ pxi
t
MLE: pi = ∑t xi
t /N

Gaussian (Normal)
Distribution
px
1
exp-
x  
 2

2  22

xt
 p(x) = N ( μ,σ2)
 MLE for μ and
σ2:
m  t μ
N
xt
 m2
s2
 t
N
σ





1
22
2
x  2

px exp 

Bias and
Variance
7
Unknown parameter 
Estimator di = d (Xi) on sample
Xi
Bias: b(d) = E [d] – 
Variance: E [(d–E [d])2]
Mean square error:
r (d,) = E [(d–)2]
= (E [d] – )2 + E [(d–E
[d])2]
= Bias2 +Variance

Bayes’ Estimator:
Example
 xt ~ N (θ, σo ) and θ ~ N ( μ, σ )
2 2
 θML = m
θMAP = θBayes’
=   
2
2
0
2
2
0
2
0
1/ 2
1/
N/
m
N / 1/
N/
E |X 

Parametric
Classification
gi x px|Ci PCi 
or
g x  logpx|C logPC 
i i i
i
i
i
i
i
px|Ci
logPCi 




 2
1
gi x 
2 2

x  i2
log2 log
2
2 2

1 x   
2
exp
10

 Given the
sample
 ML estimates
are
 Discrimina
nt
t t N
t1
X  {x ,r }
x 
 j
t
t
ri  
0ifx C , j  i
1i fxt
C
i

t
t
 i
t
t
i i
i
t
t
 i
t
t t
i
t
 i  i
i
r
xt
 m r
s
r
x r
m 
t
N
r
PC 
2
2
ˆ
i
i
i
i
ˆ
logPC 
x  m 2
i
2s 2
g x  log2
2
1
 log s 

Equal
variances
Single boundary at
halfway between
means

Variances are
different
Two
boundaries

Linear
Regression
1 0
1 0
t t
gx |w ,w w x w

t
t

t
t
t
t
x 
x w
w
t
rt
xt
t
x
 Nw w
rt
2
1
0
0 1










 t
t
 
t
 t
t
rt
xt 
 rt
w 
xt
xt
N
 1 
 w 
w0 
y  
xt
2 
A 
w A1
y

Polynomial
Regression
1 0
2
2 1 0
t
t
k
k 2
k
t t
x   w x w

g ,w ,w ,w w x  w 
x |w ,






 
N
N
N
r
xN
  
x   x 

 2 
r
r1

2 
2
x2
k 
x1
2
 x1
k

x2
2

x1
x2

1
1
 r  
D 1
w  DT
D1
DT
r

Other Error
Measures
19
1
2
N

t1
t
t
2
r gx |
E|X
 Square
Error:
 Relative Square
Error:
 Absolute Error: E (θ |X) = ∑t |rt – g(xt| θ)|
 ε-sensitive Error:
E (θ |X) = ∑ t 1(|rt – g(xt| θ)|>ε) (|rt – g(xt|θ)| – ε)

2

N

t1
t
N
t
t
2
r r
r g x |
E |X  t1

Bias and
Variance
20
E
r gx 2
|x E
r Er|x2
|x Er|x gx 2
noise squared error
EX Er| x gx | x Er | x EX gx  EX gx EX gx
2 2 2
bias variance

Estimating Bias and
Variance
 M samples Xi={xt , rt },i=1,...,M
• i i
• are used to fit gi (x), i =1,...,M
• Bias2
g
1

gxt
 f xt
2
21

t
i
t
t
 i
t i
t
g x
gx
gx gx 
g
N
1
M
1
NM
2
Variance

Bias/Variance
Dilemma
22
 Example: gi(x)=2 has no variance and high
bias
gi(x)= ∑t rt /N has lower bias with variance
i
 As we increase complexity,
bias decreases (a better fit to data) and
variance increases (fit varies more with
data)
 Bias/Variance dilemma: (Geman et al.,
1992)

Polynomial
Regression
24
Best fit “min
error”

Model
Selection
26
 Cross-validation: Measure generalization
accuracy by testing on data unused during
training
 Regularization: Penalize complex
models E’=error on data + λ model
complexity
Akaike’s information criterion (AIC),
Bayesian information criterion(BIC)
 Minimum description length (MDL):
Kolmogorov complexity,shortest
description of data

Bayesian Model
Selection
27
 Prior on models,p(model)
pmode| data
pdata|modelpmodel
pdata
 Regularization, when prior favors simpler
models
 Bayes, MAP of the posterior, p(model|data)
 Average over a number of models with
high posterior (voting, ensembles)

Regression
example
28
Coefficients
increase in
magnitude as order
increases:
1: [-0.0769,0.0016]
2: [0.1682, -0.6657,
0.0080]
3: [0.4238, -2.5778,
3.4675, -0.0002
4: [-0.1093, 1.4356,
-5.5007, 6.0454,-
0.0019]
i
t
t
wi
2
2 t1
1

N
2
r gx | w 
Regularization (L2): Ew| X

ML unit-1.pptx

Recomendados

Recomendados

Más contenido relacionado

Similar a ML unit-1.pptx

Similar a ML unit-1.pptx (20)

Más de SwarnaKumariChinni

Más de SwarnaKumariChinni (8)

Último

Último (20)

ML unit-1.pptx