2. Big Data
3
Widespread useof personal computersand
wireless communicationleads to “bigdata”
We are both producers and consumersof data
Data isnot random, it hasstructure,e.g., customer
behavior
We need “bigtheory” to extract that structurefrom
data for
(a) Understanding theprocess
(b) Making predictions for thefuture
3. Why “Learn” ?
4
Machinelearning isprogramming computersto optimize
a performance criterion usingexample data or past
experience.
Thereisnoneed to “learn” to calculate payroll
Learning isusedwhen:
Humanexpertise doesnot exist (navigating on Mars),
Humansare unable to explain their expertise (speech
recognition)
Solutionchangesin time (routing ona computernetwork)
Solutionneedsto be adapted to particular cases(user
biometrics)
4. What We T
alk About When We T
alk
About “Learning”
5
Learning general modelsfrom a data ofparticular
examples
Data ischeap and abundant (data warehouses,
data marts); knowledge isexpensiveandscarce.
Example in retail: Customertransactions toconsumer
behavior:
Peoplewhobought “Blink” also bought “Outliers”
(www.amazon.com)
Build a model that isa good anduseful
approximation to thedata.
6. What isMachine Learning?
7
Optimize a performance criterionusingexample
data or pastexperience.
Roleof Statistics: Inference from a sample
Roleof Computer science:Efficient algorithmsto
Solvethe optimization problem
Representing and evaluating the model for inference
14. Supervised Learning:Uses
15
Prediction of future cases:Usethe rule to predict
the output for future inputs
Knowledge extraction: Therule iseasy to
understand
Compression:Therule issimpler than the data it
explains
Outlier detection: Exceptionsthat arenot covered
by the rule, e.g., fraud
16. Reinforcement Learning
17
Learning a policy: Asequenceof outputs
No supervised output but delayedreward
Credit assignmentproblem
Game playing
Robotin a maze
Multipleagents,partial observability,...
18. Learning a Classfrom Examples
3
ClassCof a “familycar”
Prediction: Iscar x a family car?
Knowledge extraction: What do people expect from a
family car?
Output:
Positive(+) and negative (–) examples
Inputrepresentation:
x1: price, x2 : engine power
19. Training set X
t t N
t1
X {x ,r }
0 if x isnegative
1if x ispositive
r
2
x
x
x1
21. HypothesisclassH
h(x)
0 if h says x isnegative
1if h says x ispositive
N
t t
t1
1 h x r
E(h|X )
6
Error of hon H
22. S,G,and the Version Space
7
mostspecific hypothesis,S
mostgeneral hypothesis,G
h H, between Sand G is
consistent and make up the
versionspace
(Mitchell,1997)
24. VC Dimension
9
N points canbe labeled in 2N waysas +/–
H shattersN if there
existsh H consistent
for any of these:
VC(H ) = N
Anaxis-aligned rectangle shatters4 points only!
25. Probably Approximately Correct (PAC)
Learning
Howmanytraining examples N should wehave,
suchthat withprobability
• at least 1 ‒δ, hhaserror at mostε ?
• (Blumeret al., 1989)
Eachstripisat mostε/4
Prthat wemissa strip 1‒ε/4
Prthat N instancesmissa strip(1 ‒
ε/4)N
Prthat N instancesmiss4 strips4(1 ‒
ε/4)N
4(1 ‒
ε/4)N ≤ δ and(1 ‒
x)≤exp( ‒
x)
4exp(‒εN/4) ≤ δ andN ≥ (4/ε)log(4/δ)
10
26. Noiseand Model Complexity
11
Usethe simpler one because
Simpler to use
(lower computational
complexity)
Easierto train (lower
spacecomplexity)
Easierto explain
(moreinterpretable)
Generalizes better (lower
variance - Occam’s razor)
27. Multiple Classes, Ci i=1,...,K
t t N
t1
X {x ,r }
j
t
i
rt
i
C , j i
0ifx
1ifxt
C
i
j
t
t
hix
0 if x C , j
1if xt
C
i
Trainhypotheses
hi(x), i=1,...,K:
28. R
egression
0
1
gx w xw
2
2 1 0
gx w x w xw
t
N t1
r gxt
2
Eg|X
1
N
t
t
N t1
r w x w 2
1 0
X
1
N
E w1 ,w0 |
t
t
r f x
r t
N
t t
t1
X x
,r
29. Model Selection & Generalization
14
Learning isan ill-posed problem; data isnot
sufficient to find a uniquesolution
Theneed for inductive bias, assumptionsaboutH
Generalization: Howwell a model performs onnew
data
Overfitting: H morecomplex thanCor f
Underfitting: H lesscomplex than Cor f
30. TripleTrade-Off
15
Thereisa trade-off between threefactors
(Dietterich,2003):
1. Complexity of H, c(H),
2. Training setsize, N,
3. Generalization error, E,onnew data
AsNE
Asc(H) first Eand thenE
31. Cross-V
alidation
16
T
oestimate generalization error, weneed data
unseenduring training. We split the data as
Training set (50%)
Validation set (25%)
T
est(publication) set(25%)
Resamplingwhenthere isfew data
32. Dimensionsof a SupervisedLearner
1. Model:
2. Lossfunction:
gx|
t
E| X Lrt
,gxt
|
3. Optimization procedure:
* argminE|X
34. Probability and
Inference
3
Result of tossing a coin is
{Heads,Tails}
Random var X {1,0}
Bernoulli: P {X=1} = po
X (1 ‒ po)(1
‒ X)
Sample: X ={xt }N
t =1
Estimation: po = # {Heads}/#{Tosses} = ∑t xt /
N
Prediction of nexttoss:
Heads if po > ½, Tailsotherwise
35. Classificati
on
Credit scoring: Inputs are income and
savings. Output is low-risk vshigh-risk
Input: x = [x1,x2]T ,Output: C Î{0,1}
Prediction:
choose
C 1if P(C 1| x1,x2 ) 0.5
C 0otherwise
C 0otherwise
or
choose
C 1if P(C 1| x1,x2 ) P(C 0| x1,x2 )
37. Bayes’ Rule: K>2
Classes
k1
K
i i
px|Ck PCk
px|C PC
i i
px
px|C PC
i
PC |x
i i k k
K
choose C if PC | x max PC |x
PC 0 and PC 1
i i
i1
6
38. Losses and
Risks
i i k k
K
choose if R |x min R |x
Actions: αi
Loss of αi when the state is Ck :
λik
Expected risk (Duda and Hart,
1973)
R |x PC |x
i ik k
k1
39. Losses and Risks: 0/1
Loss
1if i k
0 if i k
ik
i
K
1 PC|x
R |x PC |x
i ik k
k1
PCk |x
ki
8
For minimum risk, choose the most probable class
40. Losses and Risks:
Reject
0 1
1
0 if i k
if i K 1 ,
otherwise
ik
K
R |x PC |x
K1 k
k1
R |x PC |x 1PC |x
i k i
ki
if PC | x PC | x k i andPC | x1
i k i
otherwise
chooseCi
reject
44. Utility
Theory
j
j
i i
Choose α if EU |x max EU |x
Prob of state k given exidence x: P
(Sk|x)
Utility of αi when state is k:Uik
Expected utility:
EU |x U PS |x
i ik k
k
45. Association
Rules
Association rule: X Y
People who buy/click/visit/enjoy X are also
likely to buy/click/visit/enjoy Y.
A rule implies association, not necessarily
causation.
48. Apriori algorithm (Agrawal et
al., 1996)
17
For (X,Y,Z), a 3-item set, to be frequent
(have enough support), (X,Y), (X,Z), and
(Y,Z) should be frequent.
If (X,Y) is not frequent, none of its supersets
can be frequent.
Once we find the frequent k-item sets, we
convert them to rules:X, Y Z, ...
and X Y, Z, ...
50. Parametric
Estimation
X= { xt }t where xt ~ p (x)
Parametric estimation:
Assume a form for p (x | ) and estimate , its
sufficient statistics, using X
e.g., N ( μ, σ2) where = { μ, σ2}
51. Maximum Likelihood
Estimation
4
Likelihoodof given the sample X
l (θ|X) = p (X |θ) = ∏t p(xt|θ)
Log likelihood
L(θ|X) = log l (θ|X) = ∑t log p(xt|θ)
Maximum likelihood estimator (MLE)
θ* = argmaxθ L(θ|X)
52. Examples:
Bernoulli/Multinomial
5
Bernoulli: Two states, failure/success, x in
{0,1}
P (x) = po
x (1 – po )(1 – x)
t
o o o
L (p |X) = log ∏ p xt
(1 – p ) (1 –
xt)
MLE: po = ∑t xt /N
Multinomial: K>2 states, xi in
{0,1}
P (x1,x2,...,xK) = ∏i pi
xi
1 2
K
t i i
L(p ,p ,...,p |X) = log ∏ ∏ pxi
t
MLE: pi = ∑t xi
t /N
54. Bias and
Variance
7
Unknown parameter
Estimator di = d (Xi) on sample
Xi
Bias: b(d) = E [d] –
Variance: E [(d–E [d])2]
Mean square error:
r (d,) = E [(d–)2]
= (E [d] – )2 + E [(d–E
[d])2]
= Bias2 +Variance
55. Bayes’
Estimator
8
Treat θ as a random var with prior p (θ)
Bayes’ rule: p (θ|X) = p(X|θ) p(θ) / p(X)
Full: p(x|X) = ∫ p(x|θ) p(θ|X) dθ
Maximum a Posteriori (MAP):
θMAP = argmaxθ p(θ|X)
Maximum Likelihood (ML): θML =argmaxθ
p(X|θ)
Bayes’: θBayes’ = E[θ|X] = ∫ θ p(θ|X)dθ
56. Bayes’ Estimator:
Example
xt ~ N (θ, σo ) and θ ~ N ( μ, σ )
2 2
θML = m
θMAP = θBayes’
=
2
2
0
2
2
0
2
0
1/ 2
1/
N/
m
N / 1/
N/
E |X
57. Parametric
Classification
gi x px|Ci PCi
or
g x logpx|C logPC
i i i
i
i
i
i
i
px|Ci
logPCi
2
1
gi x
2 2
x i2
log2 log
2
2 2
1 x
2
exp
10
58. Given the
sample
ML estimates
are
Discrimina
nt
t t N
t1
X {x ,r }
x
j
t
t
ri
0ifx C , j i
1i fxt
C
i
t
t
i
t
t
i i
i
t
t
i
t
t t
i
t
i i
i
r
xt
m r
s
r
x r
m
t
N
r
PC
2
2
ˆ
i
i
i
i
ˆ
logPC
x m 2
i
2s 2
g x log2
2
1
log s
62. Regressi
on
r f x
estimator: gx|
~ N 0,2
pr|x~ N gx|,2
t
N
t
N
t t
px
N
t1
t1
t1
t
log
p r |x
log
px ,r
L|Xlog
15
63. Regression: From LogL to
Error
16
2
1
1
1
N
2 t1
t
t
N
t1
t
t
N
t1 22
r g x |
2
r gx |
Nlog 2
22
rt
gxt
|2
2
exp
E|X
L|Xlog
64. Linear
Regression
1 0
1 0
t t
gx |w ,w w x w
t
t
t
t
t
t
x
x w
w
t
rt
xt
t
x
Nw w
rt
2
1
0
0 1
t
t
t
t
t
rt
xt
rt
w
xt
xt
N
1
w
w0
y
xt
2
A
w A1
y
65. Polynomial
Regression
1 0
2
2 1 0
t
t
k
k 2
k
t t
x w x w
g ,w ,w ,w w x w
x |w ,
N
N
N
r
xN
x x
2
r
r1
2
2
x2
k
x1
2
x1
k
x2
2
x1
x2
1
1
r
D 1
w DT
D1
DT
r
66. Other Error
Measures
19
1
2
N
t1
t
t
2
r gx |
E|X
Square
Error:
Relative Square
Error:
Absolute Error: E (θ |X) = ∑t |rt – g(xt| θ)|
ε-sensitive Error:
E (θ |X) = ∑ t 1(|rt – g(xt| θ)|>ε) (|rt – g(xt|θ)| – ε)
2
N
t1
t
N
t
t
2
r r
r g x |
E |X t1
67. Bias and
Variance
20
E
r gx 2
|x E
r Er|x2
|x Er|x gx 2
noise squared error
EX Er| x gx | x Er | x EX gx EX gx EX gx
2 2 2
bias variance
68. Estimating Bias and
Variance
M samples Xi={xt , rt },i=1,...,M
• i i
• are used to fit gi (x), i =1,...,M
• Bias2
g
1
gxt
f xt
2
21
t
i
t
t
i
t i
t
g x
gx
gx gx
g
N
1
M
1
NM
2
Variance
69. Bias/Variance
Dilemma
22
Example: gi(x)=2 has no variance and high
bias
gi(x)= ∑t rt /N has lower bias with variance
i
As we increase complexity,
bias decreases (a better fit to data) and
variance increases (fit varies more with
data)
Bias/Variance dilemma: (Geman et al.,
1992)
73. Model
Selection
26
Cross-validation: Measure generalization
accuracy by testing on data unused during
training
Regularization: Penalize complex
models E’=error on data + λ model
complexity
Akaike’s information criterion (AIC),
Bayesian information criterion(BIC)
Minimum description length (MDL):
Kolmogorov complexity,shortest
description of data
74. Bayesian Model
Selection
27
Prior on models,p(model)
pmode| data
pdata|modelpmodel
pdata
Regularization, when prior favors simpler
models
Bayes, MAP of the posterior, p(model|data)
Average over a number of models with
high posterior (voting, ensembles)
75. Regression
example
28
Coefficients
increase in
magnitude as order
increases:
1: [-0.0769,0.0016]
2: [0.1682, -0.6657,
0.0080]
3: [0.4238, -2.5778,
3.4675, -0.0002
4: [-0.1093, 1.4356,
-5.5007, 6.0454,-
0.0019]
i
t
t
wi
2
2 t1
1
N
2
r gx | w
Regularization (L2): Ew| X