4. Introduction: Data Streams
Data Streams
Sequence is potentially infinite
High amount of data: sublinear space
High speed of arrival: sublinear time per example
Once an element from a data stream has been processed it is
discarded or archived
Example
Puzzle: Finding Missing Numbers
Let π be a permutation of {1,...,n}.
Let π−1 be π with one element missing.
π−1[i] arrives in increasing order
Task: Determine the missing number
5. Introduction: Data Streams
Data Streams
Sequence is potentially infinite
High amount of data: sublinear space
High speed of arrival: sublinear time per example
Once an element from a data stream has been processed it is
discarded or archived
Example
Puzzle: Finding Missing Numbers
Let π be a permutation of {1,...,n}.
Let π−1 be π with one element missing.
π−1[i] arrives in increasing order
Task: Determine the missing number
Use a n-bit vector
to memorize all the
numbers (O(n)
space)
6. Introduction: Data Streams
Data Streams
Sequence is potentially infinite
High amount of data: sublinear space
High speed of arrival: sublinear time per example
Once an element from a data stream has been processed it is
discarded or archived
Example
Puzzle: Finding Missing Numbers
Let π be a permutation of {1,...,n}.
Let π−1 be π with one element missing.
π−1[i] arrives in increasing order
Task: Determine the missing number
Data Streams:
O(log(n)) space.
7. Introduction: Data Streams
Data Streams
Sequence is potentially infinite
High amount of data: sublinear space
High speed of arrival: sublinear time per example
Once an element from a data stream has been processed it is
discarded or archived
Example
Puzzle: Finding Missing Numbers
Let π be a permutation of {1,...,n}.
Let π−1 be π with one element missing.
π−1[i] arrives in increasing order
Task: Determine the missing number
Data Streams:
O(log(n)) space.
Store
n(n+1)
2
−∑
j≤i
π−1[j].
8. Data Streams
Data Streams
Sequence is potentially infinite
High amount of data: sublinear space
High speed of arrival: sublinear time per example
Once an element from a data stream has been processed it is
discarded or archived
Tools:
approximation
randomization, sampling
sketching
9. Data Streams
Data Streams
Sequence is potentially infinite
High amount of data: sublinear space
High speed of arrival: sublinear time per example
Once an element from a data stream has been processed it is
discarded or archived
Approximation algorithms
Small error rate with high probability
An algorithm (ε,δ)−approximates F if it outputs ˜F for which
Pr[| ˜F −F| > εF] < δ.
10. Data Streams Approximation
Algorithms
1011000111 1010101
Sliding Window
We can maintain simple statistics over sliding windows, using
O(1
ε log2
N) space, where
N is the length of the sliding window
ε is the accuracy parameter
M. Datar, A. Gionis, P. Indyk, and R. Motwani.
Maintaining stream statistics over sliding windows. 2002
11. Data Streams Approximation
Algorithms
10110001111 0101011
Sliding Window
We can maintain simple statistics over sliding windows, using
O(1
ε log2
N) space, where
N is the length of the sliding window
ε is the accuracy parameter
M. Datar, A. Gionis, P. Indyk, and R. Motwani.
Maintaining stream statistics over sliding windows. 2002
12. Data Streams Approximation
Algorithms
101100011110 1010111
Sliding Window
We can maintain simple statistics over sliding windows, using
O(1
ε log2
N) space, where
N is the length of the sliding window
ε is the accuracy parameter
M. Datar, A. Gionis, P. Indyk, and R. Motwani.
Maintaining stream statistics over sliding windows. 2002
13. Data Streams Approximation
Algorithms
1011000111101 0101110
Sliding Window
We can maintain simple statistics over sliding windows, using
O(1
ε log2
N) space, where
N is the length of the sliding window
ε is the accuracy parameter
M. Datar, A. Gionis, P. Indyk, and R. Motwani.
Maintaining stream statistics over sliding windows. 2002
14. Data Streams Approximation
Algorithms
10110001111010 1011101
Sliding Window
We can maintain simple statistics over sliding windows, using
O(1
ε log2
N) space, where
N is the length of the sliding window
ε is the accuracy parameter
M. Datar, A. Gionis, P. Indyk, and R. Motwani.
Maintaining stream statistics over sliding windows. 2002
15. Data Streams Approximation
Algorithms
101100011110101 0111010
Sliding Window
We can maintain simple statistics over sliding windows, using
O(1
ε log2
N) space, where
N is the length of the sliding window
ε is the accuracy parameter
M. Datar, A. Gionis, P. Indyk, and R. Motwani.
Maintaining stream statistics over sliding windows. 2002
17. Classification
Definition
Given nC different classes, a classifier algorithm builds a model that
predicts for every unlabelled instance I the class C to which it belongs
with accuracy.
Example
A spam filter
Example
Twitter Sentiment analysis: analyze tweets with positive or negative
feelings
18. Data stream classification cycle
1 Process an example at a time, and
inspect it only once (at most)
2 Use a limited amount of memory
3 Work in a limited amount of time
4 Be ready to predict at any point
19. Classification
Data set that
describes e-mail
features for
deciding if it is
spam.
Example
Contains Domain Has Time
“Money” type attach. received spam
yes com yes night yes
yes edu no night yes
no com yes night yes
no edu no day no
no com no day no
yes cat no day yes
Assume we have to classify the following new instance:
Contains Domain Has Time
“Money” type attach. received spam
yes edu yes day ?
20. Bayes Classifiers
Na¨ıve Bayes
Based on Bayes Theorem:
P(c|d) =
P(c)P(d|c)
P(d)
posterior =
prior ×likelikood
evidence
Estimates the probability of observing attribute a and the prior
probability P(c)
Probability of class c given an instance d:
P(c|d) =
P(c)∏a∈d P(a|c)
P(d)
21. Bayes Classifiers
Multinomial Na¨ıve Bayes
Considers a document as a bag-of-words.
Estimates the probability of observing word w and the prior
probability P(c)
Probability of class c given a test document d:
P(c|d) =
P(c)∏w∈d P(w|c)nwd
P(d)
23. Perceptron
Attribute 1
Attribute 2
Attribute 3
Attribute 4
Attribute 5
Output hw(xi)
w1
w2
w3
w4
w5
We use sigmoid function hw = σ(wTx) where
σ(x) = 1/(1+e−x
)
σ (x) = σ(x)(1−σ(x))
24. Perceptron
Minimize Mean-square error: J(w) = 1
2 ∑(yi −hw(xi))2
Stochastic Gradient Descent: w = w−η∇Jxi
Gradient of the error function:
∇J = −∑
i
(yi −hw(xi))∇hw(xi)
∇hw(xi) = hw(xi)(1−hw(xi))
Weight update rule
w = w+η ∑
i
(yi −hw(xi))hw(xi)(1−hw(xi))xi
25. Perceptron
PERCEPTRON LEARNING(Stream,η)
1 for each class
2 do PERCEPTRON LEARNING(Stream,class,η)
PERCEPTRON LEARNING(Stream,class,η)
1 £ Let w0 and w be randomly initialized
2 for each example (x,y) in Stream
3 do if class = y
4 then δ = (1−hw(x))·hw(x)·(1−hw(x))
5 else δ = (0−hw(x))·hw(x)·(1−hw(x))
6 w = w+η ·δ ·x
PERCEPTRON PREDICTION(x)
1 return argmaxclass hwclass
(x)
26. Classification
Data set that
describes e-mail
features for
deciding if it is
spam.
Example
Contains Domain Has Time
“Money” type attach. received spam
yes com yes night yes
yes edu no night yes
no com yes night yes
no edu no day no
no com no day no
yes cat no day yes
Assume we have to classify the following new instance:
Contains Domain Has Time
“Money” type attach. received spam
yes edu yes day ?
27. Classification
Assume we have to classify the following new instance:
Contains Domain Has Time
“Money” type attach. received spam
yes edu yes day ?
Time
Contains “Money”
YES
Yes
NO
No
Day
YES
Night
28. Decision Trees
Basic induction strategy:
A ← the “best” decision attribute for next node
Assign A as decision attribute for node
For each value of A, create new descendant of node
Sort training examples to leaf nodes
If training examples perfectly classified, Then STOP, Else iterate
over new leaf nodes
29. Hoeffding Trees
Hoeffding Tree : VFDT
Pedro Domingos and Geoff Hulten.
Mining high-speed data streams. 2000
With high probability, constructs an identical model that a
traditional (greedy) method would learn
With theoretical guarantees on the error rate
Time
Contains “Money”
YES
Yes
NO
No
Day
YES
Night
31. Hoeffding Bound Inequality
Let X = ∑i Xi where X1,...,Xn are independent and indentically
distributed in [0,1]. Then
1 Chernoff For each ε < 1
Pr[X > (1+ε)E[X]] ≤ exp −
ε2
3
E[X]
2 Hoeffding For each t > 0
Pr[X > E[X]+t] ≤ exp −2t2
/n
3 Bernstein Let σ2 = ∑i σ2
i the variance of X. If Xi −E[Xi] ≤ b for
each i ∈ [n] then for each t > 0
Pr[X > E[X]+t] ≤ exp −
t2
2σ2 + 2
3 bt
32. Hoeffding Tree or VFDT
HT(Stream,δ)
1 £ Let HT be a tree with a single leaf(root)
2 £ Init counts nijk at root
3 for each example (x,y) in Stream
4 do HTGROW((x,y),HT,δ)
HTGROW((x,y),HT,δ)
1 £ Sort (x,y) to leaf l using HT
2 £ Update counts nijk at leaf l
3 if examples seen so far at l are not all of the same class
4 then £ Compute G for each attribute
5 if G(Best Attr.)−G(2nd best) > R2 ln1/δ
2n
6 then £ Split leaf on best attribute
7 for each branch
8 do £ Start new leaf and initiliatize counts
33. Hoeffding Tree or VFDT
HT(Stream,δ)
1 £ Let HT be a tree with a single leaf(root)
2 £ Init counts nijk at root
3 for each example (x,y) in Stream
4 do HTGROW((x,y),HT,δ)
HTGROW((x,y),HT,δ)
1 £ Sort (x,y) to leaf l using HT
2 £ Update counts nijk at leaf l
3 if examples seen so far at l are not all of the same class
4 then £ Compute G for each attribute
5 if G(Best Attr.)−G(2nd best) > R2 ln1/δ
2n
6 then £ Split leaf on best attribute
7 for each branch
8 do £ Start new leaf and initiliatize counts
34. Hoeffding Trees
HT features
With high probability, constructs an identical model that a
traditional (greedy) method would learn
Ties: when two attributes have similar G, split if
G(Best Attr.)−G(2nd best) <
R2 ln1/δ
2n
< τ
Compute G every nmin instances
Memory: deactivate least promising nodes with lower pl ×el
pl is the probability to reach leaf l
el is the error in the node
35. Hoeffding Naive Bayes Tree
Hoeffding Tree
Majority Class learner at leaves
Hoeffding Naive Bayes Tree
G. Holmes, R. Kirkby, and B. Pfahringer.
Stress-testing Hoeffding trees, 2005.
monitors accuracy of a Majority Class learner
monitors accuracy of a Naive Bayes learner
predicts using the most accurate method
36. Bagging
Example
Dataset of 4 Instances : A, B, C, D
Classifier 1: B, A, C, B
Classifier 2: D, B, A, D
Classifier 3: B, A, C, B
Classifier 4: B, C, B, B
Classifier 5: D, C, A, C
Bagging builds a set of M base models, with a bootstrap sample
created by drawing random samples with
replacement.
37. Bagging
Example
Dataset of 4 Instances : A, B, C, D
Classifier 1: A, B, B, C
Classifier 2: A, B, D, D
Classifier 3: A, B, B, C
Classifier 4: B, B, B, C
Classifier 5: A, C, C, D
Bagging builds a set of M base models, with a bootstrap sample
created by drawing random samples with
replacement.
38. Bagging
Example
Dataset of 4 Instances : A, B, C, D
Classifier 1: A, B, B, C: A(1) B(2) C(1) D(0)
Classifier 2: A, B, D, D: A(1) B(1) C(0) D(2)
Classifier 3: A, B, B, C: A(1) B(2) C(1) D(0)
Classifier 4: B, B, B, C: A(0) B(3) C(1) D(0)
Classifier 5: A, C, C, D: A(1) B(0) C(2) D(1)
Each base model’s training set contains each of the original training
example K times where P(K = k) follows a binomial distribution.
39. Bagging
Figure : Poisson(1) Distribution.
Each base model’s training set contains each of the original training
example K times where P(K = k) follows a binomial distribution.
40. Oza and Russell’s Online Bagging
for M models
1: Initialize base models hm for all m ∈ {1,2,...,M}
2: for all training examples do
3: for m = 1,2,...,M do
4: Set w = Poisson(1)
5: Update hm with the current example with weight w
6: anytime output:
7: return hypothesis: hfin(x) = argmaxy∈Y ∑T
t=1 I(ht(x) = y)
44. Optimal Change Detector and
Predictor
High accuracy
Low false positives and false negatives ratios
Theoretical guarantees
Fast detection of change
Low computational cost: minimum space and time needed
No parameters needed
45. Algorithm ADaptive Sliding
WINdow
Example
W= 101010110111111
W0= 1
ADWIN: ADAPTIVE WINDOWING ALGORITHM
1 Initialize Window W
2 for each t 0
3 do W ← W ∪{xt} (i.e., add xt to the head of W)
4 repeat Drop elements from the tail of W
5 until | ˆµW0 − ˆµW1 | ≥ εc holds
6 for every split of W into W = W0 ·W1
7 Output ˆµW
46. Algorithm ADaptive Sliding
WINdow
Example
W= 101010110111111
W0= 1 W1 = 01010110111111
ADWIN: ADAPTIVE WINDOWING ALGORITHM
1 Initialize Window W
2 for each t 0
3 do W ← W ∪{xt} (i.e., add xt to the head of W)
4 repeat Drop elements from the tail of W
5 until | ˆµW0 − ˆµW1 | ≥ εc holds
6 for every split of W into W = W0 ·W1
7 Output ˆµW
47. Algorithm ADaptive Sliding
WINdow
Example
W= 101010110111111
W0= 10 W1 = 1010110111111
ADWIN: ADAPTIVE WINDOWING ALGORITHM
1 Initialize Window W
2 for each t 0
3 do W ← W ∪{xt} (i.e., add xt to the head of W)
4 repeat Drop elements from the tail of W
5 until | ˆµW0 − ˆµW1 | ≥ εc holds
6 for every split of W into W = W0 ·W1
7 Output ˆµW
48. Algorithm ADaptive Sliding
WINdow
Example
W= 101010110111111
W0= 101 W1 = 010110111111
ADWIN: ADAPTIVE WINDOWING ALGORITHM
1 Initialize Window W
2 for each t 0
3 do W ← W ∪{xt} (i.e., add xt to the head of W)
4 repeat Drop elements from the tail of W
5 until | ˆµW0 − ˆµW1 | ≥ εc holds
6 for every split of W into W = W0 ·W1
7 Output ˆµW
49. Algorithm ADaptive Sliding
WINdow
Example
W= 101010110111111
W0= 1010 W1 = 10110111111
ADWIN: ADAPTIVE WINDOWING ALGORITHM
1 Initialize Window W
2 for each t 0
3 do W ← W ∪{xt} (i.e., add xt to the head of W)
4 repeat Drop elements from the tail of W
5 until | ˆµW0 − ˆµW1 | ≥ εc holds
6 for every split of W into W = W0 ·W1
7 Output ˆµW
50. Algorithm ADaptive Sliding
WINdow
Example
W= 101010110111111
W0= 10101 W1 = 0110111111
ADWIN: ADAPTIVE WINDOWING ALGORITHM
1 Initialize Window W
2 for each t 0
3 do W ← W ∪{xt} (i.e., add xt to the head of W)
4 repeat Drop elements from the tail of W
5 until | ˆµW0 − ˆµW1 | ≥ εc holds
6 for every split of W into W = W0 ·W1
7 Output ˆµW
51. Algorithm ADaptive Sliding
WINdow
Example
W= 101010110111111
W0= 101010 W1 = 110111111
ADWIN: ADAPTIVE WINDOWING ALGORITHM
1 Initialize Window W
2 for each t 0
3 do W ← W ∪{xt} (i.e., add xt to the head of W)
4 repeat Drop elements from the tail of W
5 until | ˆµW0 − ˆµW1 | ≥ εc holds
6 for every split of W into W = W0 ·W1
7 Output ˆµW
52. Algorithm ADaptive Sliding
WINdow
Example
W= 101010110111111
W0= 1010101 W1 = 10111111
ADWIN: ADAPTIVE WINDOWING ALGORITHM
1 Initialize Window W
2 for each t 0
3 do W ← W ∪{xt} (i.e., add xt to the head of W)
4 repeat Drop elements from the tail of W
5 until | ˆµW0 − ˆµW1 | ≥ εc holds
6 for every split of W into W = W0 ·W1
7 Output ˆµW
53. Algorithm ADaptive Sliding
WINdow
Example
W= 101010110111111
W0= 10101011 W1 = 0111111
ADWIN: ADAPTIVE WINDOWING ALGORITHM
1 Initialize Window W
2 for each t 0
3 do W ← W ∪{xt} (i.e., add xt to the head of W)
4 repeat Drop elements from the tail of W
5 until | ˆµW0 − ˆµW1 | ≥ εc holds
6 for every split of W into W = W0 ·W1
7 Output ˆµW
54. Algorithm ADaptive Sliding
WINdow
Example
W= 101010110111111 | ˆµW0 − ˆµW1 | ≥ εc : CHANGE DET.!
W0= 101010110 W1 = 111111
ADWIN: ADAPTIVE WINDOWING ALGORITHM
1 Initialize Window W
2 for each t 0
3 do W ← W ∪{xt} (i.e., add xt to the head of W)
4 repeat Drop elements from the tail of W
5 until | ˆµW0 − ˆµW1 | ≥ εc holds
6 for every split of W into W = W0 ·W1
7 Output ˆµW
55. Algorithm ADaptive Sliding
WINdow
Example
W= 101010110111111 Drop elements from the tail of W
W0= 101010110 W1 = 111111
ADWIN: ADAPTIVE WINDOWING ALGORITHM
1 Initialize Window W
2 for each t 0
3 do W ← W ∪{xt} (i.e., add xt to the head of W)
4 repeat Drop elements from the tail of W
5 until | ˆµW0 − ˆµW1 | ≥ εc holds
6 for every split of W into W = W0 ·W1
7 Output ˆµW
56. Algorithm ADaptive Sliding
WINdow
Example
W= 01010110111111 Drop elements from the tail of W
W0= 101010110 W1 = 111111
ADWIN: ADAPTIVE WINDOWING ALGORITHM
1 Initialize Window W
2 for each t 0
3 do W ← W ∪{xt} (i.e., add xt to the head of W)
4 repeat Drop elements from the tail of W
5 until | ˆµW0 − ˆµW1 | ≥ εc holds
6 for every split of W into W = W0 ·W1
7 Output ˆµW
57. Algorithm ADaptive Sliding
WINdow
Theorem
At every time step we have:
1 (False positive rate bound). If µt remains constant within W, the
probability that ADWIN shrinks the window at this step is at most
δ.
2 (False negative rate bound). Suppose that for some partition of
W in two parts W0W1 (where W1 contains the most recent items)
we have |µW0 − µW1 | 2εc. Then with probability 1−δ ADWIN
shrinks W to W1, or shorter.
ADWIN tunes itself to the data stream at hand, with no need for the
user to hardwire or precompute parameters.
58. Algorithm ADaptive Sliding
WINdow
ADWIN using a Data Stream Sliding Window Model,
can provide the exact counts of 1’s in O(1) time per point.
tries O(logW) cutpoints
uses O(1
ε logW) memory words
the processing time per example is O(logW) (amortized and
worst-case).
Sliding Window Model
1010101 101 11 1 1
Content: 4 2 2 1 1
Capacity: 7 3 2 1 1
59. VFDT / CVFDT
Concept-adapting Very Fast Decision Trees: CVFDT
G. Hulten, L. Spencer, and P. Domingos.
Mining time-changing data streams. 2001
It keeps its model consistent with a sliding window of examples
Construct “alternative branches” as preparation for changes
If the alternative branch becomes more accurate, switch of tree
branches occurs
Time
Contains “Money”
YES
Yes
NO
No
Day
YES
Night
60. Decision Trees: CVFDT
Time
Contains “Money”
YES
Yes
NO
No
Day
YES
Night
No theoretical guarantees on the error rate of CVFDT
CVFDT parameters :
1 W: is the example window size.
2 T0: number of examples used to check at each node if the
splitting attribute is still the best.
3 T1: number of examples used to build the alternate tree.
4 T2: number of examples used to test the accuracy of the alternate
tree.
61. Decision Trees: Hoeffding Adaptive
Tree
Hoeffding Adaptive Tree:
replace frequency statistics counters by estimators
don’t need a window to store examples, due to the fact that we
maintain the statistics data needed with estimators
change the way of checking the substitution of alternate subtrees,
using a change detector with theoretical guarantees
Advantages over CVFDT:
1 Theoretical guarantees
2 No Parameters
62. ADWIN Bagging (KDD’09)
ADWIN
An adaptive sliding window whose size is recomputed online
according to the rate of change observed.
ADWIN has rigorous guarantees (theorems)
On ratio of false positives and negatives
On the relation of the size of the current window and change
rates
ADWIN Bagging
When a change is detected, the worst classifier is removed and a new
classifier is added.
63. Leveraging Bagging for Evolving
Data Streams (ECML-PKDD’10)
Randomization as a powerful tool to increase accuracy and diversity
There are three ways of using randomization:
Manipulating the input data
Manipulating the classifier algorithms
Manipulating the output targets
64. Leveraging Bagging for Evolving
Data Streams
Leveraging Bagging
Using Poisson(λ)
Leveraging Bagging MC
Using Poisson(λ) and Random Output Codes
Fast Leveraging Bagging ME
if an instance is misclassified: weight = 1
if not: weight = eT/(1−eT),
65. Empirical evaluation
Accuracy RAM-Hours
Hoeffding Tree 74.03% 0.01
Online Bagging 77.15% 2.98
ADWIN Bagging 79.24% 1.48
Leveraging Bagging 85.54% 20.17
Leveraging Bagging MC 85.37% 22.04
Leveraging Bagging ME 80.77% 0.87
Leveraging Bagging
Leveraging Bagging
Using Poisson(λ)
Leveraging Bagging MC
Using Poisson(λ) and Random Output Codes
Leveraging Bagging ME
Using weight 1 if misclassified, otherwise eT/(1−eT)
67. Clustering
Definition
Clustering is the distribution of a set of instances of examples into
non-known groups according to some common relations or affinities.
Example
Market segmentation of customers
Example
Social network communities
68. Clustering
Definition
Given
a set of instances I
a number of clusters K
an objective function cost(I)
a clustering algorithm computes an assignment of a cluster for each
instance
f : I → {1,...,K}
that minimizes the objective function cost(I)
69. Clustering
Definition
Given
a set of instances I
a number of clusters K
an objective function cost(C,I)
a clustering algorithm computes a set C of instances with |C| = K that
minimizes the objective function
cost(C,I) = ∑
x∈I
d2
(x,C)
where
d(x,c): distance function between x and c
d2(x,C) = minc∈Cd2(x,c): distance from x to the nearest point in
C
70. k-means
1. Choose k initial centers C = {c1,...,ck}
2. while stopping criterion has not been met
For i = 1,...,N
find closest center ck ∈ C to each instance pi
assign instance pi to cluster Ck
For k = 1,...,K
set ck to be the center of mass of all points in Ci
71. k-means++
1. Choose a initial center c1
For k = 2,...,K
select ck = p ∈ I with probability d2(p,C)/cost(C,I)
2. while stopping criterion has not been met
For i = 1,...,N
find closest center ck ∈ C to each instance pi
assign instance pi to cluster Ck
For k = 1,...,K
set ck to be the center of mass of all points in Ci
72. Performance Measures
Internal Measures
Sum square distance
Dunn index D = dmin
dmax
C-Index C = S−Smin
Smax−Smin
External Measures
Rand Measure
F Measure
Jaccard
Purity
73. BIRCH
BALANCED ITERATIVE REDUCING AND CLUSTERING USING
HIERARCHIES
Clustering Features CF = (N,LS,SS)
N: number of data points
LS: linear sum of the N data points
SS: square sum of the N data points
Properties:
Additivity: CF1 +CF2 = (N1 +N2,LS1 +LS2,SS1 +SS2)
Easy to compute: average inter-cluster distance
and average intra-cluster distance
Uses CF tree
Height-balanced tree with two parameters
B: branching factor
T: radius leaf threshold
74. BIRCH
BALANCED ITERATIVE REDUCING AND CLUSTERING USING
HIERARCHIES
Phase 1: Scan all data and build an initial in-memory CF tree
Phase 2: Condense into desirable range by building a smaller CF
tree (optional)
Phase 3: Global clustering
Phase 4: Cluster refining (optional and off line, as requires more
passes)
75. Clu-Stream
Clu-Stream
Uses micro-clusters to store statistics on-line
Clustering Features CF = (N,LS,SS,LT,ST)
N: numer of data points
LS: linear sum of the N data points
SS: square sum of the N data points
LT: linear sum of the time stamps
ST: square sum of the time stamps
Uses pyramidal time frame
76. Clu-Stream
On-line Phase
For each new point that arrives
the point is absorbed by a micro-cluster
the point starts a new micro-cluster of its own
delete oldest micro-cluster
merge two of the oldest micro-cluster
Off-line Phase
Apply k-means using microclusters as points
77. StreamKM++: Coresets
Coreset of a set P with respect to some problem
Small subset that approximates the original set P.
Solving the problem for the coreset provides an approximate
solution for the problem on P.
(k,ε)-coreset
A (k,ε)-coreset S of P is a subset of P that for each C of size k
(1−ε)cost(P,C) ≤ costw(S,C) ≤ (1+ε)cost(P,C)
78. StreamKM++: Coresets
Coreset Tree
Choose a leaf l node at random
Choose a new sample point denoted by qt+1 from Pl according to
d2
Based on ql and qt+1, split Pl into two subclusters and create two
child nodes
StreamKM++
Maintain L = log2( n
m )+2 buckets B0,B1,...,BL−1
80. Frequent Patterns
Suppose D is a dataset of patterns, t ∈ D, and min sup is a constant.
Definition
Support (t): number of patterns
in D that are superpatterns of t.
Definition
Pattern t is frequent if
Support (t) ≥ min sup.
Frequent Subpattern Problem
Given D and min sup, find all frequent subpatterns of patterns in D.
81. Frequent Patterns
Suppose D is a dataset of patterns, t ∈ D, and min sup is a constant.
Definition
Support (t): number of patterns
in D that are superpatterns of t.
Definition
Pattern t is frequent if
Support (t) ≥ min sup.
Frequent Subpattern Problem
Given D and min sup, find all frequent subpatterns of patterns in D.
82. Frequent Patterns
Suppose D is a dataset of patterns, t ∈ D, and min sup is a constant.
Definition
Support (t): number of patterns
in D that are superpatterns of t.
Definition
Pattern t is frequent if
Support (t) ≥ min sup.
Frequent Subpattern Problem
Given D and min sup, find all frequent subpatterns of patterns in D.
83. Frequent Patterns
Suppose D is a dataset of patterns, t ∈ D, and min sup is a constant.
Definition
Support (t): number of patterns
in D that are superpatterns of t.
Definition
Pattern t is frequent if
Support (t) ≥ min sup.
Frequent Subpattern Problem
Given D and min sup, find all frequent subpatterns of patterns in D.
87. Itemset Mining
d1 abce
d2 cde
d3 abce
d4 acde
d5 abcde
d6 bcd
Support Frequent Gen Closed
6 c c c
5 e,ce e ce
4 a,ac,ae,ace a ace
4 b,bc b bc
4 d,cd d cd
3 ab,abc,abe ab
be,bce,abce be abce
3 de,cde de cde
88. Itemset Mining
d1 abce
d2 cde
d3 abce
d4 acde
d5 abcde
d6 bcd
Support Frequent Gen Closed Max
6 c c c
5 e,ce e ce
4 a,ac,ae,ace a ace
4 b,bc b bc
4 d,cd d cd
3 ab,abc,abe ab
be,bce,abce be abce abce
3 de,cde de cde cde
89. Itemset Mining
d1 abce
d2 cde
d3 abce
d4 acde
d5 abcde
d6 bcd
Support Frequent Gen Closed Max
6 c c c
5 e,ce e ce
4 a,ac,ae,ace a ace
4 b,bc b bc
4 d,cd d cd
3 ab,abc,abe ab
be,bce,abce be abce abce
3 de,cde de cde cde
90. Itemset Mining
d1 abce
d2 cde
d3 abce
d4 acde
d5 abcde
d6 bcd
e → ce
Support Frequent Gen Closed Max
6 c c c
5 e,ce e ce
4 a,ac,ae,ace a ace
4 b,bc b bc
4 d,cd d cd
3 ab,abc,abe ab
be,bce,abce be abce abce
3 de,cde de cde cde
91. Itemset Mining
d1 abce
d2 cde
d3 abce
d4 acde
d5 abcde
d6 bcd
Support Frequent Gen Closed Max
6 c c c
5 e,ce e ce
4 a,ac,ae,ace a ace
4 b,bc b bc
4 d,cd d cd
3 ab,abc,abe ab
be,bce,abce be abce abce
3 de,cde de cde cde
92. Itemset Mining
d1 abce
d2 cde
d3 abce
d4 acde
d5 abcde
d6 bcd
Support Frequent Gen Closed Max
6 c c c
5 e,ce e ce
4 a,ac,ae,ace a ace
4 b,bc b bc
4 d,cd d cd
3 ab,abc,abe ab
be,bce,abce be abce abce
3 de,cde de cde cde
93. Itemset Mining
d1 abce
d2 cde
d3 abce
d4 acde
d5 abcde
d6 bcd
a → ace
Support Frequent Gen Closed Max
6 c c c
5 e,ce e ce
4 a,ac,ae,ace a ace
4 b,bc b bc
4 d,cd d cd
3 ab,abc,abe ab
be,bce,abce be abce abce
3 de,cde de cde cde
94. Itemset Mining
d1 abce
d2 cde
d3 abce
d4 acde
d5 abcde
d6 bcd
Support Frequent Gen Closed Max
6 c c c
5 e,ce e ce
4 a,ac,ae,ace a ace
4 b,bc b bc
4 d,cd d cd
3 ab,abc,abe ab
be,bce,abce be abce abce
3 de,cde de cde cde
95. Closed Patterns
Usually, there are too many frequent patterns. We can compute a
smaller set, while keeping the same information.
Example
A set of 1000 items, has 21000 ≈ 10301 subsets, that is more than the
number of atoms in the universe ≈ 1079
96. Closed Patterns
A priori property
If t is a subpattern of t, then Support (t ) ≥ Support (t).
Definition
A frequent pattern t is closed if none of its proper superpatterns has
the same support as it has.
Frequent subpatterns and their supports can be generated from closed
patterns.
97. Maximal Patterns
Definition
A frequent pattern t is maximal if none of its proper superpatterns is
frequent.
Frequent subpatterns can be generated from maximal patterns, but not
with their support.
All maximal patterns are closed, but not all closed patterns are
maximal.
98. Non streaming frequent itemset
miners
Representation:
Horizontal layout
T1: a, b, c
T2: b, c, e
T3: b, d, e
Vertical layout
a: 1 0 0
b: 1 1 1
c: 1 1 0
Search:
Breadth-first (levelwise): Apriori
Depth-first: Eclat, FP-Growth
99. Mining Patterns over Data Streams
Requirements: fast, use small amount of memory and adaptive
Type:
Exact
Approximate
Per batch, per transaction
Incremental, Sliding Window, Adaptive
Frequent, Closed, Maximal patterns
100. Moment
Computes closed frequents itemsets in a sliding window
Uses Closed Enumeration Tree
Uses 4 type of Nodes:
Closed Nodes
Intermediate Nodes
Unpromising Gateway Nodes
Infrequent Gateway Nodes
Adding transactions: closed items remains closed
Removing transactions: infrequent items remains infrequent
101. FP-Stream
Mining Frequent Itemsets at Multiple Time Granularities
Based in FP-Growth
Maintains
pattern tree
tilted-time window
Allows to answer time-sensitive queries
Places greater information to recent data
Drawback: time and memory complexity
102. Tree and Graph Mining: Dealing
with time changes
Keep a window on recent stream elements
Actually, just its lattice of closed sets!
Keep track of number of closed patterns in lattice, N
Use some change detector on N
When change is detected:
Drop stale part of the window
Update lattice to reflect this deletion, using deletion rule
Alternatively, sliding window of some fixed size