This document presents the ClusTree, a self-adaptive clustering algorithm for streaming data. The ClusTree uses micro-clusters to represent streaming data in a hierarchical tree structure. It provides anytime results by inserting data incrementally into the tree as it arrives. The tree structure allows it to adapt to varying processing speeds and provide clustering results at different levels of granularity. Evaluation shows the ClusTree achieves high clustering purity even at fast data rates and can adapt to changing rates by varying the number of micro-clusters maintained. It provides a fine-grained representation of the streaming data suitable as input for further analysis.
Unblocking The Main Thread Solving ANRs and Frozen Frames
Presentation ucb 2012
1. The ClusTree: Indexing Micro-Clusters
for Anytime Stream Mining
Philipp Kranen1, Ira Assent2, Corinna Baldauf1, Thomas Seidl1
1DataManagement and Data Exploration Group,
RWTH Aachen University, Germany
2Department of Computer Science, Aarhus University, Denmark
2. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining
Motivating examples
emergency
pre full professional
classifier classifier decision
normal
3. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining
Applications and tasks
Modeling
Classification
data rate
constant
data rate
varying
Outlier
detection
4. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining
Agenda
I. The Anytime principle
Anytime algorithms for stream data mining
II. The ClusTree
Self-adaptive anytime stream clustering
III. The MOA Framework
An open source framework for stream mining algorithms
4
5. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining
Definitions I
Stream
A stream : → : → , is an infinite sequence
of objects ∈ from a d‐dimensional input space and
∈ , ∀ is the discrete arrival time of object .
Inter‐arrival time
The inter‐arrival time between two consecutive objects and
is denoted as Δt , i.e. 0 Δ ∈ .
Constant and varying streams
A stream is called constant ↔ Δ Δ ∀ ,
Stream algorithms
– Online algorithms – the input is given one at a time
– Budget algorithms – tailored to a specific time budget b
– Anytime algorithms – provide a result after any amount of processing time
5
6. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining
Definitions II
Budget Algorithms – tailored to a specific time budget
– Available time < budget no result
– Available time > budget idle times
How should stream processing be done?
quality
– Little time fast result
– More time use it to improve the result
time
Anytime Algorithms – provide a result after any time
For a given input an anytime algorithm can provide a first result after a very
short initialization time and it uses additional time to improve its result. The
algorithm is interruptible after any time and will deliver the best result
obtained until the point of interruption.
6
7. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining
Anytime algorithms on constant streams
Can we do better than using all available time?
tf td
Yes we can! constant data stream type 1
type 2
…
arrival interval ta type m
Distribute computation time according to confidence values
– Spend less time on confident items
– Use additional time for uncertain objects
Prerequisites
– Anytime algorithm
– Confidence measure
7
8. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining
Existing anytime classification approaches
Anytime support vector machines
Anytime nearest neighbor classification
Anytime Bayesian classification
Categorical data
Continuous data
Others
Anytime induction of decision trees
Anytime A* algorithm
Anytime clustering
Anytime outlier detection
[References on last slide.]
8
9. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining
Sampling, buffering, anytime clustering
What about sampling?
Not appropriate for classification or outlier detection.
What about buffering?
Durations of bursts are unknown.
Why anytime clustering?
…
“Smart buffering”
Use micro‐clusters as input for further analysis
Provide constant (maximal) granularity at regular intervals
9
10. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining
Agenda
I. The Anytime principle
Anytime algorithms for stream data mining
II. The ClusTree
Self-adaptive anytime stream clustering
III. The MOA Framework
An open source framework for stream mining algorithms
10
11. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining
Problem statement
Clustering is a frequently used technique
Provides an overview, reduces amount of data, groups similar objects
Streaming scenario:
Use summaries (micro clusters) as input for further analysis
But: endless amounts of data (streams) are hard to handle
Stream clustering challenges:
Single pass clustering
Anytime
Limited time, varying time allowance
Limited memory, yet least information loss Fine grained
Evolving data Drift&Novelty
Flexible number and size of clusters
Self-adaptive
11
12. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining
Related work
Stream clustering approaches and paradigms
Convex clustering approaches (k-center)
Density-based, grid-based approaches
kernels, graphs, fractal dimensions, …
Process chunks, merge results
Maintain list, remove oldest or merge closest pair
Online and Offline component
All approaches have to restrict themselves to the worst case time
12
13. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining
Goals
Anytime clustering Anytime
don’t miss any point, no matter at which speed
Adaptive model size Self-adaptive
don’t restrict model to worst case assumptions
Fine grained representation Fine grained
provide more detailed input for offline component
Compatible to existing work on drift and novelty Drift&Novelty
Aging / Decay
Snapshots / Drift & Novelty
13
14. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining
ClusTree – basic idea
Cluster features CF = (N, LS, SS) represent micro-clusters
Allow to compute statistics like mean and variance
Maintain a balanced hierarchical data structure less time
Insert new object into more time
the closest subtree
Insertion stops
if next object arrives
Most detailed model
is stored at leaf level
Tree (= model) grows
if more time is available
14
15. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining
ClusTree structure and anytime insert Fine grained
Anytime
Hierarchy of micro-clusters CF = (N, LS, SS)
New objects (x1 … xd) are simply added to the cluster feature
N = N + 1, LSi = LSi + xi, SSi = SSi + (xi)2
Anytime insert: buffer object locally in a local buffer CF
inner entry
LS1 (t) SS1 (t) LS1 (t) SS1 (t)
n(t)
b
… … n(t)
b
… …
LSd SSd LSd b SSd b
LS1 (t) SS1 (t)
n(t)
b
… …
leaf entry LSd SSd
15
16. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining
Buffer and hitchhiker Self-adaptive
Buffer: interrupt insertion – aggregate objects on interrupt
Hitchhiker: resume insertion – take buffer along (if same way)
Maximally two objects to descend with
Tree grows through splitting nodes starting from the leaf
entry structure:
(CF, pointer, CFb )
. Level 1: root
. Level 2: hitchhike
. Level 3: buffer
. . . . Level 4: insert .
destination of destination of . 16
17. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining
Maintaining an up-to-date view Drift&Novelty
Goal: Compatible to existing work on drift and novelty
New leaf entries get a unique ID
Aging by an exponential decay function w(Δt) = β‐λΔt
Benefits of the employed decay function
Avoid splits by reusing insignificant entries
An entry’s CF still represents exactly its subtree and its buffer
Lemma 1 (ClusTree Invariant): For each inner entry es with timestamp t + Δt
and decay function w(Δt) = 2‐λΔt it holds
s
es .CF (t t ) ( w(t ) esi .CF (t ) ) es .buffer (t t )
i 1
[Proof in the paper.]
17
18. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining
Extensions of the ClusTree
Insertion of aggregates
for extremely fast streams
Iterative depth first descent
for slower streams
Local look ahead
to reduce overlapping
Explicit noise handling
and noise to cluster events
a) b) c) d)
e e n e e e n e e e n e e e n
L
L L L
19. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining
Evaluation – anytime clustering and aggregation
Forest Covertype
Anytime clustering (90.000 pps)
88% purity on leaf level
Purity on higher levels
corresponds to faster streams
>70% purity starting
three levels under root
Aggregation (varying streams)
Purity drops under 70%
at 150.000 pps
Aggregation significantly
improves the purity
on the leaf level
19
20. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining
Evaluation – adaptive clustering
Setup for constant streams
ClusTree: stream speed maintainable #MC
DenStream [SDM06] & CluStream [VLDB03]: #MC processable pps
ClusTree results: #MC is exponential (#dists is logarithmic) 20
21. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining
Agenda
I. The Anytime principle
Anytime algorithms for stream data mining
II. The ClusTree
Self-adaptive anytime stream clustering
III. The MOA Framework
An open source framework for stream mining algorithms
21
22. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining
The MOA framework
Extensible open source software
– Data generators, file streams
– Stream mining algorithms
– Measure collection
Supported stream mining tasks
– Stream clustering, stream
classification, outlier detection, …
Repeatable/benchmark settings
In collaboration with
23. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining
References
Anytime SVM: DeCoste: Anytime Query-Tuned Kernel Machines via Cholesky
Factorization. SDM, 2003
DeCoste et al.: Fast query-optimized kernel machine classification via incremental
approximate nearest support vectors. ICML, 2003
Bayes (continuous data): Seidl et al.: Indexing density models for incremental
learning and anytime classification on data streams. EDBT, 2009
Bayes (categorical): Yang et al.: Classifying under computational resource constraints:
anytime classification using probabilistic estimators. Machine Learning, 2007
Anytime Nearest Neighbor: Ueno et al.: Anytime Classification Using the Nearest
Neighbor Algorithm with Applications to Stream Mining. ICDM, 2006
Anytime + constant: Kranen et al.: Harnessing the strengths of anytime algorithms
for constant data streams. DMKD Journal, 2009
ClusTree: Kranen et al.: Self-Adaptive Anytime Stream Clustering. ICDM 2009
A complete list of references including stream clustering, MOA, evaluation, etc.:
Kranen: Anytime Algorithms for Stream Data Mining. PhD Thesis, RWTH Aachen, 2011
23