SlideShare una empresa de Scribd logo
1 de 23
Descargar para leer sin conexión
The ClusTree: Indexing Micro-Clusters
     for Anytime Stream Mining




  Philipp Kranen1, Ira Assent2, Corinna Baldauf1, Thomas Seidl1
              1DataManagement and Data Exploration Group,
                   RWTH Aachen University, Germany
       2Department of Computer Science, Aarhus University, Denmark
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining

Motivating examples




                                                   emergency
                                               pre                        full                      professional
                                            classifier                 classifier                     decision

                                                         normal
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining

    Applications and tasks

                                                                                                                   Modeling
            Classification
data rate
constant
data rate
 varying




                         Outlier
                         detection
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining

Agenda



  I.      The Anytime principle
           Anytime algorithms for stream data mining


  II.     The ClusTree
           Self-adaptive anytime stream clustering


  III. The MOA Framework
           An open source framework for stream mining algorithms




                                                                                                               4
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining

Definitions I

   Stream
         A stream  :  →        : → ,        is an infinite sequence 
         of objects  ∈ from a d‐dimensional input space  and
            ∈ ,       ∀    is the discrete arrival time of object  .
   Inter‐arrival time
         The inter‐arrival time between two consecutive objects                              and 
         is denoted as Δt             , i.e. 0 Δ ∈ .
   Constant and varying streams
         A stream  is called constant  ↔ Δ                        Δ 	∀ ,
   Stream algorithms
         – Online algorithms – the input is given one at a time
         – Budget algorithms – tailored to a specific time budget b
         – Anytime algorithms – provide a result after any amount of processing time
                                                                                                               5
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining

Definitions II

   Budget Algorithms – tailored to a specific time budget
         – Available time < budget                                  no result
         – Available time > budget                                  idle times


   How should stream processing be done?




                                                                                          quality
         – Little time                        fast result
         – More time                          use it to improve the result
                                                                                                               time

   Anytime Algorithms – provide a result after any time
            For a given input an anytime algorithm can provide a first result after a very
            short initialization time and it uses additional time to improve its result. The
            algorithm is interruptible after any time and will deliver the best result
            obtained until the point of interruption.
                                                                                                                      6
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining

Anytime algorithms on constant streams

   Can we do better than using all available time? 

                                                                                      tf                       td
        Yes we can!                                 constant data stream                                            type 1
                                                                                                                    type 2




                                                                                                                …
                                                        arrival interval ta                                         type m




   Distribute computation time according to confidence values
         – Spend less time on confident items
         – Use additional time for uncertain objects


   Prerequisites
         – Anytime algorithm
         – Confidence measure
                                                                                                                       7
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining

Existing anytime classification approaches

   Anytime support vector machines
   Anytime nearest neighbor classification
   Anytime Bayesian classification
          Categorical data
          Continuous data
   Others
          Anytime induction of decision trees
          Anytime A* algorithm
          Anytime clustering
          Anytime outlier detection


  [References on last slide.]
                                                                                                               8
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining

Sampling, buffering, anytime clustering

   What about sampling?
          Not appropriate for classification or outlier detection.


       What about buffering?
          Durations of bursts are unknown.


   Why anytime clustering?
          …
          “Smart buffering”
                 Use micro‐clusters as input for further analysis
                 Provide constant (maximal) granularity at regular intervals
                                                                                                               9
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining

Agenda



  I.      The Anytime principle
           Anytime algorithms for stream data mining


  II.     The ClusTree
           Self-adaptive anytime stream clustering


  III. The MOA Framework
           An open source framework for stream mining algorithms




                                                                                                               10
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining

Problem statement

       Clustering is a frequently used technique
              Provides an overview, reduces amount of data, groups similar objects
              Streaming scenario:
                 Use summaries (micro clusters) as input for further analysis
                 But: endless amounts of data (streams) are hard to handle



       Stream clustering challenges:
              Single pass clustering
                                                                                                        Anytime
              Limited time, varying time allowance
              Limited memory, yet least information loss                                            Fine grained
              Evolving data                                                                        Drift&Novelty
              Flexible number and size of clusters
                                                                                                     Self-adaptive

                                                                                                                     11
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining

Related work

       Stream clustering approaches and paradigms
              Convex clustering approaches (k-center)
              Density-based, grid-based approaches
              kernels, graphs, fractal dimensions, …
              Process chunks, merge results
              Maintain list, remove oldest or merge closest pair
              Online and Offline component


       All approaches have to restrict themselves to the worst case time




                                                                                                               12
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining

Goals

       Anytime clustering                                                                                       Anytime
              don’t miss any point, no matter at which speed

       Adaptive model size                                                                                    Self-adaptive
             don’t restrict model to worst case assumptions

       Fine grained representation                                                                            Fine grained
               provide more detailed input for offline component

       Compatible to existing work on drift and novelty                                                       Drift&Novelty
              Aging / Decay
              Snapshots / Drift & Novelty




                                                                                                                       13
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining

ClusTree – basic idea

       Cluster features CF = (N, LS, SS) represent micro-clusters
              Allow to compute statistics like mean and variance
       Maintain a balanced hierarchical data structure                                                        less time
              Insert new object into                                                                           more time
               the closest subtree
              Insertion stops
               if next object arrives
              Most detailed model
               is stored at leaf level
              Tree (= model) grows
               if more time is available




                                                                                                                   14
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining

ClusTree structure and anytime insert                                                                          Fine grained
                                                                                                                Anytime


   Hierarchy of micro-clusters CF = (N, LS, SS)
   New objects (x1 … xd) are simply added to the cluster feature
              N = N + 1, LSi = LSi + xi, SSi = SSi + (xi)2
       Anytime insert: buffer object locally in a local buffer CF

                   inner entry
                          LS1 (t) SS1   (t)              LS1 (t) SS1      (t)

                   n(t)
                     b
                          …       …               n(t)
                                                    b
                                                         …       …
                          LSd SSd                        LSd b SSd        b




                                                      LS1 (t) SS1   (t)

                                               n(t)
                                                 b
                                                      …       …
                               leaf entry             LSd SSd
                                                                                                                       15
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining

Buffer and hitchhiker                                                                                          Self-adaptive




   Buffer: interrupt insertion – aggregate objects on interrupt
   Hitchhiker: resume insertion – take buffer along (if same way)
            Maximally two objects to descend with
       Tree grows through splitting nodes starting from the leaf
                                                           entry structure:
                                                          (CF, pointer, CFb )


                              .                                                             Level 1: root

                 .                                                                          Level 2: hitchhike

       .                                                                                    Level 3: buffer

                 .                                 .     .      .                           Level 4: insert        .

             destination of                            destination of     .                                            16
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining

Maintaining an up-to-date view                                                                                 Drift&Novelty




       Goal: Compatible to existing work on drift and novelty
              New leaf entries get a unique ID
              Aging by an exponential decay function w(Δt) = β‐λΔt
       Benefits of the employed decay function
              Avoid splits by reusing insignificant entries
              An entry’s CF still represents exactly its subtree and its buffer


             Lemma 1 (ClusTree Invariant): For each inner entry es with timestamp t + Δt
             and decay function w(Δt) = 2‐λΔt it holds
                                                             s
                           es .CF (t  t )  ( w(t )   esi .CF (t ) )  es .buffer (t  t )
                                                             i 1
             [Proof in the paper.]

                                                                                                                       17
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining

Extensions of the ClusTree

       Insertion of aggregates
          for extremely fast streams


       Iterative depth first descent
           for slower streams


       Local look ahead
          to reduce overlapping


       Explicit noise handling
          and noise to cluster events
         a)                      b)                         c)                                           d)
              e   e    n              e   e   e   n              e   e   e   n                                 e   e   e   n

                                                  L
                                                                             L      L        L
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining

Evaluation – anytime clustering and aggregation

                                                                                  Forest Covertype
       Anytime clustering (90.000 pps)
              88% purity on leaf level
              Purity on higher levels
               corresponds to faster streams
              >70% purity starting
               three levels under root



       Aggregation (varying streams)
              Purity drops under 70%
               at 150.000 pps
              Aggregation significantly
               improves the purity
               on the leaf level
                                                                                                               19
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining

Evaluation – adaptive clustering




       Setup for constant streams
              ClusTree: stream speed  maintainable #MC
              DenStream [SDM06] & CluStream [VLDB03]: #MC  processable pps
       ClusTree results: #MC is exponential (#dists is logarithmic)                                           20
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining

Agenda



  I.      The Anytime principle
           Anytime algorithms for stream data mining


  II.     The ClusTree
           Self-adaptive anytime stream clustering


  III. The MOA Framework
           An open source framework for stream mining algorithms




                                                                                                               21
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining

The MOA framework

       Extensible open source software
         – Data generators, file streams

         – Stream mining algorithms

         – Measure collection

       Supported stream mining tasks
         – Stream clustering, stream
              classification, outlier detection, …

       Repeatable/benchmark settings

       In collaboration with
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining

References

      Anytime SVM: DeCoste: Anytime Query-Tuned Kernel Machines via Cholesky
       Factorization. SDM, 2003
      DeCoste et al.: Fast query-optimized kernel machine classification via incremental
       approximate nearest support vectors. ICML, 2003
      Bayes (continuous data): Seidl et al.: Indexing density models for incremental
       learning and anytime classification on data streams. EDBT, 2009
      Bayes (categorical): Yang et al.: Classifying under computational resource constraints:
       anytime classification using probabilistic estimators. Machine Learning, 2007
      Anytime Nearest Neighbor: Ueno et al.: Anytime Classification Using the Nearest
       Neighbor Algorithm with Applications to Stream Mining. ICDM, 2006
      Anytime + constant: Kranen et al.: Harnessing the strengths of anytime algorithms
       for constant data streams. DMKD Journal, 2009
      ClusTree: Kranen et al.: Self-Adaptive Anytime Stream Clustering. ICDM 2009
      A complete list of references including stream clustering, MOA, evaluation, etc.:
       Kranen: Anytime Algorithms for Stream Data Mining. PhD Thesis, RWTH Aachen, 2011
                                                                                                               23

Más contenido relacionado

Destacado

Conditional identity based broadcast proxy re-encryption and its application ...
Conditional identity based broadcast proxy re-encryption and its application ...Conditional identity based broadcast proxy re-encryption and its application ...
Conditional identity based broadcast proxy re-encryption and its application ...Shakas Technologies
 
Contributory broadcast encryption with efficient encryption and short ciphert...
Contributory broadcast encryption with efficient encryption and short ciphert...Contributory broadcast encryption with efficient encryption and short ciphert...
Contributory broadcast encryption with efficient encryption and short ciphert...Shakas Technologies
 
Conditional identity based broadcast proxy re-encryption and its application ...
Conditional identity based broadcast proxy re-encryption and its application ...Conditional identity based broadcast proxy re-encryption and its application ...
Conditional identity based broadcast proxy re-encryption and its application ...ieeepondy
 
3.4 density and grid methods
3.4 density and grid methods3.4 density and grid methods
3.4 density and grid methodsKrish_ver2
 
Introduction to Clustering algorithm
Introduction to Clustering algorithmIntroduction to Clustering algorithm
Introduction to Clustering algorithmhadifar
 
Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; KamberChapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kambererror007
 
Density Based Clustering
Density Based ClusteringDensity Based Clustering
Density Based ClusteringSSA KPI
 
Application of Clustering in Data Science using Real-life Examples
Application of Clustering in Data Science using Real-life Examples Application of Clustering in Data Science using Real-life Examples
Application of Clustering in Data Science using Real-life Examples Edureka!
 
Stamp enabling privacy preserving location proofs for mobile users
Stamp enabling privacy preserving location proofs for mobile usersStamp enabling privacy preserving location proofs for mobile users
Stamp enabling privacy preserving location proofs for mobile usersLeMeniz Infotech
 
Types of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithmsTypes of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithmsPrashanth Guntal
 
Optics ordering points to identify the clustering structure
Optics ordering points to identify the clustering structureOptics ordering points to identify the clustering structure
Optics ordering points to identify the clustering structureRajesh Piryani
 
Data Science - Part VII - Cluster Analysis
Data Science - Part VII -  Cluster AnalysisData Science - Part VII -  Cluster Analysis
Data Science - Part VII - Cluster AnalysisDerek Kane
 
Best topics for seminar
Best topics for seminarBest topics for seminar
Best topics for seminarshilpi nagpal
 
Slideshare Powerpoint presentation
Slideshare Powerpoint presentationSlideshare Powerpoint presentation
Slideshare Powerpoint presentationelliehood
 

Destacado (19)

Conditional identity based broadcast proxy re-encryption and its application ...
Conditional identity based broadcast proxy re-encryption and its application ...Conditional identity based broadcast proxy re-encryption and its application ...
Conditional identity based broadcast proxy re-encryption and its application ...
 
Contributory broadcast encryption with efficient encryption and short ciphert...
Contributory broadcast encryption with efficient encryption and short ciphert...Contributory broadcast encryption with efficient encryption and short ciphert...
Contributory broadcast encryption with efficient encryption and short ciphert...
 
Conditional identity based broadcast proxy re-encryption and its application ...
Conditional identity based broadcast proxy re-encryption and its application ...Conditional identity based broadcast proxy re-encryption and its application ...
Conditional identity based broadcast proxy re-encryption and its application ...
 
3.4 density and grid methods
3.4 density and grid methods3.4 density and grid methods
3.4 density and grid methods
 
Introduction to Clustering algorithm
Introduction to Clustering algorithmIntroduction to Clustering algorithm
Introduction to Clustering algorithm
 
Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; KamberChapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
 
Density Based Clustering
Density Based ClusteringDensity Based Clustering
Density Based Clustering
 
Clustering: A Survey
Clustering: A SurveyClustering: A Survey
Clustering: A Survey
 
Ppt 1
Ppt 1Ppt 1
Ppt 1
 
Application of Clustering in Data Science using Real-life Examples
Application of Clustering in Data Science using Real-life Examples Application of Clustering in Data Science using Real-life Examples
Application of Clustering in Data Science using Real-life Examples
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
Stamp enabling privacy preserving location proofs for mobile users
Stamp enabling privacy preserving location proofs for mobile usersStamp enabling privacy preserving location proofs for mobile users
Stamp enabling privacy preserving location proofs for mobile users
 
Types of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithmsTypes of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithms
 
K-MEANS AND D-STREAM ALGORITHM IN HEALTHCARE
K-MEANS AND D-STREAM ALGORITHM IN HEALTHCAREK-MEANS AND D-STREAM ALGORITHM IN HEALTHCARE
K-MEANS AND D-STREAM ALGORITHM IN HEALTHCARE
 
Optics ordering points to identify the clustering structure
Optics ordering points to identify the clustering structureOptics ordering points to identify the clustering structure
Optics ordering points to identify the clustering structure
 
Data Science - Part VII - Cluster Analysis
Data Science - Part VII -  Cluster AnalysisData Science - Part VII -  Cluster Analysis
Data Science - Part VII - Cluster Analysis
 
Best topics for seminar
Best topics for seminarBest topics for seminar
Best topics for seminar
 
Slideshare Powerpoint presentation
Slideshare Powerpoint presentationSlideshare Powerpoint presentation
Slideshare Powerpoint presentation
 
Slideshare ppt
Slideshare pptSlideshare ppt
Slideshare ppt
 

Último

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 

Último (20)

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 

Presentation ucb 2012

  • 1. The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining Philipp Kranen1, Ira Assent2, Corinna Baldauf1, Thomas Seidl1 1DataManagement and Data Exploration Group, RWTH Aachen University, Germany 2Department of Computer Science, Aarhus University, Denmark
  • 2. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining Motivating examples emergency pre full professional classifier classifier decision normal
  • 3. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining Applications and tasks Modeling Classification data rate constant data rate varying Outlier detection
  • 4. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining Agenda I. The Anytime principle Anytime algorithms for stream data mining II. The ClusTree Self-adaptive anytime stream clustering III. The MOA Framework An open source framework for stream mining algorithms 4
  • 5. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining Definitions I  Stream A stream  : → : → , is an infinite sequence  of objects  ∈ from a d‐dimensional input space  and ∈ ,  ∀ is the discrete arrival time of object  .  Inter‐arrival time The inter‐arrival time between two consecutive objects and  is denoted as Δt , i.e. 0 Δ ∈ .  Constant and varying streams A stream  is called constant  ↔ Δ Δ ∀ ,  Stream algorithms – Online algorithms – the input is given one at a time – Budget algorithms – tailored to a specific time budget b – Anytime algorithms – provide a result after any amount of processing time 5
  • 6. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining Definitions II  Budget Algorithms – tailored to a specific time budget – Available time < budget  no result – Available time > budget  idle times  How should stream processing be done? quality – Little time  fast result – More time  use it to improve the result time  Anytime Algorithms – provide a result after any time For a given input an anytime algorithm can provide a first result after a very short initialization time and it uses additional time to improve its result. The algorithm is interruptible after any time and will deliver the best result obtained until the point of interruption. 6
  • 7. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining Anytime algorithms on constant streams  Can we do better than using all available time?  tf td Yes we can! constant data stream type 1 type 2 … arrival interval ta type m  Distribute computation time according to confidence values – Spend less time on confident items – Use additional time for uncertain objects  Prerequisites – Anytime algorithm – Confidence measure 7
  • 8. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining Existing anytime classification approaches  Anytime support vector machines  Anytime nearest neighbor classification  Anytime Bayesian classification  Categorical data  Continuous data  Others  Anytime induction of decision trees  Anytime A* algorithm  Anytime clustering  Anytime outlier detection [References on last slide.] 8
  • 9. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining Sampling, buffering, anytime clustering  What about sampling?  Not appropriate for classification or outlier detection.  What about buffering?  Durations of bursts are unknown.  Why anytime clustering?  …  “Smart buffering”  Use micro‐clusters as input for further analysis  Provide constant (maximal) granularity at regular intervals 9
  • 10. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining Agenda I. The Anytime principle Anytime algorithms for stream data mining II. The ClusTree Self-adaptive anytime stream clustering III. The MOA Framework An open source framework for stream mining algorithms 10
  • 11. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining Problem statement  Clustering is a frequently used technique  Provides an overview, reduces amount of data, groups similar objects  Streaming scenario:  Use summaries (micro clusters) as input for further analysis  But: endless amounts of data (streams) are hard to handle  Stream clustering challenges:  Single pass clustering Anytime  Limited time, varying time allowance  Limited memory, yet least information loss Fine grained  Evolving data Drift&Novelty  Flexible number and size of clusters Self-adaptive 11
  • 12. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining Related work  Stream clustering approaches and paradigms  Convex clustering approaches (k-center)  Density-based, grid-based approaches  kernels, graphs, fractal dimensions, …  Process chunks, merge results  Maintain list, remove oldest or merge closest pair  Online and Offline component  All approaches have to restrict themselves to the worst case time 12
  • 13. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining Goals  Anytime clustering Anytime  don’t miss any point, no matter at which speed  Adaptive model size Self-adaptive  don’t restrict model to worst case assumptions  Fine grained representation Fine grained  provide more detailed input for offline component  Compatible to existing work on drift and novelty Drift&Novelty  Aging / Decay  Snapshots / Drift & Novelty 13
  • 14. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining ClusTree – basic idea  Cluster features CF = (N, LS, SS) represent micro-clusters  Allow to compute statistics like mean and variance  Maintain a balanced hierarchical data structure less time  Insert new object into more time the closest subtree  Insertion stops if next object arrives  Most detailed model is stored at leaf level  Tree (= model) grows if more time is available 14
  • 15. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining ClusTree structure and anytime insert Fine grained Anytime  Hierarchy of micro-clusters CF = (N, LS, SS)  New objects (x1 … xd) are simply added to the cluster feature  N = N + 1, LSi = LSi + xi, SSi = SSi + (xi)2  Anytime insert: buffer object locally in a local buffer CF inner entry LS1 (t) SS1 (t) LS1 (t) SS1 (t) n(t) b … … n(t) b … … LSd SSd LSd b SSd b LS1 (t) SS1 (t) n(t) b … … leaf entry LSd SSd 15
  • 16. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining Buffer and hitchhiker Self-adaptive  Buffer: interrupt insertion – aggregate objects on interrupt  Hitchhiker: resume insertion – take buffer along (if same way)  Maximally two objects to descend with  Tree grows through splitting nodes starting from the leaf entry structure: (CF, pointer, CFb ) . Level 1: root . Level 2: hitchhike . Level 3: buffer . . . . Level 4: insert . destination of destination of . 16
  • 17. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining Maintaining an up-to-date view Drift&Novelty  Goal: Compatible to existing work on drift and novelty  New leaf entries get a unique ID  Aging by an exponential decay function w(Δt) = β‐λΔt  Benefits of the employed decay function  Avoid splits by reusing insignificant entries  An entry’s CF still represents exactly its subtree and its buffer Lemma 1 (ClusTree Invariant): For each inner entry es with timestamp t + Δt and decay function w(Δt) = 2‐λΔt it holds s es .CF (t  t )  ( w(t )   esi .CF (t ) )  es .buffer (t  t ) i 1 [Proof in the paper.] 17
  • 18. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining Extensions of the ClusTree  Insertion of aggregates for extremely fast streams  Iterative depth first descent for slower streams  Local look ahead to reduce overlapping  Explicit noise handling and noise to cluster events a) b) c) d) e e n e e e n e e e n e e e n L L L L
  • 19. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining Evaluation – anytime clustering and aggregation Forest Covertype  Anytime clustering (90.000 pps)  88% purity on leaf level  Purity on higher levels corresponds to faster streams  >70% purity starting three levels under root  Aggregation (varying streams)  Purity drops under 70% at 150.000 pps  Aggregation significantly improves the purity on the leaf level 19
  • 20. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining Evaluation – adaptive clustering  Setup for constant streams  ClusTree: stream speed  maintainable #MC  DenStream [SDM06] & CluStream [VLDB03]: #MC  processable pps  ClusTree results: #MC is exponential (#dists is logarithmic) 20
  • 21. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining Agenda I. The Anytime principle Anytime algorithms for stream data mining II. The ClusTree Self-adaptive anytime stream clustering III. The MOA Framework An open source framework for stream mining algorithms 21
  • 22. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining The MOA framework  Extensible open source software – Data generators, file streams – Stream mining algorithms – Measure collection  Supported stream mining tasks – Stream clustering, stream classification, outlier detection, …  Repeatable/benchmark settings  In collaboration with
  • 23. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining References  Anytime SVM: DeCoste: Anytime Query-Tuned Kernel Machines via Cholesky Factorization. SDM, 2003  DeCoste et al.: Fast query-optimized kernel machine classification via incremental approximate nearest support vectors. ICML, 2003  Bayes (continuous data): Seidl et al.: Indexing density models for incremental learning and anytime classification on data streams. EDBT, 2009  Bayes (categorical): Yang et al.: Classifying under computational resource constraints: anytime classification using probabilistic estimators. Machine Learning, 2007  Anytime Nearest Neighbor: Ueno et al.: Anytime Classification Using the Nearest Neighbor Algorithm with Applications to Stream Mining. ICDM, 2006  Anytime + constant: Kranen et al.: Harnessing the strengths of anytime algorithms for constant data streams. DMKD Journal, 2009  ClusTree: Kranen et al.: Self-Adaptive Anytime Stream Clustering. ICDM 2009  A complete list of references including stream clustering, MOA, evaluation, etc.: Kranen: Anytime Algorithms for Stream Data Mining. PhD Thesis, RWTH Aachen, 2011 23