Efficient Data Stream Classification via Probabilistic Adaptive Windows

•

1 recomendación•984 vistas

Albert Bifet

Tecnología Educación

Data Streams
Data Streams
Sequence is potentially inﬁnite
High amount of data: sublinear space
High speed of arrival: sublinear time per example
Once an element from a data stream has been processed
it is discarded or archived
Big Data & Real Time

Data Streams
Approximation algorithms
Small error rate with high probability
An algorithm ( , δ)−approximates F if it outputs ˜F for which
Pr[|˜F − F| > F] < δ.
Big Data & Real Time

Data Stream Sliding Window
Sampling algorithms
Giving equal weight to old and new examples: RESERVOIR
SAMPLING
Giving more weight to recent examples: PROBABILISTIC
APPROXIMATE WINDOW
Big Data & Real Time

8 Bits Counter
1 0 1 0 1 0 1 0
What is the largest number we can
store in 8 bits?

8 Bits Counter
What is the largest number we can
store in 8 bits?

8 Bits Counter
0 20 40 60 80 100
0
20
40
60
80
100
x
f(x) = log(1 + x)/ log(2)
f(0) = 0, f(1) = 1

8 Bits Counter
0 2 4 6 8 10
0
2
4
6
8
10
x
f(x) = log(1 + x)/ log(2)
f(0) = 0, f(1) = 1

8 Bits Counter
0 2 4 6 8 10
0
2
4
6
8
10
x
f(x) = log(1 + x/30)/ log(1 + 1/30)
f(0) = 0, f(1) = 1

8 Bits Counter
0 20 40 60 80 100
0
20
40
60
80
100
x
f(x) = log(1 + x/30)/ log(1 + 1/30)
f(0) = 0, f(1) = 1

8 bits Counter
MORRIS APPROXIMATE COUNTING ALGORITHM
1 Init counter c ← 0
2 for every event in the stream
3 do rand = random number between 0 and 1
4 if rand < p
5 then c ← c + 1
What is the largest number we can
store in 8 bits?

PROBABILISTIC APPROXIMATE WINDOW
1 Init window w ← ∅
2 for every instance i in the stream
3 do store the new instance i in window w
4 for every instance j in the window
5 do rand = random number between 0 and 1
6 if rand > b−1
7 then remove instance j from window w
PAW maintains a sample of instances
in logarithmic memory, giving greater
weight to newer instances

Experiments: Methods
Abbr. Classiﬁer Parameters
NB Naive Bayes
HT Hoeffding Tree
HTLB Leveraging Bagging with HT n = 10
kNN k Nearest Neighbour w = 1000, k = 10
kNNW kNN with PAW w = 1000, k = 10
kNNWA
kNN with PAW+ADWIN w = 1000, k = 10
kNNLB
W Leveraging Bagging with kNNW n = 10
The methods we consider. Leveraging Bagging
methods use n models. kNNWA
empties its
window (of max w) when drift is detected (using
the ADWIN drift detector).

Experimental Evaluation
Table : The window size for kNN and corresponding performance.
Accuracy
−w 100 −w 500 −w 1000 −w 5000
Real Avg. 77.88 77.78 79.59 78.23
Synth. Avg. 57.99 81.93 84.74 86.03
Overall Avg. 62.53 80.28 82.59 83.11
Results

Experimental Evaluation
Table : The window size for kNN and corresponding performance.
Time (seconds)
−w 100 −w 500 −w 1000 −w 5000
Real Tot. 297 998 1754 7900
Synth. Tot. 371 1297 2313 10671
Overall Tot. 668 2295 4067 18570
Results

Experimental Evaluation
Table : The window size for kNN and corresponding performance.
RAM Hours
−w 100 −w 500 −w 1000 −w 5000
Real Tot. 0.007 0.082 0.269 5.884
Synth. Tot. 0.002 0.026 0.088 1.988
Overall Tot. 0.009 0.108 0.357 7.872
Results

Experimental Evaluation
Table : Summary of Efﬁciency: Accuracy and RAM-Hours.
NB HT HTLB kNN kNNW kNNWA
kNNLB
W
Accuracy 56.19 73.95 83.75 82.59 82.92 83.19 84.67
RAM-Hrs 0.02 1.57 300.02 0.36 8.08 8.80 250.98
Results

Conclusions
Sampling algorithms for kNN
Giving equal weight to old and new examples: RESERVOIR
SAMPLING
Giving more weight to recent examples: PROBABILISTIC
APPROXIMATE WINDOW
Big Data & Real Time

Más contenido relacionado

La actualidad más candente

Sentiment Knowledge Discovery in Twitter Streaming DataAlbert Bifet

Pitfalls in benchmarking data stream classification and how to avoid themAlbert Bifet

MOA for the IoT at ACML 2016 Albert Bifet

A Short Course in Data Stream MiningAlbert Bifet

Artificial intelligence and data stream miningAlbert Bifet

Mining Frequent Closed Graphs on Evolving Data StreamsAlbert Bifet

Streaming AlgorithmsJoe Kelley

Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data StreamsAlbert Bifet

Data streaming algorithmsSandeep Joshi

End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017StampedeCon

Mining high speed data streams: Hoeffding and VFDTDavide Gallitelli

Parallel Optimization in Machine LearningFabian Pedregosa

5.1 mining data streamsKrish_ver2

Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Jen Aman

New zealand bloom filterxlight

Tutorial 9 (bloom filters)Kira

Python-List comprehensionColin Su

Numerical tour in the Python eco-system: Python, NumPy, scikit-learnArnaud Joly

DeepLearningProjV3Ana Sanchez

Introduction to Data streaming - 05/12/2014Raja Chiky

La actualidad más candente (20)

Sentiment Knowledge Discovery in Twitter Streaming Data

Pitfalls in benchmarking data stream classification and how to avoid them

MOA for the IoT at ACML 2016

A Short Course in Data Stream Mining

Artificial intelligence and data stream mining

Mining Frequent Closed Graphs on Evolving Data Streams

Streaming Algorithms

Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams

Data streaming algorithms

End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017

Mining high speed data streams: Hoeffding and VFDT

Parallel Optimization in Machine Learning

5.1 mining data streams

Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...

New zealand bloom filter

Tutorial 9 (bloom filters)

Python-List comprehension

Numerical tour in the Python eco-system: Python, NumPy, scikit-learn

DeepLearningProjV3

Introduction to Data streaming - 05/12/2014

Destacado

Multi-label Classification with Meta-labelsAlbert Bifet

Apache Samoa: Mining Big Data Streams with Apache FlinkAlbert Bifet

Mining Big Data in Real TimeAlbert Bifet

Introduction to Big DataAlbert Bifet

PAKDD 2011 TUTORIAL Handling Concept Drift: Importance, Challenges and SolutionsAlbert Bifet

Mining Big Data in Real TimeAlbert Bifet

Real Time Big Data ManagementAlbert Bifet

Destacado (7)

Multi-label Classification with Meta-labels

Apache Samoa: Mining Big Data Streams with Apache Flink

Mining Big Data in Real Time

Introduction to Big Data

PAKDD 2011 TUTORIAL Handling Concept Drift: Importance, Challenges and Solutions

Mining Big Data in Real Time

Real Time Big Data Management

Similar a Efficient Data Stream Classification via Probabilistic Adaptive Windows

Streaming multiscale anomaly detectionRavi Kiran B.

Mining Data StreamsSujaAldrin

Real-Time Data Mining for Event StreamsSylvain Hallé

Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewenconfluent

Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...Andrii Gakhov

Opensample: A Low-latency, Sampling-based Measurement Platform for Software D...Junho Suh

ESAI-CEU-UCH solution for American Epilepsy Society Seizure Prediction ChallengeFrancisco Zamora-Martinez

"An adaptive modular approach to the mining of sensor network ...butest

20110620 amst rdam_kpbKonrad Banachewicz

Fast detection of transformed data leaks[mithun_p_c]MithunPChandra

SLIDING WINDOW SUM ALGORITHMS FOR DEEP NEURAL NETWORKSIJCI JOURNAL

DSD-INT 2018 Algorithmic Differentiation - MarkusDeltares

Complex models in ecology: challenges and solutionsPeter Solymos

streamingalgo88585858585858585pppppp.pptxGopiNathVelivela

Count-Distinct ProblemKai Zhang

Model-counting Approaches For Nonlinear Numerical ConstraintsQuoc-Sang Phan

Big Data and Small Devices by Katharina MorikBigMine

Selective and incremental re-computation in reaction to changes: an exercise ...Paolo Missier

Secure information aggregation in sensor networksAleksandr Yampolskiy

Data_Structure_and_Algorithms_Lecture_1.pptISHANAMRITSRIVASTAVA

Similar a Efficient Data Stream Classification via Probabilistic Adaptive Windows (20)

Streaming multiscale anomaly detection

Mining Data Streams

Real-Time Data Mining for Event Streams

Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen

Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...

Opensample: A Low-latency, Sampling-based Measurement Platform for Software D...

ESAI-CEU-UCH solution for American Epilepsy Society Seizure Prediction Challenge

"An adaptive modular approach to the mining of sensor network ...

20110620 amst rdam_kpb

Fast detection of transformed data leaks[mithun_p_c]

SLIDING WINDOW SUM ALGORITHMS FOR DEEP NEURAL NETWORKS

DSD-INT 2018 Algorithmic Differentiation - Markus

Complex models in ecology: challenges and solutions

streamingalgo88585858585858585pppppp.pptx

Count-Distinct Problem

Model-counting Approaches For Nonlinear Numerical Constraints

Big Data and Small Devices by Katharina Morik

Selective and incremental re-computation in reaction to changes: an exercise ...

Secure information aggregation in sensor networks

Data_Structure_and_Algorithms_Lecture_1.ppt

Más de Albert Bifet

Mining Big Data Streams with APACHE SAMOAAlbert Bifet

MOA : Massive Online AnalysisAlbert Bifet

New ensemble methods for evolving data streamsAlbert Bifet

Métodos Adaptativos de Minería de Datos y Aprendizaje para Flujos de Datos.Albert Bifet

Adaptive XML Tree Mining on Evolving Data StreamsAlbert Bifet

Adaptive Learning and Mining for Data Streams and Frequent PatternsAlbert Bifet

Mining Implications from Lattices of Closed TreesAlbert Bifet

Kalman Filters and Adaptive Windows for Learning in Data StreamsAlbert Bifet

Más de Albert Bifet (8)

Mining Big Data Streams with APACHE SAMOA

MOA : Massive Online Analysis

New ensemble methods for evolving data streams

Métodos Adaptativos de Minería de Datos y Aprendizaje para Flujos de Datos.

Adaptive XML Tree Mining on Evolving Data Streams

Adaptive Learning and Mining for Data Streams and Frequent Patterns

Mining Implications from Lattices of Closed Trees

Kalman Filters and Adaptive Windows for Learning in Data Streams

Último

Histor y of HAM Radio presentation slidevu2urc

TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc

Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC

08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls

CNv6 Instructor Chapter 6 Quality of Servicegiselly40

A Call to Action for Generative AI in 2024Results

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer

Slack Application Development 101 Slidespraypatel2

Presentation on how to chat with PDF using ChatGPT code interpreternaman860154

How to convert PDF to text with Nanonetsnaman860154

From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software

08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung

Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun

GenCyber Cyber Security Day PresentationMichael W. Hawkins

A Domino Admins Adventures (Engage 2024)Gabriella Davis

The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad

IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia

Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies

Efficient Data Stream Classification via Probabilistic Adaptive Windows

1. Efﬁcient Data Stream Classiﬁcation via Probabilistic Adaptive Windows Albert Bifet1, Jesse Read2, Bernhard Pfahringer3, Geoff Holmes3 1Yahoo! Research Barcelona 2Universidad Carlos III, Madrid, Spain 3University of Waikato, Hamilton, New Zealand SAC 2013, 19 March 2013

2. Data Streams Big Data & Real Time

3. Data Streams Data Streams Sequence is potentially inﬁnite High amount of data: sublinear space High speed of arrival: sublinear time per example Once an element from a data stream has been processed it is discarded or archived Big Data & Real Time

4. Data Streams Approximation algorithms Small error rate with high probability An algorithm ( , δ)−approximates F if it outputs ˜F for which Pr[|˜F − F| > F] < δ. Big Data & Real Time

5. Data Stream Sliding Window Sampling algorithms Giving equal weight to old and new examples: RESERVOIR SAMPLING Giving more weight to recent examples: PROBABILISTIC APPROXIMATE WINDOW Big Data & Real Time

6. 8 Bits Counter 1 0 1 0 1 0 1 0 What is the largest number we can store in 8 bits?

7. 8 Bits Counter What is the largest number we can store in 8 bits?

8. 8 Bits Counter 0 20 40 60 80 100 0 20 40 60 80 100 x f(x) = log(1 + x)/ log(2) f(0) = 0, f(1) = 1

9. 8 Bits Counter 0 2 4 6 8 10 0 2 4 6 8 10 x f(x) = log(1 + x)/ log(2) f(0) = 0, f(1) = 1

10. 8 Bits Counter 0 2 4 6 8 10 0 2 4 6 8 10 x f(x) = log(1 + x/30)/ log(1 + 1/30) f(0) = 0, f(1) = 1

11. 8 Bits Counter 0 20 40 60 80 100 0 20 40 60 80 100 x f(x) = log(1 + x/30)/ log(1 + 1/30) f(0) = 0, f(1) = 1

12. 8 bits Counter MORRIS APPROXIMATE COUNTING ALGORITHM 1 Init counter c ← 0 2 for every event in the stream 3 do rand = random number between 0 and 1 4 if rand < p 5 then c ← c + 1 What is the largest number we can store in 8 bits?

13. 8 bits Counter MORRIS APPROXIMATE COUNTING ALGORITHM 1 Init counter c ← 0 2 for every event in the stream 3 do rand = random number between 0 and 1 4 if rand < p 5 then c ← c + 1 With p = 1/2 we can store 2 × 256 with standard deviation σ = n/2

14. 8 bits Counter MORRIS APPROXIMATE COUNTING ALGORITHM 1 Init counter c ← 0 2 for every event in the stream 3 do rand = random number between 0 and 1 4 if rand < p 5 then c ← c + 1 With p = 2−c then E[2c ] = n + 2 with variance σ2 = n(n + 1)/2

15. 8 bits Counter MORRIS APPROXIMATE COUNTING ALGORITHM 1 Init counter c ← 0 2 for every event in the stream 3 do rand = random number between 0 and 1 4 if rand < p 5 then c ← c + 1 If p = b−c then E[bc ] = n(b − 1) + b, σ2 = (b − 1)n(n + 1)/2

16. PROBABILISTIC APPROXIMATE WINDOW 1 Init window w ← ∅ 2 for every instance i in the stream 3 do store the new instance i in window w 4 for every instance j in the window 5 do rand = random number between 0 and 1 6 if rand > b−1 7 then remove instance j from window w PAW maintains a sample of instances in logarithmic memory, giving greater weight to newer instances

17. Experiments: Methods Abbr. Classiﬁer Parameters NB Naive Bayes HT Hoeffding Tree HTLB Leveraging Bagging with HT n = 10 kNN k Nearest Neighbour w = 1000, k = 10 kNNW kNN with PAW w = 1000, k = 10 kNNWA kNN with PAW+ADWIN w = 1000, k = 10 kNNLB W Leveraging Bagging with kNNW n = 10 The methods we consider. Leveraging Bagging methods use n models. kNNWA empties its window (of max w) when drift is detected (using the ADWIN drift detector).

18. Experimental Evaluation Table : The window size for kNN and corresponding performance. Accuracy −w 100 −w 500 −w 1000 −w 5000 Real Avg. 77.88 77.78 79.59 78.23 Synth. Avg. 57.99 81.93 84.74 86.03 Overall Avg. 62.53 80.28 82.59 83.11 Results

19. Experimental Evaluation Table : The window size for kNN and corresponding performance. Time (seconds) −w 100 −w 500 −w 1000 −w 5000 Real Tot. 297 998 1754 7900 Synth. Tot. 371 1297 2313 10671 Overall Tot. 668 2295 4067 18570 Results

20. Experimental Evaluation Table : The window size for kNN and corresponding performance. RAM Hours −w 100 −w 500 −w 1000 −w 5000 Real Tot. 0.007 0.082 0.269 5.884 Synth. Tot. 0.002 0.026 0.088 1.988 Overall Tot. 0.009 0.108 0.357 7.872 Results

21. Experimental Evaluation Table : Summary of Efﬁciency: Accuracy and RAM-Hours. NB HT HTLB kNN kNNW kNNWA kNNLB W Accuracy 56.19 73.95 83.75 82.59 82.92 83.19 84.67 RAM-Hrs 0.02 1.57 300.02 0.36 8.08 8.80 250.98 Results

22. Conclusions Sampling algorithms for kNN Giving equal weight to old and new examples: RESERVOIR SAMPLING Giving more weight to recent examples: PROBABILISTIC APPROXIMATE WINDOW Big Data & Real Time

23. Thanks!

Efficient Data Stream Classification via Probabilistic Adaptive Windows

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (7)

Similar a Efficient Data Stream Classification via Probabilistic Adaptive Windows

Similar a Efficient Data Stream Classification via Probabilistic Adaptive Windows (20)

Más de Albert Bifet

Más de Albert Bifet (8)

Último

Último (20)

Efficient Data Stream Classification via Probabilistic Adaptive Windows