SlideShare a Scribd company logo
1 of 29
Download to read offline
Rare Time Series Motif Discovery
from Unbounded Streams
Nurjahan Begum and Eamonn Keogh
VLDB 2015
Talk Outline
• Ubiquity of time series
• What are time series motifs?
• Rare Motif Discovery
• Conclusions
Talk Outline
• Ubiquity of time series
• What are time series motifs?
• Rare Motif Discovery
• Conclusions
Time Series is Ubiquitous
0 20
0
40
0
60
0
80
0
100
0
120
0
0 50 100 150 200 250 300 350 400 4500
0.5
1
Unstructured audio stream
Sesnors on machine Shapes Hand Writing
Motion Capture Human Speech Web Clicks
Electrocardiogram
Insect Wingbeat Sound
Talk Outline
• Ubiquity of time series
• What are time series motifs?
• Rare Motif Discovery
• Conclusions
What are time series motifs?
- Approximately repeated subsequences
- An example:
Activity Recognition
walking walking stretching walking
0 200 400 600 800 1000
vacuuming
Motifs are useful as a subroutine for:
-Classification
-Clustering
-Rule Discovery
-Anomaly Detection
Talk Outline
• Ubiquity of time series
• What are time series motifs?
• Rare Motif Discovery
• Conclusions
Rare Motif Discovery
• Motivation
• Algorithms
– Brute Force
– Limited cache
• Performance Improvement
– Changing Data Representation
– Sticky Cache
• Experiments
Rare Motif Discovery
• Motivation
• Algorithms
– Brute Force
– Limited cache
• Performance Improvement
– Changing Data Representation
– Sticky Cache
• Experiments
What are time series motifs?
- Approximately repeated subsequences
- An example:
Activity Recognition
walking walking stretching walking
0 200 400 600 800 1000
vacuuming
Situations where current motif finding algorithms can perform
poorly/ fail
Far apart in space (Motifs occurring in different data chunks )
Infrequent (Computationally expensive!)
Rare Motifs: A real life example
(Four months
omitted)
3 days ago 2 days ago now131 days ago 129 days ago 127 days ago : : : : :
0
20
40
Solar Panel
Current (mA)
A never-ending time series stream from a weather station’s solar panel,
only a fraction of which we can buffer.
A pattern we are observing now seems to have also occurred about
four months ago.
Rare Motif Discovery
• Motivation
• Algorithms
– Brute Force
– Limited cache
• Performance Improvement
– Changing Data Representation
– Sticky Cache
• Experiments
Brute Force Approach
I1 I2 I3 I4 …
“current item is a motif pattern” if we find that D(k + 1, j) < T and j < k + 1.
Ik
Brute Force Approach
• Brute Force with Limited Memory
– A cache of fixed size w
Success Metric
Expected number of objects we see before we report success
I1 I2 I3 I4 …
“current item is a motif pattern” if we find that D(k + 1, j) < T and j < k + 1.
Ik
Rare Motif Discovery
• Motivation
• Algorithms
– Brute Force
– Limited cache
• Performance Improvement
– Changing Data Representation
– Sticky Cache
• Experiments
Changing Data Representation
Emulating virtually large cache
- Downsampling the data
- Reducing the dimensionality of the data
- Reducing the cardinality of the data
16 20 24 30
0
2000
4000
6000
8000
10000
12000
2 4 8 12
Expectednumberofobjectsprocessed
beforesuccess
Virtual Cache Size
Dimensionality
Reduction
Cardinality
Reduction
Downsampling
Rare Motif Discovery
• Motivation
• Why the problem is hard?
• Algorithms
– Brute Force
– Limited cache
• Performance Improvement
– Changing Data Representation
– Sticky Cache
• Experiments
Sticky Cache
0 300 600 900 1200 1500
0.4
0.6
0.8
0.991
P100 = Probability of discarding an
element from R is 100 times greater
P50 = Probability of discarding an
element from R is 50 times greater
P100
P50
P1
Probabilityofsuccess
Number of objects seen before success
•A magic cache where potential motif
patterns tend to remain for longer
•Biased cache replacement policy
Sticky Cache
Algorithm for detecting potential motifs
– Discretize each time series subsequence
– Query the Bloom Filter for the instance in question
• If Bloom Filter saw the instance before
– Tag it as potential motif pattern
• Else
– Tag it as random pattern
0 300 600 900 1200 1500
0.4
0.6
0.8
0.991
P100 = Probability of discarding an
element from R is 100 times greater
P50 = Probability of discarding an
element from R is 50 times greater
P100
P50
P1
Probabilityofsuccess
Number of objects seen before success
0 50 100 150 200 250 300 350
100
1000
400
Expectednumberofelementsseen
beforesuccess
Virtual Cache Size
Downsampling
Dimensionality
Reduction
Cardinality
Reduction
Cardinality
Reduction with Sticky cache
Which approach is best?
Comparison of all approaches in commensurate scale
Rare Motif Discovery
• Motivation
• Algorithms
– Brute Force
– Limited cache
• Performance Improvement
– Changing Data Representation
– Sticky Cache
• Case Studies
7.88 7.9 7.92 7.94 7.96 7.98 8
x 10
4
Dish washer
TS: Dishwasher + Refrigerator
Motif Length: 160 (2 hrs 40 mins)
Sampling Rate: 0.017 Hz
Day 11 Day 19: : : : : : : : : : :
(omitted section)
Day 70 Day 140 Day 210 Day 280 Day 350
Day 70 Day 140 Day 210 Day 280 Day 350
Ground Truth
Motifs Detected
Time Series Length: 2245824 (10 hours)
Sampling Frequency: 62.3 Hz
Motif Length: 188 (3 sec)
White-crowned Sparrow (Zonotrichia leucophrys)
37 minutes : : : : : : : : : : : : : : : : : : : 140 minutes
(omitted section)
0 40 80 120 160 200
36 min 54 sec
2.3 hours
A
0 40 80 120 160 200
1 min 57 sec
B
0 40 80 120 160 200
31 min 27 sec
C
Dataset: NPR August 01, 2013
Time Series Length: 29 hr 21 min 57 sec
MFCC space length: 6596741 (6.5 million)
Sampling Frequency: 62.4 Hz
Motif Length: 4 sec
Conclusions
• We address the problem of detecting rare motifs
– Changing Data representation
– Sticky Cache
• All the code and data for this paper is publicly
available!
Thank you!

More Related Content

Viewers also liked

Race To Top 食農遊藝小舖
Race To Top 食農遊藝小舖Race To Top 食農遊藝小舖
Race To Top 食農遊藝小舖guest1d22f4b
 
Second reality, a design game to study children's interpretation of objects
Second reality, a design game to study children's interpretation of objectsSecond reality, a design game to study children's interpretation of objects
Second reality, a design game to study children's interpretation of objectsTian Tang
 
Average Facebook Page Fan Growth Rates for 7 Industries
Average Facebook Page Fan Growth Rates for 7 IndustriesAverage Facebook Page Fan Growth Rates for 7 Industries
Average Facebook Page Fan Growth Rates for 7 IndustriesDoug Schumacher
 
Social media benchmark and content trends for the fruit juice industry
Social media benchmark and content trends for the fruit juice industrySocial media benchmark and content trends for the fruit juice industry
Social media benchmark and content trends for the fruit juice industryDoug Schumacher
 
Algebraic thinking: generalizations, patterns and functions
Algebraic thinking:  generalizations, patterns and functions Algebraic thinking:  generalizations, patterns and functions
Algebraic thinking: generalizations, patterns and functions gedwards2
 
Prevencion de caídas en personas de edad avanzada
Prevencion de caídas en personas de edad avanzadaPrevencion de caídas en personas de edad avanzada
Prevencion de caídas en personas de edad avanzadaCPR DE AVILES
 
Preventing Back Injuries Training by PATHS
Preventing Back Injuries Training by PATHSPreventing Back Injuries Training by PATHS
Preventing Back Injuries Training by PATHSAtlantic Training, LLC.
 
TOKPIKSA - Lihir Sustainable Rice Project
TOKPIKSA - Lihir Sustainable Rice ProjectTOKPIKSA - Lihir Sustainable Rice Project
TOKPIKSA - Lihir Sustainable Rice ProjectHenry Baraka
 

Viewers also liked (8)

Race To Top 食農遊藝小舖
Race To Top 食農遊藝小舖Race To Top 食農遊藝小舖
Race To Top 食農遊藝小舖
 
Second reality, a design game to study children's interpretation of objects
Second reality, a design game to study children's interpretation of objectsSecond reality, a design game to study children's interpretation of objects
Second reality, a design game to study children's interpretation of objects
 
Average Facebook Page Fan Growth Rates for 7 Industries
Average Facebook Page Fan Growth Rates for 7 IndustriesAverage Facebook Page Fan Growth Rates for 7 Industries
Average Facebook Page Fan Growth Rates for 7 Industries
 
Social media benchmark and content trends for the fruit juice industry
Social media benchmark and content trends for the fruit juice industrySocial media benchmark and content trends for the fruit juice industry
Social media benchmark and content trends for the fruit juice industry
 
Algebraic thinking: generalizations, patterns and functions
Algebraic thinking:  generalizations, patterns and functions Algebraic thinking:  generalizations, patterns and functions
Algebraic thinking: generalizations, patterns and functions
 
Prevencion de caídas en personas de edad avanzada
Prevencion de caídas en personas de edad avanzadaPrevencion de caídas en personas de edad avanzada
Prevencion de caídas en personas de edad avanzada
 
Preventing Back Injuries Training by PATHS
Preventing Back Injuries Training by PATHSPreventing Back Injuries Training by PATHS
Preventing Back Injuries Training by PATHS
 
TOKPIKSA - Lihir Sustainable Rice Project
TOKPIKSA - Lihir Sustainable Rice ProjectTOKPIKSA - Lihir Sustainable Rice Project
TOKPIKSA - Lihir Sustainable Rice Project
 

Similar to VLDB Talk Nurjahan Begum for pdf

Data pipelines and anomaly detection
Data pipelines and anomaly detectionData pipelines and anomaly detection
Data pipelines and anomaly detectionSho Fola Soboyejo
 
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...huguk
 
Nearest Neighbor Customer Insight
Nearest Neighbor Customer InsightNearest Neighbor Customer Insight
Nearest Neighbor Customer InsightMapR Technologies
 
Scaling classical clone detection tools for ultra large datasets
Scaling classical clone detection tools for ultra large datasetsScaling classical clone detection tools for ultra large datasets
Scaling classical clone detection tools for ultra large datasetsimanmahsa
 
Mastering the 80% of Analytics: What Data Scientists Really Do
Mastering the 80% of Analytics: What Data Scientists Really DoMastering the 80% of Analytics: What Data Scientists Really Do
Mastering the 80% of Analytics: What Data Scientists Really DoAvrio Analytics
 
Thoth - Realtime Solr Monitor and Search Analysis Engine
Thoth - Realtime Solr Monitor and Search Analysis EngineThoth - Realtime Solr Monitor and Search Analysis Engine
Thoth - Realtime Solr Monitor and Search Analysis EngineDamiano Braga
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkYan Xu
 
Associations1
Associations1Associations1
Associations1mancnilu
 
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCWhat's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCLucidworks (Archived)
 
Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford MapR Technologies
 
IBANK - Big data www.ibank.uk.com 07474222079
IBANK - Big data www.ibank.uk.com 07474222079IBANK - Big data www.ibank.uk.com 07474222079
IBANK - Big data www.ibank.uk.com 07474222079ibankuk
 
Big Data Real Time Training in Chennai
Big Data Real Time Training in ChennaiBig Data Real Time Training in Chennai
Big Data Real Time Training in ChennaiVijay Susheedran C G
 
Big Data 101 - An introduction
Big Data 101 - An introductionBig Data 101 - An introduction
Big Data 101 - An introductionNeeraj Tewari
 
CPLEX Optimization Studio, Modeling, Theory, Best Practices and Case Studies
CPLEX Optimization Studio, Modeling, Theory, Best Practices and Case StudiesCPLEX Optimization Studio, Modeling, Theory, Best Practices and Case Studies
CPLEX Optimization Studio, Modeling, Theory, Best Practices and Case Studiesoptimizatiodirectdirect
 

Similar to VLDB Talk Nurjahan Begum for pdf (20)

Data pipelines and anomaly detection
Data pipelines and anomaly detectionData pipelines and anomaly detection
Data pipelines and anomaly detection
 
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
 
Nearest Neighbor Customer Insight
Nearest Neighbor Customer InsightNearest Neighbor Customer Insight
Nearest Neighbor Customer Insight
 
Scaling classical clone detection tools for ultra large datasets
Scaling classical clone detection tools for ultra large datasetsScaling classical clone detection tools for ultra large datasets
Scaling classical clone detection tools for ultra large datasets
 
Mastering the 80% of Analytics: What Data Scientists Really Do
Mastering the 80% of Analytics: What Data Scientists Really DoMastering the 80% of Analytics: What Data Scientists Really Do
Mastering the 80% of Analytics: What Data Scientists Really Do
 
Thoth - Realtime Solr Monitor and Search Analysis Engine
Thoth - Realtime Solr Monitor and Search Analysis EngineThoth - Realtime Solr Monitor and Search Analysis Engine
Thoth - Realtime Solr Monitor and Search Analysis Engine
 
Data Mining Lecture_2.pptx
Data Mining Lecture_2.pptxData Mining Lecture_2.pptx
Data Mining Lecture_2.pptx
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural Network
 
Associations1
Associations1Associations1
Associations1
 
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCWhat's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
 
Paris Data Geeks
Paris Data GeeksParis Data Geeks
Paris Data Geeks
 
Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford
 
01-pengantar.pdf
01-pengantar.pdf01-pengantar.pdf
01-pengantar.pdf
 
IBANK - Big data www.ibank.uk.com 07474222079
IBANK - Big data www.ibank.uk.com 07474222079IBANK - Big data www.ibank.uk.com 07474222079
IBANK - Big data www.ibank.uk.com 07474222079
 
Big Data Real Time Training in Chennai
Big Data Real Time Training in ChennaiBig Data Real Time Training in Chennai
Big Data Real Time Training in Chennai
 
Big Data 101 - An introduction
Big Data 101 - An introductionBig Data 101 - An introduction
Big Data 101 - An introduction
 
CPLEX Optimization Studio, Modeling, Theory, Best Practices and Case Studies
CPLEX Optimization Studio, Modeling, Theory, Best Practices and Case StudiesCPLEX Optimization Studio, Modeling, Theory, Best Practices and Case Studies
CPLEX Optimization Studio, Modeling, Theory, Best Practices and Case Studies
 
Large scalecplex
Large scalecplexLarge scalecplex
Large scalecplex
 
Big data
Big dataBig data
Big data
 
Taming Text
Taming TextTaming Text
Taming Text
 

VLDB Talk Nurjahan Begum for pdf

  • 1.
  • 2.
  • 3.
  • 4. Rare Time Series Motif Discovery from Unbounded Streams Nurjahan Begum and Eamonn Keogh VLDB 2015
  • 5. Talk Outline • Ubiquity of time series • What are time series motifs? • Rare Motif Discovery • Conclusions
  • 6. Talk Outline • Ubiquity of time series • What are time series motifs? • Rare Motif Discovery • Conclusions
  • 7. Time Series is Ubiquitous 0 20 0 40 0 60 0 80 0 100 0 120 0 0 50 100 150 200 250 300 350 400 4500 0.5 1 Unstructured audio stream Sesnors on machine Shapes Hand Writing Motion Capture Human Speech Web Clicks Electrocardiogram Insect Wingbeat Sound
  • 8. Talk Outline • Ubiquity of time series • What are time series motifs? • Rare Motif Discovery • Conclusions
  • 9. What are time series motifs? - Approximately repeated subsequences - An example: Activity Recognition walking walking stretching walking 0 200 400 600 800 1000 vacuuming Motifs are useful as a subroutine for: -Classification -Clustering -Rule Discovery -Anomaly Detection
  • 10. Talk Outline • Ubiquity of time series • What are time series motifs? • Rare Motif Discovery • Conclusions
  • 11. Rare Motif Discovery • Motivation • Algorithms – Brute Force – Limited cache • Performance Improvement – Changing Data Representation – Sticky Cache • Experiments
  • 12. Rare Motif Discovery • Motivation • Algorithms – Brute Force – Limited cache • Performance Improvement – Changing Data Representation – Sticky Cache • Experiments
  • 13. What are time series motifs? - Approximately repeated subsequences - An example: Activity Recognition walking walking stretching walking 0 200 400 600 800 1000 vacuuming Situations where current motif finding algorithms can perform poorly/ fail Far apart in space (Motifs occurring in different data chunks ) Infrequent (Computationally expensive!)
  • 14. Rare Motifs: A real life example (Four months omitted) 3 days ago 2 days ago now131 days ago 129 days ago 127 days ago : : : : : 0 20 40 Solar Panel Current (mA) A never-ending time series stream from a weather station’s solar panel, only a fraction of which we can buffer. A pattern we are observing now seems to have also occurred about four months ago.
  • 15. Rare Motif Discovery • Motivation • Algorithms – Brute Force – Limited cache • Performance Improvement – Changing Data Representation – Sticky Cache • Experiments
  • 16. Brute Force Approach I1 I2 I3 I4 … “current item is a motif pattern” if we find that D(k + 1, j) < T and j < k + 1. Ik
  • 17. Brute Force Approach • Brute Force with Limited Memory – A cache of fixed size w Success Metric Expected number of objects we see before we report success I1 I2 I3 I4 … “current item is a motif pattern” if we find that D(k + 1, j) < T and j < k + 1. Ik
  • 18. Rare Motif Discovery • Motivation • Algorithms – Brute Force – Limited cache • Performance Improvement – Changing Data Representation – Sticky Cache • Experiments
  • 19. Changing Data Representation Emulating virtually large cache - Downsampling the data - Reducing the dimensionality of the data - Reducing the cardinality of the data 16 20 24 30 0 2000 4000 6000 8000 10000 12000 2 4 8 12 Expectednumberofobjectsprocessed beforesuccess Virtual Cache Size Dimensionality Reduction Cardinality Reduction Downsampling
  • 20. Rare Motif Discovery • Motivation • Why the problem is hard? • Algorithms – Brute Force – Limited cache • Performance Improvement – Changing Data Representation – Sticky Cache • Experiments
  • 21. Sticky Cache 0 300 600 900 1200 1500 0.4 0.6 0.8 0.991 P100 = Probability of discarding an element from R is 100 times greater P50 = Probability of discarding an element from R is 50 times greater P100 P50 P1 Probabilityofsuccess Number of objects seen before success •A magic cache where potential motif patterns tend to remain for longer •Biased cache replacement policy
  • 22. Sticky Cache Algorithm for detecting potential motifs – Discretize each time series subsequence – Query the Bloom Filter for the instance in question • If Bloom Filter saw the instance before – Tag it as potential motif pattern • Else – Tag it as random pattern 0 300 600 900 1200 1500 0.4 0.6 0.8 0.991 P100 = Probability of discarding an element from R is 100 times greater P50 = Probability of discarding an element from R is 50 times greater P100 P50 P1 Probabilityofsuccess Number of objects seen before success
  • 23. 0 50 100 150 200 250 300 350 100 1000 400 Expectednumberofelementsseen beforesuccess Virtual Cache Size Downsampling Dimensionality Reduction Cardinality Reduction Cardinality Reduction with Sticky cache Which approach is best? Comparison of all approaches in commensurate scale
  • 24. Rare Motif Discovery • Motivation • Algorithms – Brute Force – Limited cache • Performance Improvement – Changing Data Representation – Sticky Cache • Case Studies
  • 25. 7.88 7.9 7.92 7.94 7.96 7.98 8 x 10 4 Dish washer TS: Dishwasher + Refrigerator Motif Length: 160 (2 hrs 40 mins) Sampling Rate: 0.017 Hz Day 11 Day 19: : : : : : : : : : : (omitted section) Day 70 Day 140 Day 210 Day 280 Day 350 Day 70 Day 140 Day 210 Day 280 Day 350 Ground Truth Motifs Detected
  • 26. Time Series Length: 2245824 (10 hours) Sampling Frequency: 62.3 Hz Motif Length: 188 (3 sec) White-crowned Sparrow (Zonotrichia leucophrys) 37 minutes : : : : : : : : : : : : : : : : : : : 140 minutes (omitted section) 0 40 80 120 160 200 36 min 54 sec 2.3 hours A 0 40 80 120 160 200 1 min 57 sec B 0 40 80 120 160 200 31 min 27 sec C
  • 27. Dataset: NPR August 01, 2013 Time Series Length: 29 hr 21 min 57 sec MFCC space length: 6596741 (6.5 million) Sampling Frequency: 62.4 Hz Motif Length: 4 sec
  • 28. Conclusions • We address the problem of detecting rare motifs – Changing Data representation – Sticky Cache • All the code and data for this paper is publicly available!