Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.
HistoSketch: Fast Similarity-Preserving Sketching
of Streaming Histograms with Concept Drift
Dingqi Yang*, Bin Li†, Laura ...
HistoSketch: Fast Similarity-Preserving Sketching of Streaming
Histograms with Concept Drift
2
What kind of location is th...
Motivation
• Histogram similarity: foundation for many machine learning tasks
• Cardinality of histograms over data stream...
Background
Given a data stream of incoming elements xt, with a weight wt
we compute a histogram V such that
Vi is the weig...
Problem Formulation
• Create and maintain the similarity-preserving sketch S for the full
streaming histogram V such that
...
HistoSketch
• Based on the idea of consistent weighted sampling
• Generate samples such that the probability of drawing id...
HistoSketch
• We propose a new method to compute 𝑦𝑖,𝑗
𝑦𝑖,𝑗 = exp(log 𝑉𝑖 − 𝑟𝑖,𝑗 𝛽𝑖,𝑗)
• and show that this method is 1) cor...
HistoSketch
Incremental Sketch Update
Computation of sketch 𝑆 𝑡 + 1 relies only on 𝑆(𝑡) (with its corresponding hash value...
Experimental Evaluation
• Classification task
• Given labeled streaming histograms, classify those histogram instances
wit...
Experimental Evaluation
• Synthetic dataset
• Generated from two Gaussian distributions representing two classes
• Simulat...
Experimental Evaluation
1. Impact of sketch length K
• Fix 𝜆 = 0.02 and vary
𝐾 = [20, 50, 100, 200, 500, 1000]
• Compare a...
Experimental Evaluation
2. Impact of weight decay factor λ
• Fix 𝐾 = 100 and vary
𝜆 = [0, 0.005, 0.01, 0.02, 0.05, 0.1]
• ...
Experimental Evaluation
• POI dataset
• Infer a place’s category from its customers’ visiting pattern
• Foursquare dataset...
Experimental Evaluation
Classification accuracy
14
Experimental Evaluation
Runtime performance: classification time
15
Conclusion
• We introduced HistoSketch, an efficient similarity preserving sketching method
for streaming histograms with ...
Backup: Histogram Similarity
• Min-max similarity
𝑆𝑖𝑚 𝑀𝑀 𝑉 𝑎
, 𝑉 𝑏
=
Σ𝑖∈ℰmin(𝑉𝑖
𝑎
, 𝑉𝑖
𝑏
)
Σ𝑖∈ℰmax(𝑉𝑖
𝑎
, 𝑉𝑖
𝑏
)
• …normal...
Backup: HistoSketch Implementation
• Former histogram 𝑉 𝑡 is required to compute 𝑉(𝑡 + 1)
• The previous histogram is main...
Backup: Experimental Evaluation
Classification accuracy over time
19
Backup: Future Work
• Way to compute 𝑎𝑖,𝑗 can be further simplified
• Applications to other domains: e.g., recommendation,...
Próxima SlideShare
Cargando en…5
×

HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms with Concept Drift

85 visualizaciones

Publicado el

Presented at ICDM 2017

Publicado en: Datos y análisis
  • Sé el primero en comentar

  • Sé el primero en recomendar esto

HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms with Concept Drift

  1. 1. HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms with Concept Drift Dingqi Yang*, Bin Li†, Laura Rettig*, Philippe Cudré-Mauroux* *eXascale Infolab, University of Fribourg, Switzerland †School of Computer Science, Fudan University, Shanghai, China 1
  2. 2. HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms with Concept Drift 2 What kind of location is this? Places I’ve been: Bar University Museum Supermarket 0.7 0.6 0.14 0.21 0.41 0.63 0.64 0.65 0.21 0.86 0.24 0.82 0.64 0.65 0.21 0.86 0.24 0.82 0.7 0.6 0.14 0.21 0.41 0.63 Compute similarity ?
  3. 3. Motivation • Histogram similarity: foundation for many machine learning tasks • Cardinality of histograms over data streams continuously increases • Similarity-preserving data sketches • Compact, fixed size • Preserve similarity under certain measure • Are incrementally updateable • Concept drift: distribution of a histogram changes over time • If taken into account can improve accuracy of histogram-based similarity techniques • Typical method: gradual forgetting 3
  4. 4. Background Given a data stream of incoming elements xt, with a weight wt we compute a histogram V such that Vi is the weighted cumulative count of the element i. 4 xtxt-1... Streaming histogram elements xt with wt Corresponding histogram V xt-2
  5. 5. Problem Formulation • Create and maintain the similarity-preserving sketch S for the full streaming histogram V such that • each sketch has a fixed size K (K≪ |ℰ|); • the collision probability between two sketches Sa and Sb is the normalized similarity between the histograms Va and Vb  the Hamming distance between Sa and Sb approximates SIMNMM(Va, Vb); • the sketch S(t+1) can be efficiently computed from the incoming histograms element xt+1, S(t), and a weight decay factor λ. 5 xtxt-1... New element xt+1 received Incremental updating xt+1 S(t+1) xt+1S(t) λ xt-2
  6. 6. HistoSketch • Based on the idea of consistent weighted sampling • Generate samples such that the probability of drawing identical samples from two vectors is equal to their min-max similarity. • Method draws three random variables 𝑟𝑖,𝑗~𝐺𝑎𝑚𝑚𝑎(2,1), 𝑐𝑖,𝑗~𝐺𝑎𝑚𝑚𝑎 2,1 , 𝛽𝑖,𝑗~𝑈𝑛𝑖𝑓𝑜𝑟𝑚(0,1) and then computes 𝑦𝑖,𝑗 = exp 𝑟𝑖,𝑗 log 𝑉𝑖 𝑟𝑖,𝑗 + 𝛽𝑖,𝑗 − 𝛽𝑖,𝑗 which is used as input to the random hash value generation. 6
  7. 7. HistoSketch • We propose a new method to compute 𝑦𝑖,𝑗 𝑦𝑖,𝑗 = exp(log 𝑉𝑖 − 𝑟𝑖,𝑗 𝛽𝑖,𝑗) • and show that this method is 1) correct and 2) scale-invariant. Sketch creation 𝑎𝑖,𝑗 = 𝑐𝑖,𝑗 𝑦𝑖,𝑗exp(𝑟𝑖,𝑗) 7 Sketch element Sj Histogram V 0.7 0.6 0.14 0.21 0.41ai,j 3 0.14 The corresponding hash value Aj Computing hash values 1 2 3 4 5i = Minimum 1. compute 𝑦𝑖,𝑗 2. compute hash value 𝑎𝑖,𝑗 3. set 𝑆𝑗 = 𝑎𝑟𝑔𝑚𝑖𝑛 𝑖∈ℇ 𝑎𝑖,𝑗 4. set 𝐴𝑗 = 𝑚𝑖𝑛 𝑖∈ℇ 𝑎𝑖,𝑗
  8. 8. HistoSketch Incremental Sketch Update Computation of sketch 𝑆 𝑡 + 1 relies only on 𝑆(𝑡) (with its corresponding hash values 𝐴 𝑡 ), an incoming element 𝑥𝑡+1 and the weight decay factor 𝜆. 8 Sketch element Sj(t) 3 0.147Adjusted hash value Aj(t)e-λ 1 2 3 4 5i = Step II. Add xt+1 - 0.142 - - -ai,j Computing hash value for i 1 2 3 4 5i = Adjusting sketch Sketch element Sj(t+1)2 0.142 Hash value Aj(t+1) Step I. Scale V(t) by e-λ Step III. Update sketch 0.14Original hash value Aj(t) 0.14×1/(e-λ) Minimum 1. scale existing elements in A 2. add 𝑖′ to histogram 3. recompute 𝑎𝑖′ ,𝑗 4. update sketch 𝑆𝑗 and hash values 𝐴𝑗 with minimum 𝑎𝑗
  9. 9. Experimental Evaluation • Classification task • Given labeled streaming histograms, classify those histogram instances without label • KNN classifier takes data in the form of sketches for classification with 𝐾 = 5 • KNN takes most up-to-date training data for classification from continuously updated sketches 9
  10. 10. Experimental Evaluation • Synthetic dataset • Generated from two Gaussian distributions representing two classes • Simulate data streams with concept drift • Abrupt: one stream starts to receive all elements from the other distribution • Gradual: one stream starts to receive elements from the other distribution with increasing probability, and the labels change • Criteria: 1. How well is the similarity approximated? (impact of sketch length K) 2. How fast can it adapt to concept drift? (impact of weight decay factor λ) 10
  11. 11. Experimental Evaluation 1. Impact of sketch length K • Fix 𝜆 = 0.02 and vary 𝐾 = [20, 50, 100, 200, 500, 1000] • Compare against two methods that retain the full histograms: • Histogram-Classical with unweighted elements • Histogram-Forgetting with gradual forgetting weights • A sketch length of 𝐾 = 500 is sufficient to approximate Histogram- Forgetting 11
  12. 12. Experimental Evaluation 2. Impact of weight decay factor λ • Fix 𝐾 = 100 and vary 𝜆 = [0, 0.005, 0.01, 0.02, 0.05, 0.1] • Compare against Histogram-LatestK which builds a histogram from the latest 𝐾 = 100 elements in the stream (unweighted) • Similarity computation time: • HistoSketch: 13ms • Histogram-LatestK: 133ms 12
  13. 13. Experimental Evaluation • POI dataset • Infer a place’s category from its customers’ visiting pattern • Foursquare dataset: user check-ins for two years from NYC, TKY, IST • Data: user-time visit pairs discretized to the 168 hours in a week • Comparised methods: • Histogram-Coarse: discretized time slots are considered as histogram elements • Histogram-Fine-Classical: user-time pairs are considered as histogram elements • Histogram-Fine-LatestK: only latest K histogram elements • Histogram-Fine-Forgetting: gradual forgetting weights (𝜆 = 0.01) • POISketch: unweighted sketching method that approximates Histogram-Fine-Classical • HistoSketch: approximates Histogram-Fine-Forgetting (𝜆 = 0.01) • Fix 𝐾 = 100 13
  14. 14. Experimental Evaluation Classification accuracy 14
  15. 15. Experimental Evaluation Runtime performance: classification time 15
  16. 16. Conclusion • We introduced HistoSketch, an efficient similarity preserving sketching method for streaming histograms with concept drift. • We demonstrated the effectiveness in approximating normalized min-max similarity. • We use incremental updates to the sketches with gradual forgetting to adapt to concept drift. • We showed on both synthetic and real-world data sets that this method effectively and efficiently approximates similarity and adapts to concept drift. • We observed a speed-up of 7500x on classification with a small loss of accuracy of around 3.5%. 16 Thank you!
  17. 17. Backup: Histogram Similarity • Min-max similarity 𝑆𝑖𝑚 𝑀𝑀 𝑉 𝑎 , 𝑉 𝑏 = Σ𝑖∈ℰmin(𝑉𝑖 𝑎 , 𝑉𝑖 𝑏 ) Σ𝑖∈ℰmax(𝑉𝑖 𝑎 , 𝑉𝑖 𝑏 ) • …normalized: sum-to-one normalization 𝑖∈ℰ 𝑉𝑖 𝑎 = 1, 𝑖∈ℰ 𝑉𝑖 𝑏 = 1 • The collision probability between two sketches 𝑆 𝑎, 𝑆 𝑏 is exactly the normalized min-max similarity between 𝑉 𝑎, 𝑉 𝑏 Pr 𝑆𝑗 𝑎 = 𝑆𝑗 𝑏 = 𝑆𝑖𝑚 𝑁𝑀𝑀 𝑉 𝑎, 𝑉 𝑏 17
  18. 18. Backup: HistoSketch Implementation • Former histogram 𝑉 𝑡 is required to compute 𝑉(𝑡 + 1) • The previous histogram is maintained in a modified count-min sketch 𝑄 • We extend the count-min sketch with decay weights by scaling all counters 𝑄(𝑡) ∙ 𝑒−𝜆 • Parameter configuration: 𝑑 = 10, 𝑔 = 50 guarantees an error of at most 4% with probability 0.999 18
  19. 19. Backup: Experimental Evaluation Classification accuracy over time 19
  20. 20. Backup: Future Work • Way to compute 𝑎𝑖,𝑗 can be further simplified • Applications to other domains: e.g., recommendation, community detection 20

×