[RakutenTechConf2013] [D-3_2] Counting Big Databy Streaming Algorithms
1. Counting Big Data
by Streaming Algorithms
2013/10/26 @ Rakuten Technology Conference 2013
Rakuten Institute of Technology, Rakuten, Inc.,
Yusaku Kaneta
http://www.rakuten.co.jp/
2. Who am I?
• Yusaku Kaneta (@yusakukaneta)
– Joined Rakuten in April 2012.
– Rakuten Institute of Technology (RIT)
• Interests:
– String processing (esp., Pattern matching)
– Hardware design using FPGA
– Bitwise tricks & techniques
• Love TAOCP 7.1.3 & Hacker's Delight
2
3. Problem: Count Big Data
• Counting:
– Fundamental operation in data analysis.
• Big data is difficult to just count
– Because it needs huge amount of memory.
– E.g., 400GB+ is needed for
one-year access logs.
3
4. Batch Processing
• Batch processing can solve this.
– E.g.,
• Two issues:
– High latency
– Requirement for a cluster of machines
Batch
Batch
Batch
= High cost
Batch
Batch
Batch
4
6. Our Approach
• Streaming algorithms
– Can fulfill all our goals!
– Become common in Web companies.
• See the paper on Google’s PowerDrill & the code of
Twitter’s Algebird for examples of how to use.
• Keys:
– Limited memory
– Low latency
– Theoretical guarantee for accuracy
6
7. Streaming Algorithm Library
• RIT internally provides a C library
for streaming algorithms, libsketch.
• Three advantages:
Memory
efficient
• Bindings for
High
speed
High
accuracy
&
7
8. Why C?
• Our target: Python & Ruby users!
for data analysis
for stream processing
– But most of existing libraries are written in Scala
(algebird), Java (stream-lib), ...
This is a reason
why our library is written in C!
Easy to incorporate C libraris in Python & Ruby.
8
10. Count Query in Rakuten
• Example: We want to know...
1. How many unique users that checked
an item in one day (month, or year)?
2. How many products sold in one day
(month, or year)?
• Streaming algorithms for the queries
1. HyperLogLog algorithm
2. Count-Min Sketch algorithm
10
11. Count Query in Rakuten
• Example: We want to know...
1. How many unique users that checked
an item in one day (month, or year)?
2. How many products sold in one day
(month, or year)?
• Streaming algorithms for the queries
1. HyperLogLog algorithm
2. Count-Min Sketch algorithm
11
12. Problem: Unique Item Count
• Naïve approach:
– Uses dict in Python: ”dict[key] += 1”
– This can require a large amount of memory.
• Streaming algorithm: HyperLogLog
– Counts unique items approximately.
– This needs a fixed amount of memory.
• Google recently proposed an improved version of
HyperLogLog, called HyperLogLog++.
12
16. Performance
• Task: Count unique items in an item set.
Memory
efficient
High
speed
1%
4x -1%
Memory
1193MB
5MB
Speed-up
419sec
108sec
High
accuracy
Accuracy
100%
99%
This data set is small,
but we are using HyperLogLog for bigger data.
16
17. Conclusion
• Streaming algorithms in Rakuten
–We are using them for data analysis.
–We have an internal C library with bindings.
• HyperLogLog, Count-Min Sketch, and so on.
–Future: Plan to implement other algorithms.
17
18. Reference
• HyperLogLog & HyperLogLog++
– [Flajolet et al., AOFA 2007], [Heule et al., EDBT 2013]
• Count-Min Sketch
– [Cormode, Muthukrishnan, J. Algorithms, 2005]
• An excellent slide by Alex Smola
– http://alex.smola.org/teaching/berkeley2012/slides/3_Streams.pdf
• AK TECH BLOG by Aggregate Knowledge
– http://blog.aggregateknowledge.com/
• Stream-lib by Clearspring
– https://github.com/clearspring/stream-lib
18