Austin Scales- Clickstream Analytics at Bazaarvoice

Clickstream Analytics at Bazaarvoice
Evan Pollan, Engineering Lead

@EvanPollan

Agenda
• Infrastructure: lessons learned operating Hadoop in EC2
• Case study: uniques at scale using Hadoop and HBase

Confidential and Proprietary. © 2012 Bazaarvoice, Inc.

Project Magpie
• Bazaarvoice products – extremely large web surface area
• Client-side instrumentation to measure interactions
• Many event sources (apps) => one sink: Magpie
• Consolidated HTTP event collection
– Network-wide event correlation
– Network ~ many apps and many “sites” (clients)
• Clickstream == Topically segmented JSON event log files
• Sense of scale
– 10 - 20K events per second
– 500M – 1B impressions per day
– 25 – 50 GB compressed event log data per day


Infrastructure Whys
• Why Hadoop?
– Experience scaling brute-force log processing via Hadoop
• Everybody’s favorite: Akamai edge request logs
• EMR, Apache Whirr
– Needed online analytics – HBase fit the bill
– Apache OSS ecosystem familiar to BV

• Why Amazon Web Services?
– Existing infrastructure hosting solution too inflexible and slow
– Couldn’t scale R&D without an elastic infrastructure


High-level architecture
• Event collectors in auto-scale groups behind elastic load balancers
• Event stream compressed and uploaded hourly to S3
• S3: store of record
• Hadoop cluster:
– HDFS: stores raw event logs, derived file-based data sets, and HBase
HFiles/WALs
– Oozie: job scheduling, data dependency management
– MapReduce: analytics (mix of Pig, Java => 100% Java)
– HBase: stores hourly/daily analytics results
• Job Portal: job schedule viz, gap analysis & alerting
• UI/API: Analytics available via JSON API and in Backbone.js UI


EMR vs. roll our own
• Neither
• Cons: EMR
– Price premium
– Opaque Hadoop configuration
– No way to mitigate SPOFs
• Cons: Roll our own
– Small group of engineers, no ops manpower at beginning
• Solution: Cloudera
– Cloudera Manager for config management and provisioning
– CDH 3.X distribution


Missteps
• Problem: non-HA NameNode
• Solution: EBS!
• Problem: EC2 MTBF iffy
• Solution: EBS!
• Reality: When something goes wrong in AWS, it is invariably an
outage or degradation in EBS.
– Violates the whole concept of data locality. Hadoop + SAN =
sadness
• Problem: Where should HBase live?
• Solution: Co-resident with MapReduce!


Where we’ve ended up
• Moved to the latest Cloudera CDH 4.X – HA NameNode!
– Zookeeper for leader election
– Quorum Journal Manager for edit logs
• Learn to let go
– Mitigate SPOF where possible, but plan for failure
– End-to-end automation for DR/migration
• Avoid EBS like the plague
• HBase and MapReduce segmentation
– Enables different hardware step size
– Batch processing doesn’t affect HBase response time
– Better understanding of HBase/HDFS locality (or lack thereof)


Let’s talk sets


Let’s talk sets
• Common problem: uniques (e.g. unique visitors, users, etc.)
• Naïve solution: SELECT DISTINCT(X) FROM Y
• Not tenable given:
– Massive, semi-structured data set
– Thousands of grouping axes
• OK: pre-calculate via MapReduce
• But…
– What would you pre-calculate?
– Daily for each grouping?
– How would you answer queries for other time ranges? Pre-
calculate them, too?


Set Unions
• Definition: cardinality of a set is the number of elements in that set
– A = {1, 2, 3}; |A| = 3
• Cardinality of the union of two sets cannot be determined from the
cardinality of the two sets
– |A U B| not necessarily equal to |A| + |B|
– Only equal if A and B are disjoint
– How do you know if they’re disjoint?
– You need both sets
• Imagine:
– Set “a” are the visitors from yesterday
– Set “b” are the visitors from today
– To get uniques for both days, you have to look at both data sets


An entirely different set: bit sets
• Translate set members’ identifiers to an index in a bit set
• Bit sets are combinable – yahtzee!
• HBase is good at storing bits 
– MapReduce to build bit set for each grouping in your smallest
desirable unit of time
– Persist w/ row key as a function of date and grouping
• Uniques for last month?
– Scan: start and stop rows accounting for date range and grouping
– Merge each day’s bit set with a single bit set representing the union
– Count the number of “on” bits in the merged bit set => cardinality
• But…
– # bits for items whose identifiers number in the billions?
– A billion bits is a lot of bits


Bit sets – solving the size problem
• 109 bits is an expensive way to store a combinable cardinality
• Query I/O example: Uniques for last quarter
– 120 MB/day * 90 days = 10.8 GB
– Too much to pull out of HBase to answer an “online” query
• Storage example: 10K different grouping axes
– Clients, sites, favorite colors, whatever
– 120 MB * 10K = 1.2 TB/day of storage
• Possible mitigation: compression
– Still need to generate a 120 MB data structure, then compress


Cardinality Estimation
• Many different approaches to estimate the cardinality of a set
– General goal: calculate cardinality in small RAM footprint
• Big breakthrough in 2007: the HyperLogLog estimator
• What’s the big deal?
– Tunable accuracy
– Incredible information density
– Combinable
• Analog: lossy compression of bit sets
• How good?
– Estimate cardinality of 109 unique elements +/-2% in 1.5 KB


Nuts & Bolts
• http://github.com/clearspring/stream-lib
– Java impls of top-K, frequency, and cardinality for streams
• A ha moment: combining estimators from distributed counters is
no different than combining them across different time periods!
• MapReduce algorithm
– map(Event) : (key, identifier)
• key is what ever grouping you want uniques for
– Shuffle sorts all key, identifier tuples by key
– reduce(key, Iterable<identifier>) : estimator bytes
• Reducer simply updates the estimator in-place – tiny RAM footprint


Nuts & Bolts
• Reducer output: HBase Put
• HBase “schema”, e.g. daily uniques aggregated by brand:
• Scan:
Row Key Estimator
– brandX
– Jan 2-3 brandX-20130101 [0100110111000]
[0110100111000] brandX-20130102 [0110100111000]
[0100000101011] brandX-20130103 [0100000101011]
brandY-20130101 [0101100011000]
[0110100111011] brandY-20130102 [0100100111001]

Cardinality = N

Nuts & Bolts
• HBase scan is the key to making this fast
– First result: instantiate HyperLogLog estimator
– Remaining results: update estimator in-place
• O(n) to compute result, n ~ number of bits in estimator (1.5KB)
• Freedom to build a data set of unique estimators that can be
arbitrarily sliced quickly
– Quarterly, daily, weekly, ad-hoc date ranges
– HBase client pulls 1.5KB * number of days, returns a long
– Perf anecdote: REST API call to get network-wide uniques for
current month-to-date
• 66 ms over the internet
• 12 ms server-side latency


Austin Scales- Clickstream Analytics at Bazaarvoice

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Austin Scales- Clickstream Analytics at Bazaarvoice

Similar a Austin Scales- Clickstream Analytics at Bazaarvoice (20)

Último

Último (20)

Austin Scales- Clickstream Analytics at Bazaarvoice

Notas del editor