Evan Pollan talks about Bazaarvoice's Hadoop infrastructure for clickstream analytics, as well as an approach to large-scale cardinality analysis using Map/Reduce and HBase.
A magpie is a bird that suffers an irresistible urge to collect and hoard things Sense of scaleAt our current level of instrumentation and app penentration
HBase fit the bill…Given its storage model and affinity to timeseries dataGiven its clean, out of the box integration with MapReduce
I can’t, and therefore don’t, do diagrams. You’re stuck with word-dense slidesHadoop clusterOf note: we sync S3 to HDFS for optimized job execution and to enable Oozie’s data dependency managementJob PortalOozie web UI is painful to use
When most people get ready to deploy Hadoop to EC2, they choose between Elastic Map Reduce or a custom deploymentCDH distributionCurated, don’t have to worry about mixing and matching various apache component versions
non-HANameNodeEven CDH3 was not immune from SPOFEC2 MTBF iffy…The Magpie team was definitely not the first to foray in to EC2 – BV had been using EC2 for quite some time at this point
Quorum Journal Manager for edit logs- Doesn’t push the SPOF further upstream with a NFS NAS solution for shared storage of the edit logs - This system works really well. Leader election is lightning fast, and we haven’t encountered any failures of reads or writes during out “pull the plug” testingEnd-to-end automation for DR:- And by DR, I mean AZ outages; loss of 3+ data nodes; loss of 2+ “master nodes” - When our SLAs require it, we’ll run an HBase replica in another region, but still treat the MapReduce cluster as expendable HBase/HDFS locality: - Region Server and HFile blocks are not co-resident after a region has been reassigned
We have a solid hadoop infrastructure running in AWS, let’s crunch some big dataNot tenable given…Well, not tenable w/out a very, very large OLAP data store. We’ve got a hadoop cluster, though…Pre-calculate them, too?Large, expensive jobs re-processing the same data sets, lack of flexibility to the end-user
We have a solid hadoop infrastructure running in AWS, let’s crunch some big dataNot tenable given…Well, not tenable w/out a very, very large OLAP data store. We’ve got a hadoop cluster, though…Pre-calculate them, too?Large, expensive jobs re-processing the same data sets, lack of flexibility to the end-user
Conclusion: need some way to calculate and persist a representation of cardinality for an incremental time period that would not be prohibitive to scan over arbitrary time ranges and combine into a single representation of the cardinality of all the subsets.
Bit sets are combinable…Meaning you could take a bit set representation of one day’s cardinality, OR it with another day’s bit set and have a bit set that would tell you the cardinality of the union of the two daysMapReduce to build …For example, unique users at site XYZ on January 31, 2013Scan: start and stop…HBase is very good at scans over reasonable sets of data, even without the benefit of block cache, when rows are (a) reasonably narrow, and (b) the ordering of the keys leads to linear readsA billion bits is a lot of bits…It’s not big data, but it can quickly become big data
Possible mitigation: compressionStill need to generate a 120 MB data structure in RAM, then compressRetrieval non-trivial given decompression costs and heap pressure
-Calculate cardinality in small RAM footprintE.g. for stream processingBig breakthrough in 2007: HyperLogLogNew algorithm and representational data structureTeam of French mathematicians, led by FlajoletTimely: engineers at Google just published a refinement called HLL++ that is more accurate on the low and high end.Combinable…Not unique to HyperLogLogAnalog: lossy compression…BUT: doesn’t require large intermediate heap and associated CPU cycles for compression
I don’t peruse the proceedings of math conferences – but I do keep up on hacker news and high scalability.comLast April, Matt Abrams of Clearspring wrote a blog post on using HyperLogLog estimators to merge cardinality estimators from a bunch of distributed stream-processing machines