SlideShare una empresa de Scribd logo
1 de 17
Clickstream Analytics at Bazaarvoice
Evan Pollan, Engineering Lead




                                 @EvanPollan
Agenda
 •    Infrastructure: lessons learned operating Hadoop in EC2
 •    Case study: uniques at scale using Hadoop and HBase




Confidential and Proprietary. © 2012 Bazaarvoice, Inc.
Project Magpie
 •    Bazaarvoice products – extremely large web surface area
 •    Client-side instrumentation to measure interactions
 •    Many event sources (apps) => one sink: Magpie
 •    Consolidated HTTP event collection
       – Network-wide event correlation
       – Network ~ many apps and many “sites” (clients)
 •    Clickstream == Topically segmented JSON event log files
 •    Sense of scale
       – 10 - 20K events per second
       – 500M – 1B impressions per day
       – 25 – 50 GB compressed event log data per day


Confidential and Proprietary. © 2012 Bazaarvoice, Inc.
Infrastructure Whys
 •    Why Hadoop?
       – Experience scaling brute-force log processing via Hadoop
          • Everybody’s favorite: Akamai edge request logs
          • EMR, Apache Whirr
       – Needed online analytics – HBase fit the bill
       – Apache OSS ecosystem familiar to BV

 •    Why Amazon Web Services?
       – Existing infrastructure hosting solution too inflexible and slow
       – Couldn’t scale R&D without an elastic infrastructure


Confidential and Proprietary. © 2012 Bazaarvoice, Inc.
High-level architecture
 •    Event collectors in auto-scale groups behind elastic load balancers
 •    Event stream compressed and uploaded hourly to S3
 •    S3: store of record
 •    Hadoop cluster:
        –   HDFS: stores raw event logs, derived file-based data sets, and HBase
            HFiles/WALs
        –   Oozie: job scheduling, data dependency management
        –   MapReduce: analytics (mix of Pig, Java => 100% Java)
        –   HBase: stores hourly/daily analytics results
 •    Job Portal: job schedule viz, gap analysis & alerting
 •    UI/API: Analytics available via JSON API and in Backbone.js UI


Confidential and Proprietary. © 2012 Bazaarvoice, Inc.
EMR vs. roll our own
 •    Neither
 •    Cons: EMR
       – Price premium
       – Opaque Hadoop configuration
       – No way to mitigate SPOFs
 •    Cons: Roll our own
       – Small group of engineers, no ops manpower at beginning
 •    Solution: Cloudera
       – Cloudera Manager for config management and provisioning
       – CDH 3.X distribution


Confidential and Proprietary. © 2012 Bazaarvoice, Inc.
Missteps
 •    Problem: non-HA NameNode
 •    Solution: EBS!
 •    Problem: EC2 MTBF iffy
 •    Solution: EBS!
 •    Reality: When something goes wrong in AWS, it is invariably an
      outage or degradation in EBS.
       – Violates the whole concept of data locality. Hadoop + SAN =
          sadness
 •    Problem: Where should HBase live?
 •    Solution: Co-resident with MapReduce!


Confidential and Proprietary. © 2012 Bazaarvoice, Inc.
Where we’ve ended up
 •    Moved to the latest Cloudera CDH 4.X – HA NameNode!
       – Zookeeper for leader election
       – Quorum Journal Manager for edit logs
 •    Learn to let go
       – Mitigate SPOF where possible, but plan for failure
       – End-to-end automation for DR/migration
 •    Avoid EBS like the plague
 •    HBase and MapReduce segmentation
       – Enables different hardware step size
       – Batch processing doesn’t affect HBase response time
       – Better understanding of HBase/HDFS locality (or lack thereof)


Confidential and Proprietary. © 2012 Bazaarvoice, Inc.
Let’s talk sets




Confidential and Proprietary. © 2012 Bazaarvoice, Inc.
Let’s talk sets
 •    Common problem: uniques (e.g. unique visitors, users, etc.)
 •    Naïve solution: SELECT DISTINCT(X) FROM Y
 •    Not tenable given:
       – Massive, semi-structured data set
       – Thousands of grouping axes
 •    OK: pre-calculate via MapReduce
 •    But…
       – What would you pre-calculate?
       – Daily for each grouping?
       – How would you answer queries for other time ranges? Pre-
          calculate them, too?


Confidential and Proprietary. © 2012 Bazaarvoice, Inc.
Set Unions
 •    Definition: cardinality of a set is the number of elements in that set
       – A = {1, 2, 3}; |A| = 3
 •    Cardinality of the union of two sets cannot be determined from the
      cardinality of the two sets
       – |A U B| not necessarily equal to |A| + |B|
       – Only equal if A and B are disjoint
       – How do you know if they’re disjoint?
       – You need both sets
 •    Imagine:
       – Set “a” are the visitors from yesterday
       – Set “b” are the visitors from today
       – To get uniques for both days, you have to look at both data sets


Confidential and Proprietary. © 2012 Bazaarvoice, Inc.
An entirely different set: bit sets
 •    Translate set members’ identifiers to an index in a bit set
 •    Bit sets are combinable – yahtzee!
 •    HBase is good at storing bits 
       – MapReduce to build bit set for each grouping in your smallest
           desirable unit of time
       – Persist w/ row key as a function of date and grouping
 •    Uniques for last month?
       – Scan: start and stop rows accounting for date range and grouping
       – Merge each day’s bit set with a single bit set representing the union
       – Count the number of “on” bits in the merged bit set => cardinality
 •    But…
       – # bits for items whose identifiers number in the billions?
       – A billion bits is a lot of bits


Confidential and Proprietary. © 2012 Bazaarvoice, Inc.
Bit sets – solving the size problem
 •    109 bits is an expensive way to store a combinable cardinality
 •    Query I/O example: Uniques for last quarter
       – 120 MB/day * 90 days = 10.8 GB
       – Too much to pull out of HBase to answer an “online” query
 •    Storage example: 10K different grouping axes
       – Clients, sites, favorite colors, whatever
       – 120 MB * 10K = 1.2 TB/day of storage
 •    Possible mitigation: compression
       – Still need to generate a 120 MB data structure, then compress



Confidential and Proprietary. © 2012 Bazaarvoice, Inc.
Cardinality Estimation
 •    Many different approaches to estimate the cardinality of a set
       – General goal: calculate cardinality in small RAM footprint
 •    Big breakthrough in 2007: the HyperLogLog estimator
 •    What’s the big deal?
       – Tunable accuracy
       – Incredible information density
       – Combinable
 •    Analog: lossy compression of bit sets
 •    How good?
       – Estimate cardinality of 109 unique elements +/-2% in 1.5 KB


Confidential and Proprietary. © 2012 Bazaarvoice, Inc.
Nuts & Bolts
 •    http://github.com/clearspring/stream-lib
       – Java impls of top-K, frequency, and cardinality for streams
 •    A ha moment: combining estimators from distributed counters is
      no different than combining them across different time periods!
 •    MapReduce algorithm
       – map(Event) : (key, identifier)
            • key is what ever grouping you want uniques for
       – Shuffle sorts all key, identifier tuples by key
       – reduce(key, Iterable<identifier>) : estimator bytes
 •    Reducer simply updates the estimator in-place – tiny RAM footprint


Confidential and Proprietary. © 2012 Bazaarvoice, Inc.
Nuts & Bolts
 •    Reducer output: HBase Put
 •    HBase “schema”, e.g. daily uniques aggregated by brand:
 •    Scan:
                             Row Key              Estimator
       – brandX
       – Jan 2-3             brandX-20130101      [0100110111000]
      [0110100111000]        brandX-20130102      [0110100111000]
      [0100000101011]        brandX-20130103      [0100000101011]
                                             brandY-20130101   [0101100011000]
       [0110100111011]                       brandY-20130102   [0100100111001]


         Cardinality = N
Confidential and Proprietary. © 2012 Bazaarvoice, Inc.
Nuts & Bolts
 •    HBase scan is the key to making this fast
       – First result: instantiate HyperLogLog estimator
       – Remaining results: update estimator in-place
 •    O(n) to compute result, n ~ number of bits in estimator (1.5KB)
 •    Freedom to build a data set of unique estimators that can be
      arbitrarily sliced quickly
       – Quarterly, daily, weekly, ad-hoc date ranges
       – HBase client pulls 1.5KB * number of days, returns a long
       – Perf anecdote: REST API call to get network-wide uniques for
          current month-to-date
            • 66 ms over the internet
            • 12 ms server-side latency

Confidential and Proprietary. © 2012 Bazaarvoice, Inc.

Más contenido relacionado

La actualidad más candente

HBaseCon 2015: HBase and Spark
HBaseCon 2015: HBase and SparkHBaseCon 2015: HBase and Spark
HBaseCon 2015: HBase and SparkHBaseCon
 
Keynote: The Future of Apache HBase
Keynote: The Future of Apache HBaseKeynote: The Future of Apache HBase
Keynote: The Future of Apache HBaseHBaseCon
 
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring BudgetHBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring BudgetCloudera, Inc.
 
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster Cloudera, Inc.
 
Facebook - Jonthan Gray - Hadoop World 2010
Facebook - Jonthan Gray - Hadoop World 2010Facebook - Jonthan Gray - Hadoop World 2010
Facebook - Jonthan Gray - Hadoop World 2010Cloudera, Inc.
 
HBaseConAsia2018 Track2-1: Kerberos-based Big Data Security Solution and Prac...
HBaseConAsia2018 Track2-1: Kerberos-based Big Data Security Solution and Prac...HBaseConAsia2018 Track2-1: Kerberos-based Big Data Security Solution and Prac...
HBaseConAsia2018 Track2-1: Kerberos-based Big Data Security Solution and Prac...Michael Stack
 
HBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWS
HBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWSHBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWS
HBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWSHBaseCon
 
Harmonizing Multi-tenant HBase Clusters for Managing Workload Diversity
Harmonizing Multi-tenant HBase Clusters for Managing Workload DiversityHarmonizing Multi-tenant HBase Clusters for Managing Workload Diversity
Harmonizing Multi-tenant HBase Clusters for Managing Workload DiversityHBaseCon
 
Dawn of YARN @ Rocket Fuel
Dawn of YARN @ Rocket FuelDawn of YARN @ Rocket Fuel
Dawn of YARN @ Rocket FuelDataWorks Summit
 
Maintaining Low Latency While Maximizing Throughput on a Single Cluster
Maintaining Low Latency While Maximizing Throughput on a Single ClusterMaintaining Low Latency While Maximizing Throughput on a Single Cluster
Maintaining Low Latency While Maximizing Throughput on a Single ClusterMapR Technologies
 
HBaseCon 2015 General Session: State of HBase
HBaseCon 2015 General Session: State of HBaseHBaseCon 2015 General Session: State of HBase
HBaseCon 2015 General Session: State of HBaseHBaseCon
 
HBase at Bloomberg: High Availability Needs for the Financial Industry
HBase at Bloomberg: High Availability Needs for the Financial IndustryHBase at Bloomberg: High Availability Needs for the Financial Industry
HBase at Bloomberg: High Availability Needs for the Financial IndustryHBaseCon
 
A Survey of HBase Application Archetypes
A Survey of HBase Application ArchetypesA Survey of HBase Application Archetypes
A Survey of HBase Application ArchetypesHBaseCon
 
Keynote - Hosted PostgreSQL: An Objective Look
Keynote - Hosted PostgreSQL: An Objective LookKeynote - Hosted PostgreSQL: An Objective Look
Keynote - Hosted PostgreSQL: An Objective LookEDB
 
(SOV202) Choosing Among AWS Managed Database Services | AWS re:Invent 2014
(SOV202) Choosing Among AWS Managed Database Services | AWS re:Invent 2014(SOV202) Choosing Among AWS Managed Database Services | AWS re:Invent 2014
(SOV202) Choosing Among AWS Managed Database Services | AWS re:Invent 2014Amazon Web Services
 
HBaseCon 2015: State of HBase Docs and How to Contribute
HBaseCon 2015: State of HBase Docs and How to ContributeHBaseCon 2015: State of HBase Docs and How to Contribute
HBaseCon 2015: State of HBase Docs and How to ContributeHBaseCon
 
HBaseCon 2012 | HBase, the Use Case in eBay Cassini
HBaseCon 2012 | HBase, the Use Case in eBay Cassini HBaseCon 2012 | HBase, the Use Case in eBay Cassini
HBaseCon 2012 | HBase, the Use Case in eBay Cassini Cloudera, Inc.
 
HBase Backups
HBase BackupsHBase Backups
HBase BackupsHBaseCon
 
Back to School - St. Louis Hadoop Meetup September 2016
Back to School - St. Louis Hadoop Meetup September 2016Back to School - St. Louis Hadoop Meetup September 2016
Back to School - St. Louis Hadoop Meetup September 2016Adam Doyle
 

La actualidad más candente (20)

HBaseCon 2015: HBase and Spark
HBaseCon 2015: HBase and SparkHBaseCon 2015: HBase and Spark
HBaseCon 2015: HBase and Spark
 
Keynote: The Future of Apache HBase
Keynote: The Future of Apache HBaseKeynote: The Future of Apache HBase
Keynote: The Future of Apache HBase
 
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring BudgetHBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
 
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
 
Facebook - Jonthan Gray - Hadoop World 2010
Facebook - Jonthan Gray - Hadoop World 2010Facebook - Jonthan Gray - Hadoop World 2010
Facebook - Jonthan Gray - Hadoop World 2010
 
HBaseConAsia2018 Track2-1: Kerberos-based Big Data Security Solution and Prac...
HBaseConAsia2018 Track2-1: Kerberos-based Big Data Security Solution and Prac...HBaseConAsia2018 Track2-1: Kerberos-based Big Data Security Solution and Prac...
HBaseConAsia2018 Track2-1: Kerberos-based Big Data Security Solution and Prac...
 
HBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWS
HBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWSHBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWS
HBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWS
 
Harmonizing Multi-tenant HBase Clusters for Managing Workload Diversity
Harmonizing Multi-tenant HBase Clusters for Managing Workload DiversityHarmonizing Multi-tenant HBase Clusters for Managing Workload Diversity
Harmonizing Multi-tenant HBase Clusters for Managing Workload Diversity
 
Dawn of YARN @ Rocket Fuel
Dawn of YARN @ Rocket FuelDawn of YARN @ Rocket Fuel
Dawn of YARN @ Rocket Fuel
 
Maintaining Low Latency While Maximizing Throughput on a Single Cluster
Maintaining Low Latency While Maximizing Throughput on a Single ClusterMaintaining Low Latency While Maximizing Throughput on a Single Cluster
Maintaining Low Latency While Maximizing Throughput on a Single Cluster
 
HBaseCon 2015 General Session: State of HBase
HBaseCon 2015 General Session: State of HBaseHBaseCon 2015 General Session: State of HBase
HBaseCon 2015 General Session: State of HBase
 
HBase at Bloomberg: High Availability Needs for the Financial Industry
HBase at Bloomberg: High Availability Needs for the Financial IndustryHBase at Bloomberg: High Availability Needs for the Financial Industry
HBase at Bloomberg: High Availability Needs for the Financial Industry
 
A Survey of HBase Application Archetypes
A Survey of HBase Application ArchetypesA Survey of HBase Application Archetypes
A Survey of HBase Application Archetypes
 
Keynote - Hosted PostgreSQL: An Objective Look
Keynote - Hosted PostgreSQL: An Objective LookKeynote - Hosted PostgreSQL: An Objective Look
Keynote - Hosted PostgreSQL: An Objective Look
 
(SOV202) Choosing Among AWS Managed Database Services | AWS re:Invent 2014
(SOV202) Choosing Among AWS Managed Database Services | AWS re:Invent 2014(SOV202) Choosing Among AWS Managed Database Services | AWS re:Invent 2014
(SOV202) Choosing Among AWS Managed Database Services | AWS re:Invent 2014
 
HBaseCon 2015: State of HBase Docs and How to Contribute
HBaseCon 2015: State of HBase Docs and How to ContributeHBaseCon 2015: State of HBase Docs and How to Contribute
HBaseCon 2015: State of HBase Docs and How to Contribute
 
HBaseCon 2012 | HBase, the Use Case in eBay Cassini
HBaseCon 2012 | HBase, the Use Case in eBay Cassini HBaseCon 2012 | HBase, the Use Case in eBay Cassini
HBaseCon 2012 | HBase, the Use Case in eBay Cassini
 
NoSQL: Cassadra vs. HBase
NoSQL: Cassadra vs. HBaseNoSQL: Cassadra vs. HBase
NoSQL: Cassadra vs. HBase
 
HBase Backups
HBase BackupsHBase Backups
HBase Backups
 
Back to School - St. Louis Hadoop Meetup September 2016
Back to School - St. Louis Hadoop Meetup September 2016Back to School - St. Louis Hadoop Meetup September 2016
Back to School - St. Louis Hadoop Meetup September 2016
 

Similar a Austin Scales- Clickstream Analytics at Bazaarvoice

Production NoSQL in an Hour: Introduction to Amazon DynamoDB (DAT101) | AWS r...
Production NoSQL in an Hour: Introduction to Amazon DynamoDB (DAT101) | AWS r...Production NoSQL in an Hour: Introduction to Amazon DynamoDB (DAT101) | AWS r...
Production NoSQL in an Hour: Introduction to Amazon DynamoDB (DAT101) | AWS r...Amazon Web Services
 
HBase and Hadoop at Urban Airship
HBase and Hadoop at Urban AirshipHBase and Hadoop at Urban Airship
HBase and Hadoop at Urban Airshipdave_revell
 
HBaseConAsia2018 Keynote 2: Recent Development of HBase in Alibaba and Cloud
HBaseConAsia2018 Keynote 2: Recent Development of HBase in Alibaba and CloudHBaseConAsia2018 Keynote 2: Recent Development of HBase in Alibaba and Cloud
HBaseConAsia2018 Keynote 2: Recent Development of HBase in Alibaba and CloudMichael Stack
 
MongoDB: What, why, when
MongoDB: What, why, whenMongoDB: What, why, when
MongoDB: What, why, whenEugenio Minardi
 
Large-scale Web Apps @ Pinterest
Large-scale Web Apps @ PinterestLarge-scale Web Apps @ Pinterest
Large-scale Web Apps @ PinterestHBaseCon
 
Rigorous and Multi-tenant HBase Performance
Rigorous and Multi-tenant HBase PerformanceRigorous and Multi-tenant HBase Performance
Rigorous and Multi-tenant HBase PerformanceCloudera, Inc.
 
Rigorous and Multi-tenant HBase Performance Measurement
Rigorous and Multi-tenant HBase Performance MeasurementRigorous and Multi-tenant HBase Performance Measurement
Rigorous and Multi-tenant HBase Performance MeasurementDataWorks Summit
 
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924Amazon Web Services
 
Oracle big data appliance and solutions
Oracle big data appliance and solutionsOracle big data appliance and solutions
Oracle big data appliance and solutionssolarisyougood
 
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftAmazon Web Services
 
M6d cassandrapresentation
M6d cassandrapresentationM6d cassandrapresentation
M6d cassandrapresentationEdward Capriolo
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big DataJoe Alex
 
Intro to database_services_fg_aws_summit_2014
Intro to database_services_fg_aws_summit_2014Intro to database_services_fg_aws_summit_2014
Intro to database_services_fg_aws_summit_2014Amazon Web Services LATAM
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברגTaldor Group
 
Machine Learning for Smarter Apps - Jacksonville Meetup
Machine Learning for Smarter Apps - Jacksonville MeetupMachine Learning for Smarter Apps - Jacksonville Meetup
Machine Learning for Smarter Apps - Jacksonville MeetupSri Ambati
 
Modern MySQL Monitoring and Dashboards.
Modern MySQL Monitoring and Dashboards.Modern MySQL Monitoring and Dashboards.
Modern MySQL Monitoring and Dashboards.Mydbops
 
PPCD_And_AmazonRDS
PPCD_And_AmazonRDSPPCD_And_AmazonRDS
PPCD_And_AmazonRDSVibhor Kumar
 

Similar a Austin Scales- Clickstream Analytics at Bazaarvoice (20)

Production NoSQL in an Hour: Introduction to Amazon DynamoDB (DAT101) | AWS r...
Production NoSQL in an Hour: Introduction to Amazon DynamoDB (DAT101) | AWS r...Production NoSQL in an Hour: Introduction to Amazon DynamoDB (DAT101) | AWS r...
Production NoSQL in an Hour: Introduction to Amazon DynamoDB (DAT101) | AWS r...
 
HBase and Hadoop at Urban Airship
HBase and Hadoop at Urban AirshipHBase and Hadoop at Urban Airship
HBase and Hadoop at Urban Airship
 
HBaseConAsia2018 Keynote 2: Recent Development of HBase in Alibaba and Cloud
HBaseConAsia2018 Keynote 2: Recent Development of HBase in Alibaba and CloudHBaseConAsia2018 Keynote 2: Recent Development of HBase in Alibaba and Cloud
HBaseConAsia2018 Keynote 2: Recent Development of HBase in Alibaba and Cloud
 
MongoDB: What, why, when
MongoDB: What, why, whenMongoDB: What, why, when
MongoDB: What, why, when
 
Large-scale Web Apps @ Pinterest
Large-scale Web Apps @ PinterestLarge-scale Web Apps @ Pinterest
Large-scale Web Apps @ Pinterest
 
Rigorous and Multi-tenant HBase Performance
Rigorous and Multi-tenant HBase PerformanceRigorous and Multi-tenant HBase Performance
Rigorous and Multi-tenant HBase Performance
 
Rigorous and Multi-tenant HBase Performance Measurement
Rigorous and Multi-tenant HBase Performance MeasurementRigorous and Multi-tenant HBase Performance Measurement
Rigorous and Multi-tenant HBase Performance Measurement
 
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
 
Horizon for Big Data
Horizon for Big DataHorizon for Big Data
Horizon for Big Data
 
Oracle big data appliance and solutions
Oracle big data appliance and solutionsOracle big data appliance and solutions
Oracle big data appliance and solutions
 
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
 
M6d cassandrapresentation
M6d cassandrapresentationM6d cassandrapresentation
M6d cassandrapresentation
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big Data
 
Intro to database_services_fg_aws_summit_2014
Intro to database_services_fg_aws_summit_2014Intro to database_services_fg_aws_summit_2014
Intro to database_services_fg_aws_summit_2014
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברג
 
Machine Learning for Smarter Apps - Jacksonville Meetup
Machine Learning for Smarter Apps - Jacksonville MeetupMachine Learning for Smarter Apps - Jacksonville Meetup
Machine Learning for Smarter Apps - Jacksonville Meetup
 
Modern MySQL Monitoring and Dashboards.
Modern MySQL Monitoring and Dashboards.Modern MySQL Monitoring and Dashboards.
Modern MySQL Monitoring and Dashboards.
 
NoSQL_Night
NoSQL_NightNoSQL_Night
NoSQL_Night
 
PPCD_And_AmazonRDS
PPCD_And_AmazonRDSPPCD_And_AmazonRDS
PPCD_And_AmazonRDS
 
Wmware NoSQL
Wmware NoSQLWmware NoSQL
Wmware NoSQL
 

Último

"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 

Último (20)

"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 

Austin Scales- Clickstream Analytics at Bazaarvoice

  • 1. Clickstream Analytics at Bazaarvoice Evan Pollan, Engineering Lead @EvanPollan
  • 2. Agenda • Infrastructure: lessons learned operating Hadoop in EC2 • Case study: uniques at scale using Hadoop and HBase Confidential and Proprietary. © 2012 Bazaarvoice, Inc.
  • 3. Project Magpie • Bazaarvoice products – extremely large web surface area • Client-side instrumentation to measure interactions • Many event sources (apps) => one sink: Magpie • Consolidated HTTP event collection – Network-wide event correlation – Network ~ many apps and many “sites” (clients) • Clickstream == Topically segmented JSON event log files • Sense of scale – 10 - 20K events per second – 500M – 1B impressions per day – 25 – 50 GB compressed event log data per day Confidential and Proprietary. © 2012 Bazaarvoice, Inc.
  • 4. Infrastructure Whys • Why Hadoop? – Experience scaling brute-force log processing via Hadoop • Everybody’s favorite: Akamai edge request logs • EMR, Apache Whirr – Needed online analytics – HBase fit the bill – Apache OSS ecosystem familiar to BV • Why Amazon Web Services? – Existing infrastructure hosting solution too inflexible and slow – Couldn’t scale R&D without an elastic infrastructure Confidential and Proprietary. © 2012 Bazaarvoice, Inc.
  • 5. High-level architecture • Event collectors in auto-scale groups behind elastic load balancers • Event stream compressed and uploaded hourly to S3 • S3: store of record • Hadoop cluster: – HDFS: stores raw event logs, derived file-based data sets, and HBase HFiles/WALs – Oozie: job scheduling, data dependency management – MapReduce: analytics (mix of Pig, Java => 100% Java) – HBase: stores hourly/daily analytics results • Job Portal: job schedule viz, gap analysis & alerting • UI/API: Analytics available via JSON API and in Backbone.js UI Confidential and Proprietary. © 2012 Bazaarvoice, Inc.
  • 6. EMR vs. roll our own • Neither • Cons: EMR – Price premium – Opaque Hadoop configuration – No way to mitigate SPOFs • Cons: Roll our own – Small group of engineers, no ops manpower at beginning • Solution: Cloudera – Cloudera Manager for config management and provisioning – CDH 3.X distribution Confidential and Proprietary. © 2012 Bazaarvoice, Inc.
  • 7. Missteps • Problem: non-HA NameNode • Solution: EBS! • Problem: EC2 MTBF iffy • Solution: EBS! • Reality: When something goes wrong in AWS, it is invariably an outage or degradation in EBS. – Violates the whole concept of data locality. Hadoop + SAN = sadness • Problem: Where should HBase live? • Solution: Co-resident with MapReduce! Confidential and Proprietary. © 2012 Bazaarvoice, Inc.
  • 8. Where we’ve ended up • Moved to the latest Cloudera CDH 4.X – HA NameNode! – Zookeeper for leader election – Quorum Journal Manager for edit logs • Learn to let go – Mitigate SPOF where possible, but plan for failure – End-to-end automation for DR/migration • Avoid EBS like the plague • HBase and MapReduce segmentation – Enables different hardware step size – Batch processing doesn’t affect HBase response time – Better understanding of HBase/HDFS locality (or lack thereof) Confidential and Proprietary. © 2012 Bazaarvoice, Inc.
  • 9. Let’s talk sets Confidential and Proprietary. © 2012 Bazaarvoice, Inc.
  • 10. Let’s talk sets • Common problem: uniques (e.g. unique visitors, users, etc.) • Naïve solution: SELECT DISTINCT(X) FROM Y • Not tenable given: – Massive, semi-structured data set – Thousands of grouping axes • OK: pre-calculate via MapReduce • But… – What would you pre-calculate? – Daily for each grouping? – How would you answer queries for other time ranges? Pre- calculate them, too? Confidential and Proprietary. © 2012 Bazaarvoice, Inc.
  • 11. Set Unions • Definition: cardinality of a set is the number of elements in that set – A = {1, 2, 3}; |A| = 3 • Cardinality of the union of two sets cannot be determined from the cardinality of the two sets – |A U B| not necessarily equal to |A| + |B| – Only equal if A and B are disjoint – How do you know if they’re disjoint? – You need both sets • Imagine: – Set “a” are the visitors from yesterday – Set “b” are the visitors from today – To get uniques for both days, you have to look at both data sets Confidential and Proprietary. © 2012 Bazaarvoice, Inc.
  • 12. An entirely different set: bit sets • Translate set members’ identifiers to an index in a bit set • Bit sets are combinable – yahtzee! • HBase is good at storing bits  – MapReduce to build bit set for each grouping in your smallest desirable unit of time – Persist w/ row key as a function of date and grouping • Uniques for last month? – Scan: start and stop rows accounting for date range and grouping – Merge each day’s bit set with a single bit set representing the union – Count the number of “on” bits in the merged bit set => cardinality • But… – # bits for items whose identifiers number in the billions? – A billion bits is a lot of bits Confidential and Proprietary. © 2012 Bazaarvoice, Inc.
  • 13. Bit sets – solving the size problem • 109 bits is an expensive way to store a combinable cardinality • Query I/O example: Uniques for last quarter – 120 MB/day * 90 days = 10.8 GB – Too much to pull out of HBase to answer an “online” query • Storage example: 10K different grouping axes – Clients, sites, favorite colors, whatever – 120 MB * 10K = 1.2 TB/day of storage • Possible mitigation: compression – Still need to generate a 120 MB data structure, then compress Confidential and Proprietary. © 2012 Bazaarvoice, Inc.
  • 14. Cardinality Estimation • Many different approaches to estimate the cardinality of a set – General goal: calculate cardinality in small RAM footprint • Big breakthrough in 2007: the HyperLogLog estimator • What’s the big deal? – Tunable accuracy – Incredible information density – Combinable • Analog: lossy compression of bit sets • How good? – Estimate cardinality of 109 unique elements +/-2% in 1.5 KB Confidential and Proprietary. © 2012 Bazaarvoice, Inc.
  • 15. Nuts & Bolts • http://github.com/clearspring/stream-lib – Java impls of top-K, frequency, and cardinality for streams • A ha moment: combining estimators from distributed counters is no different than combining them across different time periods! • MapReduce algorithm – map(Event) : (key, identifier) • key is what ever grouping you want uniques for – Shuffle sorts all key, identifier tuples by key – reduce(key, Iterable<identifier>) : estimator bytes • Reducer simply updates the estimator in-place – tiny RAM footprint Confidential and Proprietary. © 2012 Bazaarvoice, Inc.
  • 16. Nuts & Bolts • Reducer output: HBase Put • HBase “schema”, e.g. daily uniques aggregated by brand: • Scan: Row Key Estimator – brandX – Jan 2-3 brandX-20130101 [0100110111000] [0110100111000] brandX-20130102 [0110100111000] [0100000101011] brandX-20130103 [0100000101011] brandY-20130101 [0101100011000] [0110100111011] brandY-20130102 [0100100111001] Cardinality = N Confidential and Proprietary. © 2012 Bazaarvoice, Inc.
  • 17. Nuts & Bolts • HBase scan is the key to making this fast – First result: instantiate HyperLogLog estimator – Remaining results: update estimator in-place • O(n) to compute result, n ~ number of bits in estimator (1.5KB) • Freedom to build a data set of unique estimators that can be arbitrarily sliced quickly – Quarterly, daily, weekly, ad-hoc date ranges – HBase client pulls 1.5KB * number of days, returns a long – Perf anecdote: REST API call to get network-wide uniques for current month-to-date • 66 ms over the internet • 12 ms server-side latency Confidential and Proprietary. © 2012 Bazaarvoice, Inc.

Notas del editor

  1. August – 2012 Version
  2. A magpie is a bird that suffers an irresistible urge to collect and hoard things Sense of scaleAt our current level of instrumentation and app penentration
  3. HBase fit the bill…Given its storage model and affinity to timeseries dataGiven its clean, out of the box integration with MapReduce
  4. I can’t, and therefore don’t, do diagrams. You’re stuck with word-dense slidesHadoop clusterOf note: we sync S3 to HDFS for optimized job execution and to enable Oozie’s data dependency managementJob PortalOozie web UI is painful to use
  5. When most people get ready to deploy Hadoop to EC2, they choose between Elastic Map Reduce or a custom deploymentCDH distributionCurated, don’t have to worry about mixing and matching various apache component versions
  6. non-HANameNodeEven CDH3 was not immune from SPOFEC2 MTBF iffy…The Magpie team was definitely not the first to foray in to EC2 – BV had been using EC2 for quite some time at this point
  7. Quorum Journal Manager for edit logs- Doesn’t push the SPOF further upstream with a NFS NAS solution for shared storage of the edit logs - This system works really well. Leader election is lightning fast, and we haven’t encountered any failures of reads or writes during out “pull the plug” testingEnd-to-end automation for DR:- And by DR, I mean AZ outages; loss of 3+ data nodes; loss of 2+ “master nodes” - When our SLAs require it, we’ll run an HBase replica in another region, but still treat the MapReduce cluster as expendable HBase/HDFS locality: - Region Server and HFile blocks are not co-resident after a region has been reassigned
  8. We have a solid hadoop infrastructure running in AWS, let’s crunch some big dataNot tenable given…Well, not tenable w/out a very, very large OLAP data store. We’ve got a hadoop cluster, though…Pre-calculate them, too?Large, expensive jobs re-processing the same data sets, lack of flexibility to the end-user
  9. We have a solid hadoop infrastructure running in AWS, let’s crunch some big dataNot tenable given…Well, not tenable w/out a very, very large OLAP data store. We’ve got a hadoop cluster, though…Pre-calculate them, too?Large, expensive jobs re-processing the same data sets, lack of flexibility to the end-user
  10. Conclusion: need some way to calculate and persist a representation of cardinality for an incremental time period that would not be prohibitive to scan over arbitrary time ranges and combine into a single representation of the cardinality of all the subsets.
  11. Bit sets are combinable…Meaning you could take a bit set representation of one day’s cardinality, OR it with another day’s bit set and have a bit set that would tell you the cardinality of the union of the two daysMapReduce to build …For example, unique users at site XYZ on January 31, 2013Scan: start and stop…HBase is very good at scans over reasonable sets of data, even without the benefit of block cache, when rows are (a) reasonably narrow, and (b) the ordering of the keys leads to linear readsA billion bits is a lot of bits…It’s not big data, but it can quickly become big data
  12. Possible mitigation: compressionStill need to generate a 120 MB data structure in RAM, then compressRetrieval non-trivial given decompression costs and heap pressure
  13. -Calculate cardinality in small RAM footprintE.g. for stream processingBig breakthrough in 2007: HyperLogLogNew algorithm and representational data structureTeam of French mathematicians, led by FlajoletTimely: engineers at Google just published a refinement called HLL++ that is more accurate on the low and high end.Combinable…Not unique to HyperLogLogAnalog: lossy compression…BUT: doesn’t require large intermediate heap and associated CPU cycles for compression
  14. I don’t peruse the proceedings of math conferences – but I do keep up on hacker news and high scalability.comLast April, Matt Abrams of Clearspring wrote a blog post on using HyperLogLog estimators to merge cardinality estimators from a bunch of distributed stream-processing machines