SlideShare una empresa de Scribd logo
1 de 35
Descargar para leer sin conexión
Low Latency “OLAP” with HBase
     Cosmin Lehene | Adobe




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
What we needed … and built


      OLAP Semantics
      Low Latency Ingestion
      High Throughput
      Real-time Query API




      Not hardcoded to web analytics or x-, y-, z-
       analytics, but extensible
© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   2
Building Blocks


      Dimensions, Metrics
      Aggregations
      Roll-up, drill-down, slicing and dicing, sorting




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   3
OLAP 101 – Queries example




                 Date                           Countr                        City            OS        Browser      Sale
                                                y
                 2012-05-21                     USA                           NY              Windows   FF           0.0

                 2012-05-21                     USA                           NY              Windows   FF           10.0

                 2012-05-22                     USA                           SF              OSX       Chrome       25.0

                 2012-05-22                     Canada                        Ontario         Linux     Chrome       0.0

                 2012-05-23                     USA                           Chicago         OSX       Safari       15.0

                 5 visits,                      2                             4 cities:       3 OS-es   3 browsers   50.0
                 3 days                         countries                     NY: 2           Win: 2    FF: 2        3 sales
                                                USA: 4                        SF: 1           OSX: 2    Chrome:2
                                                Canada: 1


© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.               4
OLAP 101 – Queries example

      Rolling up to country level:                                               Country    visits   sales
  SELECT COUNT(visits), SUM(sales)
                                                                                  USA        4        $50
  GROUP BY country
                                                                                  Canada     1        0



      “Slicing” by browser                                                       Country   visits sales
  SELECT COUNT(visits), SUM(sales)                                                USA       2         $10
  GROUP BY country
                                                                                  Canada    0         0
  HAVING browser = “FF”


      Top browsers by sales                                                      Browser   sales     visits
  SELECT SUM(sales), COUNT(visits)                                                Chrome    $25       2
  GROUP BY browser
                                                                                  Safari    $15       1
  ORDER BY sales
                                                                                  FF        $10       2

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   5
OLAP – Runtime Aggregation vs. Pre-aggregation


      Aggregate at runtime                                                      Pre-aggregate
            Most flexible                                                           Fast
            Fast – scatter gather                                                   Efficient – O(1)
            Space efficient                                                         High throughput
      But                                                                       But
            I/O, CPU intensive                                                      More effort to process (latency)
            slow for larger data                                                    Combinatorial explosion (space)
            low throughput                                                          No flexibility




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   6
Pre-aggregation

      Data needs to be summarized
            Can’t visualize 1B data points (no, not even with Retina display)
            Difficult to comprehend correlations among more than 3 dimensions


      Not all dimension groups are relevant
            Index on a needed basis (view selection problem)


      Runtime aggregation == TeraSort for every query?
            Pre-aggregate to reduce cardinality




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   7
SaasBase

      We tune both
      pre-aggregation level                                                      vs.    runtime post-aggregation
      (ingestion speed + space ) vs.                                             (query speed)


      Think materialized views from RDBMS




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   8
SaasBase Domain Model Mapping




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   9
SaasBase - Domain Model Mapping




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   10
SaasBase - Ingestion, Processing, Indexing, Querying




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   11
SaasBase - Ingestion, Processing, Indexing, Querying




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   12
Ingestion




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   13
Ingestion throughput vs. latency


      Historical data (large batches)
            Optimize for throughput
      Increments (latest data, smaller)
            Optimize for latency




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   14
Large, granular input strategies

      Slow listing in HDFS
            Archive processed files


      Filtering input
            FileDateFilter (log name patterns: log-YYYY-MM-dd-HH.log)
            TableInputFormat start/stop row
            File Index in HBase (track processed/new files)


      Map tasks overhead - stitching input splits
            400K files => 400K map tasks => overhead, slow reduce copy
            CombineFileInputFormat – 2GB-splits => 500 splits for 1TB
            FixedMappersTableInputFormat (e.g. 5-region splits)
© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   15
Ingestion – Bulk Import

      HFileOutputFormat (HFOF)
            100s X faster than HBase API
            No need to recover from failed jobs
            No unnecessary load on machines




  * No shuffle - global reduce order
  required!
            e.g. first reduce key needs to be in the
             first region, last one in the last region
            Watch for uneven partitions


© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   16
HFOF – FileSizeDatePartitioner

      1 partition(reduce) / day for initial import
      Uneven reduce (partitions) due to data growth over time
            Reduce k: 2010-12-04 = 500MB
            Reduce n: 2012-05-22 = 5GB => slow and will result in a 5GB region




      Balance reduce buckets based on input file sizes and the reduce key
      Generate sub-partitions based on predefined size (e.g. 1GB)

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   17
Processing




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   18
Processing



      Processing involves reading the Input (files, tables, events), pre-
       aggregating it (reducing cardinality) and generating tables that can be
       queried in real-time
            1 year: 1B events => 100B data points indexed
            Query => scan 365 data points (e.g. daily page views)




      Processing could be either MR or real-time (e.g. Storm)




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   19
Processing for OLAP semantics

            GROUP BY (process, query)
            COUNT, SUM, AVG, etc. (process, query)
            SORT (process, query)
            HAVING (mostly query, can define pre-process constraints)




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   20
SaasBase vs. SQL Views Comparison




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   21
reports.json entities definition




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   22
Processing Performance

      read, map, partition, combine, copy, sort, reduce, write


      Read:
            Scan.setCaching() (I/O ~ buffer)
            Scan.setBatching() (avoid timeouts for abnormal input, e.g. 1M hits/visit)
            Even region distribution across cluster (distributes CPU, I/O)
      Map:
            No unnecessary transformations: Bytes.toString(bytes) + Bytes.toBytes(string)
             (CPU)
            Avoid GC : new X() (CPU, Memory)
            Avoid system calls (context switching)
            Stripping unnecessary data (I/O)


© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   23
Processing Performance

      Hot (in memory) vs. Cold (on disk, on network) data
            Minimize I/O from disk/network


      Single shot MR job: SuperProcessor
            Emit all groups from one map() call


      Incremental processing
            Data format YYYY-MM-DD prefixed rowkey (HH:mm for more granularity)




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   24
Indexing




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   25
HBase natural order: hierarchical representation




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   26
Indexing - Why

      Example: top 10 cities
            ~50K [country, city] combinations per day
            Top 10 cities for 1 year =>
            365 (days) X 50K ~=15M data points scanned
            If you add gender => 30M
            If you add Device, OS, Browser …


      Might compress well, but think about the environment
      How much energy would you spend for just top 10 cities?



                                                                              * Image from: http://my.neutralexistence.com/images/Green-Earth.jpg


© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.        27
Indexing with HBase “10” < “2”

  GROUP BY year, month, country, city ORDER BY visits DESC LIMIT 10

      Lexicographic sorting

  2012/05/USA/0000000000/
  2012/05/USA/4294961296/San Francisco                                                        = 1000 visits*
  2012/05/USA/4294961396/New York                                                             = 900 visits*
  . . .
  2012/05/USA/9999999999/

      scan “t” startrow => “2012/05/USA/”, limit => 10

                                                                              * Padding numbers for lexicographic sorting:
                                                                                1000 -> Long.MAX_VALUE – 1000 = 4294961296


© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.        28
Query Engine

      Always reads indexed, compact data
      Query parsing
      Scan strategy
            Single vs. multiple scans
            Start/stop rows (prefixes, index positions, etc.)
            Index selection (volatile indexes with incremental processing)
      Deserialization
      Post-aggregation, sorting, fuzzy-sorting etc.
      Paging
      Custom dimension/metric class loading




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   29
Conclusions

      OLAP semantics on a simple data model
            Data as first class citizen
            Domain Specific “Language” for Dimensions, Metrics, Aggregations
      Tunable performance, resource allocation
      Framework for vertical analytics systems




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   30
Thank you!
                                                               Cosmin Lehene @clehene

                                                               http://hstack.org
                                                                               Credits:
                                                                              Andrei Dragomir
                                                                              Adrian Muraru
                                                                               Andrei Dulvac
                                                                              Raluca Podiuc
                                                                               Tudor Scurtu
                                                                              Bogdan Dragu
                                                                               Bogdan Drutu

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.         31
© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
OLAP 101 - Rollup

                             Countr                                                Visits   Sale
                             y
                             USA                                                   4        $50

                             Canada                                                1        $0




      Rollup: SELECT COUNT(visits), SUM(sales) GROUP BY country




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   33
OLAP 101 - Slicing

  Date                       Countr                   City                    OS             Browser      Sale
                             y
  2012-03-02                 USA                      NY                      Windows        FF           0.0

  2012-03-02                 USA                      NY                      Windows        FF           10.0

  2012-03-03                 USA                      S                       OSX            Chrome       25.0

  2012-03-03                 Canada                   Ontario                 Linux          Chrome       0.0

  2012-03-04                 USA                      Chicago                 OSX            Safari       15.0

  5 visits,                  2                        4 cities:               3 OS-es        3 browsers   50.0
  3 days                     countries                NY: 2                   Win: 2         FF: 2        3 sales
                             USA: 4                   SF: 1                   OSX: 2         Chrome:2
                             Canada: 1
      Filter or Segment or Slice (WHERE or HAVING)




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.             34
OLAP 101 – Sorting, TOP n

  Date                       Countr                   City                    OS        Browser   Sale
                             y
                                                                                        Chrome    $25

                                                                                        Safari    $15

                                                                                        Firefox   $10




      SELECT SUM(sales) as total GROUP BY browser ORDER BY total



© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.        35

Más contenido relacionado

La actualidad más candente

DB2 10 Webcast #1 - Overview And Migration Planning
DB2 10 Webcast #1 - Overview And Migration PlanningDB2 10 Webcast #1 - Overview And Migration Planning
DB2 10 Webcast #1 - Overview And Migration PlanningLaura Hood
 
Realtime Apache Hadoop at Facebook
Realtime Apache Hadoop at FacebookRealtime Apache Hadoop at Facebook
Realtime Apache Hadoop at Facebookparallellabs
 
SQL Server 2008 Fast Track Data Warehouse
SQL Server 2008 Fast Track Data WarehouseSQL Server 2008 Fast Track Data Warehouse
SQL Server 2008 Fast Track Data WarehouseMark Ginnebaugh
 
Oracle10g new features
Oracle10g  new featuresOracle10g  new features
Oracle10g new featuresTanvi_Agrawal
 
DB210 Smarter Database IBM Tech Forum 2011
DB210 Smarter Database   IBM Tech Forum 2011DB210 Smarter Database   IBM Tech Forum 2011
DB210 Smarter Database IBM Tech Forum 2011Laura Hood
 
SQL Server Workshop Paul Bertucci
SQL Server Workshop Paul BertucciSQL Server Workshop Paul Bertucci
SQL Server Workshop Paul BertucciMark Ginnebaugh
 
SDEC2011 Using Couchbase for social game scaling and speed
SDEC2011 Using Couchbase for social game scaling and speedSDEC2011 Using Couchbase for social game scaling and speed
SDEC2011 Using Couchbase for social game scaling and speedKorea Sdec
 

La actualidad más candente (13)

DB2 10 Webcast #1 - Overview And Migration Planning
DB2 10 Webcast #1 - Overview And Migration PlanningDB2 10 Webcast #1 - Overview And Migration Planning
DB2 10 Webcast #1 - Overview And Migration Planning
 
Realtime Apache Hadoop at Facebook
Realtime Apache Hadoop at FacebookRealtime Apache Hadoop at Facebook
Realtime Apache Hadoop at Facebook
 
SQL Server 2008 Fast Track Data Warehouse
SQL Server 2008 Fast Track Data WarehouseSQL Server 2008 Fast Track Data Warehouse
SQL Server 2008 Fast Track Data Warehouse
 
Ta3
Ta3Ta3
Ta3
 
Monster
MonsterMonster
Monster
 
Oracle10g new features
Oracle10g  new featuresOracle10g  new features
Oracle10g new features
 
DB210 Smarter Database IBM Tech Forum 2011
DB210 Smarter Database   IBM Tech Forum 2011DB210 Smarter Database   IBM Tech Forum 2011
DB210 Smarter Database IBM Tech Forum 2011
 
SQL Server Workshop Paul Bertucci
SQL Server Workshop Paul BertucciSQL Server Workshop Paul Bertucci
SQL Server Workshop Paul Bertucci
 
An Hour of DB2 Tips
An Hour of DB2 TipsAn Hour of DB2 Tips
An Hour of DB2 Tips
 
SQLFire Webinar
SQLFire WebinarSQLFire Webinar
SQLFire Webinar
 
SQLFire at Strata 2012
SQLFire at Strata 2012SQLFire at Strata 2012
SQLFire at Strata 2012
 
SDEC2011 Using Couchbase for social game scaling and speed
SDEC2011 Using Couchbase for social game scaling and speedSDEC2011 Using Couchbase for social game scaling and speed
SDEC2011 Using Couchbase for social game scaling and speed
 
SQLFire lightning talk
SQLFire lightning talkSQLFire lightning talk
SQLFire lightning talk
 

Destacado

Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveApache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveXu Jiang
 
Real-time “OLAP” for Big Data (+ use cases) - bigdata.ro 2013
Real-time “OLAP” for Big Data (+ use cases) - bigdata.ro 2013Real-time “OLAP” for Big Data (+ use cases) - bigdata.ro 2013
Real-time “OLAP” for Big Data (+ use cases) - bigdata.ro 2013Cosmin Lehene
 
6. Apache Kylin Roadmap and Community - Apache Kylin Meetup @Shanghai
6. Apache Kylin Roadmap and Community - Apache Kylin Meetup @Shanghai6. Apache Kylin Roadmap and Community - Apache Kylin Meetup @Shanghai
6. Apache Kylin Roadmap and Community - Apache Kylin Meetup @ShanghaiLuke Han
 
HISTORIA ACTIVA
HISTORIA ACTIVAHISTORIA ACTIVA
HISTORIA ACTIVAJose Ramon
 
Making Of Zoozoo (Part 1)
Making Of Zoozoo (Part 1)Making Of Zoozoo (Part 1)
Making Of Zoozoo (Part 1)nirvanafilmblog
 
Ha nacido un concursante
Ha nacido un concursanteHa nacido un concursante
Ha nacido un concursanteJose Ramon
 
DÍAS DE RADIO
DÍAS DE RADIODÍAS DE RADIO
DÍAS DE RADIOJose Ramon
 
Mismuseos.net: Art After Technology (putting cultural data to work)
Mismuseos.net: Art After Technology (putting cultural data to work)Mismuseos.net: Art After Technology (putting cultural data to work)
Mismuseos.net: Art After Technology (putting cultural data to work)GNOSS
 
RHBC Announcements 3/19/17
RHBC Announcements 3/19/17RHBC Announcements 3/19/17
RHBC Announcements 3/19/17rhbc
 
The cognitive approach to abnormality (2)
The cognitive approach to abnormality (2)The cognitive approach to abnormality (2)
The cognitive approach to abnormality (2)clivecaines
 
Elastic HBase on Mesos - HBaseCon 2015
Elastic HBase on Mesos - HBaseCon 2015Elastic HBase on Mesos - HBaseCon 2015
Elastic HBase on Mesos - HBaseCon 2015Cosmin Lehene
 
Apache Cassandra Lesson: Data Modelling and CQL3
Apache Cassandra Lesson: Data Modelling and CQL3Apache Cassandra Lesson: Data Modelling and CQL3
Apache Cassandra Lesson: Data Modelling and CQL3Markus Klems
 
Normas de cine
Normas de cineNormas de cine
Normas de cineJose Ramon
 
Stateless Hypervisors at Scale
Stateless Hypervisors at ScaleStateless Hypervisors at Scale
Stateless Hypervisors at ScaleAntony Messerl
 
IS OLAP DEAD IN THE AGE OF BIG DATA?
IS OLAP DEAD IN THE AGE OF BIG DATA?IS OLAP DEAD IN THE AGE OF BIG DATA?
IS OLAP DEAD IN THE AGE OF BIG DATA?DataWorks Summit
 
Hadoop World 2011: Advanced HBase Schema Design
Hadoop World 2011: Advanced HBase Schema DesignHadoop World 2011: Advanced HBase Schema Design
Hadoop World 2011: Advanced HBase Schema DesignCloudera, Inc.
 
Test strategies for data processing pipelines
Test strategies for data processing pipelinesTest strategies for data processing pipelines
Test strategies for data processing pipelinesLars Albertsson
 
A Survey of HBase Application Archetypes
A Survey of HBase Application ArchetypesA Survey of HBase Application Archetypes
A Survey of HBase Application ArchetypesHBaseCon
 
Apache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup SlidesApache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup SlidesIsheeta Sanghi
 

Destacado (20)

Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveApache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
 
Real-time “OLAP” for Big Data (+ use cases) - bigdata.ro 2013
Real-time “OLAP” for Big Data (+ use cases) - bigdata.ro 2013Real-time “OLAP” for Big Data (+ use cases) - bigdata.ro 2013
Real-time “OLAP” for Big Data (+ use cases) - bigdata.ro 2013
 
6. Apache Kylin Roadmap and Community - Apache Kylin Meetup @Shanghai
6. Apache Kylin Roadmap and Community - Apache Kylin Meetup @Shanghai6. Apache Kylin Roadmap and Community - Apache Kylin Meetup @Shanghai
6. Apache Kylin Roadmap and Community - Apache Kylin Meetup @Shanghai
 
HISTORIA ACTIVA
HISTORIA ACTIVAHISTORIA ACTIVA
HISTORIA ACTIVA
 
Making Of Zoozoo (Part 1)
Making Of Zoozoo (Part 1)Making Of Zoozoo (Part 1)
Making Of Zoozoo (Part 1)
 
Ha nacido un concursante
Ha nacido un concursanteHa nacido un concursante
Ha nacido un concursante
 
DÍAS DE RADIO
DÍAS DE RADIODÍAS DE RADIO
DÍAS DE RADIO
 
Mismuseos.net: Art After Technology (putting cultural data to work)
Mismuseos.net: Art After Technology (putting cultural data to work)Mismuseos.net: Art After Technology (putting cultural data to work)
Mismuseos.net: Art After Technology (putting cultural data to work)
 
RHBC Announcements 3/19/17
RHBC Announcements 3/19/17RHBC Announcements 3/19/17
RHBC Announcements 3/19/17
 
The cognitive approach to abnormality (2)
The cognitive approach to abnormality (2)The cognitive approach to abnormality (2)
The cognitive approach to abnormality (2)
 
Elastic HBase on Mesos - HBaseCon 2015
Elastic HBase on Mesos - HBaseCon 2015Elastic HBase on Mesos - HBaseCon 2015
Elastic HBase on Mesos - HBaseCon 2015
 
Apache Cassandra Lesson: Data Modelling and CQL3
Apache Cassandra Lesson: Data Modelling and CQL3Apache Cassandra Lesson: Data Modelling and CQL3
Apache Cassandra Lesson: Data Modelling and CQL3
 
Normas de cine
Normas de cineNormas de cine
Normas de cine
 
Stateless Hypervisors at Scale
Stateless Hypervisors at ScaleStateless Hypervisors at Scale
Stateless Hypervisors at Scale
 
Beacosystem V3
Beacosystem V3Beacosystem V3
Beacosystem V3
 
IS OLAP DEAD IN THE AGE OF BIG DATA?
IS OLAP DEAD IN THE AGE OF BIG DATA?IS OLAP DEAD IN THE AGE OF BIG DATA?
IS OLAP DEAD IN THE AGE OF BIG DATA?
 
Hadoop World 2011: Advanced HBase Schema Design
Hadoop World 2011: Advanced HBase Schema DesignHadoop World 2011: Advanced HBase Schema Design
Hadoop World 2011: Advanced HBase Schema Design
 
Test strategies for data processing pipelines
Test strategies for data processing pipelinesTest strategies for data processing pipelines
Test strategies for data processing pipelines
 
A Survey of HBase Application Archetypes
A Survey of HBase Application ArchetypesA Survey of HBase Application Archetypes
A Survey of HBase Application Archetypes
 
Apache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup SlidesApache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup Slides
 

Similar a Low Latency “OLAP” with HBase - HBaseCon 2012

Xebia adobe flash mobile applications
Xebia adobe flash mobile applicationsXebia adobe flash mobile applications
Xebia adobe flash mobile applicationsMichael Chaize
 
xTech2006_DB2onRails
xTech2006_DB2onRailsxTech2006_DB2onRails
xTech2006_DB2onRailswebuploader
 
Flex and LiveCycle Data Services Best Practices from the Trenches (Adobe MAX ...
Flex and LiveCycle Data Services Best Practices from the Trenches (Adobe MAX ...Flex and LiveCycle Data Services Best Practices from the Trenches (Adobe MAX ...
Flex and LiveCycle Data Services Best Practices from the Trenches (Adobe MAX ...François Le Droff
 
오라클 DR 및 복제 솔루션(Dbvisit 소개)
오라클 DR 및 복제 솔루션(Dbvisit 소개)오라클 DR 및 복제 솔루션(Dbvisit 소개)
오라클 DR 및 복제 솔루션(Dbvisit 소개)Linux Foundation Korea
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Bhupesh Bansal
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop User Group
 
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...Romeo Kienzler
 
Compression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of TradeoffsCompression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of TradeoffsDataWorks Summit
 
Software im SAP Umfeld_IBM DB2
Software im SAP Umfeld_IBM DB2Software im SAP Umfeld_IBM DB2
Software im SAP Umfeld_IBM DB2IBM Switzerland
 
Improving Hadoop Resiliency and Operational Efficiency with EMC Isilon
Improving Hadoop Resiliency and Operational Efficiency with EMC IsilonImproving Hadoop Resiliency and Operational Efficiency with EMC Isilon
Improving Hadoop Resiliency and Operational Efficiency with EMC IsilonDataWorks Summit/Hadoop Summit
 
Monitoring with Icinga2 at Adobe
Monitoring with Icinga2 at AdobeMonitoring with Icinga2 at Adobe
Monitoring with Icinga2 at AdobeIcinga
 
Leveraging Open Source to Manage SAN Performance
Leveraging Open Source to Manage SAN PerformanceLeveraging Open Source to Manage SAN Performance
Leveraging Open Source to Manage SAN Performancebrettallison
 
IBMHadoopofferingTechline-Systems2015
IBMHadoopofferingTechline-Systems2015IBMHadoopofferingTechline-Systems2015
IBMHadoopofferingTechline-Systems2015Daniela Zuppini
 
Big Data - HDInsight and Power BI
Big Data - HDInsight and Power BIBig Data - HDInsight and Power BI
Big Data - HDInsight and Power BIPrasad Prabhu (PP)
 
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...Sumeet Singh
 
Compression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of TradeoffsCompression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of TradeoffsDataWorks Summit
 
OVH Lab - Enterprise Cloud Databases
OVH Lab - Enterprise Cloud DatabasesOVH Lab - Enterprise Cloud Databases
OVH Lab - Enterprise Cloud DatabasesOVHcloud
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Cloudera, Inc.
 
Large Scale SQL Considerations for SharePoint Deployments
Large Scale SQL Considerations for SharePoint DeploymentsLarge Scale SQL Considerations for SharePoint Deployments
Large Scale SQL Considerations for SharePoint DeploymentsJoel Oleson
 

Similar a Low Latency “OLAP” with HBase - HBaseCon 2012 (20)

Xebia adobe flash mobile applications
Xebia adobe flash mobile applicationsXebia adobe flash mobile applications
Xebia adobe flash mobile applications
 
xTech2006_DB2onRails
xTech2006_DB2onRailsxTech2006_DB2onRails
xTech2006_DB2onRails
 
Flex and LiveCycle Data Services Best Practices from the Trenches (Adobe MAX ...
Flex and LiveCycle Data Services Best Practices from the Trenches (Adobe MAX ...Flex and LiveCycle Data Services Best Practices from the Trenches (Adobe MAX ...
Flex and LiveCycle Data Services Best Practices from the Trenches (Adobe MAX ...
 
오라클 DR 및 복제 솔루션(Dbvisit 소개)
오라클 DR 및 복제 솔루션(Dbvisit 소개)오라클 DR 및 복제 솔루션(Dbvisit 소개)
오라클 DR 및 복제 솔루션(Dbvisit 소개)
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
 
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
 
Compression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of TradeoffsCompression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of Tradeoffs
 
Software im SAP Umfeld_IBM DB2
Software im SAP Umfeld_IBM DB2Software im SAP Umfeld_IBM DB2
Software im SAP Umfeld_IBM DB2
 
Ibm db2 big sql
Ibm db2 big sqlIbm db2 big sql
Ibm db2 big sql
 
Improving Hadoop Resiliency and Operational Efficiency with EMC Isilon
Improving Hadoop Resiliency and Operational Efficiency with EMC IsilonImproving Hadoop Resiliency and Operational Efficiency with EMC Isilon
Improving Hadoop Resiliency and Operational Efficiency with EMC Isilon
 
Monitoring with Icinga2 at Adobe
Monitoring with Icinga2 at AdobeMonitoring with Icinga2 at Adobe
Monitoring with Icinga2 at Adobe
 
Leveraging Open Source to Manage SAN Performance
Leveraging Open Source to Manage SAN PerformanceLeveraging Open Source to Manage SAN Performance
Leveraging Open Source to Manage SAN Performance
 
IBMHadoopofferingTechline-Systems2015
IBMHadoopofferingTechline-Systems2015IBMHadoopofferingTechline-Systems2015
IBMHadoopofferingTechline-Systems2015
 
Big Data - HDInsight and Power BI
Big Data - HDInsight and Power BIBig Data - HDInsight and Power BI
Big Data - HDInsight and Power BI
 
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
 
Compression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of TradeoffsCompression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of Tradeoffs
 
OVH Lab - Enterprise Cloud Databases
OVH Lab - Enterprise Cloud DatabasesOVH Lab - Enterprise Cloud Databases
OVH Lab - Enterprise Cloud Databases
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
 
Large Scale SQL Considerations for SharePoint Deployments
Large Scale SQL Considerations for SharePoint DeploymentsLarge Scale SQL Considerations for SharePoint Deployments
Large Scale SQL Considerations for SharePoint Deployments
 

Último

A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfAarwolf Industries LLC
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFMichael Gough
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Karmanjay Verma
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 

Último (20)

A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdf
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDF
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 

Low Latency “OLAP” with HBase - HBaseCon 2012

  • 1. Low Latency “OLAP” with HBase Cosmin Lehene | Adobe © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
  • 2. What we needed … and built  OLAP Semantics  Low Latency Ingestion  High Throughput  Real-time Query API  Not hardcoded to web analytics or x-, y-, z- analytics, but extensible © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 2
  • 3. Building Blocks  Dimensions, Metrics  Aggregations  Roll-up, drill-down, slicing and dicing, sorting © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 3
  • 4. OLAP 101 – Queries example Date Countr City OS Browser Sale y 2012-05-21 USA NY Windows FF 0.0 2012-05-21 USA NY Windows FF 10.0 2012-05-22 USA SF OSX Chrome 25.0 2012-05-22 Canada Ontario Linux Chrome 0.0 2012-05-23 USA Chicago OSX Safari 15.0 5 visits, 2 4 cities: 3 OS-es 3 browsers 50.0 3 days countries NY: 2 Win: 2 FF: 2 3 sales USA: 4 SF: 1 OSX: 2 Chrome:2 Canada: 1 © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 4
  • 5. OLAP 101 – Queries example  Rolling up to country level: Country visits sales SELECT COUNT(visits), SUM(sales) USA 4 $50 GROUP BY country Canada 1 0  “Slicing” by browser Country visits sales SELECT COUNT(visits), SUM(sales) USA 2 $10 GROUP BY country Canada 0 0 HAVING browser = “FF”  Top browsers by sales Browser sales visits SELECT SUM(sales), COUNT(visits) Chrome $25 2 GROUP BY browser Safari $15 1 ORDER BY sales FF $10 2 © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 5
  • 6. OLAP – Runtime Aggregation vs. Pre-aggregation  Aggregate at runtime  Pre-aggregate  Most flexible  Fast  Fast – scatter gather  Efficient – O(1)  Space efficient  High throughput  But  But  I/O, CPU intensive  More effort to process (latency)  slow for larger data  Combinatorial explosion (space)  low throughput  No flexibility © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 6
  • 7. Pre-aggregation  Data needs to be summarized  Can’t visualize 1B data points (no, not even with Retina display)  Difficult to comprehend correlations among more than 3 dimensions  Not all dimension groups are relevant  Index on a needed basis (view selection problem)  Runtime aggregation == TeraSort for every query?  Pre-aggregate to reduce cardinality © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 7
  • 8. SaasBase  We tune both  pre-aggregation level vs. runtime post-aggregation  (ingestion speed + space ) vs. (query speed)  Think materialized views from RDBMS © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 8
  • 9. SaasBase Domain Model Mapping © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 9
  • 10. SaasBase - Domain Model Mapping © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 10
  • 11. SaasBase - Ingestion, Processing, Indexing, Querying © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 11
  • 12. SaasBase - Ingestion, Processing, Indexing, Querying © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 12
  • 13. Ingestion © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 13
  • 14. Ingestion throughput vs. latency  Historical data (large batches)  Optimize for throughput  Increments (latest data, smaller)  Optimize for latency © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 14
  • 15. Large, granular input strategies  Slow listing in HDFS  Archive processed files  Filtering input  FileDateFilter (log name patterns: log-YYYY-MM-dd-HH.log)  TableInputFormat start/stop row  File Index in HBase (track processed/new files)  Map tasks overhead - stitching input splits  400K files => 400K map tasks => overhead, slow reduce copy  CombineFileInputFormat – 2GB-splits => 500 splits for 1TB  FixedMappersTableInputFormat (e.g. 5-region splits) © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 15
  • 16. Ingestion – Bulk Import  HFileOutputFormat (HFOF)  100s X faster than HBase API  No need to recover from failed jobs  No unnecessary load on machines * No shuffle - global reduce order required!  e.g. first reduce key needs to be in the first region, last one in the last region  Watch for uneven partitions © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 16
  • 17. HFOF – FileSizeDatePartitioner  1 partition(reduce) / day for initial import  Uneven reduce (partitions) due to data growth over time  Reduce k: 2010-12-04 = 500MB  Reduce n: 2012-05-22 = 5GB => slow and will result in a 5GB region  Balance reduce buckets based on input file sizes and the reduce key  Generate sub-partitions based on predefined size (e.g. 1GB) © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 17
  • 18. Processing © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 18
  • 19. Processing  Processing involves reading the Input (files, tables, events), pre- aggregating it (reducing cardinality) and generating tables that can be queried in real-time  1 year: 1B events => 100B data points indexed  Query => scan 365 data points (e.g. daily page views)  Processing could be either MR or real-time (e.g. Storm) © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 19
  • 20. Processing for OLAP semantics  GROUP BY (process, query)  COUNT, SUM, AVG, etc. (process, query)  SORT (process, query)  HAVING (mostly query, can define pre-process constraints) © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 20
  • 21. SaasBase vs. SQL Views Comparison © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 21
  • 22. reports.json entities definition © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 22
  • 23. Processing Performance  read, map, partition, combine, copy, sort, reduce, write  Read:  Scan.setCaching() (I/O ~ buffer)  Scan.setBatching() (avoid timeouts for abnormal input, e.g. 1M hits/visit)  Even region distribution across cluster (distributes CPU, I/O)  Map:  No unnecessary transformations: Bytes.toString(bytes) + Bytes.toBytes(string) (CPU)  Avoid GC : new X() (CPU, Memory)  Avoid system calls (context switching)  Stripping unnecessary data (I/O) © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 23
  • 24. Processing Performance  Hot (in memory) vs. Cold (on disk, on network) data  Minimize I/O from disk/network  Single shot MR job: SuperProcessor  Emit all groups from one map() call  Incremental processing  Data format YYYY-MM-DD prefixed rowkey (HH:mm for more granularity) © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 24
  • 25. Indexing © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 25
  • 26. HBase natural order: hierarchical representation © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 26
  • 27. Indexing - Why  Example: top 10 cities  ~50K [country, city] combinations per day  Top 10 cities for 1 year =>  365 (days) X 50K ~=15M data points scanned  If you add gender => 30M  If you add Device, OS, Browser …  Might compress well, but think about the environment  How much energy would you spend for just top 10 cities? * Image from: http://my.neutralexistence.com/images/Green-Earth.jpg © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 27
  • 28. Indexing with HBase “10” < “2” GROUP BY year, month, country, city ORDER BY visits DESC LIMIT 10  Lexicographic sorting 2012/05/USA/0000000000/ 2012/05/USA/4294961296/San Francisco = 1000 visits* 2012/05/USA/4294961396/New York = 900 visits* . . . 2012/05/USA/9999999999/  scan “t” startrow => “2012/05/USA/”, limit => 10 * Padding numbers for lexicographic sorting: 1000 -> Long.MAX_VALUE – 1000 = 4294961296 © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 28
  • 29. Query Engine  Always reads indexed, compact data  Query parsing  Scan strategy  Single vs. multiple scans  Start/stop rows (prefixes, index positions, etc.)  Index selection (volatile indexes with incremental processing)  Deserialization  Post-aggregation, sorting, fuzzy-sorting etc.  Paging  Custom dimension/metric class loading © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 29
  • 30. Conclusions  OLAP semantics on a simple data model  Data as first class citizen  Domain Specific “Language” for Dimensions, Metrics, Aggregations  Tunable performance, resource allocation  Framework for vertical analytics systems © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 30
  • 31. Thank you! Cosmin Lehene @clehene http://hstack.org Credits: Andrei Dragomir Adrian Muraru Andrei Dulvac Raluca Podiuc Tudor Scurtu Bogdan Dragu Bogdan Drutu © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 31
  • 32. © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
  • 33. OLAP 101 - Rollup Countr Visits Sale y USA 4 $50 Canada 1 $0  Rollup: SELECT COUNT(visits), SUM(sales) GROUP BY country © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 33
  • 34. OLAP 101 - Slicing Date Countr City OS Browser Sale y 2012-03-02 USA NY Windows FF 0.0 2012-03-02 USA NY Windows FF 10.0 2012-03-03 USA S OSX Chrome 25.0 2012-03-03 Canada Ontario Linux Chrome 0.0 2012-03-04 USA Chicago OSX Safari 15.0 5 visits, 2 4 cities: 3 OS-es 3 browsers 50.0 3 days countries NY: 2 Win: 2 FF: 2 3 sales USA: 4 SF: 1 OSX: 2 Chrome:2 Canada: 1  Filter or Segment or Slice (WHERE or HAVING) © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 34
  • 35. OLAP 101 – Sorting, TOP n Date Countr City OS Browser Sale y Chrome $25 Safari $15 Firefox $10  SELECT SUM(sales) as total GROUP BY browser ORDER BY total © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 35

Notas del editor

  1. How many HBase users?
  2. Data as first class citizen
  3. Check contrast on projector
  4. Just like speedvs space in general CS/algoQueries always hit indexes
  5. Dimensions – readtransformserializedeserialize data attributesMetrics – read/transform/aggregate/serializeConstraints: ingestion filteringReport: instrument dimensions groups + metrics with aggregations, sorting
  6. QUERY ENGINE -&gt; INDEX(always realtime)
  7. Initial import/process and NEW reports (not covered) on historical data
  8. 18K regions, upgrade to 0.92
  9. DiagramHARD TO DIGEST (TOO MUCH INFO, TOO CONDENSED)
  10. Process = aggregate,generate indexes (natural)Query = uses indexes, can do extra aggregation
  11. LEFT: report definition, NOT a QUERYLIKE A VIEW - CREATED - THEN QUERIED
  12. Inconsistent
  13. Rowkey =dimensions group -&gt; metrics (right)
  14. GO BACK to EXPLAIN
  15. &gt;100K/sec/threadREALTIME
  16. Data analysts work with familiar concepts