SlideShare una empresa de Scribd logo
1 de 35
Low Latency “OLAP” with HBase
     Cosmin Lehene | Adobe




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
What we needed … and built


      OLAP Semantics
      Low Latency Ingestion
      High Throughput
      Real-time Query API




      Not hardcoded to web analytics or x-, y-, z-
       analytics, but extensible
© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   2
Building Blocks


      Dimensions, Metrics
      Aggregations
      Roll-up, drill-down, slicing and dicing, sorting




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   3
OLAP 101 – Queries example




                 Date                           Countr                        City            OS        Browser      Sale
                                                y
                 2012-05-21                     USA                           NY              Windows   FF           0.0

                 2012-05-21                     USA                           NY              Windows   FF           10.0

                 2012-05-22                     USA                           SF              OSX       Chrome       25.0

                 2012-05-22                     Canada                        Ontario         Linux     Chrome       0.0

                 2012-05-23                     USA                           Chicago         OSX       Safari       15.0

                 5 visits,                      2                             4 cities:       3 OS-es   3 browsers   50.0
                 3 days                         countries                     NY: 2           Win: 2    FF: 2        3 sales
                                                USA: 4                        SF: 1           OSX: 2    Chrome:2
                                                Canada: 1


© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.               4
OLAP 101 – Queries example

      Rolling up to country level:                                               Country    visits   sales
  SELECT COUNT(visits), SUM(sales)
                                                                                  USA        4        $50
  GROUP BY country
                                                                                  Canada     1        0



      “Slicing” by browser                                                       Country   visits sales
  SELECT COUNT(visits), SUM(sales)                                                USA       2         $10
  GROUP BY country
                                                                                  Canada    0         0
  HAVING browser = “FF”


      Top browsers by sales                                                      Browser   sales     visits
  SELECT SUM(sales), COUNT(visits)                                                Chrome    $25       2
  GROUP BY browser
                                                                                  Safari    $15       1
  ORDER BY sales
                                                                                  FF        $10       2

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   5
OLAP – Runtime Aggregation vs. Pre-aggregation


      Aggregate at runtime                                                      Pre-aggregate
            Most flexible                                                           Fast
            Fast – scatter gather                                                   Efficient – O(1)
            Space efficient                                                         High throughput
      But                                                                       But
            I/O, CPU intensive                                                      More effort to process (latency)
            slow for larger data                                                    Combinatorial explosion (space)
            low throughput                                                          No flexibility




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   6
Pre-aggregation

      Data needs to be summarized
            Can’t visualize 1B data points (no, not even with Retina display)
            Difficult to comprehend correlations among more than 3 dimensions


      Not all dimension groups are relevant
            Index on a needed basis (view selection problem)


      Runtime aggregation == TeraSort for every query?
            Pre-aggregate to reduce cardinality




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   7
SaasBase

      We tune both
      pre-aggregation level                                                      vs.    runtime post-aggregation
      (ingestion speed + space ) vs.                                             (query speed)


      Think materialized views from RDBMS




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   8
SaasBase Domain Model Mapping




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   9
SaasBase - Domain Model Mapping




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   10
SaasBase - Ingestion, Processing, Indexing, Querying




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   11
SaasBase - Ingestion, Processing, Indexing, Querying




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   12
Ingestion




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   13
Ingestion throughput vs. latency


      Historical data (large batches)
            Optimize for throughput
      Increments (latest data, smaller)
            Optimize for latency




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   14
Large, granular input strategies

      Slow listing in HDFS
            Archive processed files


      Filtering input
            FileDateFilter (log name patterns: log-YYYY-MM-dd-HH.log)
            TableInputFormat start/stop row
            File Index in HBase (track processed/new files)


      Map tasks overhead - stitching input splits
            400K files => 400K map tasks => overhead, slow reduce copy
            CombineFileInputFormat – 2GB-splits => 500 splits for 1TB
            FixedMappersTableInputFormat (e.g. 5-region splits)
© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   15
Ingestion – Bulk Import

      HFileOutputFormat (HFOF)
            100s X faster than HBase API
            No need to recover from failed jobs
            No unnecessary load on machines




  * No shuffle - global reduce order
  required!
            e.g. first reduce key needs to be in the
             first region, last one in the last region
            Watch for uneven partitions


© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   16
HFOF – FileSizeDatePartitioner

      1 partition(reduce) / day for initial import
      Uneven reduce (partitions) due to data growth over time
            Reduce k: 2010-12-04 = 500MB
            Reduce n: 2012-05-22 = 5GB => slow and will result in a 5GB region




      Balance reduce buckets based on input file sizes and the reduce key
      Generate sub-partitions based on predefined size (e.g. 1GB)

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   17
Processing




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   18
Processing



      Processing involves reading the Input (files, tables, events), pre-
       aggregating it (reducing cardinality) and generating tables that can be
       queried in real-time
            1 year: 1B events => 100B data points indexed
            Query => scan 365 data points (e.g. daily page views)




      Processing could be either MR or real-time (e.g. Storm)




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   19
Processing for OLAP semantics

            GROUP BY (process, query)
            COUNT, SUM, AVG, etc. (process, query)
            SORT (process, query)
            HAVING (mostly query, can define pre-process constraints)




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   20
SaasBase vs. SQL Views Comparison




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   21
reports.json entities definition




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   22
Processing Performance

      read, map, partition, combine, copy, sort, reduce, write


      Read:
            Scan.setCaching() (I/O ~ buffer)
            Scan.setBatching() (avoid timeouts for abnormal input, e.g. 1M hits/visit)
            Even region distribution across cluster (distributes CPU, I/O)
      Map:
            No unnecessary transformations: Bytes.toString(bytes) + Bytes.toBytes(string)
             (CPU)
            Avoid GC : new X() (CPU, Memory)
            Avoid system calls (context switching)
            Stripping unnecessary data (I/O)


© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   23
Processing Performance

      Hot (in memory) vs. Cold (on disk, on network) data
            Minimize I/O from disk/network


      Single shot MR job: SuperProcessor
            Emit all groups from one map() call


      Incremental processing
            Data format YYYY-MM-DD prefixed rowkey (HH:mm for more granularity)




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   24
Indexing




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   25
HBase natural order: hierarchical representation




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   26
Indexing - Why

      Example: top 10 cities
            ~50K [country, city] combinations per day
            Top 10 cities for 1 year =>
            365 (days) X 50K ~=15M data points scanned
            If you add gender => 30M
            If you add Device, OS, Browser …


      Might compress well, but think about the environment
      How much energy would you spend for just top 10 cities?



                                                                              * Image from: http://my.neutralexistence.com/images/Green-Earth.jpg


© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.        27
Indexing with HBase “10” < “2”

  GROUP BY year, month, country, city ORDER BY visits DESC LIMIT 10

      Lexicographic sorting

  2012/05/USA/0000000000/
  2012/05/USA/4294961296/San Francisco                                                        = 1000 visits*
  2012/05/USA/4294961396/New York                                                             = 900 visits*
  . . .
  2012/05/USA/9999999999/

      scan “t” startrow => “2012/05/USA/”, limit => 10

                                                                              * Padding numbers for lexicographic sorting:
                                                                                1000 -> Long.MAX_VALUE – 1000 = 4294961296


© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.        28
Query Engine

      Always reads indexed, compact data
      Query parsing
      Scan strategy
            Single vs. multiple scans
            Start/stop rows (prefixes, index positions, etc.)
            Index selection (volatile indexes with incremental processing)
      Deserialization
      Post-aggregation, sorting, fuzzy-sorting etc.
      Paging
      Custom dimension/metric class loading




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   29
Conclusions

      OLAP semantics on a simple data model
            Data as first class citizen
            Domain Specific “Language” for Dimensions, Metrics, Aggregations
      Tunable performance, resource allocation
      Framework for vertical analytics systems




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   30
Thank you!
                                                               Cosmin Lehene @clehene

                                                               http://hstack.org
                                                                               Credits:
                                                                              Andrei Dragomir
                                                                              Adrian Muraru
                                                                               Andrei Dulvac
                                                                              Raluca Podiuc
                                                                               Tudor Scurtu
                                                                              Bogdan Dragu
                                                                               Bogdan Drutu

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.         31
© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
OLAP 101 - Rollup

                             Countr                                                Visits   Sale
                             y
                             USA                                                   4        $50

                             Canada                                                1        $0




      Rollup: SELECT COUNT(visits), SUM(sales) GROUP BY country




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   33
OLAP 101 - Slicing

  Date                       Countr                   City                    OS             Browser      Sale
                             y
  2012-03-02                 USA                      NY                      Windows        FF           0.0

  2012-03-02                 USA                      NY                      Windows        FF           10.0

  2012-03-03                 USA                      S                       OSX            Chrome       25.0

  2012-03-03                 Canada                   Ontario                 Linux          Chrome       0.0

  2012-03-04                 USA                      Chicago                 OSX            Safari       15.0

  5 visits,                  2                        4 cities:               3 OS-es        3 browsers   50.0
  3 days                     countries                NY: 2                   Win: 2         FF: 2        3 sales
                             USA: 4                   SF: 1                   OSX: 2         Chrome:2
                             Canada: 1
      Filter or Segment or Slice (WHERE or HAVING)




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.             34
OLAP 101 – Sorting, TOP n

  Date                       Countr                   City                    OS        Browser   Sale
                             y
                                                                                        Chrome    $25

                                                                                        Safari    $15

                                                                                        Firefox   $10




      SELECT SUM(sales) as total GROUP BY browser ORDER BY total



© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.        35

Más contenido relacionado

La actualidad más candente

Oracle10g new features
Oracle10g  new featuresOracle10g  new features
Oracle10g new features
Tanvi_Agrawal
 

La actualidad más candente (13)

DB2 10 Webcast #1 - Overview And Migration Planning
DB2 10 Webcast #1 - Overview And Migration PlanningDB2 10 Webcast #1 - Overview And Migration Planning
DB2 10 Webcast #1 - Overview And Migration Planning
 
Realtime Apache Hadoop at Facebook
Realtime Apache Hadoop at FacebookRealtime Apache Hadoop at Facebook
Realtime Apache Hadoop at Facebook
 
SQL Server 2008 Fast Track Data Warehouse
SQL Server 2008 Fast Track Data WarehouseSQL Server 2008 Fast Track Data Warehouse
SQL Server 2008 Fast Track Data Warehouse
 
Ta3
Ta3Ta3
Ta3
 
Monster
MonsterMonster
Monster
 
Oracle10g new features
Oracle10g  new featuresOracle10g  new features
Oracle10g new features
 
DB210 Smarter Database IBM Tech Forum 2011
DB210 Smarter Database   IBM Tech Forum 2011DB210 Smarter Database   IBM Tech Forum 2011
DB210 Smarter Database IBM Tech Forum 2011
 
SQL Server Workshop Paul Bertucci
SQL Server Workshop Paul BertucciSQL Server Workshop Paul Bertucci
SQL Server Workshop Paul Bertucci
 
An Hour of DB2 Tips
An Hour of DB2 TipsAn Hour of DB2 Tips
An Hour of DB2 Tips
 
SQLFire Webinar
SQLFire WebinarSQLFire Webinar
SQLFire Webinar
 
SQLFire at Strata 2012
SQLFire at Strata 2012SQLFire at Strata 2012
SQLFire at Strata 2012
 
SDEC2011 Using Couchbase for social game scaling and speed
SDEC2011 Using Couchbase for social game scaling and speedSDEC2011 Using Couchbase for social game scaling and speed
SDEC2011 Using Couchbase for social game scaling and speed
 
SQLFire lightning talk
SQLFire lightning talkSQLFire lightning talk
SQLFire lightning talk
 

Destacado

Real-time “OLAP” for Big Data (+ use cases) - bigdata.ro 2013
Real-time “OLAP” for Big Data (+ use cases) - bigdata.ro 2013Real-time “OLAP” for Big Data (+ use cases) - bigdata.ro 2013
Real-time “OLAP” for Big Data (+ use cases) - bigdata.ro 2013
Cosmin Lehene
 
The cognitive approach to abnormality (2)
The cognitive approach to abnormality (2)The cognitive approach to abnormality (2)
The cognitive approach to abnormality (2)
clivecaines
 

Destacado (20)

Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveApache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
 
Real-time “OLAP” for Big Data (+ use cases) - bigdata.ro 2013
Real-time “OLAP” for Big Data (+ use cases) - bigdata.ro 2013Real-time “OLAP” for Big Data (+ use cases) - bigdata.ro 2013
Real-time “OLAP” for Big Data (+ use cases) - bigdata.ro 2013
 
6. Apache Kylin Roadmap and Community - Apache Kylin Meetup @Shanghai
6. Apache Kylin Roadmap and Community - Apache Kylin Meetup @Shanghai6. Apache Kylin Roadmap and Community - Apache Kylin Meetup @Shanghai
6. Apache Kylin Roadmap and Community - Apache Kylin Meetup @Shanghai
 
HISTORIA ACTIVA
HISTORIA ACTIVAHISTORIA ACTIVA
HISTORIA ACTIVA
 
Making Of Zoozoo (Part 1)
Making Of Zoozoo (Part 1)Making Of Zoozoo (Part 1)
Making Of Zoozoo (Part 1)
 
Ha nacido un concursante
Ha nacido un concursanteHa nacido un concursante
Ha nacido un concursante
 
DÍAS DE RADIO
DÍAS DE RADIODÍAS DE RADIO
DÍAS DE RADIO
 
Mismuseos.net: Art After Technology (putting cultural data to work)
Mismuseos.net: Art After Technology (putting cultural data to work)Mismuseos.net: Art After Technology (putting cultural data to work)
Mismuseos.net: Art After Technology (putting cultural data to work)
 
RHBC Announcements 3/19/17
RHBC Announcements 3/19/17RHBC Announcements 3/19/17
RHBC Announcements 3/19/17
 
The cognitive approach to abnormality (2)
The cognitive approach to abnormality (2)The cognitive approach to abnormality (2)
The cognitive approach to abnormality (2)
 
Elastic HBase on Mesos - HBaseCon 2015
Elastic HBase on Mesos - HBaseCon 2015Elastic HBase on Mesos - HBaseCon 2015
Elastic HBase on Mesos - HBaseCon 2015
 
Apache Cassandra Lesson: Data Modelling and CQL3
Apache Cassandra Lesson: Data Modelling and CQL3Apache Cassandra Lesson: Data Modelling and CQL3
Apache Cassandra Lesson: Data Modelling and CQL3
 
Normas de cine
Normas de cineNormas de cine
Normas de cine
 
Stateless Hypervisors at Scale
Stateless Hypervisors at ScaleStateless Hypervisors at Scale
Stateless Hypervisors at Scale
 
Beacosystem V3
Beacosystem V3Beacosystem V3
Beacosystem V3
 
IS OLAP DEAD IN THE AGE OF BIG DATA?
IS OLAP DEAD IN THE AGE OF BIG DATA?IS OLAP DEAD IN THE AGE OF BIG DATA?
IS OLAP DEAD IN THE AGE OF BIG DATA?
 
Hadoop World 2011: Advanced HBase Schema Design
Hadoop World 2011: Advanced HBase Schema DesignHadoop World 2011: Advanced HBase Schema Design
Hadoop World 2011: Advanced HBase Schema Design
 
Test strategies for data processing pipelines
Test strategies for data processing pipelinesTest strategies for data processing pipelines
Test strategies for data processing pipelines
 
A Survey of HBase Application Archetypes
A Survey of HBase Application ArchetypesA Survey of HBase Application Archetypes
A Survey of HBase Application Archetypes
 
Apache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup SlidesApache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup Slides
 

Similar a Low Latency “OLAP” with HBase - HBaseCon 2012

xTech2006_DB2onRails
xTech2006_DB2onRailsxTech2006_DB2onRails
xTech2006_DB2onRails
webuploader
 
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
Romeo Kienzler
 
Software im SAP Umfeld_IBM DB2
Software im SAP Umfeld_IBM DB2Software im SAP Umfeld_IBM DB2
Software im SAP Umfeld_IBM DB2
IBM Switzerland
 
IBMHadoopofferingTechline-Systems2015
IBMHadoopofferingTechline-Systems2015IBMHadoopofferingTechline-Systems2015
IBMHadoopofferingTechline-Systems2015
Daniela Zuppini
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Cloudera, Inc.
 

Similar a Low Latency “OLAP” with HBase - HBaseCon 2012 (20)

Xebia adobe flash mobile applications
Xebia adobe flash mobile applicationsXebia adobe flash mobile applications
Xebia adobe flash mobile applications
 
xTech2006_DB2onRails
xTech2006_DB2onRailsxTech2006_DB2onRails
xTech2006_DB2onRails
 
Flex and LiveCycle Data Services Best Practices from the Trenches (Adobe MAX ...
Flex and LiveCycle Data Services Best Practices from the Trenches (Adobe MAX ...Flex and LiveCycle Data Services Best Practices from the Trenches (Adobe MAX ...
Flex and LiveCycle Data Services Best Practices from the Trenches (Adobe MAX ...
 
오라클 DR 및 복제 솔루션(Dbvisit 소개)
오라클 DR 및 복제 솔루션(Dbvisit 소개)오라클 DR 및 복제 솔루션(Dbvisit 소개)
오라클 DR 및 복제 솔루션(Dbvisit 소개)
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
 
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
 
Compression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of TradeoffsCompression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of Tradeoffs
 
Software im SAP Umfeld_IBM DB2
Software im SAP Umfeld_IBM DB2Software im SAP Umfeld_IBM DB2
Software im SAP Umfeld_IBM DB2
 
Ibm db2 big sql
Ibm db2 big sqlIbm db2 big sql
Ibm db2 big sql
 
Improving Hadoop Resiliency and Operational Efficiency with EMC Isilon
Improving Hadoop Resiliency and Operational Efficiency with EMC IsilonImproving Hadoop Resiliency and Operational Efficiency with EMC Isilon
Improving Hadoop Resiliency and Operational Efficiency with EMC Isilon
 
Monitoring with Icinga2 at Adobe
Monitoring with Icinga2 at AdobeMonitoring with Icinga2 at Adobe
Monitoring with Icinga2 at Adobe
 
Leveraging Open Source to Manage SAN Performance
Leveraging Open Source to Manage SAN PerformanceLeveraging Open Source to Manage SAN Performance
Leveraging Open Source to Manage SAN Performance
 
IBMHadoopofferingTechline-Systems2015
IBMHadoopofferingTechline-Systems2015IBMHadoopofferingTechline-Systems2015
IBMHadoopofferingTechline-Systems2015
 
Big Data - HDInsight and Power BI
Big Data - HDInsight and Power BIBig Data - HDInsight and Power BI
Big Data - HDInsight and Power BI
 
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
 
Compression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of TradeoffsCompression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of Tradeoffs
 
OVH Lab - Enterprise Cloud Databases
OVH Lab - Enterprise Cloud DatabasesOVH Lab - Enterprise Cloud Databases
OVH Lab - Enterprise Cloud Databases
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
 
Large Scale SQL Considerations for SharePoint Deployments
Large Scale SQL Considerations for SharePoint DeploymentsLarge Scale SQL Considerations for SharePoint Deployments
Large Scale SQL Considerations for SharePoint Deployments
 

Último

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Último (20)

Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 

Low Latency “OLAP” with HBase - HBaseCon 2012

  • 1. Low Latency “OLAP” with HBase Cosmin Lehene | Adobe © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
  • 2. What we needed … and built  OLAP Semantics  Low Latency Ingestion  High Throughput  Real-time Query API  Not hardcoded to web analytics or x-, y-, z- analytics, but extensible © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 2
  • 3. Building Blocks  Dimensions, Metrics  Aggregations  Roll-up, drill-down, slicing and dicing, sorting © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 3
  • 4. OLAP 101 – Queries example Date Countr City OS Browser Sale y 2012-05-21 USA NY Windows FF 0.0 2012-05-21 USA NY Windows FF 10.0 2012-05-22 USA SF OSX Chrome 25.0 2012-05-22 Canada Ontario Linux Chrome 0.0 2012-05-23 USA Chicago OSX Safari 15.0 5 visits, 2 4 cities: 3 OS-es 3 browsers 50.0 3 days countries NY: 2 Win: 2 FF: 2 3 sales USA: 4 SF: 1 OSX: 2 Chrome:2 Canada: 1 © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 4
  • 5. OLAP 101 – Queries example  Rolling up to country level: Country visits sales SELECT COUNT(visits), SUM(sales) USA 4 $50 GROUP BY country Canada 1 0  “Slicing” by browser Country visits sales SELECT COUNT(visits), SUM(sales) USA 2 $10 GROUP BY country Canada 0 0 HAVING browser = “FF”  Top browsers by sales Browser sales visits SELECT SUM(sales), COUNT(visits) Chrome $25 2 GROUP BY browser Safari $15 1 ORDER BY sales FF $10 2 © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 5
  • 6. OLAP – Runtime Aggregation vs. Pre-aggregation  Aggregate at runtime  Pre-aggregate  Most flexible  Fast  Fast – scatter gather  Efficient – O(1)  Space efficient  High throughput  But  But  I/O, CPU intensive  More effort to process (latency)  slow for larger data  Combinatorial explosion (space)  low throughput  No flexibility © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 6
  • 7. Pre-aggregation  Data needs to be summarized  Can’t visualize 1B data points (no, not even with Retina display)  Difficult to comprehend correlations among more than 3 dimensions  Not all dimension groups are relevant  Index on a needed basis (view selection problem)  Runtime aggregation == TeraSort for every query?  Pre-aggregate to reduce cardinality © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 7
  • 8. SaasBase  We tune both  pre-aggregation level vs. runtime post-aggregation  (ingestion speed + space ) vs. (query speed)  Think materialized views from RDBMS © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 8
  • 9. SaasBase Domain Model Mapping © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 9
  • 10. SaasBase - Domain Model Mapping © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 10
  • 11. SaasBase - Ingestion, Processing, Indexing, Querying © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 11
  • 12. SaasBase - Ingestion, Processing, Indexing, Querying © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 12
  • 13. Ingestion © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 13
  • 14. Ingestion throughput vs. latency  Historical data (large batches)  Optimize for throughput  Increments (latest data, smaller)  Optimize for latency © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 14
  • 15. Large, granular input strategies  Slow listing in HDFS  Archive processed files  Filtering input  FileDateFilter (log name patterns: log-YYYY-MM-dd-HH.log)  TableInputFormat start/stop row  File Index in HBase (track processed/new files)  Map tasks overhead - stitching input splits  400K files => 400K map tasks => overhead, slow reduce copy  CombineFileInputFormat – 2GB-splits => 500 splits for 1TB  FixedMappersTableInputFormat (e.g. 5-region splits) © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 15
  • 16. Ingestion – Bulk Import  HFileOutputFormat (HFOF)  100s X faster than HBase API  No need to recover from failed jobs  No unnecessary load on machines * No shuffle - global reduce order required!  e.g. first reduce key needs to be in the first region, last one in the last region  Watch for uneven partitions © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 16
  • 17. HFOF – FileSizeDatePartitioner  1 partition(reduce) / day for initial import  Uneven reduce (partitions) due to data growth over time  Reduce k: 2010-12-04 = 500MB  Reduce n: 2012-05-22 = 5GB => slow and will result in a 5GB region  Balance reduce buckets based on input file sizes and the reduce key  Generate sub-partitions based on predefined size (e.g. 1GB) © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 17
  • 18. Processing © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 18
  • 19. Processing  Processing involves reading the Input (files, tables, events), pre- aggregating it (reducing cardinality) and generating tables that can be queried in real-time  1 year: 1B events => 100B data points indexed  Query => scan 365 data points (e.g. daily page views)  Processing could be either MR or real-time (e.g. Storm) © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 19
  • 20. Processing for OLAP semantics  GROUP BY (process, query)  COUNT, SUM, AVG, etc. (process, query)  SORT (process, query)  HAVING (mostly query, can define pre-process constraints) © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 20
  • 21. SaasBase vs. SQL Views Comparison © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 21
  • 22. reports.json entities definition © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 22
  • 23. Processing Performance  read, map, partition, combine, copy, sort, reduce, write  Read:  Scan.setCaching() (I/O ~ buffer)  Scan.setBatching() (avoid timeouts for abnormal input, e.g. 1M hits/visit)  Even region distribution across cluster (distributes CPU, I/O)  Map:  No unnecessary transformations: Bytes.toString(bytes) + Bytes.toBytes(string) (CPU)  Avoid GC : new X() (CPU, Memory)  Avoid system calls (context switching)  Stripping unnecessary data (I/O) © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 23
  • 24. Processing Performance  Hot (in memory) vs. Cold (on disk, on network) data  Minimize I/O from disk/network  Single shot MR job: SuperProcessor  Emit all groups from one map() call  Incremental processing  Data format YYYY-MM-DD prefixed rowkey (HH:mm for more granularity) © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 24
  • 25. Indexing © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 25
  • 26. HBase natural order: hierarchical representation © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 26
  • 27. Indexing - Why  Example: top 10 cities  ~50K [country, city] combinations per day  Top 10 cities for 1 year =>  365 (days) X 50K ~=15M data points scanned  If you add gender => 30M  If you add Device, OS, Browser …  Might compress well, but think about the environment  How much energy would you spend for just top 10 cities? * Image from: http://my.neutralexistence.com/images/Green-Earth.jpg © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 27
  • 28. Indexing with HBase “10” < “2” GROUP BY year, month, country, city ORDER BY visits DESC LIMIT 10  Lexicographic sorting 2012/05/USA/0000000000/ 2012/05/USA/4294961296/San Francisco = 1000 visits* 2012/05/USA/4294961396/New York = 900 visits* . . . 2012/05/USA/9999999999/  scan “t” startrow => “2012/05/USA/”, limit => 10 * Padding numbers for lexicographic sorting: 1000 -> Long.MAX_VALUE – 1000 = 4294961296 © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 28
  • 29. Query Engine  Always reads indexed, compact data  Query parsing  Scan strategy  Single vs. multiple scans  Start/stop rows (prefixes, index positions, etc.)  Index selection (volatile indexes with incremental processing)  Deserialization  Post-aggregation, sorting, fuzzy-sorting etc.  Paging  Custom dimension/metric class loading © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 29
  • 30. Conclusions  OLAP semantics on a simple data model  Data as first class citizen  Domain Specific “Language” for Dimensions, Metrics, Aggregations  Tunable performance, resource allocation  Framework for vertical analytics systems © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 30
  • 31. Thank you! Cosmin Lehene @clehene http://hstack.org Credits: Andrei Dragomir Adrian Muraru Andrei Dulvac Raluca Podiuc Tudor Scurtu Bogdan Dragu Bogdan Drutu © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 31
  • 32. © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
  • 33. OLAP 101 - Rollup Countr Visits Sale y USA 4 $50 Canada 1 $0  Rollup: SELECT COUNT(visits), SUM(sales) GROUP BY country © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 33
  • 34. OLAP 101 - Slicing Date Countr City OS Browser Sale y 2012-03-02 USA NY Windows FF 0.0 2012-03-02 USA NY Windows FF 10.0 2012-03-03 USA S OSX Chrome 25.0 2012-03-03 Canada Ontario Linux Chrome 0.0 2012-03-04 USA Chicago OSX Safari 15.0 5 visits, 2 4 cities: 3 OS-es 3 browsers 50.0 3 days countries NY: 2 Win: 2 FF: 2 3 sales USA: 4 SF: 1 OSX: 2 Chrome:2 Canada: 1  Filter or Segment or Slice (WHERE or HAVING) © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 34
  • 35. OLAP 101 – Sorting, TOP n Date Countr City OS Browser Sale y Chrome $25 Safari $15 Firefox $10  SELECT SUM(sales) as total GROUP BY browser ORDER BY total © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 35

Notas del editor

  1. How many HBase users?
  2. Data as first class citizen
  3. Check contrast on projector
  4. Just like speedvs space in general CS/algoQueries always hit indexes
  5. Dimensions – readtransformserializedeserialize data attributesMetrics – read/transform/aggregate/serializeConstraints: ingestion filteringReport: instrument dimensions groups + metrics with aggregations, sorting
  6. QUERY ENGINE -&gt; INDEX(always realtime)
  7. Initial import/process and NEW reports (not covered) on historical data
  8. 18K regions, upgrade to 0.92
  9. DiagramHARD TO DIGEST (TOO MUCH INFO, TOO CONDENSED)
  10. Process = aggregate,generate indexes (natural)Query = uses indexes, can do extra aggregation
  11. LEFT: report definition, NOT a QUERYLIKE A VIEW - CREATED - THEN QUERIED
  12. Inconsistent
  13. Rowkey =dimensions group -&gt; metrics (right)
  14. GO BACK to EXPLAIN
  15. &gt;100K/sec/threadREALTIME
  16. Data analysts work with familiar concepts