SlideShare a Scribd company logo
1 of 54
HBase and
Hadoop at
Urban Airship
April 25, 2012
                           Dave Revell
                 dave@urbanairship.com
                          @dave_revell
Who are we?

•   Who am I?
     •   Airshipper for 10 months, Hadoop user for 1.5 years
     •   Database Engineer on Core Data team: we collect
         events from mobile devices and create reports
•   What is Urban Airship?
     •   SaaS for mobile developers. Features that devs
         shouldn’t build themselves.
     •   Mostly push notifications
     •   No airships :(
Goals
Goals
•   “Near real time” reporting
      •   Counters: messages sent and received, app opens, in
          various time slices
      •   More complex analyses: time-in-app, uniques,
          conversions
Goals
•   “Near real time” reporting
      •   Counters: messages sent and received, app opens, in
          various time slices
      •   More complex analyses: time-in-app, uniques,
          conversions
•   Scale
      •   Billions of “events” per month, ~100 bytes each
      •   40 billion events so far, looking exponential.
      •   Event arrival rate varies wildly, ~10K/sec (?)
Enter Hadoop
Enter Hadoop

•   An Apache project with HDFS, MapReduce, and Common
     •   Open source, Apache license
Enter Hadoop

•   An Apache project with HDFS, MapReduce, and Common
     •   Open source, Apache license
•   In common usage: platform, framework, ecosystem
     •   HBase, Hive, Pig, ZooKeeper, Mahout, Oozie ....
Enter Hadoop

•   An Apache project with HDFS, MapReduce, and Common
      •   Open source, Apache license
•   In common usage: platform, framework, ecosystem
      •   HBase, Hive, Pig, ZooKeeper, Mahout, Oozie ....
•   It’s in Java
Enter Hadoop

•   An Apache project with HDFS, MapReduce, and Common
      •   Open source, Apache license
•   In common usage: platform, framework, ecosystem
      •   HBase, Hive, Pig, ZooKeeper, Mahout, Oozie ....
•   It’s in Java
•   History: early 2000s, originally a clone of Google’s GFS and
    MapReduce
Enter HBase
Enter HBase

•   HBase is a database that uses HDFS for storage
Enter HBase

•   HBase is a database that uses HDFS for storage
•   Based on Google’s BigTable. Not relational or SQL.
Enter HBase

•   HBase is a database that uses HDFS for storage
•   Based on Google’s BigTable. Not relational or SQL.
•   Solves the problem “how do I query my Hadoop data?”
      •   Operations typically take a few milliseconds
      •   MapReduce is not suitable for real time queries
Enter HBase

•   HBase is a database that uses HDFS for storage
•   Based on Google’s BigTable. Not relational or SQL.
•   Solves the problem “how do I query my Hadoop data?”
      •   Operations typically take a few milliseconds
      •   MapReduce is not suitable for real time queries
•   Scales well by adding servers (if you do everything right)
Enter HBase

•   HBase is a database that uses HDFS for storage
•   Based on Google’s BigTable. Not relational or SQL.
•   Solves the problem “how do I query my Hadoop data?”
      •   Operations typically take a few milliseconds
      •   MapReduce is not suitable for real time queries
•   Scales well by adding servers (if you do everything right)
•   Not highly-available or multi-datacenter
UA’s basic architecture
   Events in                                     Reports out
            Mobile devices               Reports user

             Queue (Kafka)               Web service

                                 HBase

                                  HDFS



  (not shown: analysis code that reads
  events from HBase and puts derived
         data back into HBase)
Analyzing events
                           •   Absorbs traffic spikes

Queue of incoming events   •   Partially decouples database from internet

                           •   Pub/sub, groups of consumers share work

                           •   Consumes event queue

                           •   Does simple streaming analyses (counters)
UA proprietary Java code
                           •   Stages data in HBase tables for more
                               complex analyses that come later


                           •   Calculations that are difficult or inefficient to
                               compute as data streams through
 Incremental batch jobs
                           •   Read from HBase, write back to HBase
HBase data model
HBase data model

•   The abstraction offered by HBase for reading and writing
HBase data model

•   The abstraction offered by HBase for reading and writing
•   As useful as possible without limiting scalability too much
HBase data model

•   The abstraction offered by HBase for reading and writing
•   As useful as possible without limiting scalability too much
•   Data is in rows, rows are in tables, ordered by row key
HBase data model

•   The abstraction offered by HBase for reading and writing
•   As useful as possible without limiting scalability too much
•   Data is in rows, rows are in tables, ordered by row key


      myApp:1335139200   OPENS_COUNT: 3987 SENDS_COUNT: 28832

      myApp:1335142800   OPENS_COUNT: 4230 SENDS_COUNT: 38990
HBase data model

•   The abstraction offered by HBase for reading and writing
•   As useful as possible without limiting scalability too much
•   Data is in rows, rows are in tables, ordered by row key


      myApp:1335139200       OPENS_COUNT: 3987 SENDS_COUNT: 28832

      myApp:1335142800       OPENS_COUNT: 4230 SENDS_COUNT: 38990



       (not shown: column families)
The HBase data model, cont.


                                              {“myRowKey1”: {
•   This is a nested map/dictionary
                                                 “myColFam”: {
•   Scannable in lexicographic key order            “myQualifierX”: “foo”,
                                                    “myQualifierY”: “bar”}},
•   Interface is very simple:                  “rowKey2”: {
                                                 “myColFam”:
      •   get, put, delete, scan, increment        “myQualifierA”: “baz”,
                                                   “myQualifierB”: “”}},
•   Bytes only
HBase API example

byte[] firstNameQualifier = “fname”.getBytes();

byte[] lastNameQualifier = “lname”.getBytes();

byte[] personalInfoColFam = “personalInfo”.getBytes();



HTable hTable = new HTable(“users”);

Put put = new Put(“dave”.getBytes());

put.add(personalInfoColFam, firstNameQualifier, “Dave”.getBytes());

put.add(personalInfoColFam, lastNameQualifier, “Revell”.getBytes());

hTable.put(put);
How to not fail at HBase
How to not fail at HBase

•   Things you should have done initially, but now it’s too late
    and you’re irretrievably screwed
      •   Keep table count and column family count low
      •   Keep rows narrow, use compound keys
      •   Scale by adding more rows
      •   Tune your flush threshold and memstore sizes
      •   It’s OK to store complex objects as Protobuf/Thrift/etc.
      •   Always try for sequential IO over random IO
MapReduce, briefly
•   The original use case for Hadoop
•   Mappers take in large data set and send (key,value) pairs to
    reducers. Reducers aggregate input pairs and generate
    output.

                    My     input    data items

                  Mapper   Mapper   Mapper   Mapper



                    Reducer            Reducer

                    Output             Output
MapReduce issues
MapReduce issues

•   Hard to process incrementally (efficiently)
MapReduce issues

•   Hard to process incrementally (efficiently)
•   Hard to achieve low latency
MapReduce issues

•   Hard to process incrementally (efficiently)
•   Hard to achieve low latency
•   Can’t have too many jobs
MapReduce issues

•   Hard to process incrementally (efficiently)
•   Hard to achieve low latency
•   Can’t have too many jobs
•   Requires elaborate workflow automation
MapReduce issues

•   Hard to process incrementally (efficiently)
•   Hard to achieve low latency
•   Can’t have too many jobs
•   Requires elaborate workflow automation
MapReduce issues

•   Hard to process incrementally (efficiently)
•   Hard to achieve low latency
•   Can’t have too many jobs
•   Requires elaborate workflow automation


•   Urban Airship uses MapReduce over HBase data for:
      •   Ad-hoc analysis
      •   Monthly billing
Live demo




 (Jump to web browser for HBase and MR status pages)
Batch processing at UA
Batch processing at UA

•   Quartz scheduler, distributed over 3 nodes
      •   Time-in-app, audience count, conversions
Batch processing at UA

•   Quartz scheduler, distributed over 3 nodes
      •   Time-in-app, audience count, conversions
Batch processing at UA

•   Quartz scheduler, distributed over 3 nodes
      •   Time-in-app, audience count, conversions


•   General pattern
      •   Arriving events set a low water mark for its app
      •   Batch jobs reprocess events starting at the low water
          mark
Strengths
Strengths

•   Uptime
     •   We know all the ways to crash by now
Strengths

•   Uptime
     •   We know all the ways to crash by now
•   Schema design, throughput, and scaling
     •   There are many subtle mistakes to avoid
Strengths

•   Uptime
      •   We know all the ways to crash by now
•   Schema design, throughput, and scaling
      •   There are many subtle mistakes to avoid
•   Writing custom tools (statshtable, hbackup, gclogtailer)
Strengths

•   Uptime
      •   We know all the ways to crash by now
•   Schema design, throughput, and scaling
      •   There are many subtle mistakes to avoid
•   Writing custom tools (statshtable, hbackup, gclogtailer)
•   “Real time most of the time”
Weaknesses of our design
Weaknesses of our design


•   Shipping features quickly
Weaknesses of our design


•   Shipping features quickly
•   Hardware efficiency
Weaknesses of our design


•   Shipping features quickly
•   Hardware efficiency
•   Infrastructure automation
Weaknesses of our design


•   Shipping features quickly
•   Hardware efficiency
•   Infrastructure automation
•   Writing custom tools, getting bogged down at low levels,
    leaky abstractions
Weaknesses of our design


•   Shipping features quickly
•   Hardware efficiency
•   Infrastructure automation
•   Writing custom tools, getting bogged down at low levels,
    leaky abstractions
•   Serious operational Java skills required
Reading



•   Hadoop: The Definitive Guide by Tom White
•   HBase: The Definitive Guide by Lars George
•   http://hbase.apache.org/book.html
Questions?




•   #hbase on Freenode
•   hbase-dev, hbase-user Apache mailing lists

More Related Content

What's hot

Apache HBase Application Archetypes
Apache HBase Application ArchetypesApache HBase Application Archetypes
Apache HBase Application Archetypes
Cloudera, Inc.
 
An introduction to apache drill presentation
An introduction to apache drill presentationAn introduction to apache drill presentation
An introduction to apache drill presentation
MapR Technologies
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1
Sperasoft
 

What's hot (20)

Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
 
Impala for PhillyDB Meetup
Impala for PhillyDB MeetupImpala for PhillyDB Meetup
Impala for PhillyDB Meetup
 
Back to School - St. Louis Hadoop Meetup September 2016
Back to School - St. Louis Hadoop Meetup September 2016Back to School - St. Louis Hadoop Meetup September 2016
Back to School - St. Louis Hadoop Meetup September 2016
 
HBaseCon 2015 General Session: Zen - A Graph Data Model on HBase
HBaseCon 2015 General Session: Zen - A Graph Data Model on HBaseHBaseCon 2015 General Session: Zen - A Graph Data Model on HBase
HBaseCon 2015 General Session: Zen - A Graph Data Model on HBase
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big Data
 
HBaseCon 2013: ETL for Apache HBase
HBaseCon 2013: ETL for Apache HBaseHBaseCon 2013: ETL for Apache HBase
HBaseCon 2013: ETL for Apache HBase
 
Data Engineering with Spring, Hadoop and Hive
Data Engineering with Spring, Hadoop and Hive	Data Engineering with Spring, Hadoop and Hive
Data Engineering with Spring, Hadoop and Hive
 
Apache HBase Application Archetypes
Apache HBase Application ArchetypesApache HBase Application Archetypes
Apache HBase Application Archetypes
 
Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applications
 
How to develop Big Data Pipelines for Hadoop, by Costin Leau
How to develop Big Data Pipelines for Hadoop, by Costin LeauHow to develop Big Data Pipelines for Hadoop, by Costin Leau
How to develop Big Data Pipelines for Hadoop, by Costin Leau
 
An introduction to apache drill presentation
An introduction to apache drill presentationAn introduction to apache drill presentation
An introduction to apache drill presentation
 
Impala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopImpala: Real-time Queries in Hadoop
Impala: Real-time Queries in Hadoop
 
NYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache HadoopNYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache Hadoop
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorial
 
Spring for Apache Hadoop
Spring for Apache HadoopSpring for Apache Hadoop
Spring for Apache Hadoop
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1
 
Next Generation Hadoop Operations
Next Generation Hadoop OperationsNext Generation Hadoop Operations
Next Generation Hadoop Operations
 
HBaseCon 2012 | Getting Real about Interactive Big Data Management with Lily ...
HBaseCon 2012 | Getting Real about Interactive Big Data Management with Lily ...HBaseCon 2012 | Getting Real about Interactive Big Data Management with Lily ...
HBaseCon 2012 | Getting Real about Interactive Big Data Management with Lily ...
 
Hadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an exampleHadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an example
 
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big DataHBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data
 

Similar to HBase and Hadoop at Urban Airship

Rigorous and Multi-tenant HBase Performance Measurement
Rigorous and Multi-tenant HBase Performance MeasurementRigorous and Multi-tenant HBase Performance Measurement
Rigorous and Multi-tenant HBase Performance Measurement
DataWorks Summit
 
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Andrew Brust
 
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdfimpalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
ssusere05ec21
 

Similar to HBase and Hadoop at Urban Airship (20)

Intro to HBase - Lars George
Intro to HBase - Lars GeorgeIntro to HBase - Lars George
Intro to HBase - Lars George
 
מיכאל
מיכאלמיכאל
מיכאל
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 
Hive
HiveHive
Hive
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
Rigorous and Multi-tenant HBase Performance Measurement
Rigorous and Multi-tenant HBase Performance MeasurementRigorous and Multi-tenant HBase Performance Measurement
Rigorous and Multi-tenant HBase Performance Measurement
 
Rigorous and Multi-tenant HBase Performance
Rigorous and Multi-tenant HBase PerformanceRigorous and Multi-tenant HBase Performance
Rigorous and Multi-tenant HBase Performance
 
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
 
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big Data
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
 
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of AltiscaleDebugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
 
Cheetah:Data Warehouse on Top of MapReduce
Cheetah:Data Warehouse on Top of MapReduceCheetah:Data Warehouse on Top of MapReduce
Cheetah:Data Warehouse on Top of MapReduce
 
Impala presentation
Impala presentationImpala presentation
Impala presentation
 
hbaseconasia2019 BigData NoSQL System: ApsaraDB, HBase and Spark
hbaseconasia2019 BigData NoSQL System: ApsaraDB, HBase and Sparkhbaseconasia2019 BigData NoSQL System: ApsaraDB, HBase and Spark
hbaseconasia2019 BigData NoSQL System: ApsaraDB, HBase and Spark
 
2013 year of real-time hadoop
2013 year of real-time hadoop2013 year of real-time hadoop
2013 year of real-time hadoop
 
Apache Drill
Apache DrillApache Drill
Apache Drill
 
Cloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for HadoopCloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for Hadoop
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
 
Hadoop for the Absolute Beginner
Hadoop for the Absolute BeginnerHadoop for the Absolute Beginner
Hadoop for the Absolute Beginner
 
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdfimpalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 

HBase and Hadoop at Urban Airship

  • 1. HBase and Hadoop at Urban Airship April 25, 2012 Dave Revell dave@urbanairship.com @dave_revell
  • 2. Who are we? • Who am I? • Airshipper for 10 months, Hadoop user for 1.5 years • Database Engineer on Core Data team: we collect events from mobile devices and create reports • What is Urban Airship? • SaaS for mobile developers. Features that devs shouldn’t build themselves. • Mostly push notifications • No airships :(
  • 4. Goals • “Near real time” reporting • Counters: messages sent and received, app opens, in various time slices • More complex analyses: time-in-app, uniques, conversions
  • 5. Goals • “Near real time” reporting • Counters: messages sent and received, app opens, in various time slices • More complex analyses: time-in-app, uniques, conversions • Scale • Billions of “events” per month, ~100 bytes each • 40 billion events so far, looking exponential. • Event arrival rate varies wildly, ~10K/sec (?)
  • 7. Enter Hadoop • An Apache project with HDFS, MapReduce, and Common • Open source, Apache license
  • 8. Enter Hadoop • An Apache project with HDFS, MapReduce, and Common • Open source, Apache license • In common usage: platform, framework, ecosystem • HBase, Hive, Pig, ZooKeeper, Mahout, Oozie ....
  • 9. Enter Hadoop • An Apache project with HDFS, MapReduce, and Common • Open source, Apache license • In common usage: platform, framework, ecosystem • HBase, Hive, Pig, ZooKeeper, Mahout, Oozie .... • It’s in Java
  • 10. Enter Hadoop • An Apache project with HDFS, MapReduce, and Common • Open source, Apache license • In common usage: platform, framework, ecosystem • HBase, Hive, Pig, ZooKeeper, Mahout, Oozie .... • It’s in Java • History: early 2000s, originally a clone of Google’s GFS and MapReduce
  • 12. Enter HBase • HBase is a database that uses HDFS for storage
  • 13. Enter HBase • HBase is a database that uses HDFS for storage • Based on Google’s BigTable. Not relational or SQL.
  • 14. Enter HBase • HBase is a database that uses HDFS for storage • Based on Google’s BigTable. Not relational or SQL. • Solves the problem “how do I query my Hadoop data?” • Operations typically take a few milliseconds • MapReduce is not suitable for real time queries
  • 15. Enter HBase • HBase is a database that uses HDFS for storage • Based on Google’s BigTable. Not relational or SQL. • Solves the problem “how do I query my Hadoop data?” • Operations typically take a few milliseconds • MapReduce is not suitable for real time queries • Scales well by adding servers (if you do everything right)
  • 16. Enter HBase • HBase is a database that uses HDFS for storage • Based on Google’s BigTable. Not relational or SQL. • Solves the problem “how do I query my Hadoop data?” • Operations typically take a few milliseconds • MapReduce is not suitable for real time queries • Scales well by adding servers (if you do everything right) • Not highly-available or multi-datacenter
  • 17. UA’s basic architecture Events in Reports out Mobile devices Reports user Queue (Kafka) Web service HBase HDFS (not shown: analysis code that reads events from HBase and puts derived data back into HBase)
  • 18. Analyzing events • Absorbs traffic spikes Queue of incoming events • Partially decouples database from internet • Pub/sub, groups of consumers share work • Consumes event queue • Does simple streaming analyses (counters) UA proprietary Java code • Stages data in HBase tables for more complex analyses that come later • Calculations that are difficult or inefficient to compute as data streams through Incremental batch jobs • Read from HBase, write back to HBase
  • 20. HBase data model • The abstraction offered by HBase for reading and writing
  • 21. HBase data model • The abstraction offered by HBase for reading and writing • As useful as possible without limiting scalability too much
  • 22. HBase data model • The abstraction offered by HBase for reading and writing • As useful as possible without limiting scalability too much • Data is in rows, rows are in tables, ordered by row key
  • 23. HBase data model • The abstraction offered by HBase for reading and writing • As useful as possible without limiting scalability too much • Data is in rows, rows are in tables, ordered by row key myApp:1335139200 OPENS_COUNT: 3987 SENDS_COUNT: 28832 myApp:1335142800 OPENS_COUNT: 4230 SENDS_COUNT: 38990
  • 24. HBase data model • The abstraction offered by HBase for reading and writing • As useful as possible without limiting scalability too much • Data is in rows, rows are in tables, ordered by row key myApp:1335139200 OPENS_COUNT: 3987 SENDS_COUNT: 28832 myApp:1335142800 OPENS_COUNT: 4230 SENDS_COUNT: 38990 (not shown: column families)
  • 25. The HBase data model, cont. {“myRowKey1”: { • This is a nested map/dictionary “myColFam”: { • Scannable in lexicographic key order “myQualifierX”: “foo”, “myQualifierY”: “bar”}}, • Interface is very simple: “rowKey2”: { “myColFam”: • get, put, delete, scan, increment “myQualifierA”: “baz”, “myQualifierB”: “”}}, • Bytes only
  • 26. HBase API example byte[] firstNameQualifier = “fname”.getBytes(); byte[] lastNameQualifier = “lname”.getBytes(); byte[] personalInfoColFam = “personalInfo”.getBytes(); HTable hTable = new HTable(“users”); Put put = new Put(“dave”.getBytes()); put.add(personalInfoColFam, firstNameQualifier, “Dave”.getBytes()); put.add(personalInfoColFam, lastNameQualifier, “Revell”.getBytes()); hTable.put(put);
  • 27. How to not fail at HBase
  • 28. How to not fail at HBase • Things you should have done initially, but now it’s too late and you’re irretrievably screwed • Keep table count and column family count low • Keep rows narrow, use compound keys • Scale by adding more rows • Tune your flush threshold and memstore sizes • It’s OK to store complex objects as Protobuf/Thrift/etc. • Always try for sequential IO over random IO
  • 29. MapReduce, briefly • The original use case for Hadoop • Mappers take in large data set and send (key,value) pairs to reducers. Reducers aggregate input pairs and generate output. My input data items Mapper Mapper Mapper Mapper Reducer Reducer Output Output
  • 31. MapReduce issues • Hard to process incrementally (efficiently)
  • 32. MapReduce issues • Hard to process incrementally (efficiently) • Hard to achieve low latency
  • 33. MapReduce issues • Hard to process incrementally (efficiently) • Hard to achieve low latency • Can’t have too many jobs
  • 34. MapReduce issues • Hard to process incrementally (efficiently) • Hard to achieve low latency • Can’t have too many jobs • Requires elaborate workflow automation
  • 35. MapReduce issues • Hard to process incrementally (efficiently) • Hard to achieve low latency • Can’t have too many jobs • Requires elaborate workflow automation
  • 36. MapReduce issues • Hard to process incrementally (efficiently) • Hard to achieve low latency • Can’t have too many jobs • Requires elaborate workflow automation • Urban Airship uses MapReduce over HBase data for: • Ad-hoc analysis • Monthly billing
  • 37. Live demo (Jump to web browser for HBase and MR status pages)
  • 39. Batch processing at UA • Quartz scheduler, distributed over 3 nodes • Time-in-app, audience count, conversions
  • 40. Batch processing at UA • Quartz scheduler, distributed over 3 nodes • Time-in-app, audience count, conversions
  • 41. Batch processing at UA • Quartz scheduler, distributed over 3 nodes • Time-in-app, audience count, conversions • General pattern • Arriving events set a low water mark for its app • Batch jobs reprocess events starting at the low water mark
  • 43. Strengths • Uptime • We know all the ways to crash by now
  • 44. Strengths • Uptime • We know all the ways to crash by now • Schema design, throughput, and scaling • There are many subtle mistakes to avoid
  • 45. Strengths • Uptime • We know all the ways to crash by now • Schema design, throughput, and scaling • There are many subtle mistakes to avoid • Writing custom tools (statshtable, hbackup, gclogtailer)
  • 46. Strengths • Uptime • We know all the ways to crash by now • Schema design, throughput, and scaling • There are many subtle mistakes to avoid • Writing custom tools (statshtable, hbackup, gclogtailer) • “Real time most of the time”
  • 48. Weaknesses of our design • Shipping features quickly
  • 49. Weaknesses of our design • Shipping features quickly • Hardware efficiency
  • 50. Weaknesses of our design • Shipping features quickly • Hardware efficiency • Infrastructure automation
  • 51. Weaknesses of our design • Shipping features quickly • Hardware efficiency • Infrastructure automation • Writing custom tools, getting bogged down at low levels, leaky abstractions
  • 52. Weaknesses of our design • Shipping features quickly • Hardware efficiency • Infrastructure automation • Writing custom tools, getting bogged down at low levels, leaky abstractions • Serious operational Java skills required
  • 53. Reading • Hadoop: The Definitive Guide by Tom White • HBase: The Definitive Guide by Lars George • http://hbase.apache.org/book.html
  • 54. Questions? • #hbase on Freenode • hbase-dev, hbase-user Apache mailing lists

Editor's Notes

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. \n
  18. \n
  19. \n
  20. \n
  21. \n
  22. \n
  23. \n
  24. \n
  25. \n
  26. \n
  27. \n
  28. \n
  29. \n
  30. \n
  31. \n
  32. \n
  33. \n
  34. \n
  35. \n
  36. \n
  37. \n
  38. \n
  39. \n
  40. \n
  41. \n
  42. \n
  43. \n
  44. \n
  45. \n