SlideShare una empresa de Scribd logo
1 de 16
Apache Hadoop and HBase in the Real World



              Joey Echeverria
                  @fwiffo

                #novahug
The Plug

We're Training!

                Developer Training July 25 to 27
                 Admin Training July 28 to 29

                http://www.cloudera.com/training

We’re Hiring!

                  Solution Architects, Trainers,
                 Distributed Systems Engineers

                http://www.cloudera.com/careers

           Copyright 2011 Cloudera Inc. All rights reserved   2
1 Minute Hadoop Recap

HDFS

Distributed file system
Optimized for streaming reads and writes
Block-level replication


MapReduce

Distributed processing framework
Reads/writes data in HDFS (typically)
Operates over (key, value) view of data


           Copyright 2011 Cloudera Inc. All rights reserved   3
Where does HBase come in?

Google

Google invented GFS and MapReduce
GFS optimized for streaming reads and writes



BigTable

Google's answer to random read/write workloads




           Copyright 2011 Cloudera Inc. All rights reserved   4
HBase: BigTable-like storage (for Hadoop)




          Copyright 2011 Cloudera Inc. All rights reserved   5
What is HBase?

Key/value column family store

Data stored in HDFS

ZooKeeper for coordination

Access model is get/put/del

Plus range scans and versions




           Copyright 2011 Cloudera Inc. All rights reserved   6
Architecture




Image courtsey Lars George, Licensed uner Creative Commons Attribution-Noncomm

                  Copyright 2011 Cloudera Inc. All rights reserved   7
Tables and Column Families

Static part of the schema

Column families also form locality groups

One Store per family
Multiple HFiles per Store

Tables split into regions

Continuous range of row keys
Unit of distribution
Automatically split
Pre-split for performance

            Copyright 2011 Cloudera Inc. All rights reserved   8
Why use HBase?

Variable schema in each record

Collections of data for each key

Atomic control of per-key data

Row access to each column family




            Copyright 2011 Cloudera Inc. All rights reserved   9
HBase Applications


“Smart Data, at Scale, made Easy”
   http://www.lilyproject.org




“Distributed, scalable Time Series Database (TSDB)”
   http://opentsdb.net




           Copyright 2011 Cloudera Inc. All rights reserved   10
Real-time ad optimizations

Capturing impressions and serving ads


HBase front-end – to serve models (via memcached)


HBase back-end – to serve pixels and capture cookies


MapReduce to compute models between the two




           Copyright 2011 Cloudera Inc. All rights reserved   11
Click stream sessionization

Key on userid and time


Seperate table for significant events (e.g. purchase)


Load data using HBase importtsv tool


Sessionization performed by simple scans




            Copyright 2011 Cloudera Inc. All rights reserved   12
Mozilla - Soccorro

When Firefox crashes, where do reports go?


The Mozilla team gathers those crashes in HBase


Crashes varry widely and change format often


Processors take each individually and parse it out

   http://code.google.com/p/socorro


           Copyright 2011 Cloudera Inc. All rights reserved   13
Navteq

Location based content serving


All served out of HBase, location makes a great key


Content is variable – Maps, POI, User Data


Preprocessing is all done via MR jobs




           Copyright 2011 Cloudera Inc. All rights reserved   14
Cloudera

Gathers data about customer clusters


Each customer node is a key with Avro values


Easy to browse, quick to find issues on Nodes


Dump to HDFS and process with Pig




           Copyright 2011 Cloudera Inc. All rights reserved   15
Copyright 2011 Cloudera Inc. All rights reserved   16

Más contenido relacionado

La actualidad más candente

Hadoop on Cloud: Why and How?
Hadoop on Cloud: Why and How?Hadoop on Cloud: Why and How?
Hadoop on Cloud: Why and How?Cloudera, Inc.
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Cloudera, Inc.
 
Oracle Cloud DBaaS
Oracle Cloud DBaaSOracle Cloud DBaaS
Oracle Cloud DBaaSArush Jain
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosEuangelos Linardos
 
Apache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop SummitApache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop SummitSaptak Sen
 
Hadoop Infrastructure (Oct. 3rd, 2012)
Hadoop Infrastructure (Oct. 3rd, 2012)Hadoop Infrastructure (Oct. 3rd, 2012)
Hadoop Infrastructure (Oct. 3rd, 2012)John Dougherty
 
Delivering Pluggable Database as a Service
Delivering Pluggable Database as a ServiceDelivering Pluggable Database as a Service
Delivering Pluggable Database as a ServicePete Sharman
 
Is Cloud a right Companion for Hadoop
Is Cloud a right Companion for HadoopIs Cloud a right Companion for Hadoop
Is Cloud a right Companion for HadoopDataWorks Summit
 
Accumulo Summit 2015: Ambari and Accumulo: HDP 2.3 Upcoming Features [Sponsored]
Accumulo Summit 2015: Ambari and Accumulo: HDP 2.3 Upcoming Features [Sponsored]Accumulo Summit 2015: Ambari and Accumulo: HDP 2.3 Upcoming Features [Sponsored]
Accumulo Summit 2015: Ambari and Accumulo: HDP 2.3 Upcoming Features [Sponsored]Accumulo Summit
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...Data Con LA
 
Data Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache HadoopData Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache HadoopCloudera, Inc.
 
Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...
Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...
Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...EMC
 
Impala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for HadoopImpala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for HadoopCloudera, Inc.
 
Impala use case @ Zoosk
Impala use case @ ZooskImpala use case @ Zoosk
Impala use case @ ZooskCloudera, Inc.
 
Accumulo Summit 2015: Alternatives to Apache Accumulo's Java API [API]
Accumulo Summit 2015: Alternatives to Apache Accumulo's Java API [API]Accumulo Summit 2015: Alternatives to Apache Accumulo's Java API [API]
Accumulo Summit 2015: Alternatives to Apache Accumulo's Java API [API]Accumulo Summit
 
Big Data: SQL query federation for Hadoop and RDBMS data
Big Data:  SQL query federation for Hadoop and RDBMS dataBig Data:  SQL query federation for Hadoop and RDBMS data
Big Data: SQL query federation for Hadoop and RDBMS dataCynthia Saracco
 
Delivering Schema as a Service
Delivering Schema as a ServiceDelivering Schema as a Service
Delivering Schema as a ServicePete Sharman
 
Big Data Processing with Hadoop-MapReduce in Cloud Systems
Big Data Processing with Hadoop-MapReduce in Cloud SystemsBig Data Processing with Hadoop-MapReduce in Cloud Systems
Big Data Processing with Hadoop-MapReduce in Cloud SystemsIntellipaat
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : BeginnersShweta Patnaik
 
Solution Use Case Demo: The Power of Relationships in Your Big Data
Solution Use Case Demo: The Power of Relationships in Your Big DataSolution Use Case Demo: The Power of Relationships in Your Big Data
Solution Use Case Demo: The Power of Relationships in Your Big DataInfiniteGraph
 

La actualidad más candente (20)

Hadoop on Cloud: Why and How?
Hadoop on Cloud: Why and How?Hadoop on Cloud: Why and How?
Hadoop on Cloud: Why and How?
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
 
Oracle Cloud DBaaS
Oracle Cloud DBaaSOracle Cloud DBaaS
Oracle Cloud DBaaS
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
 
Apache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop SummitApache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop Summit
 
Hadoop Infrastructure (Oct. 3rd, 2012)
Hadoop Infrastructure (Oct. 3rd, 2012)Hadoop Infrastructure (Oct. 3rd, 2012)
Hadoop Infrastructure (Oct. 3rd, 2012)
 
Delivering Pluggable Database as a Service
Delivering Pluggable Database as a ServiceDelivering Pluggable Database as a Service
Delivering Pluggable Database as a Service
 
Is Cloud a right Companion for Hadoop
Is Cloud a right Companion for HadoopIs Cloud a right Companion for Hadoop
Is Cloud a right Companion for Hadoop
 
Accumulo Summit 2015: Ambari and Accumulo: HDP 2.3 Upcoming Features [Sponsored]
Accumulo Summit 2015: Ambari and Accumulo: HDP 2.3 Upcoming Features [Sponsored]Accumulo Summit 2015: Ambari and Accumulo: HDP 2.3 Upcoming Features [Sponsored]
Accumulo Summit 2015: Ambari and Accumulo: HDP 2.3 Upcoming Features [Sponsored]
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...
 
Data Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache HadoopData Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache Hadoop
 
Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...
Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...
Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...
 
Impala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for HadoopImpala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for Hadoop
 
Impala use case @ Zoosk
Impala use case @ ZooskImpala use case @ Zoosk
Impala use case @ Zoosk
 
Accumulo Summit 2015: Alternatives to Apache Accumulo's Java API [API]
Accumulo Summit 2015: Alternatives to Apache Accumulo's Java API [API]Accumulo Summit 2015: Alternatives to Apache Accumulo's Java API [API]
Accumulo Summit 2015: Alternatives to Apache Accumulo's Java API [API]
 
Big Data: SQL query federation for Hadoop and RDBMS data
Big Data:  SQL query federation for Hadoop and RDBMS dataBig Data:  SQL query federation for Hadoop and RDBMS data
Big Data: SQL query federation for Hadoop and RDBMS data
 
Delivering Schema as a Service
Delivering Schema as a ServiceDelivering Schema as a Service
Delivering Schema as a Service
 
Big Data Processing with Hadoop-MapReduce in Cloud Systems
Big Data Processing with Hadoop-MapReduce in Cloud SystemsBig Data Processing with Hadoop-MapReduce in Cloud Systems
Big Data Processing with Hadoop-MapReduce in Cloud Systems
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Solution Use Case Demo: The Power of Relationships in Your Big Data
Solution Use Case Demo: The Power of Relationships in Your Big DataSolution Use Case Demo: The Power of Relationships in Your Big Data
Solution Use Case Demo: The Power of Relationships in Your Big Data
 

Destacado (6)

Chapter #5
Chapter #5Chapter #5
Chapter #5
 
Chapter 2 mktg
Chapter 2 mktgChapter 2 mktg
Chapter 2 mktg
 
Chapter 3 marketing
Chapter 3 marketingChapter 3 marketing
Chapter 3 marketing
 
Chapter 1
Chapter   1Chapter   1
Chapter 1
 
Principles Of Marketing 1
Principles Of  Marketing 1Principles Of  Marketing 1
Principles Of Marketing 1
 
Marketing - Philip Kotler Ch 1
Marketing - Philip Kotler Ch 1Marketing - Philip Kotler Ch 1
Marketing - Philip Kotler Ch 1
 

Similar a Hadoop and h base in the real world

Hadoop and HBase in the Real World
Hadoop and HBase in the Real WorldHadoop and HBase in the Real World
Hadoop and HBase in the Real WorldCloudera, Inc.
 
Hadoop World 2010: Productionizing Hadoop: Lessons Learned
Hadoop World 2010: Productionizing Hadoop: Lessons LearnedHadoop World 2010: Productionizing Hadoop: Lessons Learned
Hadoop World 2010: Productionizing Hadoop: Lessons LearnedCloudera, Inc.
 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony NguyenThanh Nguyen
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
 
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...DataWorks Summit
 
Cloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a championCloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a championAmeet Paranjape
 
Built-In Security for the Cloud
Built-In Security for the CloudBuilt-In Security for the Cloud
Built-In Security for the CloudDataWorks Summit
 
Hadoop in adtech
Hadoop in adtechHadoop in adtech
Hadoop in adtechYuta Imai
 
Storage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduceStorage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduceChris Nauroth
 
Protecting your Critical Hadoop Clusters Against Disasters
Protecting your Critical Hadoop Clusters Against DisastersProtecting your Critical Hadoop Clusters Against Disasters
Protecting your Critical Hadoop Clusters Against DisastersDataWorks Summit
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookAmr Awadallah
 
Big data or big deal
Big data or big dealBig data or big deal
Big data or big dealeduarderwee
 
Oracle Unified Information Architeture + Analytics by Example
Oracle Unified Information Architeture + Analytics by ExampleOracle Unified Information Architeture + Analytics by Example
Oracle Unified Information Architeture + Analytics by ExampleHarald Erb
 
Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...DataWorks Summit
 
Cloudy with a chance of Hadoop - real world considerations
Cloudy with a chance of Hadoop - real world considerationsCloudy with a chance of Hadoop - real world considerations
Cloudy with a chance of Hadoop - real world considerationsDataWorks Summit
 
Hortonworks and Red Hat Webinar - Part 2
Hortonworks and Red Hat Webinar - Part 2Hortonworks and Red Hat Webinar - Part 2
Hortonworks and Red Hat Webinar - Part 2Hortonworks
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemShivaji Dutta
 
Hortonworks - What's Possible with a Modern Data Architecture?
Hortonworks - What's Possible with a Modern Data Architecture?Hortonworks - What's Possible with a Modern Data Architecture?
Hortonworks - What's Possible with a Modern Data Architecture?Hortonworks
 
Big data and hadoop product page
Big data and hadoop product pageBig data and hadoop product page
Big data and hadoop product pageJanu Jahnavi
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1Thanh Nguyen
 

Similar a Hadoop and h base in the real world (20)

Hadoop and HBase in the Real World
Hadoop and HBase in the Real WorldHadoop and HBase in the Real World
Hadoop and HBase in the Real World
 
Hadoop World 2010: Productionizing Hadoop: Lessons Learned
Hadoop World 2010: Productionizing Hadoop: Lessons LearnedHadoop World 2010: Productionizing Hadoop: Lessons Learned
Hadoop World 2010: Productionizing Hadoop: Lessons Learned
 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony Nguyen
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
 
Cloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a championCloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a champion
 
Built-In Security for the Cloud
Built-In Security for the CloudBuilt-In Security for the Cloud
Built-In Security for the Cloud
 
Hadoop in adtech
Hadoop in adtechHadoop in adtech
Hadoop in adtech
 
Storage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduceStorage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduce
 
Protecting your Critical Hadoop Clusters Against Disasters
Protecting your Critical Hadoop Clusters Against DisastersProtecting your Critical Hadoop Clusters Against Disasters
Protecting your Critical Hadoop Clusters Against Disasters
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
 
Big data or big deal
Big data or big dealBig data or big deal
Big data or big deal
 
Oracle Unified Information Architeture + Analytics by Example
Oracle Unified Information Architeture + Analytics by ExampleOracle Unified Information Architeture + Analytics by Example
Oracle Unified Information Architeture + Analytics by Example
 
Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...
 
Cloudy with a chance of Hadoop - real world considerations
Cloudy with a chance of Hadoop - real world considerationsCloudy with a chance of Hadoop - real world considerations
Cloudy with a chance of Hadoop - real world considerations
 
Hortonworks and Red Hat Webinar - Part 2
Hortonworks and Red Hat Webinar - Part 2Hortonworks and Red Hat Webinar - Part 2
Hortonworks and Red Hat Webinar - Part 2
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystem
 
Hortonworks - What's Possible with a Modern Data Architecture?
Hortonworks - What's Possible with a Modern Data Architecture?Hortonworks - What's Possible with a Modern Data Architecture?
Hortonworks - What's Possible with a Modern Data Architecture?
 
Big data and hadoop product page
Big data and hadoop product pageBig data and hadoop product page
Big data and hadoop product page
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1
 

Más de Joey Echeverria

Building production spark streaming applications
Building production spark streaming applicationsBuilding production spark streaming applications
Building production spark streaming applicationsJoey Echeverria
 
Embeddable data transformation for real time streams
Embeddable data transformation for real time streamsEmbeddable data transformation for real time streams
Embeddable data transformation for real time streamsJoey Echeverria
 
The Future of Apache Hadoop Security
The Future of Apache Hadoop SecurityThe Future of Apache Hadoop Security
The Future of Apache Hadoop SecurityJoey Echeverria
 
Building data pipelines with kite
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kiteJoey Echeverria
 
Apache Accumulo and Cloudera
Apache Accumulo and ClouderaApache Accumulo and Cloudera
Apache Accumulo and ClouderaJoey Echeverria
 
Analyzing twitter data with hadoop
Analyzing twitter data with hadoopAnalyzing twitter data with hadoop
Analyzing twitter data with hadoopJoey Echeverria
 
Hadoop in three use cases
Hadoop in three use casesHadoop in three use cases
Hadoop in three use casesJoey Echeverria
 
Scratching your own itch
Scratching your own itchScratching your own itch
Scratching your own itchJoey Echeverria
 
The power of hadoop in cloud computing
The power of hadoop in cloud computingThe power of hadoop in cloud computing
The power of hadoop in cloud computingJoey Echeverria
 

Más de Joey Echeverria (12)

Debugging Apache Spark
Debugging Apache SparkDebugging Apache Spark
Debugging Apache Spark
 
Building production spark streaming applications
Building production spark streaming applicationsBuilding production spark streaming applications
Building production spark streaming applications
 
Streaming ETL for All
Streaming ETL for AllStreaming ETL for All
Streaming ETL for All
 
Embeddable data transformation for real time streams
Embeddable data transformation for real time streamsEmbeddable data transformation for real time streams
Embeddable data transformation for real time streams
 
The Future of Apache Hadoop Security
The Future of Apache Hadoop SecurityThe Future of Apache Hadoop Security
The Future of Apache Hadoop Security
 
Building data pipelines with kite
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kite
 
Apache Accumulo and Cloudera
Apache Accumulo and ClouderaApache Accumulo and Cloudera
Apache Accumulo and Cloudera
 
Analyzing twitter data with hadoop
Analyzing twitter data with hadoopAnalyzing twitter data with hadoop
Analyzing twitter data with hadoop
 
Big data security
Big data securityBig data security
Big data security
 
Hadoop in three use cases
Hadoop in three use casesHadoop in three use cases
Hadoop in three use cases
 
Scratching your own itch
Scratching your own itchScratching your own itch
Scratching your own itch
 
The power of hadoop in cloud computing
The power of hadoop in cloud computingThe power of hadoop in cloud computing
The power of hadoop in cloud computing
 

Hadoop and h base in the real world

  • 1. Apache Hadoop and HBase in the Real World Joey Echeverria @fwiffo #novahug
  • 2. The Plug We're Training! Developer Training July 25 to 27 Admin Training July 28 to 29 http://www.cloudera.com/training We’re Hiring! Solution Architects, Trainers, Distributed Systems Engineers http://www.cloudera.com/careers Copyright 2011 Cloudera Inc. All rights reserved 2
  • 3. 1 Minute Hadoop Recap HDFS Distributed file system Optimized for streaming reads and writes Block-level replication MapReduce Distributed processing framework Reads/writes data in HDFS (typically) Operates over (key, value) view of data Copyright 2011 Cloudera Inc. All rights reserved 3
  • 4. Where does HBase come in? Google Google invented GFS and MapReduce GFS optimized for streaming reads and writes BigTable Google's answer to random read/write workloads Copyright 2011 Cloudera Inc. All rights reserved 4
  • 5. HBase: BigTable-like storage (for Hadoop) Copyright 2011 Cloudera Inc. All rights reserved 5
  • 6. What is HBase? Key/value column family store Data stored in HDFS ZooKeeper for coordination Access model is get/put/del Plus range scans and versions Copyright 2011 Cloudera Inc. All rights reserved 6
  • 7. Architecture Image courtsey Lars George, Licensed uner Creative Commons Attribution-Noncomm Copyright 2011 Cloudera Inc. All rights reserved 7
  • 8. Tables and Column Families Static part of the schema Column families also form locality groups One Store per family Multiple HFiles per Store Tables split into regions Continuous range of row keys Unit of distribution Automatically split Pre-split for performance Copyright 2011 Cloudera Inc. All rights reserved 8
  • 9. Why use HBase? Variable schema in each record Collections of data for each key Atomic control of per-key data Row access to each column family Copyright 2011 Cloudera Inc. All rights reserved 9
  • 10. HBase Applications “Smart Data, at Scale, made Easy” http://www.lilyproject.org “Distributed, scalable Time Series Database (TSDB)” http://opentsdb.net Copyright 2011 Cloudera Inc. All rights reserved 10
  • 11. Real-time ad optimizations Capturing impressions and serving ads HBase front-end – to serve models (via memcached) HBase back-end – to serve pixels and capture cookies MapReduce to compute models between the two Copyright 2011 Cloudera Inc. All rights reserved 11
  • 12. Click stream sessionization Key on userid and time Seperate table for significant events (e.g. purchase) Load data using HBase importtsv tool Sessionization performed by simple scans Copyright 2011 Cloudera Inc. All rights reserved 12
  • 13. Mozilla - Soccorro When Firefox crashes, where do reports go? The Mozilla team gathers those crashes in HBase Crashes varry widely and change format often Processors take each individually and parse it out http://code.google.com/p/socorro Copyright 2011 Cloudera Inc. All rights reserved 13
  • 14. Navteq Location based content serving All served out of HBase, location makes a great key Content is variable – Maps, POI, User Data Preprocessing is all done via MR jobs Copyright 2011 Cloudera Inc. All rights reserved 14
  • 15. Cloudera Gathers data about customer clusters Each customer node is a key with Avro values Easy to browse, quick to find issues on Nodes Dump to HDFS and process with Pig Copyright 2011 Cloudera Inc. All rights reserved 15
  • 16. Copyright 2011 Cloudera Inc. All rights reserved 16