SlideShare una empresa de Scribd logo
1 de 33
Advanced HBase Schema
       Design
   Berlin Buzzwords, June 2012
            Lars George
        lars@cloudera.com
About Me
• Solutions Architect @ Cloudera
• Apache HBase & Whirr Committer
• Author of
   HBase – The Definitive Guide
• Working with HBase since end
  of 2007
• Organizer of the Munich OpenHUG
Agenda
1   Overview of HBase
2   Schema Design
3   Examples
4   Wrap up




                                 3
HBase Tables
HBase Tables
HBase Tables
HBase Tables
HBase Tables
HBase Tables
HBase Tables
Physical Model:
        HBase Tables and Regions
• Table is made up of any number if regions
• Region is specified by its startKey and endKey
• Each region may live on a different node and is
  made up of several HDFS files and blocks, each
  of which is replicated by Hadoop
Distribution
Storage Separation
• Column Families allow for separation of data
  – Used by Columnar Databases for fast analytical
    queries, but on column level only
  – Allows different or no compression depending on
    the content type
• Segregate information based on access
  pattern
• Data is stored in one or more storage file,
  called HFiles
Fold, Store, and Shift
Fold, Store, and Shift
• Logical layout does not match physical one
• All values are stored with the full coordinates,
  including: Row Key, Column Family, Column
  Qualifier, and Timestamp
• Folds columns into “row per column”
• NULLs are cost free as nothing is stored
• Versions are multiple “rows” in folded table
Logical Model:
                  HBase Tables
• Tables are sorted by Row in lexicographical order
• Table schema only defines its column families
  – Each family consists of any number of columns
  – Each column consists of any number of versions
  – Columns only exist when inserted, NULLs are free
  – Columns within a family are sorted and stored
    together
  – Everything except table names are byte[]

 (Table, Row, Family:Column, Timestamp) -> Value
Column Family vs. Column
• Use only a few column families
  – Causes many files that need to stay open per
    region plus class overhead per family
• Best used when logical separation between
  data and meta columns
• Sorting per family can be used to convey
  application logic or access pattern
Key Cardinality
Key Cardinality
• The best performance is gained from using
  row keys
• Time range bound reads can skip store files
  – So can Bloom Filters
• Selecting column families reduces the amount
  of data to be scanned
• Pure value based filtering is a full table scan
  – Filters often are too, but reduce network traffic
Key/Table Design
• Crucial to gain best performance
  – Why do I need to know? Well, you also need to
    know that RDBMS is only working well when
    columns are indexed and query plan is OK
• Absence of secondary indexes forces use of
  row key or column name sorting
• Transfer multiple indexes into one
  – Generate large table -> Good since fits
    architecture and spreads across cluster
DDI
• Stands for Denormalization, Duplication and
  Intelligent Keys
• Needed to overcome shortcomings of
  architecture
• Denormalization -> Replacement for JOINs
• Duplication -> Design for reads
• Intelligent Keys -> Implement indexing and
  sorting, optimize reads
Key Design
Key Design Summary
• Based on access pattern, either use sequential
  or random keys
• Often a combination of both is needed
  – Overcome architectural limitations
• Neither is necessarily bad
  – Use bulk import for sequential keys and reads
  – Random keys are good for random access patterns
Pre-materialize Everything
• Achieve one read per customer request if
  possible
• Otherwise keep at lowest number
• Reads between 10ms (cache miss) and 1ms
  (cache hit)
• Use MapReduce to compute exacts in batch
• Store and merge updates live
• Use incrementColumnValue

           Motto: “Design for Reads”
Relational Model
Muddled Up!
Remodeling
WTH?
Example: OpenTSDB




• Metric Type, Tags are stored as IDs
• Periodically rolled up
Summary
• Design for Use-Case
    – Read, Write, or Both?
•   Avoid Hotspotting
•   Consider using IDs instead of full text
•   Leverage Column Family to HFile relation
•   Shift details to appropriate position
    – Composite Keys
    – Column Qualifiers
Summary (cont.)
• Schema design is a combination of
  – Designing the keys (row and column)
  – Segregate data into column families
  – Choose compression and block sizes
• Similar techniques are needed to scale most
  systems
  – Add indexes, partition data, consistent hashing
• Denormalization, Duplication, and Intelligent
  Keys (DDI)
Questions?




Email: lars@cloudera.com
Twitter: @larsgeorge

Más contenido relacionado

La actualidad más candente

HBase in Practice
HBase in PracticeHBase in Practice
HBase in Practicelarsgeorge
 
Introduction To HBase
Introduction To HBaseIntroduction To HBase
Introduction To HBaseAnil Gupta
 
HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...
HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...
HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...Cloudera, Inc.
 
HBase Read High Availability Using Timeline Consistent Region Replicas
HBase  Read High Availability Using Timeline Consistent Region ReplicasHBase  Read High Availability Using Timeline Consistent Region Replicas
HBase Read High Availability Using Timeline Consistent Region Replicasenissoz
 
Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)alexbaranau
 
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...HBaseCon
 
Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBaseCloudera, Inc.
 
Apache HBase - Just the Basics
Apache HBase - Just the BasicsApache HBase - Just the Basics
Apache HBase - Just the BasicsHBaseCon
 
Chicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An IntroductionChicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An IntroductionCloudera, Inc.
 
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...Cloudera, Inc.
 
Apache HBase 1.0 Release
Apache HBase 1.0 ReleaseApache HBase 1.0 Release
Apache HBase 1.0 ReleaseNick Dimiduk
 
HBaseCon 2013: Full-Text Indexing for Apache HBase
HBaseCon 2013: Full-Text Indexing for Apache HBaseHBaseCon 2013: Full-Text Indexing for Apache HBase
HBaseCon 2013: Full-Text Indexing for Apache HBaseCloudera, Inc.
 
HBaseCon 2015- HBase @ Flipboard
HBaseCon 2015- HBase @ FlipboardHBaseCon 2015- HBase @ Flipboard
HBaseCon 2015- HBase @ FlipboardMatthew Blair
 
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceHBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceCloudera, Inc.
 
HBase Read High Availability Using Timeline-Consistent Region Replicas
HBase Read High Availability Using Timeline-Consistent Region ReplicasHBase Read High Availability Using Timeline-Consistent Region Replicas
HBase Read High Availability Using Timeline-Consistent Region ReplicasHBaseCon
 
Apache HBase for Architects
Apache HBase for ArchitectsApache HBase for Architects
Apache HBase for ArchitectsNick Dimiduk
 

La actualidad más candente (20)

Hbase: an introduction
Hbase: an introductionHbase: an introduction
Hbase: an introduction
 
HBase in Practice
HBase in PracticeHBase in Practice
HBase in Practice
 
Introduction To HBase
Introduction To HBaseIntroduction To HBase
Introduction To HBase
 
Hadoop - Apache Hbase
Hadoop - Apache HbaseHadoop - Apache Hbase
Hadoop - Apache Hbase
 
HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...
HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...
HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...
 
HBase Read High Availability Using Timeline Consistent Region Replicas
HBase  Read High Availability Using Timeline Consistent Region ReplicasHBase  Read High Availability Using Timeline Consistent Region Replicas
HBase Read High Availability Using Timeline Consistent Region Replicas
 
NoSQL: Cassadra vs. HBase
NoSQL: Cassadra vs. HBaseNoSQL: Cassadra vs. HBase
NoSQL: Cassadra vs. HBase
 
Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)
 
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
 
Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBase
 
Apache HBase - Just the Basics
Apache HBase - Just the BasicsApache HBase - Just the Basics
Apache HBase - Just the Basics
 
Chicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An IntroductionChicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An Introduction
 
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
 
Apache HBase 1.0 Release
Apache HBase 1.0 ReleaseApache HBase 1.0 Release
Apache HBase 1.0 Release
 
HBase
HBaseHBase
HBase
 
HBaseCon 2013: Full-Text Indexing for Apache HBase
HBaseCon 2013: Full-Text Indexing for Apache HBaseHBaseCon 2013: Full-Text Indexing for Apache HBase
HBaseCon 2013: Full-Text Indexing for Apache HBase
 
HBaseCon 2015- HBase @ Flipboard
HBaseCon 2015- HBase @ FlipboardHBaseCon 2015- HBase @ Flipboard
HBaseCon 2015- HBase @ Flipboard
 
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceHBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
 
HBase Read High Availability Using Timeline-Consistent Region Replicas
HBase Read High Availability Using Timeline-Consistent Region ReplicasHBase Read High Availability Using Timeline-Consistent Region Replicas
HBase Read High Availability Using Timeline-Consistent Region Replicas
 
Apache HBase for Architects
Apache HBase for ArchitectsApache HBase for Architects
Apache HBase for Architects
 

Destacado

HBase Advanced - Lars George
HBase Advanced - Lars GeorgeHBase Advanced - Lars George
HBase Advanced - Lars GeorgeJAX London
 
Web intelligence and big data
Web intelligence and big dataWeb intelligence and big data
Web intelligence and big dataRafael Mendes
 
Apache thrift-RPC service cross languages
Apache thrift-RPC service cross languagesApache thrift-RPC service cross languages
Apache thrift-RPC service cross languagesJimmy Lai
 
Ysance conference - cloud computing - aws - 3 mai 2010
Ysance   conference - cloud computing - aws - 3 mai 2010Ysance   conference - cloud computing - aws - 3 mai 2010
Ysance conference - cloud computing - aws - 3 mai 2010Ysance
 
Social Networks and the Richness of Data
Social Networks and the Richness of DataSocial Networks and the Richness of Data
Social Networks and the Richness of Datalarsgeorge
 
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 GenoaHadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoalarsgeorge
 
An introduction to Apache Thrift
An introduction to Apache ThriftAn introduction to Apache Thrift
An introduction to Apache ThriftMike Frampton
 
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012larsgeorge
 
Introduction sur les problématiques d'une architecture distribuée
Introduction sur les problématiques d'une architecture distribuéeIntroduction sur les problématiques d'une architecture distribuée
Introduction sur les problématiques d'une architecture distribuéeKhanh Maudoux
 
Apache Thrift, a brief introduction
Apache Thrift, a brief introductionApache Thrift, a brief introduction
Apache Thrift, a brief introductionRandy Abernethy
 
3d Construction Card
3d Construction Card3d Construction Card
3d Construction CardJanet Ellis
 
HBase Applications - Atlanta HUG - May 2014
HBase Applications - Atlanta HUG - May 2014HBase Applications - Atlanta HUG - May 2014
HBase Applications - Atlanta HUG - May 2014larsgeorge
 
HBase Sizing Notes
HBase Sizing NotesHBase Sizing Notes
HBase Sizing Noteslarsgeorge
 
Big Data is not Rocket Science
Big Data is not Rocket ScienceBig Data is not Rocket Science
Big Data is not Rocket Sciencelarsgeorge
 
Big Data – HBase, integrando hadoop, bi e dw; Montando o seu big data Cloude...
Big Data – HBase, integrando hadoop, bi e dw; Montando o seu big data  Cloude...Big Data – HBase, integrando hadoop, bi e dw; Montando o seu big data  Cloude...
Big Data – HBase, integrando hadoop, bi e dw; Montando o seu big data Cloude...Flavio Fonte, PMP, ITIL
 

Destacado (20)

HBase Advanced - Lars George
HBase Advanced - Lars GeorgeHBase Advanced - Lars George
HBase Advanced - Lars George
 
H base key design
H base key designH base key design
H base key design
 
Valerii Moisieienko Apache hbase workshop
Valerii Moisieienko	Apache hbase workshopValerii Moisieienko	Apache hbase workshop
Valerii Moisieienko Apache hbase workshop
 
Web intelligence and big data
Web intelligence and big dataWeb intelligence and big data
Web intelligence and big data
 
The analytics edge
The analytics edgeThe analytics edge
The analytics edge
 
Apache thrift-RPC service cross languages
Apache thrift-RPC service cross languagesApache thrift-RPC service cross languages
Apache thrift-RPC service cross languages
 
Ysance conference - cloud computing - aws - 3 mai 2010
Ysance   conference - cloud computing - aws - 3 mai 2010Ysance   conference - cloud computing - aws - 3 mai 2010
Ysance conference - cloud computing - aws - 3 mai 2010
 
Hadoop unit
Hadoop unitHadoop unit
Hadoop unit
 
Social Networks and the Richness of Data
Social Networks and the Richness of DataSocial Networks and the Richness of Data
Social Networks and the Richness of Data
 
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 GenoaHadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
 
An introduction to Apache Thrift
An introduction to Apache ThriftAn introduction to Apache Thrift
An introduction to Apache Thrift
 
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
 
Introduction sur les problématiques d'une architecture distribuée
Introduction sur les problématiques d'une architecture distribuéeIntroduction sur les problématiques d'une architecture distribuée
Introduction sur les problématiques d'une architecture distribuée
 
Présentation Club STORM
Présentation Club STORMPrésentation Club STORM
Présentation Club STORM
 
Apache Thrift, a brief introduction
Apache Thrift, a brief introductionApache Thrift, a brief introduction
Apache Thrift, a brief introduction
 
3d Construction Card
3d Construction Card3d Construction Card
3d Construction Card
 
HBase Applications - Atlanta HUG - May 2014
HBase Applications - Atlanta HUG - May 2014HBase Applications - Atlanta HUG - May 2014
HBase Applications - Atlanta HUG - May 2014
 
HBase Sizing Notes
HBase Sizing NotesHBase Sizing Notes
HBase Sizing Notes
 
Big Data is not Rocket Science
Big Data is not Rocket ScienceBig Data is not Rocket Science
Big Data is not Rocket Science
 
Big Data – HBase, integrando hadoop, bi e dw; Montando o seu big data Cloude...
Big Data – HBase, integrando hadoop, bi e dw; Montando o seu big data  Cloude...Big Data – HBase, integrando hadoop, bi e dw; Montando o seu big data  Cloude...
Big Data – HBase, integrando hadoop, bi e dw; Montando o seu big data Cloude...
 

Similar a HBase Advanced Schema Design - Berlin Buzzwords - June 2012

Schema Design
Schema DesignSchema Design
Schema DesignQBurst
 
UNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptxUNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptxRahul Borate
 
UNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptxUNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptxRahul Borate
 
Conhecendo o Apache HBase
Conhecendo o Apache HBaseConhecendo o Apache HBase
Conhecendo o Apache HBaseFelipe Ferreira
 
Nyc hadoop meetup introduction to h base
Nyc hadoop meetup   introduction to h baseNyc hadoop meetup   introduction to h base
Nyc hadoop meetup introduction to h base智杰 付
 
Hbase schema design and sizing apache-con europe - nov 2012
Hbase schema design and sizing   apache-con europe - nov 2012Hbase schema design and sizing   apache-con europe - nov 2012
Hbase schema design and sizing apache-con europe - nov 2012Chris Huang
 
SE2016 Java Valerii Moisieienko "Apache HBase Workshop"
SE2016 Java Valerii Moisieienko "Apache HBase Workshop"SE2016 Java Valerii Moisieienko "Apache HBase Workshop"
SE2016 Java Valerii Moisieienko "Apache HBase Workshop"Inhacking
 
NoSql - mayank singh
NoSql - mayank singhNoSql - mayank singh
NoSql - mayank singhMayank Singh
 
The No SQL Principles and Basic Application Of Casandra Model
The No SQL Principles and Basic Application Of Casandra ModelThe No SQL Principles and Basic Application Of Casandra Model
The No SQL Principles and Basic Application Of Casandra ModelRishikese MR
 
Introduction to Apache HBase
Introduction to Apache HBaseIntroduction to Apache HBase
Introduction to Apache HBaseGokuldas Pillai
 
CCS334 BIG DATA ANALYTICS UNIT 5 PPT ELECTIVE PAPER
CCS334 BIG DATA ANALYTICS UNIT 5 PPT  ELECTIVE PAPERCCS334 BIG DATA ANALYTICS UNIT 5 PPT  ELECTIVE PAPER
CCS334 BIG DATA ANALYTICS UNIT 5 PPT ELECTIVE PAPERKrishnaVeni451953
 
Cassandra an overview
Cassandra an overviewCassandra an overview
Cassandra an overviewPritamKathar
 

Similar a HBase Advanced Schema Design - Berlin Buzzwords - June 2012 (20)

HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
Schema Design
Schema DesignSchema Design
Schema Design
 
UNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptxUNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptx
 
Apache HBase Workshop
Apache HBase WorkshopApache HBase Workshop
Apache HBase Workshop
 
UNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptxUNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptx
 
01 hbase
01 hbase01 hbase
01 hbase
 
No sql databases
No sql databasesNo sql databases
No sql databases
 
Conhecendo o Apache HBase
Conhecendo o Apache HBaseConhecendo o Apache HBase
Conhecendo o Apache HBase
 
Nyc hadoop meetup introduction to h base
Nyc hadoop meetup   introduction to h baseNyc hadoop meetup   introduction to h base
Nyc hadoop meetup introduction to h base
 
NoSql
NoSqlNoSql
NoSql
 
Hbase schema design and sizing apache-con europe - nov 2012
Hbase schema design and sizing   apache-con europe - nov 2012Hbase schema design and sizing   apache-con europe - nov 2012
Hbase schema design and sizing apache-con europe - nov 2012
 
SE2016 Java Valerii Moisieienko "Apache HBase Workshop"
SE2016 Java Valerii Moisieienko "Apache HBase Workshop"SE2016 Java Valerii Moisieienko "Apache HBase Workshop"
SE2016 Java Valerii Moisieienko "Apache HBase Workshop"
 
NoSql - mayank singh
NoSql - mayank singhNoSql - mayank singh
NoSql - mayank singh
 
Apache hive
Apache hiveApache hive
Apache hive
 
The No SQL Principles and Basic Application Of Casandra Model
The No SQL Principles and Basic Application Of Casandra ModelThe No SQL Principles and Basic Application Of Casandra Model
The No SQL Principles and Basic Application Of Casandra Model
 
Introduction to Apache HBase
Introduction to Apache HBaseIntroduction to Apache HBase
Introduction to Apache HBase
 
Hbase 20141003
Hbase 20141003Hbase 20141003
Hbase 20141003
 
CCS334 BIG DATA ANALYTICS UNIT 5 PPT ELECTIVE PAPER
CCS334 BIG DATA ANALYTICS UNIT 5 PPT  ELECTIVE PAPERCCS334 BIG DATA ANALYTICS UNIT 5 PPT  ELECTIVE PAPER
CCS334 BIG DATA ANALYTICS UNIT 5 PPT ELECTIVE PAPER
 
Cassandra an overview
Cassandra an overviewCassandra an overview
Cassandra an overview
 
4. hbase overview
4. hbase overview4. hbase overview
4. hbase overview
 

Más de larsgeorge

Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadooplarsgeorge
 
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv larsgeorge
 
HBase Status Report - Hadoop Summit Europe 2014
HBase Status Report - Hadoop Summit Europe 2014HBase Status Report - Hadoop Summit Europe 2014
HBase Status Report - Hadoop Summit Europe 2014larsgeorge
 
HBase Sizing Guide
HBase Sizing GuideHBase Sizing Guide
HBase Sizing Guidelarsgeorge
 
Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013larsgeorge
 
HBase and Impala Notes - Munich HUG - 20131017
HBase and Impala Notes - Munich HUG - 20131017HBase and Impala Notes - Munich HUG - 20131017
HBase and Impala Notes - Munich HUG - 20131017larsgeorge
 
Realtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBaseRealtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBaselarsgeorge
 

Más de larsgeorge (7)

Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
 
HBase Status Report - Hadoop Summit Europe 2014
HBase Status Report - Hadoop Summit Europe 2014HBase Status Report - Hadoop Summit Europe 2014
HBase Status Report - Hadoop Summit Europe 2014
 
HBase Sizing Guide
HBase Sizing GuideHBase Sizing Guide
HBase Sizing Guide
 
Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013
 
HBase and Impala Notes - Munich HUG - 20131017
HBase and Impala Notes - Munich HUG - 20131017HBase and Impala Notes - Munich HUG - 20131017
HBase and Impala Notes - Munich HUG - 20131017
 
Realtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBaseRealtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBase
 

Último

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 

Último (20)

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 

HBase Advanced Schema Design - Berlin Buzzwords - June 2012

  • 1. Advanced HBase Schema Design Berlin Buzzwords, June 2012 Lars George lars@cloudera.com
  • 2. About Me • Solutions Architect @ Cloudera • Apache HBase & Whirr Committer • Author of HBase – The Definitive Guide • Working with HBase since end of 2007 • Organizer of the Munich OpenHUG
  • 3. Agenda 1 Overview of HBase 2 Schema Design 3 Examples 4 Wrap up 3
  • 11. Physical Model: HBase Tables and Regions • Table is made up of any number if regions • Region is specified by its startKey and endKey • Each region may live on a different node and is made up of several HDFS files and blocks, each of which is replicated by Hadoop
  • 13. Storage Separation • Column Families allow for separation of data – Used by Columnar Databases for fast analytical queries, but on column level only – Allows different or no compression depending on the content type • Segregate information based on access pattern • Data is stored in one or more storage file, called HFiles
  • 15. Fold, Store, and Shift • Logical layout does not match physical one • All values are stored with the full coordinates, including: Row Key, Column Family, Column Qualifier, and Timestamp • Folds columns into “row per column” • NULLs are cost free as nothing is stored • Versions are multiple “rows” in folded table
  • 16. Logical Model: HBase Tables • Tables are sorted by Row in lexicographical order • Table schema only defines its column families – Each family consists of any number of columns – Each column consists of any number of versions – Columns only exist when inserted, NULLs are free – Columns within a family are sorted and stored together – Everything except table names are byte[] (Table, Row, Family:Column, Timestamp) -> Value
  • 17. Column Family vs. Column • Use only a few column families – Causes many files that need to stay open per region plus class overhead per family • Best used when logical separation between data and meta columns • Sorting per family can be used to convey application logic or access pattern
  • 19. Key Cardinality • The best performance is gained from using row keys • Time range bound reads can skip store files – So can Bloom Filters • Selecting column families reduces the amount of data to be scanned • Pure value based filtering is a full table scan – Filters often are too, but reduce network traffic
  • 20. Key/Table Design • Crucial to gain best performance – Why do I need to know? Well, you also need to know that RDBMS is only working well when columns are indexed and query plan is OK • Absence of secondary indexes forces use of row key or column name sorting • Transfer multiple indexes into one – Generate large table -> Good since fits architecture and spreads across cluster
  • 21. DDI • Stands for Denormalization, Duplication and Intelligent Keys • Needed to overcome shortcomings of architecture • Denormalization -> Replacement for JOINs • Duplication -> Design for reads • Intelligent Keys -> Implement indexing and sorting, optimize reads
  • 23. Key Design Summary • Based on access pattern, either use sequential or random keys • Often a combination of both is needed – Overcome architectural limitations • Neither is necessarily bad – Use bulk import for sequential keys and reads – Random keys are good for random access patterns
  • 24. Pre-materialize Everything • Achieve one read per customer request if possible • Otherwise keep at lowest number • Reads between 10ms (cache miss) and 1ms (cache hit) • Use MapReduce to compute exacts in batch • Store and merge updates live • Use incrementColumnValue Motto: “Design for Reads”
  • 28. WTH?
  • 29.
  • 30. Example: OpenTSDB • Metric Type, Tags are stored as IDs • Periodically rolled up
  • 31. Summary • Design for Use-Case – Read, Write, or Both? • Avoid Hotspotting • Consider using IDs instead of full text • Leverage Column Family to HFile relation • Shift details to appropriate position – Composite Keys – Column Qualifiers
  • 32. Summary (cont.) • Schema design is a combination of – Designing the keys (row and column) – Segregate data into column families – Choose compression and block sizes • Similar techniques are needed to scale most systems – Add indexes, partition data, consistent hashing • Denormalization, Duplication, and Intelligent Keys (DDI)