SlideShare una empresa de Scribd logo
1 de 20
HBase
In Practice




Ravi Veeramachaneni
Topics
           Why HBase?
           HBase Usecases – HBase @Navteq
                Design Considerations
                Hardware/Deployment Considerations
                Practical Tips (Tuning/Optimization)
                Wanted Features




Ravi Veeramachaneni                    HBase – In Practice   2
Hadoop Benefits
          • Stores (HDFS) and Process (MR) large amounts of data
          • Scales (100s and 1000s of nodes)
          • Inexpensive (no license cost, low cost hardware)
          • Fast (1TB sort in 62s, 1PB in 16.25h*)
          • Availability (failover built into the platform)
          • Data Recoverability (failure should not result in any data
            loss)
          • Replication (out-of-the-box 3-way replication and
            configurable)
          • Better Throughput (Time to read the whole dataset is more
                 important than latency in reading the first record)
          • Write once and read-many-times pattern
          • Works well with structured, unstructured or semi-structured
            data
              *YDN Blog: Jim Gray’s Benchmark @ http://developer.yahoo.com/blogs/hadoop/posts/2009/05/hadoop_sorts_a_petabyte_in_162/

Ravi Veeramachaneni                                             HBase – In Practice                                                     3
But …
           Not so good or does not support
                      • Random access
                      • Updating the data and/or file (writes are always
                        at the EOF)
                      • Apps that require low latency access to data
                      • Does not to support lots of small files
                      • Does not support multiple writers
                      • Not a solution for every Data problem



Ravi Veeramachaneni                      HBase – In Practice               4
Featuring HBase
           HBase Scales (runs on top of Hadoop)
           HBase provides fast table scans for time ranges and
            fast key based lookups
           HBase stores null values for free
                      • Saves both disk space and disk IO time
           HBase supports unstructured/semi-structured data through
            column families
           HBase has built-in version management
           Map Reduce data input
                      • Tables are sorted and have unique keys
                          • Reducer often times optional
                          • Combiner not needed
           Strong community support and wider adoption

Ravi Veeramachaneni                               HBase – In Practice   5
HBase Usecases
                To solve Big Data problems
                Sparse data (un- or semi-structured)
                Cost effectively Scalable
                Versioned data
                Some other features may interest to you
                         Linear distribution of data across the data nodes
                         Rows are stored in byte-lexographic sorted order
                         Atomic Read/Write/Update
                         Data Access – Random, Sequential reads and writes
                         Automatic replication of Data for HA
           But not for every Data problem

Ravi Veeramachaneni                            HBase – In Practice            6
Navteq’s Usecase
           Content is
                      –   Constantly growing (in higher TB)
                      –   Sparse and unstructured
                      –   Provided in multiple data formats
                      –   Ingested, processed and delivered in transactional and batch mode


           Content Breadth
                      – 100s of millions of content records
                      – 100s of content suppliers + community input


           Content Depth
                      – On average, a content record has 120 attributes
                      – Certain types of content have more than 400 attributes
                      – Content classified across 270+ categories

Ravi Veeramachaneni                                HBase – In Practice                        7
Content Processing High-level Overview
                                                    Batch and Transactional API


                               Bulk Content                                                Customer and
                                 Sources                                                   Community UGC



                                              Merchant          Community,
                                               Data              User and
                                                                 Merchant
                                                                  Media



                                                                         Place ID    from Place Registry
                                                                         Location ID from Location Referencing

                      Source & Blended Record Management                              Tiered Quality System




                                                                                    PUBLISHING
                                                                                 real-time, on-demand
                                                            Place ID             Bulk Content delivery; Search, and
                                                           Location ID           other mobile devices



Ravi Veeramachaneni                                        HBase – In Practice                                        8
HBase @ NAVTEQ
           Started in 2009, hbase 0.19.x (apache)
                      • 8-node VMWare Sandbox Cluster
                      • Flaky, unstable, RS Failures
                      • Switched to CDH
           Early 2010, hbase 0.20.x (CDH2)
                      • 10-node Physical Sandbox Cluster
                      • Still had lot of challenges, RS Failures, META corruption
                      • Cluster expanded significantly with multiple environments
           Current (hbase 0.90.3)
                      • Moved to CDH3u1 official release
                      • Multiple teams/projects using Hadoop/HBase implementation
                      • Working on Hive/HBase integration, Oozie, Lucene/Solr
                        integration, Cloudera Enterprise and few other


Ravi Veeramachaneni                             HBase – In Practice                 9
Measured Business Value
           Scalability & Deployment
                      •   Handling spikes are addressed by simply adding nodes
                      •   No code changes or deployment needed
                      •   From 15 to 30 to 60 nodes and more, as data grows
                      •   Deployment are well managed and controlled (from 12-16
                          hours to < 2 hours)
           Speed to Market
                      • By supporting Real-time transactions (instead of quarterly
                        update)
                      • Batch updates are handled more efficiently (from days to
                        hours)
           Faster Supplier On-boarding
                      • Flexible and externally managed Business Rules
           Cheaper than the existing solution
                       <$2m vs. $12m (based on projected growth)
Ravi Veeramachaneni                            HBase – In Practice                   10
HBase & Zookeeper
           ZK – Distributed coordination service
                      • Coordinates messages sent across the network between nodes
                        (network fails, etc.)
           HBase depends on ZK and authorizes ZK to manage the state

           HBase hosts key info on ZK
                      • Location of root catalog table
                      • Address of the current cluster master
                      • Bootstrapping a client connection to an HBase cluster

           Client connects to ZK quorum first
                      • To learn the location of -ROOT-
                      • Clients consult -ROOT- to elicit the location of the .META. Region
                      • Client then does a lookup against the found .META. Region to figure
                        the hosting user-space region and its location
                      • Clients caches all the above for future traversing

Ravi Veeramachaneni                              HBase – In Practice                          11
Design Considerations
           Database/schema design
                      • Transition to Column-oriented or flat schema
           Understand your access pattern
           Row-key design/implementation
                      • Sequential keys
                          • Suffers from distribution of load but uses the block caches
                          • Can be addressed by pre-splitting the regions
                      • Randomize keys to get better distribution
                          • Achieved through hashing on Key Attributes – SHA1 or MD5
                          • Suffers range scans
           Too many Column Families (NOT Good)
                      • Initially we had about 30 or so, now reduced to 8
           Compression
                      • LZO or Snappy (20% better than LZO) – Block (default)

Ravi Veeramachaneni                                 HBase – In Practice                   12
Design Considerations
           Serialization
                      • AVRO didn’t work well – deserialization issue
                      • Developed configurable serialization mechanism that uses JSON
                        except Date type
           Secondary Indexes
                      • Were using ITHBase and IHBase from contrib – doesn’t work well
                      • Redesigned schema without need for index
                      • We still need it though
           Performance
                      • Several tunable parameters
                          • Hadoop, HBase, OS, JVM, Networking, Hardware
           Scalability
                      • Interfacing with real-time (interactive) systems from batch oriented
                        system


Ravi Veeramachaneni                                 HBase – In Practice                        13
Hadoop/HBase Processes




Ravi Veeramachaneni      HBase – In Practice   14
Hardware/Deployment Considerations
           Hardware (Hadoop+HBase)
                      • Data Node - 24GB RAM, 8 Cores, 4x1TB (64GB, 24 Cores, 8x2TB)
                      • 6 mappers and 6 reducers per node (16 mappers, 4 reducers)
                      • Memory allocation by process
                          •   Data Node – 1GB (2GB)
                          •   Task Tracker – 1GB (2GB)
                          •   Map Tasks – 6x1GB (16x1.5GB)
                          •   Reduce Tasks – 6x1GB (4x1.5GB)
                          •   Region Server – 8GB (24GB)
                          •   Total Allocation: 24GB (64GB)
           Deployment
                      • Do not run ZK instances on DN, have a separate ZK quorum (3
                        minimum)
                      • Do not run HMaster on NN
                      • Avoid SPOF for HMaster (run additional master(s))


Ravi Veeramachaneni                                   HBase – In Practice              15
HBase Configuration/Tuning
           Configuring HBase
                      • Configuration is the key
                      • Many moving parts – typos, out of synchronization
                      • Operating System
                          • Number of open files (ulimit) to 32K or even higher (/etc/security/limits.conf)
                          • vm.swapiness to lower or 0
                      • HDFS
                          • Adjust block size based on the use case
                          • Increase xceivers to 2047 (dfs.datanode.max.xceivers)
                          • Set socket timeout to 0 (dfs.datanode.socket.write.timeout)
                      • HBase
                          • Needs more memory
                          • No swapping – JVM hates it
                          • GC pauses could cause timeouts or RS failures (read article posted by
                            Todd Lipcon on avoiding full GC)


Ravi Veeramachaneni                                    HBase – In Practice                                    16
HBase Configuration/Tuning
           HBase
                      • Per-cluster
                          • Turn-off block cache if the hit ratio is less (hfile.block.cache.size, default
                            20%)
                      • Per-table
                          • MemStore flush Size (hbase.hregion.memstore.flush.size, default 64MB and
                            hbase.hregion.memstore.block.multiplier, default 2)
                          • Max File Size (hbase.hregion.max.filesize, default 256MB)
                      • Per-CF
                          • Compression
                          • Bloom Filter
                      • Per-RS
                          • Amount of heap in each RS to reserve for all MemStores
                            (hbase.regionserver.global.memstore.upperLimit, default 0.4)
                          • MemStore flush size
                          • Max file size
                      • Per-SF
                          • Maximum number of SFs per store to allow
                            (hbase.hstore.blockingStoreFiles, default 7)

Ravi Veeramachaneni                                    HBase – In Practice                                   17
HBase Configuration/Tuning
                      • HBase
                          • Write (puts) optimization (Ryan Rawson HUG8 presentation – HBase
                            importing)
                               –   hbase.regionserver.global.memstore.upperLimit=0.3
                               –   hbase.regionserver.global.memstore.lowerLimit=0.15
                               –   hbase.regionserver.handler.count=256
                               –   hbase.hregion.memstore.block.multiplier=8
                               –   hbase.hstore.blockingStoreFiles=25
                          • Control number of store files (hbase.hregion.max.filesize)
           Security
                      • Still in flux, need robust RBAC
           Reliability
                      • Name Node is SPOF
                      • HBase is sensitive
                          • Region Server Failures

Ravi Veeramachaneni                                   HBase – In Practice                      18
Desired Features
           Better operational tools for using Hadoop and HBase
                      • Job management, backup, restore, user provisioning, general
                        administrative tasks, etc.
                Support for Secondary Indexes
                Full-text Indexes and Searching (Lucene/Solr integration?)
                HA support for Name Node
                Need Data Replication for HA & DR
                Security at Table, CF and Row level
                Good documentation (it’s getting better though) – now Lars
                 book out



Ravi Veeramachaneni                             HBase – In Practice                   19
Thank you




Ravi Veeramachaneni   HBase – In Practice   20

Más contenido relacionado

La actualidad más candente

A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook
A Survey of Petabyte Scale Databases and Storage Systems Deployed at FacebookA Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook
A Survey of Petabyte Scale Databases and Storage Systems Deployed at FacebookBigDataCloud
 
Hadoop World 2011: Practical HBase - Ravi Veeramchaneni, Informatica
Hadoop World 2011: Practical HBase - Ravi Veeramchaneni, InformaticaHadoop World 2011: Practical HBase - Ravi Veeramchaneni, Informatica
Hadoop World 2011: Practical HBase - Ravi Veeramchaneni, InformaticaCloudera, Inc.
 
Advanced Security In Hadoop Cluster
Advanced Security In Hadoop ClusterAdvanced Security In Hadoop Cluster
Advanced Security In Hadoop ClusterEdureka!
 
HBase Status Report - Hadoop Summit Europe 2014
HBase Status Report - Hadoop Summit Europe 2014HBase Status Report - Hadoop Summit Europe 2014
HBase Status Report - Hadoop Summit Europe 2014larsgeorge
 
HBase @ Twitter
HBase @ TwitterHBase @ Twitter
HBase @ Twitterctrezzo
 
Intro to HBase - Lars George
Intro to HBase - Lars GeorgeIntro to HBase - Lars George
Intro to HBase - Lars GeorgeJAX London
 
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBaseHBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBaseCloudera, Inc.
 
Hadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
Hadoop World 2011: Advanced HBase Schema Design - Lars George, ClouderaHadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
Hadoop World 2011: Advanced HBase Schema Design - Lars George, ClouderaCloudera, Inc.
 
Storage infrastructure using HBase behind LINE messages
Storage infrastructure using HBase behind LINE messagesStorage infrastructure using HBase behind LINE messages
Storage infrastructure using HBase behind LINE messagesLINE Corporation (Tech Unit)
 
HBase at Bloomberg: High Availability Needs for the Financial Industry
HBase at Bloomberg: High Availability Needs for the Financial IndustryHBase at Bloomberg: High Availability Needs for the Financial Industry
HBase at Bloomberg: High Availability Needs for the Financial IndustryHBaseCon
 
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster Cloudera, Inc.
 
2015 GHC Presentation - High Availability and High Frequency Big Data Analytics
2015 GHC Presentation - High Availability and High Frequency Big Data Analytics2015 GHC Presentation - High Availability and High Frequency Big Data Analytics
2015 GHC Presentation - High Availability and High Frequency Big Data AnalyticsEsther Kundin
 
HBaseCon 2013: Compaction Improvements in Apache HBase
HBaseCon 2013: Compaction Improvements in Apache HBaseHBaseCon 2013: Compaction Improvements in Apache HBase
HBaseCon 2013: Compaction Improvements in Apache HBaseCloudera, Inc.
 
HBaseCon 2012 | HBase, the Use Case in eBay Cassini
HBaseCon 2012 | HBase, the Use Case in eBay Cassini HBaseCon 2012 | HBase, the Use Case in eBay Cassini
HBaseCon 2012 | HBase, the Use Case in eBay Cassini Cloudera, Inc.
 
HBaseCon 2013: Integration of Apache Hive and HBase
HBaseCon 2013: Integration of Apache Hive and HBaseHBaseCon 2013: Integration of Apache Hive and HBase
HBaseCon 2013: Integration of Apache Hive and HBaseCloudera, Inc.
 
Realtime Apache Hadoop at Facebook
Realtime Apache Hadoop at FacebookRealtime Apache Hadoop at Facebook
Realtime Apache Hadoop at Facebookparallellabs
 
Future of cloud storage
Future of cloud storageFuture of cloud storage
Future of cloud storageGlusterFS
 
Compaction and Splitting in Apache Accumulo
Compaction and Splitting in Apache AccumuloCompaction and Splitting in Apache Accumulo
Compaction and Splitting in Apache AccumuloHortonworks
 
Introduction to HBase - NoSqlNow2015
Introduction to HBase - NoSqlNow2015Introduction to HBase - NoSqlNow2015
Introduction to HBase - NoSqlNow2015Apekshit Sharma
 

La actualidad más candente (20)

A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook
A Survey of Petabyte Scale Databases and Storage Systems Deployed at FacebookA Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook
A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook
 
Hadoop World 2011: Practical HBase - Ravi Veeramchaneni, Informatica
Hadoop World 2011: Practical HBase - Ravi Veeramchaneni, InformaticaHadoop World 2011: Practical HBase - Ravi Veeramchaneni, Informatica
Hadoop World 2011: Practical HBase - Ravi Veeramchaneni, Informatica
 
Hbase: an introduction
Hbase: an introductionHbase: an introduction
Hbase: an introduction
 
Advanced Security In Hadoop Cluster
Advanced Security In Hadoop ClusterAdvanced Security In Hadoop Cluster
Advanced Security In Hadoop Cluster
 
HBase Status Report - Hadoop Summit Europe 2014
HBase Status Report - Hadoop Summit Europe 2014HBase Status Report - Hadoop Summit Europe 2014
HBase Status Report - Hadoop Summit Europe 2014
 
HBase @ Twitter
HBase @ TwitterHBase @ Twitter
HBase @ Twitter
 
Intro to HBase - Lars George
Intro to HBase - Lars GeorgeIntro to HBase - Lars George
Intro to HBase - Lars George
 
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBaseHBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
 
Hadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
Hadoop World 2011: Advanced HBase Schema Design - Lars George, ClouderaHadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
Hadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
 
Storage infrastructure using HBase behind LINE messages
Storage infrastructure using HBase behind LINE messagesStorage infrastructure using HBase behind LINE messages
Storage infrastructure using HBase behind LINE messages
 
HBase at Bloomberg: High Availability Needs for the Financial Industry
HBase at Bloomberg: High Availability Needs for the Financial IndustryHBase at Bloomberg: High Availability Needs for the Financial Industry
HBase at Bloomberg: High Availability Needs for the Financial Industry
 
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
 
2015 GHC Presentation - High Availability and High Frequency Big Data Analytics
2015 GHC Presentation - High Availability and High Frequency Big Data Analytics2015 GHC Presentation - High Availability and High Frequency Big Data Analytics
2015 GHC Presentation - High Availability and High Frequency Big Data Analytics
 
HBaseCon 2013: Compaction Improvements in Apache HBase
HBaseCon 2013: Compaction Improvements in Apache HBaseHBaseCon 2013: Compaction Improvements in Apache HBase
HBaseCon 2013: Compaction Improvements in Apache HBase
 
HBaseCon 2012 | HBase, the Use Case in eBay Cassini
HBaseCon 2012 | HBase, the Use Case in eBay Cassini HBaseCon 2012 | HBase, the Use Case in eBay Cassini
HBaseCon 2012 | HBase, the Use Case in eBay Cassini
 
HBaseCon 2013: Integration of Apache Hive and HBase
HBaseCon 2013: Integration of Apache Hive and HBaseHBaseCon 2013: Integration of Apache Hive and HBase
HBaseCon 2013: Integration of Apache Hive and HBase
 
Realtime Apache Hadoop at Facebook
Realtime Apache Hadoop at FacebookRealtime Apache Hadoop at Facebook
Realtime Apache Hadoop at Facebook
 
Future of cloud storage
Future of cloud storageFuture of cloud storage
Future of cloud storage
 
Compaction and Splitting in Apache Accumulo
Compaction and Splitting in Apache AccumuloCompaction and Splitting in Apache Accumulo
Compaction and Splitting in Apache Accumulo
 
Introduction to HBase - NoSqlNow2015
Introduction to HBase - NoSqlNow2015Introduction to HBase - NoSqlNow2015
Introduction to HBase - NoSqlNow2015
 

Destacado

Spatial Data processing with Hadoop
Spatial Data processing with HadoopSpatial Data processing with Hadoop
Spatial Data processing with HadoopVisionGEOMATIQUE2014
 
Building a geospatial processing pipeline using Hadoop and HBase and how Mons...
Building a geospatial processing pipeline using Hadoop and HBase and how Mons...Building a geospatial processing pipeline using Hadoop and HBase and how Mons...
Building a geospatial processing pipeline using Hadoop and HBase and how Mons...DataWorks Summit
 
How To Analyze Geolocation Data with Hive and Hadoop
How To Analyze Geolocation Data with Hive and HadoopHow To Analyze Geolocation Data with Hive and Hadoop
How To Analyze Geolocation Data with Hive and HadoopHortonworks
 
MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location Aware...
MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location Aware...MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location Aware...
MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location Aware...nishimurashoji
 
A 3 dimensional data model in hbase for large time-series dataset-20120915
A 3 dimensional data model in hbase for large time-series dataset-20120915A 3 dimensional data model in hbase for large time-series dataset-20120915
A 3 dimensional data model in hbase for large time-series dataset-20120915Dan Han
 
Computation of spatial data on Hadoop Cluster
Computation of spatial data on Hadoop ClusterComputation of spatial data on Hadoop Cluster
Computation of spatial data on Hadoop ClusterAbhishek Sagar
 
HGrid A Data Model for Large Geospatial Data Sets in HBase
HGrid A Data Model for Large Geospatial Data Sets in HBaseHGrid A Data Model for Large Geospatial Data Sets in HBase
HGrid A Data Model for Large Geospatial Data Sets in HBaseDan Han
 
Sept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical IntroductionSept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical IntroductionAdam Muise
 
Adding Location and Geospatial Analytics to Big Data Analytics (BDT210) | AWS...
Adding Location and Geospatial Analytics to Big Data Analytics (BDT210) | AWS...Adding Location and Geospatial Analytics to Big Data Analytics (BDT210) | AWS...
Adding Location and Geospatial Analytics to Big Data Analytics (BDT210) | AWS...Amazon Web Services
 

Destacado (9)

Spatial Data processing with Hadoop
Spatial Data processing with HadoopSpatial Data processing with Hadoop
Spatial Data processing with Hadoop
 
Building a geospatial processing pipeline using Hadoop and HBase and how Mons...
Building a geospatial processing pipeline using Hadoop and HBase and how Mons...Building a geospatial processing pipeline using Hadoop and HBase and how Mons...
Building a geospatial processing pipeline using Hadoop and HBase and how Mons...
 
How To Analyze Geolocation Data with Hive and Hadoop
How To Analyze Geolocation Data with Hive and HadoopHow To Analyze Geolocation Data with Hive and Hadoop
How To Analyze Geolocation Data with Hive and Hadoop
 
MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location Aware...
MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location Aware...MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location Aware...
MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location Aware...
 
A 3 dimensional data model in hbase for large time-series dataset-20120915
A 3 dimensional data model in hbase for large time-series dataset-20120915A 3 dimensional data model in hbase for large time-series dataset-20120915
A 3 dimensional data model in hbase for large time-series dataset-20120915
 
Computation of spatial data on Hadoop Cluster
Computation of spatial data on Hadoop ClusterComputation of spatial data on Hadoop Cluster
Computation of spatial data on Hadoop Cluster
 
HGrid A Data Model for Large Geospatial Data Sets in HBase
HGrid A Data Model for Large Geospatial Data Sets in HBaseHGrid A Data Model for Large Geospatial Data Sets in HBase
HGrid A Data Model for Large Geospatial Data Sets in HBase
 
Sept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical IntroductionSept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical Introduction
 
Adding Location and Geospatial Analytics to Big Data Analytics (BDT210) | AWS...
Adding Location and Geospatial Analytics to Big Data Analytics (BDT210) | AWS...Adding Location and Geospatial Analytics to Big Data Analytics (BDT210) | AWS...
Adding Location and Geospatial Analytics to Big Data Analytics (BDT210) | AWS...
 

Similar a Practical HBase - Hadoop World2011

HBaseCon 2012 | HBase Security for the Enterprise - Andrew Purtell, Trend Micro
HBaseCon 2012 | HBase Security for the Enterprise - Andrew Purtell, Trend MicroHBaseCon 2012 | HBase Security for the Enterprise - Andrew Purtell, Trend Micro
HBaseCon 2012 | HBase Security for the Enterprise - Andrew Purtell, Trend MicroCloudera, Inc.
 
Chicago Data Summit: Geo-based Content Processing Using HBase
Chicago Data Summit: Geo-based Content Processing Using HBaseChicago Data Summit: Geo-based Content Processing Using HBase
Chicago Data Summit: Geo-based Content Processing Using HBaseCloudera, Inc.
 
HugeTable:Application-Oriented Structure Data Storage System
HugeTable:Application-Oriented Structure Data Storage SystemHugeTable:Application-Oriented Structure Data Storage System
HugeTable:Application-Oriented Structure Data Storage Systemqlw5
 
Techincal Talk Hbase-Ditributed,no-sql database
Techincal Talk Hbase-Ditributed,no-sql databaseTechincal Talk Hbase-Ditributed,no-sql database
Techincal Talk Hbase-Ditributed,no-sql databaseRishabh Dugar
 
支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统yongboy
 
Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase强 王
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברגTaldor Group
 
Hadoop Data Reservoir Webinar
Hadoop Data Reservoir WebinarHadoop Data Reservoir Webinar
Hadoop Data Reservoir WebinarPlatfora
 
Musings on Secondary Indexing in HBase
Musings on Secondary Indexing in HBaseMusings on Secondary Indexing in HBase
Musings on Secondary Indexing in HBaseJesse Yates
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsBig Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsEsther Kundin
 
Conhecendo o Apache HBase
Conhecendo o Apache HBaseConhecendo o Apache HBase
Conhecendo o Apache HBaseFelipe Ferreira
 
Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011Hortonworks
 
The power of hadoop in cloud computing
The power of hadoop in cloud computingThe power of hadoop in cloud computing
The power of hadoop in cloud computingJoey Echeverria
 
HBase and Hadoop at Urban Airship
HBase and Hadoop at Urban AirshipHBase and Hadoop at Urban Airship
HBase and Hadoop at Urban Airshipdave_revell
 
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Cloudera, Inc.
 

Similar a Practical HBase - Hadoop World2011 (20)

HBaseCon 2012 | HBase Security for the Enterprise - Andrew Purtell, Trend Micro
HBaseCon 2012 | HBase Security for the Enterprise - Andrew Purtell, Trend MicroHBaseCon 2012 | HBase Security for the Enterprise - Andrew Purtell, Trend Micro
HBaseCon 2012 | HBase Security for the Enterprise - Andrew Purtell, Trend Micro
 
Chicago Data Summit: Geo-based Content Processing Using HBase
Chicago Data Summit: Geo-based Content Processing Using HBaseChicago Data Summit: Geo-based Content Processing Using HBase
Chicago Data Summit: Geo-based Content Processing Using HBase
 
HugeTable:Application-Oriented Structure Data Storage System
HugeTable:Application-Oriented Structure Data Storage SystemHugeTable:Application-Oriented Structure Data Storage System
HugeTable:Application-Oriented Structure Data Storage System
 
Techincal Talk Hbase-Ditributed,no-sql database
Techincal Talk Hbase-Ditributed,no-sql databaseTechincal Talk Hbase-Ditributed,no-sql database
Techincal Talk Hbase-Ditributed,no-sql database
 
支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统
 
Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברג
 
Hadoop Data Reservoir Webinar
Hadoop Data Reservoir WebinarHadoop Data Reservoir Webinar
Hadoop Data Reservoir Webinar
 
Hive
HiveHive
Hive
 
Musings on Secondary Indexing in HBase
Musings on Secondary Indexing in HBaseMusings on Secondary Indexing in HBase
Musings on Secondary Indexing in HBase
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsBig Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
 
Conhecendo o Apache HBase
Conhecendo o Apache HBaseConhecendo o Apache HBase
Conhecendo o Apache HBase
 
Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011
 
Bi with apache hadoop(en)
Bi with apache hadoop(en)Bi with apache hadoop(en)
Bi with apache hadoop(en)
 
Firebird meets NoSQL
Firebird meets NoSQLFirebird meets NoSQL
Firebird meets NoSQL
 
The power of hadoop in cloud computing
The power of hadoop in cloud computingThe power of hadoop in cloud computing
The power of hadoop in cloud computing
 
HBase and Hadoop at Urban Airship
HBase and Hadoop at Urban AirshipHBase and Hadoop at Urban Airship
HBase and Hadoop at Urban Airship
 
hive.pptx
hive.pptxhive.pptx
hive.pptx
 
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 

Último

The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 

Último (20)

The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 

Practical HBase - Hadoop World2011

  • 2. Topics  Why HBase?  HBase Usecases – HBase @Navteq  Design Considerations  Hardware/Deployment Considerations  Practical Tips (Tuning/Optimization)  Wanted Features Ravi Veeramachaneni HBase – In Practice 2
  • 3. Hadoop Benefits • Stores (HDFS) and Process (MR) large amounts of data • Scales (100s and 1000s of nodes) • Inexpensive (no license cost, low cost hardware) • Fast (1TB sort in 62s, 1PB in 16.25h*) • Availability (failover built into the platform) • Data Recoverability (failure should not result in any data loss) • Replication (out-of-the-box 3-way replication and configurable) • Better Throughput (Time to read the whole dataset is more important than latency in reading the first record) • Write once and read-many-times pattern • Works well with structured, unstructured or semi-structured data *YDN Blog: Jim Gray’s Benchmark @ http://developer.yahoo.com/blogs/hadoop/posts/2009/05/hadoop_sorts_a_petabyte_in_162/ Ravi Veeramachaneni HBase – In Practice 3
  • 4. But …  Not so good or does not support • Random access • Updating the data and/or file (writes are always at the EOF) • Apps that require low latency access to data • Does not to support lots of small files • Does not support multiple writers • Not a solution for every Data problem Ravi Veeramachaneni HBase – In Practice 4
  • 5. Featuring HBase  HBase Scales (runs on top of Hadoop)  HBase provides fast table scans for time ranges and fast key based lookups  HBase stores null values for free • Saves both disk space and disk IO time  HBase supports unstructured/semi-structured data through column families  HBase has built-in version management  Map Reduce data input • Tables are sorted and have unique keys • Reducer often times optional • Combiner not needed  Strong community support and wider adoption Ravi Veeramachaneni HBase – In Practice 5
  • 6. HBase Usecases  To solve Big Data problems  Sparse data (un- or semi-structured)  Cost effectively Scalable  Versioned data  Some other features may interest to you  Linear distribution of data across the data nodes  Rows are stored in byte-lexographic sorted order  Atomic Read/Write/Update  Data Access – Random, Sequential reads and writes  Automatic replication of Data for HA  But not for every Data problem Ravi Veeramachaneni HBase – In Practice 6
  • 7. Navteq’s Usecase  Content is – Constantly growing (in higher TB) – Sparse and unstructured – Provided in multiple data formats – Ingested, processed and delivered in transactional and batch mode  Content Breadth – 100s of millions of content records – 100s of content suppliers + community input  Content Depth – On average, a content record has 120 attributes – Certain types of content have more than 400 attributes – Content classified across 270+ categories Ravi Veeramachaneni HBase – In Practice 7
  • 8. Content Processing High-level Overview Batch and Transactional API Bulk Content Customer and Sources Community UGC Merchant Community, Data User and Merchant Media Place ID from Place Registry Location ID from Location Referencing Source & Blended Record Management Tiered Quality System PUBLISHING real-time, on-demand Place ID Bulk Content delivery; Search, and Location ID other mobile devices Ravi Veeramachaneni HBase – In Practice 8
  • 9. HBase @ NAVTEQ  Started in 2009, hbase 0.19.x (apache) • 8-node VMWare Sandbox Cluster • Flaky, unstable, RS Failures • Switched to CDH  Early 2010, hbase 0.20.x (CDH2) • 10-node Physical Sandbox Cluster • Still had lot of challenges, RS Failures, META corruption • Cluster expanded significantly with multiple environments  Current (hbase 0.90.3) • Moved to CDH3u1 official release • Multiple teams/projects using Hadoop/HBase implementation • Working on Hive/HBase integration, Oozie, Lucene/Solr integration, Cloudera Enterprise and few other Ravi Veeramachaneni HBase – In Practice 9
  • 10. Measured Business Value  Scalability & Deployment • Handling spikes are addressed by simply adding nodes • No code changes or deployment needed • From 15 to 30 to 60 nodes and more, as data grows • Deployment are well managed and controlled (from 12-16 hours to < 2 hours)  Speed to Market • By supporting Real-time transactions (instead of quarterly update) • Batch updates are handled more efficiently (from days to hours)  Faster Supplier On-boarding • Flexible and externally managed Business Rules  Cheaper than the existing solution  <$2m vs. $12m (based on projected growth) Ravi Veeramachaneni HBase – In Practice 10
  • 11. HBase & Zookeeper  ZK – Distributed coordination service • Coordinates messages sent across the network between nodes (network fails, etc.)  HBase depends on ZK and authorizes ZK to manage the state  HBase hosts key info on ZK • Location of root catalog table • Address of the current cluster master • Bootstrapping a client connection to an HBase cluster  Client connects to ZK quorum first • To learn the location of -ROOT- • Clients consult -ROOT- to elicit the location of the .META. Region • Client then does a lookup against the found .META. Region to figure the hosting user-space region and its location • Clients caches all the above for future traversing Ravi Veeramachaneni HBase – In Practice 11
  • 12. Design Considerations  Database/schema design • Transition to Column-oriented or flat schema  Understand your access pattern  Row-key design/implementation • Sequential keys • Suffers from distribution of load but uses the block caches • Can be addressed by pre-splitting the regions • Randomize keys to get better distribution • Achieved through hashing on Key Attributes – SHA1 or MD5 • Suffers range scans  Too many Column Families (NOT Good) • Initially we had about 30 or so, now reduced to 8  Compression • LZO or Snappy (20% better than LZO) – Block (default) Ravi Veeramachaneni HBase – In Practice 12
  • 13. Design Considerations  Serialization • AVRO didn’t work well – deserialization issue • Developed configurable serialization mechanism that uses JSON except Date type  Secondary Indexes • Were using ITHBase and IHBase from contrib – doesn’t work well • Redesigned schema without need for index • We still need it though  Performance • Several tunable parameters • Hadoop, HBase, OS, JVM, Networking, Hardware  Scalability • Interfacing with real-time (interactive) systems from batch oriented system Ravi Veeramachaneni HBase – In Practice 13
  • 14. Hadoop/HBase Processes Ravi Veeramachaneni HBase – In Practice 14
  • 15. Hardware/Deployment Considerations  Hardware (Hadoop+HBase) • Data Node - 24GB RAM, 8 Cores, 4x1TB (64GB, 24 Cores, 8x2TB) • 6 mappers and 6 reducers per node (16 mappers, 4 reducers) • Memory allocation by process • Data Node – 1GB (2GB) • Task Tracker – 1GB (2GB) • Map Tasks – 6x1GB (16x1.5GB) • Reduce Tasks – 6x1GB (4x1.5GB) • Region Server – 8GB (24GB) • Total Allocation: 24GB (64GB)  Deployment • Do not run ZK instances on DN, have a separate ZK quorum (3 minimum) • Do not run HMaster on NN • Avoid SPOF for HMaster (run additional master(s)) Ravi Veeramachaneni HBase – In Practice 15
  • 16. HBase Configuration/Tuning  Configuring HBase • Configuration is the key • Many moving parts – typos, out of synchronization • Operating System • Number of open files (ulimit) to 32K or even higher (/etc/security/limits.conf) • vm.swapiness to lower or 0 • HDFS • Adjust block size based on the use case • Increase xceivers to 2047 (dfs.datanode.max.xceivers) • Set socket timeout to 0 (dfs.datanode.socket.write.timeout) • HBase • Needs more memory • No swapping – JVM hates it • GC pauses could cause timeouts or RS failures (read article posted by Todd Lipcon on avoiding full GC) Ravi Veeramachaneni HBase – In Practice 16
  • 17. HBase Configuration/Tuning  HBase • Per-cluster • Turn-off block cache if the hit ratio is less (hfile.block.cache.size, default 20%) • Per-table • MemStore flush Size (hbase.hregion.memstore.flush.size, default 64MB and hbase.hregion.memstore.block.multiplier, default 2) • Max File Size (hbase.hregion.max.filesize, default 256MB) • Per-CF • Compression • Bloom Filter • Per-RS • Amount of heap in each RS to reserve for all MemStores (hbase.regionserver.global.memstore.upperLimit, default 0.4) • MemStore flush size • Max file size • Per-SF • Maximum number of SFs per store to allow (hbase.hstore.blockingStoreFiles, default 7) Ravi Veeramachaneni HBase – In Practice 17
  • 18. HBase Configuration/Tuning • HBase • Write (puts) optimization (Ryan Rawson HUG8 presentation – HBase importing) – hbase.regionserver.global.memstore.upperLimit=0.3 – hbase.regionserver.global.memstore.lowerLimit=0.15 – hbase.regionserver.handler.count=256 – hbase.hregion.memstore.block.multiplier=8 – hbase.hstore.blockingStoreFiles=25 • Control number of store files (hbase.hregion.max.filesize)  Security • Still in flux, need robust RBAC  Reliability • Name Node is SPOF • HBase is sensitive • Region Server Failures Ravi Veeramachaneni HBase – In Practice 18
  • 19. Desired Features  Better operational tools for using Hadoop and HBase • Job management, backup, restore, user provisioning, general administrative tasks, etc.  Support for Secondary Indexes  Full-text Indexes and Searching (Lucene/Solr integration?)  HA support for Name Node  Need Data Replication for HA & DR  Security at Table, CF and Row level  Good documentation (it’s getting better though) – now Lars book out Ravi Veeramachaneni HBase – In Practice 19
  • 20. Thank you Ravi Veeramachaneni HBase – In Practice 20