SlideShare una empresa de Scribd logo
1 de 29
dwivedishashwat@gmail.com
QUERIES
Flow of Hbase read/write
QUERY
About Meter Logs:
It best suited for persistent and day by day
increasing data, as you data grows you can keep on
adding more nodes, and you have lot of facility,
process this data such as pig, hive and map-reduce.
You can run map-reduce and output the datasets
which can be indexed and used for faster search of
that huge data.
QUERY
Usibility
 HBase isn't suitable for every problem.
 First, make sure you have enough data. If you have hundreds of millions or
billions of rows, then HBase is a good candidate. If you only have a few
thousand/million rows, then using a traditional RDBMS might be a better
choice due to the fact that all of your data might wind up on a single node
(or two) and the rest of the cluster may be sitting idle.
 Second, make sure you can live without all the extra features that an
RDBMS provides (e.g., typed columns, secondary indexes, transactions,
advanced query languages, etc.) An application built against an RDBMS
cannot be "ported" to HBase by simply changing a JDBC driver, for
example. Consider moving from an RDBMS to HBase as a complete
redesign as opposed to a port.
 Third, make sure you have enough hardware. Even HDFS doesn't do well
with anything less than 5 DataNodes (due to things such as HDFS block
replication which has a default of 3), plus a NameNode.
 HBase can run quite well stand-alone on a laptop - but this should be
considered a development configuration only.
QUERY
Difference Between NoSQL DB and RDBMS
 NoSQL is a kind of database that doesn't have a fixed schema like a
traditional RDBMS does. With the NoSQL databases the schema is
defined by the developer at run time. They don't write normal SQL
statements against the database, but instead use an API to get the data
that they need. The NoSQL databases can usually scale across
different physical servers easily without needing to know which server
the data you are looking for is on.
 Why we should think of using it
 Durability
 Scalability on fly
 Distributed data
 Persistence etc.
QUERY
Row Oriented and Column oriented DBS
A column-oriented DBMS is a database
management system (DBMS) that stores data
tables as sections of columns of data rather than
as rows of data, like most relational DBMSs
1,Smith,Joe,40000;
2,Jones,Mary,50000;
3,Johnson,Cathy,44000;
1,2,3; Smith,Jones,Johnson;
Joe,Mary,Cathy;
40000,50000,44000;
A column-oriented database is different from
traditional row-oriented databases because of
how they store data. By storing a whole column
together instead of a row, you can minimize disk
access when selecting a few columns from a
row containing many columns. In row-oriented
databases there's no difference if you select just
one or all fields from a row.
ID
Nam
e
Age
1
Bhavi
n
29
2
Roge
r
30
This would be persisted in a conventional RDBMS as
follows -
1,Bhavin,29|2,Roger,30
In a column oriented DBMS this would be persisted as -
1,2|Bhavin,Roger|29,30
MORE DEPTH OF HBASE CONCEPTS
MODES OF HBASE OPERATION
Stand Alone :
In standalone mode, there is no distributed
file system and no Java services/daemons
are started. All mappers and reducers run
inside a single Java VM.
This mode is best suited for testing
purpose, and experimentations with
HBase.
MODES OF HBASE OPERATION…
Pseudo-Distributed Mode
Pseudo-distributed mode, Hbase processing is
distributed over all of the cores/processors on a
single machine. Hbase writes all files to the Hadoop
Distributed FileSystem (HDFS), and all services and
daemons communicate over local TCP sockets for
inter-process communication
A pseudo-distributed mode is simply a distributed
mode run on a single host. Use this configuration
testing and prototyping on HBase. Do not use this
configuration for production nor for evaluating
HBase performance
MODES OF HBASE OPERATION…
Distributed:
In distributed mode the daemons are
spread across all nodes in the cluster
Distributed modes require an instance
of the Hadoop Distributed File System
(HDFS)
BASIC PREREQUISITES
Java
SSH
DNS
NTP
ulimit and nproc
Hadoop for Distributed mode
BASIC PREREQUISITES IN DETAIL
Just like Hadoop, HBase requires at least java 6
from Oracle
ssh must be installed and sshd must be running
to use Hadoop's scripts to manage remote
Hadoop and HBase daemons. You must be
able to ssh to all nodes, including your local
node, using passwordless login
HBase uses the local hostname to self-report its
IP address. Both forward and reverse DNS
resolving must work
The clocks on cluster members should be in basic
alignments. Some skew is tolerable but wild skew could
generate odd behaviors. Run NTP on your cluster, or an
equivalent.
It uses a lot of files all at the same time. The default ulimit
-n -- i.e. user file limit -- of 1024 on most *nix systems is
insufficient Any significant amount of loading will lead
you to “java.io.IOException...(Too many open files)”
Hadoop for Distributed file system and mapreduce
processing.
BASIC PREREQUISITES IN DETAIL
HBASE CONFIGURATION FILES
hbase-site.xml
hbase-default.xml
hbase-env.sh
log4j.properties
regionservers
HBASE-DEFAULT.XML
Not all configuration options make it out
to hbase-default.xml. Configuration
that it is thought rare anyone would
change can exist only in code; the
only way to turn up such
configurations is via a reading of the
source code itself.
hbase.rootdir
 The directory shared by region servers and into which HBase persists.
 Default: file:///tmp/hbase-${user.name}/hbase
 hdfs://namenode:9000/hbase  For Distributed
hbase.master.port
 The port the HBase Master should bind to. Default is 60000
hbase.cluster.distributed
 The mode the cluster will be in. Possible values are false for standalone
mode and true for distributed mode
hbase.tmp.dir
 Temporary directory on the local filesystem.
 Default: /tmp/hbase-${user.name}
hbase.regionserver.port
 The port the HBase RegionServer binds to.
 Default: 60020
There are lot many parameters which need to be change for more
customized and optimized Hbase cluster.
HBASE-SITE.XML
Just as in Hadoop where you add site-
specific HDFS configuration to the
hdfs-site.xml file, for HBase, site
specific customizations go into the file
conf/hbase-site.xml. For the list of
configurable properties
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>hbase.zookeeper.quorum</name>
<value>node1,node2,node3</value>
<description>The directory shared by RegionServers.
</description>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/export/zookeeper</value>
<description>Property from ZooKeeper's config zoo.cfg.
The directory where the snapshot is stored.
</description>
</property>
<property>
<name>hbase.rootdir</name>
<:value>hdfs//node0:8020/hbase</value>
<description>The directory shared by RegionServers.
</description>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
<description>The mode the cluster will be in. Possible values are
false: standalone and pseudo-distributed setups with managed
Zookeeper
true: fully-distributed with unmanaged Zookeeper Quorum (see hbase-
env.sh)
</description>
</property>
HBASE-ENV.SH
Set HBase environment variables in this file.
Examples include options to pass the JVM on
start of an HBase daemon such as heap size and
garbarge collector configs. You can also set
configurations for HBase configuration, log
directories, niceness, ssh options, where to locate
process pid files, etc. Open the file at conf/hbase-
env.sh and peruse its content. Each option is fairly
well documented. Add your own environment
variables here if you want them read by HBase
daemons on startup.
export JAVA_HOME=/usr/lib//jvm/java-6-sun/
export HBASE_CLASSPATH=
export HBASE_HEAPSIZE=1000
LOG4J.PROPERTIES
Edit this file to change rate at which
HBase files are rolled and to
change the level at which HBase
logs messages.
Changes here will require a cluster
restart for HBase
REGIONSERVERS
In this file you list the nodes that will run
RegionServers.
Eg :
regionservernode1
regionservernode2
regionservernode3
CONFIGURATIONS
Required Configurations
 Java
 SSH
 DNS
 NTP
 ulimit and nproc
 Hadoop for Distributed mode
Recommended Configurations
 zookeeper.session.timeout
 The default timeout is three minutes (specified in milliseconds). This means
that if a server crashes, it will be three minutes before the Master notices the
crash and starts recovery
 Number of ZooKeeper Instances
 Compression
 Bigger Regions
 Balancer
 The balancer is a periodic operation which is run on the master to redistribute
regions on the cluster. It is configured via hbase.balancer.period and defaults to
300000
Still more are there, Just read more to have more optimized cluster.
GETTING STARTED WITH HBASE
Start HBase
./bin/start-hbase.sh starting Master, logging
to logs/hbase-user-master-example.org.out
Connect to your running HBase via the
shell
./bin/hbase shell HBase Shell
hbase(main):001:0>
And on this shell you can type shell
command which hbase provides to
perform various operations.
I am here :
dwivedishashwat@gmail.com Twitter : shashwat_2010
Facebook : shriparv@gmail.com Skype: shriparv
Search Read Research and Share to have more better
understandability.
 Right ?
http://db.csail.mit.edu/projects/cstore/abadi-sigmod08.pdf
https://cs.uwaterloo.ca/~aelhelw/papers/dolap11.pdf
https://ccp.cloudera.com/download/attachments/14549380/CDH2_In
stallation_Guide.pdf?version=1&modificationDate=134887134000
0
http://hbase.apache.org/
https://ccp.cloudera.com/display/CDHDOC/HBase+Installation
Last and the best one
www.google.com :)
NEED MORE CLARIFICATION ON QUERIES??

Más contenido relacionado

La actualidad más candente

Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)
mundlapudi
 
Hug Hbase Presentation.
Hug Hbase Presentation.Hug Hbase Presentation.
Hug Hbase Presentation.
Jack Levin
 

La actualidad más candente (20)

HBase In Action - Chapter 10 - Operations
HBase In Action - Chapter 10 - OperationsHBase In Action - Chapter 10 - Operations
HBase In Action - Chapter 10 - Operations
 
Apache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS FederationApache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS Federation
 
HBase @ Twitter
HBase @ TwitterHBase @ Twitter
HBase @ Twitter
 
Hadoop Cluster With High Availability
Hadoop Cluster With High AvailabilityHadoop Cluster With High Availability
Hadoop Cluster With High Availability
 
HBase Storage Internals
HBase Storage InternalsHBase Storage Internals
HBase Storage Internals
 
Introduction to apache hadoop
Introduction to apache hadoopIntroduction to apache hadoop
Introduction to apache hadoop
 
Big data interview questions and answers
Big data interview questions and answersBig data interview questions and answers
Big data interview questions and answers
 
Apache kafka configuration-guide
Apache kafka configuration-guideApache kafka configuration-guide
Apache kafka configuration-guide
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfs
 
Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Configure h base hadoop and hbase client
Configure h base hadoop and hbase clientConfigure h base hadoop and hbase client
Configure h base hadoop and hbase client
 
Administer Hadoop Cluster
Administer Hadoop ClusterAdminister Hadoop Cluster
Administer Hadoop Cluster
 
ACADGILD:: HADOOP LESSON
ACADGILD:: HADOOP LESSON ACADGILD:: HADOOP LESSON
ACADGILD:: HADOOP LESSON
 
Hadoop administration
Hadoop administrationHadoop administration
Hadoop administration
 
Learn Hadoop Administration
Learn Hadoop AdministrationLearn Hadoop Administration
Learn Hadoop Administration
 
Hug Hbase Presentation.
Hug Hbase Presentation.Hug Hbase Presentation.
Hug Hbase Presentation.
 
Hadoop Summit 2012 | HBase Consistency and Performance Improvements
Hadoop Summit 2012 | HBase Consistency and Performance ImprovementsHadoop Summit 2012 | HBase Consistency and Performance Improvements
Hadoop Summit 2012 | HBase Consistency and Performance Improvements
 
Upgrading hadoop
Upgrading hadoopUpgrading hadoop
Upgrading hadoop
 
Hadoop migration and upgradation
Hadoop migration and upgradationHadoop migration and upgradation
Hadoop migration and upgradation
 

Similar a Hbase

支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统
yongboy
 
Hw09 Practical HBase Getting The Most From Your H Base Install
Hw09   Practical HBase  Getting The Most From Your H Base InstallHw09   Practical HBase  Getting The Most From Your H Base Install
Hw09 Practical HBase Getting The Most From Your H Base Install
Cloudera, Inc.
 
Optimization on Key-value Stores in Cloud Environment
Optimization on Key-value Stores in Cloud EnvironmentOptimization on Key-value Stores in Cloud Environment
Optimization on Key-value Stores in Cloud Environment
Fei Dong
 
Hadoop cluster configuration
Hadoop cluster configurationHadoop cluster configuration
Hadoop cluster configuration
prabakaranbrick
 

Similar a Hbase (20)

H base
H baseH base
H base
 
Data Storage and Management project Report
Data Storage and Management project ReportData Storage and Management project Report
Data Storage and Management project Report
 
Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase
 
支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统
 
Hbase 20141003
Hbase 20141003Hbase 20141003
Hbase 20141003
 
Hw09 Practical HBase Getting The Most From Your H Base Install
Hw09   Practical HBase  Getting The Most From Your H Base InstallHw09   Practical HBase  Getting The Most From Your H Base Install
Hw09 Practical HBase Getting The Most From Your H Base Install
 
HBase Mongo_DB Project
HBase Mongo_DB ProjectHBase Mongo_DB Project
HBase Mongo_DB Project
 
Data Storage Management
Data Storage ManagementData Storage Management
Data Storage Management
 
Optimization on Key-value Stores in Cloud Environment
Optimization on Key-value Stores in Cloud EnvironmentOptimization on Key-value Stores in Cloud Environment
Optimization on Key-value Stores in Cloud Environment
 
HBase.pptx
HBase.pptxHBase.pptx
HBase.pptx
 
Hadoop cluster configuration
Hadoop cluster configurationHadoop cluster configuration
Hadoop cluster configuration
 
Introduction to HBase
Introduction to HBaseIntroduction to HBase
Introduction to HBase
 
Hbase: an introduction
Hbase: an introductionHbase: an introduction
Hbase: an introduction
 
Hadoop - Apache Hbase
Hadoop - Apache HbaseHadoop - Apache Hbase
Hadoop - Apache Hbase
 
Performance Analysis of HBASE and MONGODB
Performance Analysis of HBASE and MONGODBPerformance Analysis of HBASE and MONGODB
Performance Analysis of HBASE and MONGODB
 
Chicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An IntroductionChicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An Introduction
 
Oracle NoSQL Database Compared to Cassandra and HBase
Oracle NoSQL Database Compared to Cassandra and HBaseOracle NoSQL Database Compared to Cassandra and HBase
Oracle NoSQL Database Compared to Cassandra and HBase
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
 
Big data with HDFS and Mapreduce
Big data  with HDFS and MapreduceBig data  with HDFS and Mapreduce
Big data with HDFS and Mapreduce
 
Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptx
 

Más de Shashwat Shriparv

LibreOffice 7.3.pptx
LibreOffice 7.3.pptxLibreOffice 7.3.pptx
LibreOffice 7.3.pptx
Shashwat Shriparv
 
Intermediate code generation1
Intermediate code generation1Intermediate code generation1
Intermediate code generation1
Shashwat Shriparv
 

Más de Shashwat Shriparv (20)

Learning Linux Series Administrator Commands.pptx
Learning Linux Series Administrator Commands.pptxLearning Linux Series Administrator Commands.pptx
Learning Linux Series Administrator Commands.pptx
 
LibreOffice 7.3.pptx
LibreOffice 7.3.pptxLibreOffice 7.3.pptx
LibreOffice 7.3.pptx
 
Kerberos Architecture.pptx
Kerberos Architecture.pptxKerberos Architecture.pptx
Kerberos Architecture.pptx
 
Suspending a Process in Linux.pptx
Suspending a Process in Linux.pptxSuspending a Process in Linux.pptx
Suspending a Process in Linux.pptx
 
Kerberos Architecture.pptx
Kerberos Architecture.pptxKerberos Architecture.pptx
Kerberos Architecture.pptx
 
Command Seperators.pptx
Command Seperators.pptxCommand Seperators.pptx
Command Seperators.pptx
 
R language introduction
R language introductionR language introduction
R language introduction
 
Hive query optimization infinity
Hive query optimization infinityHive query optimization infinity
Hive query optimization infinity
 
H base introduction & development
H base introduction & developmentH base introduction & development
H base introduction & development
 
Hbase interact with shell
Hbase interact with shellHbase interact with shell
Hbase interact with shell
 
H base development
H base developmentH base development
H base development
 
My sql
My sqlMy sql
My sql
 
Apache tomcat
Apache tomcatApache tomcat
Apache tomcat
 
Linux 4 you
Linux 4 youLinux 4 you
Linux 4 you
 
Next generation technology
Next generation technologyNext generation technology
Next generation technology
 
Java interview questions
Java interview questionsJava interview questions
Java interview questions
 
C# interview quesions
C# interview quesionsC# interview quesions
C# interview quesions
 
I pv6
I pv6I pv6
I pv6
 
Inventory system
Inventory systemInventory system
Inventory system
 
Intermediate code generation1
Intermediate code generation1Intermediate code generation1
Intermediate code generation1
 

Último

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Último (20)

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 

Hbase

  • 3.
  • 4.
  • 5. QUERY About Meter Logs: It best suited for persistent and day by day increasing data, as you data grows you can keep on adding more nodes, and you have lot of facility, process this data such as pig, hive and map-reduce. You can run map-reduce and output the datasets which can be indexed and used for faster search of that huge data.
  • 6. QUERY Usibility  HBase isn't suitable for every problem.  First, make sure you have enough data. If you have hundreds of millions or billions of rows, then HBase is a good candidate. If you only have a few thousand/million rows, then using a traditional RDBMS might be a better choice due to the fact that all of your data might wind up on a single node (or two) and the rest of the cluster may be sitting idle.  Second, make sure you can live without all the extra features that an RDBMS provides (e.g., typed columns, secondary indexes, transactions, advanced query languages, etc.) An application built against an RDBMS cannot be "ported" to HBase by simply changing a JDBC driver, for example. Consider moving from an RDBMS to HBase as a complete redesign as opposed to a port.  Third, make sure you have enough hardware. Even HDFS doesn't do well with anything less than 5 DataNodes (due to things such as HDFS block replication which has a default of 3), plus a NameNode.  HBase can run quite well stand-alone on a laptop - but this should be considered a development configuration only.
  • 7. QUERY Difference Between NoSQL DB and RDBMS  NoSQL is a kind of database that doesn't have a fixed schema like a traditional RDBMS does. With the NoSQL databases the schema is defined by the developer at run time. They don't write normal SQL statements against the database, but instead use an API to get the data that they need. The NoSQL databases can usually scale across different physical servers easily without needing to know which server the data you are looking for is on.  Why we should think of using it  Durability  Scalability on fly  Distributed data  Persistence etc.
  • 8. QUERY Row Oriented and Column oriented DBS A column-oriented DBMS is a database management system (DBMS) that stores data tables as sections of columns of data rather than as rows of data, like most relational DBMSs 1,Smith,Joe,40000; 2,Jones,Mary,50000; 3,Johnson,Cathy,44000; 1,2,3; Smith,Jones,Johnson; Joe,Mary,Cathy; 40000,50000,44000;
  • 9. A column-oriented database is different from traditional row-oriented databases because of how they store data. By storing a whole column together instead of a row, you can minimize disk access when selecting a few columns from a row containing many columns. In row-oriented databases there's no difference if you select just one or all fields from a row. ID Nam e Age 1 Bhavi n 29 2 Roge r 30 This would be persisted in a conventional RDBMS as follows - 1,Bhavin,29|2,Roger,30 In a column oriented DBMS this would be persisted as - 1,2|Bhavin,Roger|29,30
  • 10. MORE DEPTH OF HBASE CONCEPTS
  • 11. MODES OF HBASE OPERATION Stand Alone : In standalone mode, there is no distributed file system and no Java services/daemons are started. All mappers and reducers run inside a single Java VM. This mode is best suited for testing purpose, and experimentations with HBase.
  • 12. MODES OF HBASE OPERATION… Pseudo-Distributed Mode Pseudo-distributed mode, Hbase processing is distributed over all of the cores/processors on a single machine. Hbase writes all files to the Hadoop Distributed FileSystem (HDFS), and all services and daemons communicate over local TCP sockets for inter-process communication A pseudo-distributed mode is simply a distributed mode run on a single host. Use this configuration testing and prototyping on HBase. Do not use this configuration for production nor for evaluating HBase performance
  • 13. MODES OF HBASE OPERATION… Distributed: In distributed mode the daemons are spread across all nodes in the cluster Distributed modes require an instance of the Hadoop Distributed File System (HDFS)
  • 14. BASIC PREREQUISITES Java SSH DNS NTP ulimit and nproc Hadoop for Distributed mode
  • 15. BASIC PREREQUISITES IN DETAIL Just like Hadoop, HBase requires at least java 6 from Oracle ssh must be installed and sshd must be running to use Hadoop's scripts to manage remote Hadoop and HBase daemons. You must be able to ssh to all nodes, including your local node, using passwordless login HBase uses the local hostname to self-report its IP address. Both forward and reverse DNS resolving must work
  • 16. The clocks on cluster members should be in basic alignments. Some skew is tolerable but wild skew could generate odd behaviors. Run NTP on your cluster, or an equivalent. It uses a lot of files all at the same time. The default ulimit -n -- i.e. user file limit -- of 1024 on most *nix systems is insufficient Any significant amount of loading will lead you to “java.io.IOException...(Too many open files)” Hadoop for Distributed file system and mapreduce processing. BASIC PREREQUISITES IN DETAIL
  • 18. HBASE-DEFAULT.XML Not all configuration options make it out to hbase-default.xml. Configuration that it is thought rare anyone would change can exist only in code; the only way to turn up such configurations is via a reading of the source code itself.
  • 19. hbase.rootdir  The directory shared by region servers and into which HBase persists.  Default: file:///tmp/hbase-${user.name}/hbase  hdfs://namenode:9000/hbase  For Distributed hbase.master.port  The port the HBase Master should bind to. Default is 60000 hbase.cluster.distributed  The mode the cluster will be in. Possible values are false for standalone mode and true for distributed mode hbase.tmp.dir  Temporary directory on the local filesystem.  Default: /tmp/hbase-${user.name} hbase.regionserver.port  The port the HBase RegionServer binds to.  Default: 60020 There are lot many parameters which need to be change for more customized and optimized Hbase cluster.
  • 20. HBASE-SITE.XML Just as in Hadoop where you add site- specific HDFS configuration to the hdfs-site.xml file, for HBase, site specific customizations go into the file conf/hbase-site.xml. For the list of configurable properties
  • 21. <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>hbase.zookeeper.quorum</name> <value>node1,node2,node3</value> <description>The directory shared by RegionServers. </description> </property> <property> <name>hbase.zookeeper.property.dataDir</name> <value>/export/zookeeper</value> <description>Property from ZooKeeper's config zoo.cfg. The directory where the snapshot is stored. </description>
  • 22. </property> <property> <name>hbase.rootdir</name> <:value>hdfs//node0:8020/hbase</value> <description>The directory shared by RegionServers. </description> </property> <property> <name>hbase.cluster.distributed</name> <value>true</value> <description>The mode the cluster will be in. Possible values are false: standalone and pseudo-distributed setups with managed Zookeeper true: fully-distributed with unmanaged Zookeeper Quorum (see hbase- env.sh) </description> </property>
  • 23. HBASE-ENV.SH Set HBase environment variables in this file. Examples include options to pass the JVM on start of an HBase daemon such as heap size and garbarge collector configs. You can also set configurations for HBase configuration, log directories, niceness, ssh options, where to locate process pid files, etc. Open the file at conf/hbase- env.sh and peruse its content. Each option is fairly well documented. Add your own environment variables here if you want them read by HBase daemons on startup.
  • 25. LOG4J.PROPERTIES Edit this file to change rate at which HBase files are rolled and to change the level at which HBase logs messages. Changes here will require a cluster restart for HBase
  • 26. REGIONSERVERS In this file you list the nodes that will run RegionServers. Eg : regionservernode1 regionservernode2 regionservernode3
  • 27. CONFIGURATIONS Required Configurations  Java  SSH  DNS  NTP  ulimit and nproc  Hadoop for Distributed mode Recommended Configurations  zookeeper.session.timeout  The default timeout is three minutes (specified in milliseconds). This means that if a server crashes, it will be three minutes before the Master notices the crash and starts recovery  Number of ZooKeeper Instances  Compression  Bigger Regions  Balancer  The balancer is a periodic operation which is run on the master to redistribute regions on the cluster. It is configured via hbase.balancer.period and defaults to 300000 Still more are there, Just read more to have more optimized cluster.
  • 28. GETTING STARTED WITH HBASE Start HBase ./bin/start-hbase.sh starting Master, logging to logs/hbase-user-master-example.org.out Connect to your running HBase via the shell ./bin/hbase shell HBase Shell hbase(main):001:0> And on this shell you can type shell command which hbase provides to perform various operations.
  • 29. I am here : dwivedishashwat@gmail.com Twitter : shashwat_2010 Facebook : shriparv@gmail.com Skype: shriparv Search Read Research and Share to have more better understandability.  Right ? http://db.csail.mit.edu/projects/cstore/abadi-sigmod08.pdf https://cs.uwaterloo.ca/~aelhelw/papers/dolap11.pdf https://ccp.cloudera.com/download/attachments/14549380/CDH2_In stallation_Guide.pdf?version=1&modificationDate=134887134000 0 http://hbase.apache.org/ https://ccp.cloudera.com/display/CDHDOC/HBase+Installation Last and the best one www.google.com :) NEED MORE CLARIFICATION ON QUERIES??