Hortonworks Technical Workshop: HBase and Apache Phoenix

© Hortonworks Inc. 2014
SQL on HBase with Phoenix

Agenda
What Is Apache HBase
•  High Level Overview.
•  Technical Detail.
What Is Apache Phoenix
•  Overview.
•  What’s New.
•  Secondary Index Demo.

New Data Requires a New Data Architecture
Source: IDC
2.8
ZB
in
2012

85%
from
New
Data
Types

15x
Machine
Data
by
2020

40
ZB
by
2020

OLTP,
ERP,
CRM
Systems

Unstructured
documents,
emails

Clickstream

Server
logs

Sen>ment,
Web
Data

Sensor,
Machine
Data

Geoloca>on

Modern
Database
Needs

More
Scalable

Handle
New
Data
Types

Intelligent
and
Predic>ve

What Is Apache HBase?
100%
Open
Source

Store
and
Process
Petabytes
of
Data

Flexible
Schema

Scale
out
on
Commodity
Servers

High
Performance,
High
Availability

Integrated
with
YARN

SQL
and
NoSQL
Interfaces

YARN
:
Data
OperaGng
System

HBase

RegionServer

1
°
°
°
°
°
°
°
°
°
°

°
°
°
°
°
°
°
°
°
°
N

HDFS

(Permanent
Data
Storage)

HBase

RegionServer

HBase

RegionServer

Dynamic Schema
Scales Horizontally to PB of Data
Directly Integrated with Hadoop

Kinds of Apps Built with HBase
Interested? See HBase Case Studies later in this document.
Write Heavy Low-Latency
Search /
Indexing
Messaging
Audit /
Log Archive AdvertisingData Cubes
Time Series
Sensor /
Device

HBase is Deeply Integrated with Hadoop
•  Data
is
stored
in
HDFS.
You
can

store
more
data
and
re-‐use
exis>ng

HDFS
exper>se.

•  HBase
is
integrated
with
YARN.

•  Analy>cs
in-‐place
using
Hive,
Pig,

Spark
and
more.

Who’s Using HBase?

HBase Technical Details
Spring 2014
Version 1.0

HBase Technical Details
Based on Google BigTable
•  Dynamic schema.
•  Good for very sparse datasets.
•  All data is range-partitioned for trivial horizontal scaling across commodity hardware.
Directly integrated with HDFS and Hadoop
•  Analyze data in HBase with any Hadoop ecosystem tools (Hive, Pig, MapReduce, Tez, etc.)
•  Re-use existing Hadoop skills to run HBase.

Logical Architecture
Distributed, persistent partitions of a BigTable
a
b
d
c
e
f
h
g
i
j
l
k
m
n
p
o
Table A
Region 1
Region 2
Region 3
Region 4
Region Server 7
Table A, Region 1
Table A, Region 2
Table G, Region 1070
Table L, Region 25
Region Server 86
Table A, Region 3
Table C, Region 30
Table F, Region 160
Table F, Region 776
Region Server 367
Table A, Region 4
Table C, Region 17
Table E, Region 52
Table P, Region 1116
Legend:
- A single table is partitioned into Regions of roughly equal size.
- Regions are assigned to Region Servers across the cluster.
- Region Servers host roughly the same number of regions.

Logical Data Model
A sparse, multi-dimensional, sorted map
Legend:
- Rows are sorted by rowkey.
- Within a row, values are located by column family and qualifier.
- Values also carry a timestamp; there can me multiple versions of a value.
- Within a column family, data is schemaless. Qualifiers and values are treated as arbitrary bytes.
1368387247 [3.6 kb png data]"thumb"cf2b
a
cf1
1368394583 7
1368394261 "hello"
"bar"
1368394583 22
1368394925 13.6
1368393847 "world"
"foo"
cf2
1368387684 "almost the loneliest number"1.0001
1368396302 "fourth of July""2011-07-04"
Table A
rowkey
column
family
column
qualifier
timestamp value

HBase HA Overview (Introduced in HDP 2.1)
HMaster

Zookeeper

Client
Client
Client
Client

HBase
RegionServer

Region:

100-‐199

(Standby)

Region:

200-‐299

(Standby)

Region:

0-‐99

(Primary)

HBase
RegionServer

Region:

100-‐199

(Primary)

Region:

0-‐99

(Standby)

Region:

200-‐299

(Primary)

HFile
HFile
HFile
HFile
HFile
HFile

HDFS

HBase
HA:

Real-‐Time

Replica>on

Low-‐Latency

Reads
and
Writes

In-‐Memory
Cache
In-‐Memory
Cache

Hive,
Pig,
MapReduce
Hive,
Pig,
MapReduce

Data
Stored

to
HDFS

Read
or
Write
Directly

from
Hadoop
Tools

Cluster
Topology,

Data
Placement

Apache Phoenix
Spring 2014
Version 1.0
The SQL Skin for HBase

Apache Phoenix
A SQL Skin for HBase
•  Provides a SQL interface for managing data in HBase.
•  Large subset of SQL:1999 mandatory featureset.
•  Create tables, insert and update data and perform low-latency point lookups through JDBC.
•  Phoenix JDBC driver easily embeddable in any app that supports JDBC.
Phoenix Makes HBase Better
•  Oriented toward online / semi-transactional apps.
•  If HBase is a good fit for your app, Phoenix makes it even better.
•  Phoenix gets you out of the “one table per query” model many other NoSQL stores force you into.

Apache Phoenix: Current Capabilities
Feature Supported?
Common SQL Datatypes Yes
Inserts and Updates Yes
SELECT, DISTINCT, GROUP BY, HAVING Yes
NOT NULL and Primary Key constrants Yes
Inner and Outer JOINs Yes
Views Yes
Subqueries HDP 2.2
Robust Secondary Indexes HDP 2.2

Apache Phoenix: Future Capabilities
Feature Supported?
Multi-Table Transactions Future
Scalable Joins (Fact-to-Fact) Future
Analytics, Windowing Functions Future

Phoenix Provides Familiar SQL Constructs
Compare: Phoenix versus Native API
Code Notes
//
HBase
Native
API.

HBaseAdmin
hbase
=
new
HBaseAdmin(conf);

HTableDescriptor
desc
=
new
HTableDescriptor("us_population");

HColumnDescriptor
state
=
new
HColumnDescriptor("state".getBytes());

HColumnDescriptor
city
=
new
HColumnDescriptor("city".getBytes());

HColumnDescriptor
population
=
new
HColumnDescriptor("population".getBytes());

desc.addFamily(state);

desc.addFamily(city);

desc.addFamily(population);

hbase.createTable(desc);

//
Phoenix
DDL.

CREATE
TABLE
us_population
(

state
CHAR(2)
NOT
NULL,

city
VARCHAR
NOT
NULL,

population
BIGINT

CONSTRAINT
my_pk
PRIMARY
KEY
(state,
city));

•  Familiar SQL syntax.
•  Provides additional constraint
checking.

Phoenix: Architecture
HBase Cluster
Phoenix

Coprocessor

Phoenix

Coprocessor

Phoenix

Coprocessor

Java

Applica>on

Phoenix
JDBC

Driver

User Application

Phoenix Performance
Phoenix Performance Characterization:
•  Suitable for 10s of thousands of point-lookups per second.
•  Suitable for thousands of aggregations / filtered searches per second.
•  Supports extremely high concurrency.
Phoenix Performance Optimizations
•  Column skipping.
•  Table salting.
•  Skip scans.
Performance characteristics:
•  Index point lookups in milliseconds.
•  Aggregation and Top-N queries in a few seconds over large datasets.

Phoenix Use Cases
Phoenix is for:
•  Rapidly and easily building an application backed by HBase.
•  Making use of your existing SQL skills and investment.
•  High performing aggregations of moderately-sized datasets inside HBase.
Phoenix is not for:
•  Sophisticated SQL queries involving large joins or advanced SQL features.
•  Queries requiring large scans that do not use indexes.
•  ETL.

Phoenix: Futures
Short-term focus:
•  Transactions.
•  Scalable joins.
•  Analytical capabilities.
Long-term focus: Primary interface for HBase.
•  Build HBase applications using Phoenix.
•  Configure cluster security and replication using Phoenix.
•  Integration with BI tools like Microstrategy.

What’s New in Apache Phoenix

What’s New in Apache Phoenix
Phoenix in HDP 2.2
•  Based on Apache Phoenix 4.2.
•  8 new features, 143 total improvements and fixes.
Notable new features.
•  Robust secondary indexes.
•  Sub-joins.
•  Basic window functions.
•  Bulk loader improvements.

Robust Secondary Index
Background / Refresher
•  Phoenix supports local and global secondary indexes.
•  Updating a global index may require coordination with another RegionServer.
•  See Phoenix docs if you need info on which to use when.
Before Phoenix 4.1 (HDP 2.1):
•  Using global indexes, if the RegionServer serving the index key was down, regionservers would abort.
•  Note: Does not affect local indexes.
Phoenix 4.1+:
•  If the global index cannot be updated:
•  The index is temporarily disabled.
•  Background job is launched to rebuild the index.
•  Reads will go directly to base tables rather than accessing the index.
•  Writes will continue to update the index.
•  Controlled by: phoenix.index.failure.handling.rebuild

Improved SQL: Sub Joins
Example:
select
*
from
A

left
join
(B
join
C
on
B.bc_id
=
C.bc_id)

on
A.ab_id
=
B.ab_id
and
A.ac_id
=
C.ac_id;
Caveats related to joins still apply:
•  Still broadcast joins only.

Phoenix: Basic Window Functions
FIRST_VALUE, LAST_VALUE, NTH_VALUE
•  No OVER or PARTITION BY.
•  Function applied to each group based on GROUP BY.
Example:
SELECT

FIRST_VALUE(“column1”)

WITHIN
GROUP

(ORDER
BY
column2
ASC)

FROM

table

GROUP
BY

column3;

ENCODE, DECODE
DECODE
•  Supports hexadecimal format.
DECODE('000000008512af277ffffff8',
'hex')

ENCODE
•  Supports hexadecimal and Base62
ENCODE(1,
'base62')

What is base 62???
•  Used to encode data using only letters and numbers.

•  Commonly used for things like URL shorteners.

Demo
Phoenix Secondary Indexes

Secondary Index Recap
Index Management via JDBC:
•  CREATE INDEX my_index ON my_table (v1);
•  DROP INDEX my_index ON my_table;
•  ALTER INDEX my_index ON my_table DISABLE / REBUILD;
Index population during bulk import:
•  Uses the CsvBulkLoadTool utility (not psql.py).
•  Adds the --index-table argument to specify your target index.
HADOOP_CLASSPATH=/path/to/hbase-‐protocol.jar:/path/to/hbase/conf

hadoop
jar
phoenix-‐4.0.0.jar

org.apache.phoenix.mapreduce.CsvBulkLoadTool

-‐-‐table
EXAMPLE
-‐-‐input
/data/example.csv

Hortonworks Technical Workshop: HBase and Apache Phoenix

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Hortonworks Technical Workshop: HBase and Apache Phoenix

Similar a Hortonworks Technical Workshop: HBase and Apache Phoenix (20)

Más de Hortonworks

Más de Hortonworks (20)

Último

Último (20)

Hortonworks Technical Workshop: HBase and Apache Phoenix