Más contenido relacionado La actualidad más candente (20) Similar a Hortonworks Technical Workshop: HBase and Apache Phoenix (20) Hortonworks Technical Workshop: HBase and Apache Phoenix 1. Page 1 © Hortonworks Inc. 2014
SQL on HBase with Phoenix
2. Page 2 © Hortonworks Inc. 2014
Agenda
What Is Apache HBase
• High Level Overview.
• Technical Detail.
What Is Apache Phoenix
• Overview.
• What’s New.
• Secondary Index Demo.
3. Page 3 © Hortonworks Inc. 2014
New Data Requires a New Data Architecture
Source: IDC
2.8
ZB
in
2012
85%
from
New
Data
Types
15x
Machine
Data
by
2020
40
ZB
by
2020
OLTP,
ERP,
CRM
Systems
Unstructured
documents,
emails
Clickstream
Server
logs
Sen>ment,
Web
Data
Sensor,
Machine
Data
Geoloca>on
Modern
Database
Needs
More
Scalable
Handle
New
Data
Types
Intelligent
and
Predic>ve
4. Page 4 © Hortonworks Inc. 2014
What Is Apache HBase?
100%
Open
Source
Store
and
Process
Petabytes
of
Data
Flexible
Schema
Scale
out
on
Commodity
Servers
High
Performance,
High
Availability
Integrated
with
YARN
SQL
and
NoSQL
Interfaces
YARN
:
Data
OperaGng
System
HBase
RegionServer
1
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
N
HDFS
(Permanent
Data
Storage)
HBase
RegionServer
HBase
RegionServer
Dynamic Schema
Scales Horizontally to PB of Data
Directly Integrated with Hadoop
5. Page 5 © Hortonworks Inc. 2014
Kinds of Apps Built with HBase
Interested? See HBase Case Studies later in this document.
Write Heavy Low-Latency
Search /
Indexing
Messaging
Audit /
Log Archive AdvertisingData Cubes
Time Series
Sensor /
Device
6. Page 6 © Hortonworks Inc. 2014
HBase is Deeply Integrated with Hadoop
• Data
is
stored
in
HDFS.
You
can
store
more
data
and
re-‐use
exis>ng
HDFS
exper>se.
• HBase
is
integrated
with
YARN.
• Analy>cs
in-‐place
using
Hive,
Pig,
Spark
and
more.
7. Page 7 © Hortonworks Inc. 2014
Who’s Using HBase?
8. Page 8 © Hortonworks Inc. 2014
HBase Technical Details
Spring 2014
Version 1.0
9. Page 9 © Hortonworks Inc. 2014
HBase Technical Details
Based on Google BigTable
• Dynamic schema.
• Good for very sparse datasets.
• All data is range-partitioned for trivial horizontal scaling across commodity hardware.
Directly integrated with HDFS and Hadoop
• Analyze data in HBase with any Hadoop ecosystem tools (Hive, Pig, MapReduce, Tez, etc.)
• Re-use existing Hadoop skills to run HBase.
11. Page 11 © Hortonworks Inc. 2014
Logical Architecture
Distributed, persistent partitions of a BigTable
a
b
d
c
e
f
h
g
i
j
l
k
m
n
p
o
Table A
Region 1
Region 2
Region 3
Region 4
Region Server 7
Table A, Region 1
Table A, Region 2
Table G, Region 1070
Table L, Region 25
Region Server 86
Table A, Region 3
Table C, Region 30
Table F, Region 160
Table F, Region 776
Region Server 367
Table A, Region 4
Table C, Region 17
Table E, Region 52
Table P, Region 1116
Legend:
- A single table is partitioned into Regions of roughly equal size.
- Regions are assigned to Region Servers across the cluster.
- Region Servers host roughly the same number of regions.
12. Page 12 © Hortonworks Inc. 2014
Logical Data Model
A sparse, multi-dimensional, sorted map
Legend:
- Rows are sorted by rowkey.
- Within a row, values are located by column family and qualifier.
- Values also carry a timestamp; there can me multiple versions of a value.
- Within a column family, data is schemaless. Qualifiers and values are treated as arbitrary bytes.
1368387247 [3.6 kb png data]"thumb"cf2b
a
cf1
1368394583 7
1368394261 "hello"
"bar"
1368394583 22
1368394925 13.6
1368393847 "world"
"foo"
cf2
1368387684 "almost the loneliest number"1.0001
1368396302 "fourth of July""2011-07-04"
Table A
rowkey
column
family
column
qualifier
timestamp value
13. Page 13 © Hortonworks Inc. 2014
HBase HA Overview (Introduced in HDP 2.1)
HMaster
Zookeeper
Client
Client
Client
Client
HBase
RegionServer
Region:
100-‐199
(Standby)
Region:
200-‐299
(Standby)
Region:
0-‐99
(Primary)
HBase
RegionServer
Region:
100-‐199
(Primary)
Region:
0-‐99
(Standby)
Region:
200-‐299
(Primary)
HFile
HFile
HFile
HFile
HFile
HFile
HDFS
HBase
HA:
Real-‐Time
Replica>on
Low-‐Latency
Reads
and
Writes
In-‐Memory
Cache
In-‐Memory
Cache
Hive,
Pig,
MapReduce
Hive,
Pig,
MapReduce
Data
Stored
to
HDFS
Read
or
Write
Directly
from
Hadoop
Tools
Cluster
Topology,
Data
Placement
14. Page 14 © Hortonworks Inc. 2014
Apache Phoenix
Spring 2014
Version 1.0
The SQL Skin for HBase
15. Page 15 © Hortonworks Inc. 2014
Apache Phoenix
A SQL Skin for HBase
• Provides a SQL interface for managing data in HBase.
• Large subset of SQL:1999 mandatory featureset.
• Create tables, insert and update data and perform low-latency point lookups through JDBC.
• Phoenix JDBC driver easily embeddable in any app that supports JDBC.
Phoenix Makes HBase Better
• Oriented toward online / semi-transactional apps.
• If HBase is a good fit for your app, Phoenix makes it even better.
• Phoenix gets you out of the “one table per query” model many other NoSQL stores force you into.
16. Page 16 © Hortonworks Inc. 2014
Apache Phoenix: Current Capabilities
Feature Supported?
Common SQL Datatypes Yes
Inserts and Updates Yes
SELECT, DISTINCT, GROUP BY, HAVING Yes
NOT NULL and Primary Key constrants Yes
Inner and Outer JOINs Yes
Views Yes
Subqueries HDP 2.2
Robust Secondary Indexes HDP 2.2
17. Page 17 © Hortonworks Inc. 2014
Apache Phoenix: Future Capabilities
Feature Supported?
Multi-Table Transactions Future
Scalable Joins (Fact-to-Fact) Future
Analytics, Windowing Functions Future
18. Page 18 © Hortonworks Inc. 2014
Phoenix Provides Familiar SQL Constructs
Compare: Phoenix versus Native API
Code Notes
//
HBase
Native
API.
HBaseAdmin
hbase
=
new
HBaseAdmin(conf);
HTableDescriptor
desc
=
new
HTableDescriptor("us_population");
HColumnDescriptor
state
=
new
HColumnDescriptor("state".getBytes());
HColumnDescriptor
city
=
new
HColumnDescriptor("city".getBytes());
HColumnDescriptor
population
=
new
HColumnDescriptor("population".getBytes());
desc.addFamily(state);
desc.addFamily(city);
desc.addFamily(population);
hbase.createTable(desc);
//
Phoenix
DDL.
CREATE
TABLE
us_population
(
state
CHAR(2)
NOT
NULL,
city
VARCHAR
NOT
NULL,
population
BIGINT
CONSTRAINT
my_pk
PRIMARY
KEY
(state,
city));
• Familiar SQL syntax.
• Provides additional constraint
checking.
19. Page 19 © Hortonworks Inc. 2014
Phoenix: Architecture
HBase Cluster
Phoenix
Coprocessor
Phoenix
Coprocessor
Phoenix
Coprocessor
Java
Applica>on
Phoenix
JDBC
Driver
User Application
20. Page 20 © Hortonworks Inc. 2014
Phoenix Performance
Phoenix Performance Characterization:
• Suitable for 10s of thousands of point-lookups per second.
• Suitable for thousands of aggregations / filtered searches per second.
• Supports extremely high concurrency.
Phoenix Performance Optimizations
• Column skipping.
• Table salting.
• Skip scans.
Performance characteristics:
• Index point lookups in milliseconds.
• Aggregation and Top-N queries in a few seconds over large datasets.
21. Page 21 © Hortonworks Inc. 2014
Phoenix Use Cases
Phoenix is for:
• Rapidly and easily building an application backed by HBase.
• Making use of your existing SQL skills and investment.
• High performing aggregations of moderately-sized datasets inside HBase.
Phoenix is not for:
• Sophisticated SQL queries involving large joins or advanced SQL features.
• Queries requiring large scans that do not use indexes.
• ETL.
22. Page 22 © Hortonworks Inc. 2014
Phoenix: Futures
Short-term focus:
• Transactions.
• Scalable joins.
• Analytical capabilities.
Long-term focus: Primary interface for HBase.
• Build HBase applications using Phoenix.
• Configure cluster security and replication using Phoenix.
• Integration with BI tools like Microstrategy.
23. Page 23 © Hortonworks Inc. 2014
What’s New in Apache Phoenix
24. Page 24 © Hortonworks Inc. 2014
What’s New in Apache Phoenix
Phoenix in HDP 2.2
• Based on Apache Phoenix 4.2.
• 8 new features, 143 total improvements and fixes.
Notable new features.
• Robust secondary indexes.
• Sub-joins.
• Basic window functions.
• Bulk loader improvements.
25. Page 25 © Hortonworks Inc. 2014
Robust Secondary Index
Background / Refresher
• Phoenix supports local and global secondary indexes.
• Updating a global index may require coordination with another RegionServer.
• See Phoenix docs if you need info on which to use when.
Before Phoenix 4.1 (HDP 2.1):
• Using global indexes, if the RegionServer serving the index key was down, regionservers would abort.
• Note: Does not affect local indexes.
Phoenix 4.1+:
• If the global index cannot be updated:
• The index is temporarily disabled.
• Background job is launched to rebuild the index.
• Reads will go directly to base tables rather than accessing the index.
• Writes will continue to update the index.
• Controlled by: phoenix.index.failure.handling.rebuild
26. Page 26 © Hortonworks Inc. 2014
Improved SQL: Sub Joins
Example:
select
*
from
A
left
join
(B
join
C
on
B.bc_id
=
C.bc_id)
on
A.ab_id
=
B.ab_id
and
A.ac_id
=
C.ac_id;
Caveats related to joins still apply:
• Still broadcast joins only.
27. Page 27 © Hortonworks Inc. 2014
Phoenix: Basic Window Functions
FIRST_VALUE, LAST_VALUE, NTH_VALUE
• No OVER or PARTITION BY.
• Function applied to each group based on GROUP BY.
Example:
SELECT
FIRST_VALUE(“column1”)
WITHIN
GROUP
(ORDER
BY
column2
ASC)
FROM
table
GROUP
BY
column3;
28. Page 28 © Hortonworks Inc. 2014
ENCODE, DECODE
DECODE
• Supports hexadecimal format.
DECODE('000000008512af277ffffff8',
'hex')
ENCODE
• Supports hexadecimal and Base62
ENCODE(1,
'base62')
What is base 62???
• Used to encode data using only letters and numbers.
• Commonly used for things like URL shorteners.
29. Page 29 © Hortonworks Inc. 2014
Demo
Phoenix Secondary Indexes
30. Page 30 © Hortonworks Inc. 2014
Secondary Index Recap
Index Management via JDBC:
• CREATE INDEX my_index ON my_table (v1);
• DROP INDEX my_index ON my_table;
• ALTER INDEX my_index ON my_table DISABLE / REBUILD;
Index population during bulk import:
• Uses the CsvBulkLoadTool utility (not psql.py).
• Adds the --index-table argument to specify your target index.
HADOOP_CLASSPATH=/path/to/hbase-‐protocol.jar:/path/to/hbase/conf
hadoop
jar
phoenix-‐4.0.0.jar
org.apache.phoenix.mapreduce.CsvBulkLoadTool
-‐-‐table
EXAMPLE
-‐-‐input
/data/example.csv