Apache HBase: Where We've Been and What's Upcoming

Headline
Goes
Here

Speaker
Name
or
Subhead
Goes
Here

DO
NOT
USE
PUBLICLY

PRIOR
TO
10/23/12

Apache
HBase
–
Where
we’ve
been

and
what’s
upcoming

Jonathan
Hsieh
|
@jmhsieh

Tech
lead
/
SoMware
Engineer
at
Cloudera
|
HBase
PMC
Member

Hadoop
Users
Group
UK

April
10,
2014

4/10/14 Hadoop Users Group UK

Who
Am
I?

•  Cloudera:

•  Tech
Lead
HBase
Team

•  So<ware
Engineer

•  Apache
HBase
commiVer
/
PMC

•  Apache
Flume
founder
/
PMC

•  U
of
Washington:

•  Research
in
Distributed
Systems


What
is
Apache
HBase?

Apache
HBase
is
a

reliable,
column-‐
oriented
data
store
that

provides
consistent,
low-‐
latency,
random
read/
write
access.

ZK
HDFS

App
MR


HBase
provides
Low-‐latency
Random
Access

•  Writes:

•  1-‐3ms,
1k-‐20k
writes/sec
per
node

•  Reads:

•  0-‐3ms
cached,
10-‐30ms
disk

•  10k-‐40k
reads
/
second
/
node
from

cache

•  Cell
size:

•  0B-‐3MB

•  Read,
write,
and
insert
data

anywhere
in
the
table

0000000000

1111111111

2222222222

3333333333

4444444444

5555555555

6666666666

7777777777

1

2

3

4

5

Core
Properces

•  ACID
guarantees
on
a
row

•  Writes
are
durable

•  Strong
consistency
first,
then
availability

•  AMer
failure,
recover
and
return
current
value
instead
of
returning
stale
value

•  CAS
and
atomic
increments
can
be
efficient.

•  Sorted
By
Primary
Key

•  Short
scans
are
efficient

•  Parcconed
by
Primary
Key

•  Log
Structured
Merged
Tree

•  Writes
are
extremely
efficient

•  Reads
are
efficient

•  Periodic
layout
opcmizacons
for
read
opcmizacon
(“compaccons”)
required.


An
HBase
History

Where
We’ve
Been


Jan
‘12:
0.92.0

Apache
HBase
Timeline

2014
2006
2007
2008
2009
2010
2011
2013
2012

Nov
’06:

Google

BigTable
OSDI
‘06

Apr
‘07:
First

Apache
HBase

commit
as

Hadoop
contrib

project

Apr
‘10:
Apache

HBase
becomes

top
level
project
Oct
‘13:
0.96.0

Apr’11:
CDH3
GA

with
HBase
0.90.1

May
‘12:

HBaseCon
2012

Jun
‘13:

HBaseCon
2013

Jan‘08:
Promoted

to
Hadoop

subproject

Feb
‘13:
0.98.0
May
‘12:
0.94.0

Summer‘11:

Messages

on
HBase

Summer
‘09

StumbleUpon

goes
produccon
on

HBase
~0.20

Nov
‘11:

Cassini

on
HBase

Jan
‘13

Phoenix

on
HBase

Summer‘11:

Web

Crawl

Cache

Developer
Community

•  Accve
community!

•  Diverse
commiVers

from
many

organizacons


Apache
HBase
“Nascar”
Slide


Apache
HBase
Core
Development

•  Vendors

•  Self
Service

Apache
HBase
Sample
Users

•  Inbox

•  Storage

•  Web

•  Search

•  Analyccs

•  Monitoring

Apache
HBase
Ecosystem
Projects


What’s
here
and
new
today?

Today:
Apache
0.96.2
/
0.98.1


Criccal
Features

Disaster
Recovery

•  Cluster
Replicacon

•  Table
Snapshots

•  Copy
Table

•  Import
/
Export
Tables

•  Metadata
Corrupcon
repair

tool
(hbck)

AdministraMve
and
ConMnuity

•  Kerberos
based
Authenccacon

•  ACL
based
Authorizacon

•  Conﬁg
change
via
rolling
restart.

•  Within
version
rolling
upgrade.

•  Protobuf
based
wire
protocol

for
RPC
future
prooﬁng


Hardened
for
0.96

Table
AdministraMon

•  Online
Schema
change

•  Online
Region
Merging

•  Concnuous
fault
injeccon

tescng
with
“Chaos
Monkey”

Performance
Tuning

•  Alternate
key
encodings
for

eﬃcient
memory
usage

•  Exploring
Compactor
policy

minimizes
compaccon
storms

•  Smart
and
Adapcve
Stochascc

region
load
balancer

•  Fast
split
policy
for
new
tables


MR
over
Table
Snapshots
(0.98,
CDH5.0)

•  Previously
MapReduce
jobs

over
HBase
required
online
full

table
scan

•  Idea:
Take

a
snapshot
and
run

MR
job
over
snapshot
ﬁles

•  Doesn’t
use
HBase
client

•  Avoid
aﬀeccng
HBase
caches

•  3-‐5x
perf
boost.

map

map

map

map

map

map

map

map

reduce

reduce

reduce

map

map

map

map

map

map

map

map

reduce

reduce

reduce

snapshot

Mean
Time
to
Recovery
(MTTR)

•  Machine
failures
happen
in
distributed
systems

•  Average
unavailability
when
automaccally
recovering
from
a

failure.

•  Recovery
cme
for
a
unclean
data
center
power
cycle

recovered
nocfy
repair
detect

Region

unavailable

Region
available

client
aware

Region
available

client
unaware

Fast
nocficacon
and
deteccon
(0.96)

•  Proaccve
nocficacon
of
HMaster
failure
(0.96)

•  Proaccve
nocficacon
of
RS
failure
(0.96)

•  Nocfy
client
on
recovery
(0.96)

•  Fast
server
failover
(Hardware)

recovered
replay
assign
split

Region

unavailable

Region
available

for
RW

hdfs
hdfs

detect

hdfs

•  Previously
had
two
IO
intensive
passes:

•  Log
splitng
to
intermediate
files

•  Assign
and
log
replay

•  Now
just
one
IO
heavy
pass:

Assign
first,
then
split+replay.

•  Improves
read
and
write
recovery
cmes.

•  Off
by
default
currently*.

Distributed
log
replay
(experimental
0.96)

recovered
split
+
replay
assign

Region

unavailable

Region
available

for
RW

Region
available

for
replay
writes

hdfs

detect

*Caveat:
If
you
override
cme
stamps
you
could
have

READ
REPEATED
isolacon
violacons
(use
tags
to
fix
this)

Cell
Tags
(0.98
experimental)

•  Mechanism
for
aVaching
arbitrary
metadata
to

Cells.

•  Mocvacon:
Finer-‐grained
isolacon

•  Use
for
Accumulo-‐style
cell-‐level
visibility

•  Main
feature
for
0.98

•  Other
uses:

•  Add
sequence
numbers
to
enable
correct
fast

read/write
recovery

•  Potencal
for
schema
tags


Htrace
(0.96
experimental)

•  Problem:

•  Where
is
cme
being
spent
inside
HBase?

•  Solucon:
HTrace
Framework

•  Inspired
by
Google
Dapper

•  Threaded
through
HBase
and
HDFS

•  Tracks
cme
spent
in
calls
in
a
distributed
system
by
tracking
spans*

on
diﬀerent
machines.

*Some
assembly
scll
required.


HTrace:
Distributed
Tracing
in
HBase
and
HDFS

•  Framework
Inspired
by

Google
Dapper

•  Tracks
cme
spent
in
calls

in
RPCs
across
diﬀerent

machines.

•  Threaded
through
HBase

(0.96)
and
future
HDFS.

HBase

RS

1

HDFS

DN
ZK

HBase

Client

HBase

meta

HDFS

NN

A
span

RPC
calls

Zipkin
–
Visualizing
Spans

•  UI
+
Visualizacon
System

•  WriVen
by
TwiVer

•  Zipkin
HBase
Storage

•  Zipkin
HTrace
integracon

•  View
where
cme
from
a

speciﬁc
call
is
spent
in

HBase,
HDFS,
and
ZK.


A
Future
HBase

What’s
Upcoming


Outline

•  Improved
Mean
cme
to
recovery
(MTTR)

•  Improved
Predictability

•  Improved
Usability

•  Improved
Mulctenancy


Faster
read
recovery

Improving
MTTR
Further


recovered
split
+
replay

Distributed
log
replay
with
fast
write
recovery

assign

Region

unavailable

Region
available

for
RW

Region
available

for
all
writes

hdfs

detect

•  Writes
in
HBase
do
not
incur
reads.

•  With
distributed
log
replay,
we’ve
already
have
regions
open

for
write.

•  Allow
fresh
writes
while
replaying
old
logs*.

*Caveat:
If
you
override
cme
stamps
you
could
have

READ
REPEATED
isolacon
violacons
(use
tags
to
ﬁx
this)

Fast
Read
Recovery
(proposed)

•  Idea:
Priscne
Region
fast
read
recovery

•  If
region
not
edited
it
is
consistent
and
can
recover
RW
immediately

•  Idea:
Shadow
Regions
for
fast
read
recovery

•  Shadow
region
tails
the
WAL
of
the
primary
region

•  Shadow
memstore
is
one
HDFS
block
behind,
catch
up
recover
RW

•  Currently
some
progress
for
trunk

recovered
assign

Region

unavailable

Can
guarantee
no
new
edits?

Region
available

for
all
RW

detect

Can
guarantee
we
have
all
edits?

Region
available
for
all
RW

Improving
the
99%cle

Improving
Predictability


Common
causes
of
performance
variability

•  Locality
Loss

•  Favored
Nodes,
HDFS
block
aﬃnity

•  Compaccon

•  Exploring
compactor

•  GC*

•  Oﬀ-‐heap
Cache

•  Hardware
hiccups

•  MulM
WAL,
HDFS
speculaMve
read


Performance
degraded
aMer
recovery

•  AMer
recovery,
reads
suﬀer
a
performance
hit.

•  Regions
have
lost
locality

•  To
maintain
performance
aMer
failover,
we
need
to
regain
locality.

•  Compact
Region
to
regain
locality

•  We
can
do
beVer
by
using
HDFS
features

performance
recovered
recovered

Service
recovered;

degraded
performance
L

recovery

Performance
recovered
because

compaccon
restores
locality
J

•  Control
and
track
where
block
replicas
are

•  All
ﬁles
for
a
region
created
such
that
blocks
go
to
the
same
set
of
favored
nodes

•  When
failing
over,
assign
the
region
to
one
of
those
favored
nodes.

•  Currently
a
preview
feature
in
0.96

•  Disabled
by
default
because
it
doesn’t
work
well
with
the
latest
balancer
or
splits.

•  Will
likely
use
upcoming
HDFS
block
aﬃnity
for
beVer
operability

•  Originally
on
Facebook’s
0.89,
ported
to
0.96

performance
recovered

Read
Throughput:
Favored
Nodes
(experimental

0.96)

Service
recovered;
performance
sustained
because

region
assigned
to
favored
node.
J

recovery

Read
latency:
HDFS
hedged
read
(CDH5.0)

•  HBase’s
Region
servers
use

HDFS
client
to
reads
1
of
3

HDFS
block
replicas

•  If
you
chose
the
slow
node,

your
reads
are
slow.

•  If
a
read
is
taking
too
long,

speculacvely
go
to
another

that
may
be
faster.

4/10/14 Hadoop Users Group UKRS

1

2

3

Slow
read!

Hdfs
replicas
RS

1

2

3

Hdfs
replicas

Too
slow,
read
other
replica

Read
latency:
Read
Replicas
(in
progress)

•  HBase
client
reads
from
primary

region
servers.

•  If
you
chose
the
slow
node,
your

reads
are
slow.

•  Idea:
Read
replica
assigned
to
other

region
servers.

Replicas
periodically

catch
up
(via
snapshots
or
shadow

region
memstores)

•  Client
speciﬁes
if
stale
read
OK.

If
a

read
is
taking
too
long,
speculacvely

go
to
another
that
may
be
faster.

Hbase

Client

1

Slow
read!

1

2

3

Region

replicas

Too
slow,
read
stale
replica

Hbase

Client

Write
latency:
Mulcple
WALs
(in
progress)

•  HBase’s
HDFS
client
writes
3

replicas

•  Min
write
latency
is
bounded

by
the
slowest
of
the
3

replicas

•  Idea:
If
a
write
is
taking
too

long
let’s
duplicate
it
on

another
set
that
may
be

faster.

4/10/14 Hadoop Users Group UKRS

1

2

3

Slow
Write

Hdfs

replicas
RS

1

2

3

Hdfs

replicas

1

2

3
Hdfs

replicas

Too
slow,
write

to
other
replica

Improving
Usability

Autotuning,
Tracing,
and
SQL


Making
HBase
easier
to
use
and
tune.

•  Diﬃcult
to
see
what
is
happening
in
HBase

•  Easy
to
make
poor
design
decisions
early
without
realizing

•  New
Developments

•  Memory
auto
tuning

•  HTrace
+
Zipkin

•  Frameworks
for
Schema
design


Memory
Use
Auto-‐tuning
(trunk)

•  Memory
is
divided
between

•  the
memstore
(used
for
serving
recent
writes)

•  the
block
cache
(used
for
read
hot
spots)

•  Need
to
choose
balance
for
work
load

memstore

Block
cache

memstore

Block
cache

memstore

Block
cache

Read
Heavy

Balanced
Write
heavy

HBase
Schemas

•  HBase
Applicacon
developers
must
iterate
to
ﬁnd
a
suitable
HBase

schema

•  Schema
criMcal
for
Performance
at
Scale

•  How
can
we
make
this
easier?

•  How
can
we
reduce
the
expercse
required
to
do
this?

•  Today:

•  Lots
of
tuning
knobs

•  Developers
need
to
understand
Column
Families,
Rowkey
design,
Data

encoding,
…

•  Some
are
expensive
to
change
aMer
the
fact


How
should
I
arrange
my
data?

•  Isomorphic
data
representacons!

rowkey
d:

bob-‐col1
aaaa

bob-‐col2
bbbb

bob-‐col3
cccc

bob-‐col4
dddd

jon-‐col1
eeee

jon-‐col2
ffff

jon-‐col3
gggg

jon-‐col4
hhhh

Rowkey
d:col1
d:col2
d:col3
d:col4

bob
aaaa
bbbb
cccc
dddd

jon
eeee
ffff
gggg
hhhhh

Rowkey
col1:
col2:
col3:
col4:

bob
aaaa
bbbb
cccc
dddd

jon
eeee
ffff
gggg
hhhhh

Short
Fat
Table
using
column
qualifiers

Short
Fat
Table
using
column
families

Tall
skinny
with

compound
rowkey

How
should
I
arrange
my
data?

•  Isomorphic
data
representacons!

rowkey
d:

bob-‐col1
aaaa

bob-‐col2
bbbb

bob-‐col3
cccc

bob-‐col4
dddd

jon-‐col1
eeee

jon-‐col2
ffff

jon-‐col3
gggg

jon-‐col4
hhhh

Rowkey
d:col1
d:col2
d:col3
d:col4

bob
aaaa
bbbb
cccc
dddd

jon
eeee
ffff
gggg
hhhhh

Rowkey
col1:
col2:
col3:
col4:

bob
aaaa
bbbb
cccc
dddd

jon
eeee
ffff
gggg
hhhhh

Short
Fat
Table
using
column
qualifiers

Short
Fat
Table
using
column
families

Tall
skinny
with

compound
rowkey

With
great
power
comes
great

responsibility!

How
can
we
make
this
easier
for
users?

Impala

•  Scalable
Low-‐latency
SQL
querying
for
HDFS
(and
HBase!)

•  ODBC/JDBC
driver
interface

•  Highlights

•  Use’s
Hive
metastore
and
its
hbase-‐hbase
connector

conﬁguracon
convencons.

•  Nacve
code
implementacon,
uses
JIT
for
query

execucon
opcmizacon.

•  Authorizacon
via
Kerberos
support

•  Open
sourced
by
Cloudera

•  hVps://github.com/cloudera/impala


Phoenix

•  A
SQL
skin
over
HBase
targecng
low-‐latency
queries.

•  JDBC
SQL
interface

•  Highlights

•  Adds
Types

•  Handles
Compound
Row
key
encoding

•  Secondary
indices
in
development

•  Provides
some
pushdown
aggregacons
(coprocessor).

•  Open
sourced
by
Salesforce.com

•  Work
from
James
Taylor,
Jesse
Yates,
et
al

•  hVps://github.com/forcedotcom/phoenix


Kite
(nee
Cloudera
Development
Kit/CDK)

•  APIs
that
provides
a
Dataset
abstracMon

•  Provides
get/put/delete
API
in
avro
objects

•  HBase
Support
in
progress

•  Highlights

•  Supports
mulcple
components
of
the
hadoop
distros

(ﬂume,
morphlines,
hive,
crunch,
hcat)

•  Provides
types
using

Avro
and
parquet
formats
for

encoding
encces

•  Manages
schema
evolucon

•  Open
source
by
Cloudera

•  hVps://github.com/kite-‐sdk/kite


Many
apps
and
users
in
a
single
cluster

Mulc-‐tenancy


Growing
HBase

•  Pre
0.96.0:
scaling
up
HBase
for

single
HBase
applicacons

•  Essencally
a
single
user
for
single

app.

•  Ex:
Facebook
messages,
one

applicacon,
many
hbase
clusters

•  Shard
users
to
diﬀerent
pods

•  Focused
on
concnuity
and
disaster

recovery
features

•  Cross-‐cluster
Replicacon

•  Table
Snapshots

•  Rolling
Upgrades

#
of
isolated
applicacons

#
of
clusters

Scalability

One
giant
applicacon,

Mulcple
clusters

Growing
HBase

•  In
0.96
we
introduce
primicves

for
supporcng
MulMtenancy

•  Many
users,
many
applicacons,

one
HBase
cluster

•  Need
to
have
some
control
of

the
interaccons
diﬀerent
users

cause.

•  Ex:
Manage
for
MR
analyccs
and

low-‐latency
serving
in
one

cluster.

#
of
isolated
applicacons

#
of
clusters

mulctenancy

Scalability

One
giant
applicacon,

Mulcple
clusters

Many
applicacons

In
one
shared
cluster

Namespaces
(0.96)

•  Namespaces
provide
an
abstraccon
for
mulcple
tenants
to

create
and
manage
their
own
tables
within
a
large
HBase

instance.

Namespace
blue
Namespace
green
Namespace
orange

Mulctenancy
goals

•  Security
(0.96)

•  A
separate
admin
ACLs
for
diﬀerent
sets
of
tables

•  Quotas
(in
progress)

•  Max
tables,
max
regions.

•  Performance
Isolacon
(in
progress)

•  Limit
performance
impact
load
on
one
table
has
on
others.

•  Priority
(future)

•  Prioricze
some
workloads/tables/user
before
others


Isolacon
with
Region
Server
Groups
(in
progress)

Region
assignment
distribucon
(no
region
server
groups)

Namespace
blue
Namespace
green
Namespace
orange

Isolacon
with
Region
Server
Groups
(in
progress)

RSG
blue
RSG
green
orange

Namespace
blue
Namespace
green
Namespace
orange

Region
assignment
distribucon
with
Region
Server
Groups
(RSG)

Conclusions


Summary
by
Version

0.90
(CDH3)
0.92
/0.94
(CDH4)
0.96
(CDH5)
Next
(0.98
/
1.0.0)

New

Features

Stability

Reliability

Concnuity
Mulctenancy

MTTR
Recovery
in

Hours

Recovery
in
Minutes
Recovery
of
writes
in

seconds,
reads
in
10’s

of
seconds

Recovery
in
Seconds

(reads+writes)

Perf
Baseline
BeVer
Throughput
Opcmizing

Performance

Predictable

Performance

Usability
HBase

Developer

Expercse

HBase
Operaconal

Experience

Distributed
Systems

Admin
Experience

Applicacon

Developers

Experience


Quescons?

@jmhsieh


Apache HBase: Where We've Been and What's Upcoming

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (19)

Similar a Apache HBase: Where We've Been and What's Upcoming

Similar a Apache HBase: Where We've Been and What's Upcoming (20)

Más de huguk

Más de huguk (20)

Último

Último (20)

Apache HBase: Where We've Been and What's Upcoming