Dchug m7-30 apr2013

1
©MapR
Technologies

HBase
and
M7
Technical

Overview

Jim
Fiori

Senior
Solu8ons
Architect

MapR
Technologies

April2013

2
©MapR
Technologies

§  Background

§  “a
3-‐hour
tour”

§  Early
Hadoop
ﬁre-‐ﬁght

§  Big
Data

Who
am
I?

3
©MapR
Technologies

Apache
HBase

MapR

M7

Agenda

4
©MapR
Technologies

HBase

Google
BigTable
Paper
-‐
2006

A
sparse,
distributed,
persistent,
indexed,
and

sorted
map

OR

A
NoSQL
database

OR

A
Columnar
data
store

5
©MapR
Technologies

Key-‐Value
Store

§  Row
key

–  Binary
sortable
value

§  Row
content
key
(analogous
to
a
column)

–  Column
family
(string)

–  Column
qualifier
(binary)

–  Version/8mestamp
(number)

§  A
row
key,
column
family,
column
qualifier,
and
version
uniquely

iden8fies
a
par8cular
cell

–  A
cell
contains
a
single
binary
value

6
©MapR
Technologies

A
Row

Value1
Row
Key
Value2
Value3
Value4
ValueN

…

C0

C1

C2

C3

C4

CN

Column

Family

Row
Key

Column

Qualifier

Version
Value2

Column

Family

Row
Key

Column

Qualifier

Version
Value1

Column

Family

Row
Key

Column

Qualifier

Version
ValueN

…

7
©MapR
Technologies

§  Weakly
typed
and
schema-‐less
(unstructured
or
perhaps
semi-‐
structured)

–  Almost
everything
is
binary

§  No
constraints

–  You
can
put
any
binary
value
in
any
cell

–  You
can
even
put
incompa8ble
types
in
two
different
instances
of
the
same

column
family:column
qualifier

§  Column
(qualifiers)
are
created
implicitly

§  Different
rows
can
have
different
columns

§  No
transac8ons/no
ACID

–  Only
unit
of
atomic
opera8on
is
a
single
row

Not
A
TradiDonal
RDBMS

8
©MapR
Technologies

§  APIs
for
querying
(get),
scanning,
and
upda8ng
(put)

–  Operate
on
row
key,
column
family,
qualiﬁer,
version,
and
values

–  Can
par8ally
specify
and
will
retrieve
union
of
results

•  if
just
specify
row
key,
will
get
all
values
for
it
(with
column
family,
qualiﬁer)

–  By
default
only
largest
version
(most
recent
if
8mestamp)

is
returned

•  Specify
row
key
and
column
family
to
get
will
retrieve
all
values
for
that
row
and

column
family

–  Scanning
is
just
get
over
a
range
of
row
keys

§  Version

–  While
defaults
to
a
8mestamp,
any
integer
is
acceptable

API

9
©MapR
Technologies

§  Rather
than
storing
table
rows
linearly
on
disk
and
each
row
on

disk
as
a
single
byte
range
with
fixed
size
fields,
store
columns
of

row
separately

–  Very
efficient
storage
for
sparse
data
sets
(NULL
is
free)

–  Compression
works
beker
on
similar
data

–  Fetches
of
only
subsets
of
row
very
efficient
(less
disk
IO)

–  No
fixed
size
on
column
values

–  No
requirement
to
even
define
columns

§  Columns
are
grouped
together
into
column
families

–  Basically
a
file
on
disk

–  A
unit
of
op8miza8on

–  In
Hbase,
adding
column
is
implicit,
adding
column
family
is
explicit

Columnar

10
©MapR
Technologies

HBase
Table
Architecture

§  Tables
are
divided
into
key
ranges
(regions)

§  Regions
are
served
by
nodes
(RegionServers)

§  Columns
are
divided
into
access
groups
(columns
families)

CF1
CF2
CF3
CF4
CF5

R1

R2

R3

R4

11
©MapR
Technologies

HBase
Architecture

12
©MapR
Technologies

§  Data
is
stored
in
sorted
order

–  A
table
contains
rows

–  A
sequence
of
rows
are
grouped
together
into
a
region

•  A
region
consists
of
various
ﬁles
related
to
those
rows
and
is
loaded
into
a
region

server

•  Regions
are
stored
in
HDFS
for
high
availability

–  A
single
region
server
manages
mul8ple
regions

•  Region
assignment
can
change
–
load
balancing,
failures,
etc.

§  Clients
connect
to
tables

–  HBase
run8me
transparently
determines
the
region
(based
on
key
ranges)

and
contacts
the
appropriate
region
server

§  At
any
given
8me
exactly
one
region
server
provides
access
to
a

region

–  Master
region
servers
(with
Zookeeper)
manage
that

Storage
Model
Highlights

13
©MapR
Technologies

§  Very
scalable

§  Easy
to
add
region
servers

§  Easy
to
move
regions
around

§  Scans
are
eﬃcient

–  Unlike
hashing
based
models

§  Access
via
row
key
is
very
eﬃcient

–  Note:
there
are
no
secondary
indexes

§  No
schema,
can
store
whatever
you
want
when
you
want

§  Strong
consistency

§  Integrated
with
Hadoop

–  Map-‐Reduce
on
HBase
is
straighoorward

–  HDFS/MapR-‐FS
provides
data
replica8on

What’s
Great
About
This?

14
©MapR
Technologies

§  Data
from
a
region
column
family
is
stored
in
an
HFile

– An
HFile
contains
row
key:column
qualiﬁer:version:value

entries

– Index
at
the
end
into
the
data
–
64KB
“blocks”
by
default

§  Update

– New
value
is
wriken
persistently
to
Write
Ahead
Log
(WAL)

– Cached
in
memory
(MemStore)

– When
memory
ﬁlls,
write
out
new
HFile

§  Read

– Checks
in
memory,
then
all
of
the
HFiles

– Read
data
cached
in
memory

§  Delete

– Create
a
tombstone
record
(purged
at
major
compac8on)

Data
Storage
Architecture

15
©MapR
Technologies

Apache
HBase
HFile
Structure

64Kbyte
blocks

are
compressed

An
index
into
the

compressed
blocks
is

created
as
a
btree

Key-‐value

pairs
are

laid
out
in

increasing

order

Each
cell
is
an
individual
key
+
value

-‐
a
row
repeats
the
key
for
each
column

16
©MapR
Technologies

HBase
Region
OperaDon

§  Typical
region
size
is
a
few
GB,
some8mes
even
10G
or
20G

§  RS

holds
data
in
memory
in
a
MemStore
un8l
full,
then

writes
a
new
HFile

–  Logical
view
of
database
constructed
by
layering
these
ﬁles,
with
the

latest
on
top

Key
range
represented
by
this
region

newest

oldest

17
©MapR
Technologies

HBase
Read
AmplificaDon

§  When
a
get/scan
comes
in,
all
the
files
have
to
be
examined

–  schema-‐less,
so
where
is
the
column?

–  Done
in-‐memory
and
does
not
change
what's
on
disk

•  Bloom-‐filters
do
not
help
in
scans

newest

oldest

With
7
files,
a
1K-‐record
get()
poten8ally
takes
about
30
seeks,

7
block
fetches
and
decompressions,
from
HDFS.
Even
with
the
index
in
memory

7
seeks
and
7
block
fetches
are
required.

18
©MapR
Technologies

HBase
Write
AmplificaDon

§  To
reduce
the
read-‐amplifica8on,
HBase
merges
the
HFiles

periodically

–  process
called
compac8on

–  runs
automa8cally
when
too
many
files

–  usually
turned
off
due
to
I/O
storms
which
interfere
with
client

access

–  and
kicked-‐off
manually
on
weekends

Major
compac8on
reads
all
files
and

merges

into
a
single
HFile

20
©MapR
Technologies

§  A
persistent
record
of
every
update/insert
in
sequence
order

–  Shared
by
all
regions
on
one
region
server

–  WAL
files
periodically
rolled
to
limit
size
but
older
WALs
s8ll
needed

–  WAL
file
no
longer
needed
once
every
region
with
updates
in
WAL
file
has

flushed
those
from
memory
to
an
HFile

•  Remember
that
more
HFiles
slow
read
path!

§  Must
be
replayed
as
part
of
recovery
process
since
in
memory

updates
are
“lost”

–  This
is
very
expensive
and
delays
bringing
a
region
back
online

WAL
File

21
©MapR
Technologies

What’s
Not
So
Good

Reliability

• Complex
coordina8on
between
ZK,
HDFS,
HBase

Master,
and
Region
Server
during
region
movement

• Compac8ons
disrupt
opera8ons

• Very
slow
crash
recovery
because
of

• Coordina8on
complexity

• WAL
log
reading
(one
log/server)

Business
conDnuity

• Many
administra8ve
ac8ons
require
down8me

• Not
well
integrated
into
MapR-‐FS
mirroring
and

snapshot
func8onality

22
©MapR
Technologies

What’s
Not
So
Good

Performance

• Very
long
read/write
path

• Signiﬁcant
read
and
write
ampliﬁca8on

• Mul8ple
JVMs
in
read/write
path
–
GC
delays!

Manageability

• Compac8ons,
splits
and
merges
must
be
done

manually
(in
reality)

• Lots
of
“well
known”
problems
maintaining
reliable

cluster
–
spliwng,
compac8ons,
region
assignment,
etc.

• Prac8cal
limits
on
number
of
regions/region
server
and

size
of
regions
–
can
make
it
hard
to
fully
u8lize

hardware

23
©MapR
Technologies

Region
Assignment
in
Apache
HBase

24
©MapR
Technologies

Apache
HBase
on
MapR

Limited
data
management,
data
protec8on
and
disaster
recovery
for
tables.

25
©MapR
Technologies

HBase

MapR

M7

Containers

Agenda

27
©MapR
Technologies

MapR
DistribuDon
for
Apache
Hadoop

§  Complete
Hadoop
distribu8on

§  Comprehensive
management

suite

§  Industry-‐standard
interfaces

§  Enterprise-‐grade

dependability

§  Higher
performance

28
©MapR
Technologies

MapR:
The
Enterprise
Grade
DistribuDon

29
©MapR
Technologies

One
PlaVorm
for
Big
Data

…
Batch

99.999%

HA

Data

Protec8on

Disaster

Recovery

Scalability

&

Performance

Enterprise

Integra8on

Mul8-‐
tenancy

Map

Reduce

File-‐Based

Applica8ons
SQL
Database
Search
Stream

Processing

Interac8ve
Real-‐8me

…

Broad

range
of

applica8ons

Recommenda8on
Engines
Fraud
Detec8on
Billing
Logis8cs

Risk
Modeling
Market
Segmenta8on
Inventory
Forecas8ng

32
©MapR
Technologies

The
Cloud
Leaders
Pick
MapR

Google
chose
MapR
to

provide
Hadoop
on
Google

Compute
Engine

Amazon
EMR
is
the
largest

Hadoop
provider
in
revenue

and
#
of
clusters

MinuteSort
Record

1.5
TB
in
60
seconds

2103
nodes

34
©MapR
Technologies

MapR
EdiDons

§  Control
System

§  NFS
Access

§  Performance

§  High
Availability

§  Snapshots
&
Mirroring

§  24
X
7
Support

§  Annual
Subscrip8on

§  Control
System

§  NFS
Access

§  Performance

§  Unlimited
Nodes

§  Free

Compute
Engine

Also
Available
through:

§  All
the
Features
of
M5

§  Simpliﬁed

Administra8on
for

HBase

§  Increased
Performance

§  Consistent
Low
Latency

§  Uniﬁed
Snapshots,

Mirroring

35
©MapR
Technologies

Hbase

MapR

M7

Agenda

37
©MapR
Technologies

Introducing
MapR
M7

§  An
integrated
system

–  Unified
namespace
for
files
and
tables

–  Built-‐in
data
management
&
protec8on

–  No
extra
administra8on

§  Architected
for
reliability
and
performance

–  Fewer
layers

–  Single
hop
to
data

–  No
compac8ons,
low
i/o
amplifica8on

–  Seamless
splits,
automa8c
merges

–  Instant
recovery

38
©MapR
Technologies

M7:

Remove
Layers,
Simplify

MapR

M7

Take
note!
No
JVM!

39
©MapR
Technologies

Binary
CompaDble
with
HBase
APIs

§  HBase
applica8ons
work
"as
is"
with
M7

–  No
need
to
recompile
(binary
compa8ble)

§  Can
run
M7
and
HBase
side-‐by-‐side
on
the
same
cluster

–  e.g.,
during
a
migra8on

–  can
access
both
M7
table
and
HBase
table
in
same
program

§  Use
standard
Apache
HBase
CopyTable
tool
to
copy
a
table

from
HBase
to
M7
or
vice-‐versa

%
hbase
org.apache.hadoop.hbase.mapreduce.CopyTable

-‐-‐new.name=/user/srivas/mytable
oldtable

40
©MapR
Technologies

M7:

No
Master
and
No
RegionServers

No
extra
daemons
to
manage

One
hop
to
data
Uniﬁed
cache

No
JVM
problems

42
©MapR
Technologies

Uniﬁed
Namespace
for
Files
and
Tables

$
pwd

/mapr/default/user/dave

$
ls

file1

file2

table1

table2

$
hbase
shell

hbase(main):003:0>
create
'/user/dave/table3',
'cf1',
'cf2',
'cf3'

0
row(s)
in
0.1570
seconds

$
ls

file1

file2

table1

table2

table3

$
hadoop
fs
-‐ls
/user/dave

Found
5
items

-‐rw-‐r-‐-‐r-‐-‐

3
mapr
mapr

16
2012-‐09-‐28
08:34
/user/dave/file1

-‐rw-‐r-‐-‐r-‐-‐

3
mapr
mapr

22
2012-‐09-‐28
08:34
/user/dave/file2

trwxr-‐xr-‐x

3
mapr
mapr

2
2012-‐09-‐28
08:32
/user/dave/table1

trwxr-‐xr-‐x

3
mapr
mapr

2
2012-‐09-‐28
08:33
/user/dave/table2

trwxr-‐xr-‐x

3
mapr
mapr

2
2012-‐09-‐28
08:38
/user/dave/table3

43
©MapR
Technologies

Tables
for
End
Users

§  Users
can
create
and
manage
their
own
tables

–  Unlimited
#
of
tables

§  Tables
can
be
created
in
any
directory

–  Tables
count
towards
volume
and
user
quotas

§  No
admin
interven8on
needed

–  I
can
create
a
file
or
a
directory
without
opening
a
8cket
with

admin
team,
why
not
a
table?

–  Do
stuff
on
the
fly,

no
stop/restart
servers

§  Automa8c
data
protec8on
and
disaster
recovery

–  Users
can
recover
from
snapshots/mirrors
on
their
own

46
©MapR
Technologies

HBase
Write
AmplificaDon
Analysis

§  Assume
10G
per
region,
write
10%
per
day,
grow
10%
per
week

–  1G
of
writes

–  a~er
7
days,
7
files
of
1G
and
1file
of
10G
(only
1G
is
growth)

§  IO
Cost

–  Wrote
7G
to
WAL
+
7G
to
HFiles

–  Compac8on
adds
s8ll
more

•  read:
17G

(=
7
x
1G

+
1
x
10G)

•  write:

11G
write
to
new
Hfile

–  WAF
–
wrote
7G
“for
real”
but
actual
disk
IO
a~er
compac8on
is
read

17G
+
write
25G
and
that’s
assuming
no
applica8on
reads!

§  IO
Cost
of
1000
regions
similar
to
above

–  read
17T,

write
25T

è
major
impact
on
node

§  Best
prac8ce,
limit
#
of
regions/node
à
can’t
fully
u8lize

storage

47
©MapR
Technologies

AlternaDve:
Level-‐DB

§  Tiered,
logarithmic
increase

–  L1:
2
x
1M

ﬁles

–  L2:

10
x
1M

–  L3:

100
x
1M

–  L4:

1,000
x
1M,
etc

§  Compac8on
overhead

–  avoids
IO
storms

(i/o
done
in
smaller
increments
of

~10M)

–  but
signiﬁcantly
extra
bandwidth
compared
to
HBase

§  Read
overhead
is
s8ll
high

–  10-‐15
seeks,
perhaps
more
if
the
lowest
level
is
very
large

–  40K
-‐
60K

read
from
disk
to
retrieve
a
1K
record

48
©MapR
Technologies

BTree
analysis

§  Read
ﬁnds
data
directly,
proven
to
be
fastest

–  interior
nodes
only
hold
keys

–  very
large
branching
factor

–  values
only
at
leaves

–  thus
index
caches
work

–  R
=
logN
seeks,
if
no
caching

–  1K
record
read
will
transfer
about
logN
blocks
from
disk

§  Writes
are
slow
on
inserts

–  inserted
into
correct
place
right
away

–  otherwise
read
will
not
ﬁnd
it

–  requires
btree
to
be
con8nuously
rebalanced

–  causes
extreme
random
i/o
in
insert
path

–  W
=
2.5x
+
logN
seeks
if
no
caching

49
©MapR
Technologies

Log-‐Structured
Merge
Trees

§  LSM
Trees
reduce
insert
cost
by
deferring
and
batching
index
changes

–  If
don't
compact
o~en,
read
perf
is
impacted

–  If
compact
too
o~en,
write
perf
is
impacted

§  B-‐Trees
are
great
for
reads

–  but
expensive
to
update
in
real-‐8me

Index Log
Index
Memory Disk
Write
Read
Can
we
combine
both
ideas?

Writes
cannot
be
done
beker
than
W
=
2.5x

write
to
log

+

write
data
to
somewhere

+

update
meta-‐data

50
©MapR
Technologies

M7
from
MapR

§  Twis8ng
BTree's

–  leaves
are
variable
size
(8K
-‐
8M
or
larger)

–  can
stay
unbalanced
for
long
periods
of
8me

•  more
inserts
will
balance
it
eventually

•  automa8cally
throkles
updates
to
interior
btree
nodes

–  M7
inserts
"close
to"
where
the
data
is
supposed
to
go

§  Reads

–  Uses
BTree
structure
to
get
"close"
very
fast

•  very
high
branching
with
key-‐prefix-‐compression

–  U8lizes
a
separate
lower-‐level
index
to
find
it
exactly

•  updated
"in-‐place”
bloom-‐filters
for
gets,
range-‐maps
for
scans

§  Overhead

–  1K
record
read
will
transfer
about
32K
from
disk
in
logN
seeks

51
©MapR
Technologies

M7

provides
Instant
Recovery

§  Instead
of
having
one
WAL/region
server
or
even
one/region,

we
have
many
micro-‐WALs/region

§  0-‐40
microWALs
per
region

–  idle
WALs
“compacted”,
so
most
are
empty

–  region
is
up
before
all
microWALs
are
recovered

–  recovers
region
in
background
in
parallel

–  when
a
key
is
accessed,
that
microWAL
is
recovered
inline

–  1000-‐10000x
faster
recovery

§  Never
perform
equivalent
of
HBase
major
or
minor

compac8on

§  Why
doesn't
HBase
do
this?
M7
uses
MapR-‐FS,
not
HDFS

–  No
limit
to
#
of
ﬁles
on
disk

–  No
limit
to
#
open
ﬁles

–  I/O
path
translates
random
writes
to
sequen8al
writes
on
disk

53
©MapR
Technologies

M7:

Fileservers
Serve
Regions

§  Region
lives
en8rely
inside
a
container

–  Does
not
coordinate
through
ZooKeeper

§  Containers
support
distributed
transac8ons

–  with
replica8on
built-‐in

§  Only
coordina8on
in
the
system
is
for
splits

–  Between
region-‐map
and
data-‐container

–  already
solved
this
problem
for
ﬁles
and
its
chunks

57
©MapR
Technologies

M7
Containers

§  Container
holds
many
files

– regular,
dir,
symlink,
btree,
chunk-‐map,
region-‐map,
…

– all
random-‐write
capable

§  Container
is
replicated
to
servers

– unit
of
resynchroniza8on

§  Region
lives
en8rely
inside
1
container

– all
files
+
WALs
+
btree's
+
bloom-‐filters
+
range-‐maps

63
©MapR
Technologies

Other
M7
Features

§  Smaller
disk
footprint

– M7
never
repeats
the
key
or
column
name

§  Columnar
layout

– M7
supports
64
column
families

– in-‐memory
column-‐families

§  Online
admin

– M7
schema
changes
on
the
ﬂy

– delete/rename/redistribute
tables

§  Run
MapReduce
and
tables
on
same
cluster

§  UI:
hbase
shell,
MCS
GUI,
maprcli

Dchug m7-30 apr2013

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (11)

Similar a Dchug m7-30 apr2013

Similar a Dchug m7-30 apr2013 (20)

Último

Último (20)

Dchug m7-30 apr2013