HBase Breathes w ith In-Memor y Compaction via Accordion

Accordion:
HBase Breathes w ith In-Memor y Compaction

Eshcar Hillel, Anastasia Braginsky, Edward Bortnikov ⎪ HBaseCon, Jun 12, 2017

The Team
2
Edward Bortnikov

Anastasia Braginsky

(committer)

Eshcar Hillel

(committer)

Michael Stack

Anoop Sam John

Ramkrishna Vasudevan

Quest: The User’s Holy Grail
3
Reliable

Persistent

Storage

In-Memory

Database

Performance

What is Accordion?
4
Novel Write-Path Algorithm

Better Performance of Write-Intensive Workloads

Write Throughput ì, Read Latency î

Better Disk Use

Write ampliﬁcation î

GA in HBase 2.0 (becomes default MemStore implementation)

In a Nutshell
5
Inspired by Log-Structured-Merge (LSM) Tree Design

Transforms random I/O to sequential I/O (efﬁcient!)

Governs the HBase storage organization

Accordion reapplies the LSM Tree design to RAM data

à Efﬁcient resource use – data lives in memory longer

à  Less disk I/O

à  Ultimately, higher speed

How LSM Trees Work
6
MemStore

HFile

HFile

HFile

RAM

Disk

Put
Get/Scan

Flush

Compaction

HRegion

Data updates
stored as versions

Compaction
eliminates

redundancies

LSM Trees in Action
7
MemStore
MemStore
MemStore
MemStore
MemStore

HFile
HFile

HFile

HFile

HFile

HFile

HFile

HFile

Flush!
Flush!
Flush!
Compaction!

Accordion: In-Memory LSM Tree
8
Active Segment

HFile

Immutable Segment

Immutable Segment

Immutable Segment

HFile

Compacting

MemStore

Flush

Put
Get/Scan

Compaction

RAM

Disk

Accordion in Action
9
Active

Segment

Active

Segment

Active

Segment

Active

Segment

Active

Segment

Immutable

Segment

Immutable

Segment

Immutable

Segment

Immutable

Segment

In-Memory

Flush!

In-Memory

Flush!

In-Memory

Compaction!

Snapshot

Disk Flush!

Compaction

Pipeline

Flat Immutable Segment Index
10
Hhjs
iutkldfk;wjt;w
iejerg;iopp
Jkkgkykytkt
gcccdddeiuy
oweuoweiuo
ieu
Poiuytrewqa
sdfaaabbbm
nppppbvcxq
qaaaxcvb
qqqwertyuioas
dfghjklrtyuiopl
kjhgfpppwww
mnbvcmnb
Jkdddfkgbbbd
iwpoqqqaaacc
cdddeiuyoweu
oweiuoieu
k;wjt;wiej;iwj
opppqqqyrta
aajeutkiyt
Jkkgkykytcc
dddeiuyowe
uoweiuoieu
Poiuytrewqa
hmnppppbv
cxqqaaaxcv
b
qqqwertyuioas
dfghjklrtyuiopl
kjhgfpppwww
mnbvcmnb
Jkkgkykytktjjjjo
ooooooqqbyfjt
dhghfhfngfhfg
bcccdddeiuyo
weuoweiuoieu
Hhjjuuyrqaa
ss
iuaaajeutkiyt
Jkkgkykytktg
kg;diwpoqeu
oweiuoieu
Poiuytrejkl;;
mnppppbvcx
qqaaaxcvb
qqqwertyuioas
dfghjklrtyuioplk
jhgfpppwwwm
nbvcmnb
Jkdddfkaabbb
cccdddeiuyow
euoweiuoieu
utkldfk;ioppp
qqqyrtaaaje
utkiyt
diwpoqqqaa
abbbcccddw
euoweiuoieu
hjkl;;mnpppp
bvcxqqaaax
cvb
qqqwertyuioas
dfghjklrtyuioplk
jhgfpppwwwm
nbvcmnb
Jkkgkyaaabbyf
jtdhghfhfngfhfg
bcccdddeiuyo
weuoweiuoieu
Cell Storage

Cell Storage

Flatten

Skiplist Index
CellArrayMap Index

Lean footprint – the smaller the cells the better!

KV-
Objects

Redundancy Elimination
11
In-Memory Compaction merges the pipelined segments

Get access latency under control (less segments to scan)

BASIC compaction

Multiple indexes merged into one, cell data remains in place

EAGER compaction

Redundant data versions eliminated (SQM scan)

BASIC vs EAGER
12
BASIC: universal optimization, avoids physical data copy

EAGER: high value for highly redundant workloads

SQM scan is expensive

Data relocation cost may be high (think MSLAB!)

Configuration

BASIC is default, EAGER may be configured

Future implementation may figure out the right mode automatically

Compaction Pipeline: Correctness & Performance
13
Shared Data Structure

Read access: Get, Scan, in-memory compaction

Write access: in-memory flush, in-memory compaction, disk flush

Design Choice: Non-Blocking Reads

Read-Only pipeline clone – no synchronization upon read access

Copy-on-Write upon modification

Versioning prevents compaction concurrent to other updates

More Memory Efficiency - KV Object Elimination
14
Hhjs
iutkldfk;wjt;w
iejerg;iopp
Jkkgkykytkt
gcccdddeiuy
oweuoweiuo
ieu
Poiuytrewqa
sdfaaabbbm
nppppbvcxq
qaaaxcvb
qqqwertyuioas
dfghjklrtyuiopl
kjhgfpppwww
mnbvcmnb
Jkdddfkgbbbd
iwpoqqqaaacc
cdddeiuyoweu
oweiuoieu
k;wjt;wiej;iwj
opppqqqyrta
aajeutkiyt
Jkkgkykytcc
dddeiuyowe
uoweiuoieu
Poiuytrewqa
hmnppppbv
cxqqaaaxcv
b
qqqwertyuioas
dfghjklrtyuiopl
kjhgfpppwww
mnbvcmnb
Jkkgkykytktjjjjo
ooooooqqbyfjt
dhghfhfngfhfg
bcccdddeiuyo
weuoweiuoieu
Cell Storage

CellArrayMap Index

Lean Footprint (no KV-Objects). Friendly to Off-Heap Implementation.

Hhjs
iutkldfk;wjt;w
iejerg;iopp
Jkkgkykytkt
gcccdddeiuy
oweuoweiuo
ieu
Poiuytrewqa
sdfaaabbbm
nppppbvcxq
qaaaxcvb
qqqwertyuioas
dfghjklrtyuiopl
kjhgfpppwww
mnbvcmnb
Jkdddfkgbbbd
iwpoqqqaaacc
cdddeiuyoweu
oweiuoieu
k;wjt;wiej;iwj
opppqqqyrta
aajeutkiyt
Jkkgkykytcc
dddeiuyowe
uoweiuoieu
Poiuytrewqa
hmnppppbv
cxqqaaaxcv
b
qqqwertyuioas
dfghjklrtyuiopl
kjhgfpppwww
mnbvcmnb
Jkkgkykytktjjjjo
ooooooqqbyfjt
dhghfhfngfhfg
bcccdddeiuyo
weuoweiuoieu
Cell Storage

CellChunkMap Index

The Software Side: What’s New?
15
CompactingMemStore: BASIC and EAGER conﬁgurations

DefaultMemStore: NONE conﬁguration

Segment Class Hierarchy: Mutable, Immutable, Composite

NavigableMap Implementations: CellArrayMap, CellChunkMap

MemStoreCompactor: compaction algorithms implementation

CellChunkMap Support (Experimental)
16
Cell objects embedded directly into CellChunkMap (CCM)

New cell type - reference data by unique ChunkID

ChunkCreator: Chunk allocation + ChunkID management

Stores mapping of ChunkID’s to Chunk references

Strong references to chunks managed by CCM’s, weak to the rest

The CCM’s themselves are allocated via the same mechanism

Some exotic use cases

E.g., jumbo cells allocated in one-time chunks outside chunk pools

Evaluation Setup
17
System

2-node HBase on top of 3-node HDFS, 1Gbps interconnect

Intel Xeon E5620 (12-core), 2.8TB SSD storage, 48GB RAM

RS conﬁg: 16GB RAM (40% Cache/40% MemStore), on-heap, no MSLAB

Data

1 table (100 regions, 50 columns), 30GB-100GB

Workload Driver

YCSB (1 node, 12 threads)

Batched (async) writes (10KB buffer)

Experiments
18
Metrics

Write throughput, read latency (distribution), disk footprint/amplification

Workloads
(varied at client side)

Write-Only (100% Put) vs Mixed (50% Put/50% Get)

Uniform vs Zipfian Key Distributions

Small Values (100B) vs Big Values (1K)

Configurations (varied at server side)

Most experiments exercise Async WAL

19

Write Throughput
+25% +44% 100GB Dataset

100% Writes

100B Values

Every write updates

a single column

Gains less pronounced

with big values (1KB)

+11%
(why?)
-
20,000
40,000
60,000
80,000
100,000
120,000
140,000
160,000
Zipf Uniform
Throughput,ops/sec
NONE
BASIC
EAGER

20

Single-Key Write Latency
0
1
2
3
4
5
6
7
50% (median) 75% 95% 99% (tail)
Latency,ms
NONE
BASIC
EAGER
100GB Dataset

Zipf distribution

100% Writes

100B Values

21

Single-Key Read Latency
30GB Dataset

Zipf Distribution

50% Writes/50% Reads

100B Values

0
1
2
3
4
5
6
50% (median) 75% 95% 99% (tail)
Latency,ms
NONE
BASIC
EAGER
+9%
(why?)
-13%

22

Disk Footprint/Write Amplification
100GB Dataset

Zipf Distribution

100% Writes

100B Values
-29%
0
200
400
600
800
1000
1200
Flushes Compactions Data Written (GB)
NONE
BASIC
EAGER

Status
23
In-Memory Compaction GA in HBase 2.0

Master JIRA HBASE-14918 complete (~20 subtasks)

Major refactoring/extension of the MemStore code

Many details in Apache HBase blog posts

CellChunkMap Index, Off-Heap support in progress

Master JIRA HBASE-16421

Summary
24
Accordion = a leaner and faster write path

Space-Efﬁcient Index + Redundancy Elimination à less I/O

Less Frequent Flushes à increased write throughput

Less On-Disk Compaction à reduced write ampliﬁcation

Data stays longer in RAM à reduced tail read latency

Edging Closer to In-Memory Database Performance

Thanks to Our Partners for Being Awesome

25

HBase Breathes w ith In-Memor y Compaction via Accordion

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to HBase Breathes w ith In-Memor y Compaction via Accordion

Similar to HBase Breathes w ith In-Memor y Compaction via Accordion (20)

More from HBaseCon

More from HBaseCon (20)

Recently uploaded

Recently uploaded (20)

HBase Breathes w ith In-Memor y Compaction via Accordion