Anoop Sam John and Ramkrishna Vasudevan (Intel)
HBase provides an LRU based on heap cache but its size (and so the total data size that can be cached) is limited by Java’s max heap space. This talk highlights our work under HBASE-11425 to allow the HBase read path to work directly from the off-heap area.
1. Off Heaping HBase Read path
HBASE-11425
Anoop Sam John
Ramkrishna S Vasudevan
Intel BigData Team – Bangalore, India
2. L2 off heap cache can give large cache size
Not constrained by Java max heap size possible issues.
4 MB physical memory buffers.
Different sized buckets 5 KB, 9 KB,… 513 KB. Each bucket having at least 4 slots
HFile blocks placed in appropriate sized bucket
One Block may span across 2 ByteBuffers.
Read path assumption of data being in a byte array.
Cells having assumption of data parts being in byte array. (ie. Rowkey, family, value
etc)
Read hitting block in cache need on heap copy of that block
Temp array of 64K creation and copy. More garbage
Overview
4 MB
513 KB buckets
3. Read from Bucket Cache
Region1
Region2
Read request
Read request
HRegionServer
Read response
Read response Scanner layers
Scanner layers
On heap
HfileBlock
On heap
HfileBlock
Off heap
Bucket Cache
4. Off Heap Read Path from Bucket
Cache
Region1
Region2
Read request
Read request
HRegionServer
Off heap
Bucket Cache
Read response
Read response Scanner layers
Scanner layers
End to End off heap - from bucket cache till RPC
5. Selection of data structure for off heap storage
During reads, parse individual Cell components multiple times
Cells are frequently compared for proper ordering
Bucket cache uses NIO DirectByteBuffer for off heap cache
JMH benchmark NIO vs Netty
Test doing reads of int, long, bytes from NIO ByteBuffer and Netty ByteBuff
Test with Unsafe based reads
Conclusion : Continue with the existing NIO DBB based buckets in BucketCache
Off Heap Data Structure
Benchmark Mode Cnt Score Error Units
nettyOffheap: thrpt 57366360.944 ±11533933.769 ops/s
nioOffheap : thrpt 60089837.738 ±14171768.229 ops/s
Benchmark Mode Cnt Score Error Units
nettyOffheap: thrpt 83613659.416 ± 535211.991 ops/s
nioOffheap : thrpt 84514777.734 ± 1199369.976 ops/s
6. Cellify read path HBASE-7320 , HBASE-11871 , HBASE-11805
Cells flow in read path
Move out of KeyValue assumption
HFile block backed by ByteBuffer rather than byte[]
Remove all byte[] assumption in seeking, encoding etc
Cell extension
Support ByteBuffer backed getXXX APIs.
Added Cell extension ByteBufferedCell and exposed within Server only
Creating off heap backed ByteBufferedCell when reading blocks from off heap bucket cache
getXXXArray() calls on off heap buffer backed Cells works with a temp byte[] copy. More garbage
CellUtil APIs for operations like equals, copy which checks for ByteBufferedCell
Suggest CPs, custom filter use these APIs.
Note
Filter# filterRowKey(byte[] buffer, int offset, int length) deprecated against filterRowKey(Cell firstRowCell)
RegionObserver # postScannerFilterRow(ObserverContext<RegionCoprocessorEnvironment>, InternalScanner,
byte[], int, short, boolean) deprecated against
postScannerFilterRow(ObserverContext<RegionCoprocessorEnvironment>, InternalScanner, Cell, boolean)
Building Blocks for Off Heaping
7. KVComparator -> CellComparator HBASE-10800 , HBASE-13500
JMH benchmark with off heap buffer compare vs byte[] compare
Using Unsafe way of compare
Each buffer with 135 bytes
Both buffers equal
No performance overhead with comparing off heap backed cells
Benchmark Mode Cnt Score Error Units
offheapCompare: thrpt 38205893.545 ± 265309.769 ops/s
onheapCompare: thrpt 37166847.740 ± 430242.970 ops/s
Building Blocks for Off Heaping
8. HFile block data might split across 2 ByteBuffers
Avoid copy
Need single data structure which backs N ByteBuffers
Java NIO ByteBuffer is not extendable
Wrapper class org.apache.hadoop.hbase.nio.ByteBuff
org.apache.hadoop.hbase.nio.SingleByteBuff
org.apache.hadoop.hbase.nio. MultiByteBuff
HFile block’s data structure type changed to ByteBuff
NIO ByteBuffer Wrapper
MultiByteBuff
SingleByteBuff
9. BucketCache evicts blocks and frees the buckets when out of space
Any block can be evicted. Readers copy block data to temp byte[]
After HBASE-11425 readers refer to bucket memory area directly
Can evict only unreferenced blocks
Bucket Cache Block Eviction
Call#setResponse
RpcCallback#run
RegionScanner#shipped
KeyValueHeap#shipped
StoreScanner#shipped
KeyValueHeap#shipped
StoreFileScanner#shipped
HFileScanner#shipped
HFile.Reader#returnBlock
BlockCache#returnBlock Decrement ref count
Ref count based block cache and block eviction
Increment ref count when reader hits a block in L2 cache
Decrement once response is created for RPC
Evict if/when ref count = 0
10. Complete Picture
Region1
Region2
Read request
Read request
HRegionServer
Off heap
Bucket Cache
Refcount++
Read response
Read response Scanner layers
Scanner layers
Refcount++
callback
callback
Refcount--
Refcount--
MultiByteBuff
SingleByteBuff
End to End off heap - from bucket cache till RPC
11. Performance Test Results
PerformanceEvaluation Tool (PE)
Table with one CF and one cell per row. 100 GB total data. Each row with 1K value size
Entire data is loaded into Bucket cache
Single node cluster
CPU : Intel(R) Xeon(R) CPU with 8 cores. RAM : 150 GB
JDK : 1.8
HBase configuration
HBASE_HEAPSIZE = 9 GB
HBASE_OFFHEAPSIZE = 105 GB
hbase.bucketcache.size = 102GB
GC – Default HBase GC setting (CMS )
Multi get with 100 rows
Every thread doing 100 K operations
= 10 million rows get
Avg completion run time of each
thread (In secs)
Convert to throughput – Gain of
102% - 460%
89.38
139.81
285.66
361.23
817.91
1372.81
44.04 50.55 70.23 88.6
165.4
244.72
0
200
400
600
800
1000
1200
1400
1600
5 threads 10 threads 20 threads 25 threads 50 threads 75 threads
HBase Random GET Average Completion Time (s) (The
lower the better)
Before HBASE-11425 After HBASE-11425
12. Performance Test Results
PerformanceEvaluation Tool (PE)
Random Range Scan 10K range
with filterAll filter (No data returned back)
Each thread doing range scan for 1000 times
Entire data is loaded into Bucket cache
449.1
728.64
908.26
1904.93
319.87
451.58
560.46
1158
0
500
1000
1500
2000
2500
10 threads 20 threads 25 threads 50 threads
Range Scan only server side
Average Completion Time (s) (The lower the better)
Before HBASE-11425 After HBASE-11425
13. Performance Test Results
PerformanceEvaluation Tool (PE)
Random Range Scan 10K range
Returning 10% of rows back to client
Each thread doing range scan for 1000 times
Entire data is loaded into Bucket cache
449.1
728.64
908.26
1904.93
319.87
451.58
560.46
1158
0
500
1000
1500
2000
2500
10 threads 20 threads 25 threads 50 threads
HBase Range Scan with filter
Average Completion Time (s) (The lower the better)
Before HBASE-11425 After HBASE-11425
14. Performance Test Results
YCSB Test
Table with one CF and 10 columns per row. Each row with 1K value. 90 GB total data
Entire data is loaded into Bucket cache
Single node cluster
CPU : Intel(R) Xeon(R) CPU with 8 cores. RAM : 150 GB
JDK : 1.8
HBase configuration
HBASE_HEAPSIZE = 9 GB
HBASE_OFFHEAPSIZE = 105 GB
hbase.bucketcache.size = 102GB
23277.97
25922.18 24558.72 24316.74
28045.53
45767.99
58904.03
63280.86
0
10000
20000
30000
40000
50000
60000
70000
10 threads 25 threads 50 threads 75 threads
YCSB Random GET
Throughput
Before HBASE-11425 After HBASE-11425
Multi get with 100 rows
Every thread doing 5
million operations
20- 160% improvement
15. PE test comparing L1 cache vs Off heap L2 cache with 20GB data
Multi get with 100 rows
Entire data is loaded into bucket cache
Each thread doing 10 million operations = 10 billion rows get
L1 test L2 test
Max heap – 32 GB Max heap – 12 GB
Performance Test Results
300.5
559.3
1195.9
1793.1
307.6
523.9
1144.2
1707.6
0
200
400
600
800
1000
1200
1400
1600
1800
2000
10 threads 25 threads 50 threads 75 threads
HBase Random GET Average Completion Time (s) (The
lower the better)
L1 cache L2 cache
16. MultiGets – Before HBASE-11425 (25 threads) MultiGets – After HBASE-11425(25 threads)
GC Graphs
17. ScanRange10000 – Before HBASE-11425 (20 threads) ScanRange10000 – After HBASE-11425(20 threads)
GC Graphs
18. Feature will be available in HBase 2.0 release
Make Bucket cache default in HBase 2.0 – Refer HBASE-11323
‘Rocketfuel’ started using this for random read work load
Backported to 1.x based version
More details
https://blogs.apache.org/hbase/entry/offheaping_the_read_path_in
Feature Availability
19. Future work
Off heaping write path – HBASE-11579
Off heap MSLAB pool
Read request bytes into off heap buffer pool
Lazy creation of ByteBuffer pools
Fixed sized off heap ByteBuffers from pool
Protobuf changes to handle off heap ByteBuffers
In-memory flushes/compaction (HBASE-14918) from Yahoo
Questions??
Future work & QA