Performance tuning in BlueStore & RocksDB - Li Xiaoyan

Performance tuning in BlueStore&RocksDB
Li Xiaoyan (Lisa), Intel
1

Overview
2
DISK DISK DISK DISK DISK
OSD OSD OSD
MMM
M
Storage Node
Monitor Node
OSD OSD
ObjectStoreObjectStore ObjectStore ObjectStore ObjectStore
• ObjectStore is to store data in
local nodes.
• BlueStore is default object
store.

BlueStore
• BlueStore = Block +
NewStore
• Data write directly to raw
block device.
• Metadata write to
Key/value database
(Rocksdb).
Ø Object data
Ø Omap
Ø Deferred logs
Ø others
• RocksDB is above light
weight file system BlueFS.
3

RocksDB
MemTable
Immutable
MemTable
write
Memory
Disk
Compaction
.sstLevel 0
.sstLevel 1
Level 2 .sst
SSTable
.log
manife
st
current
• A key-value database,
originated by Google,
improved by Facebook.
• Based on LSM (Log-
Structure merge Tree).
• Key words:
Ø Active MemTable
Ø Immutable
MemTable
Ø SST file
Ø LOG
4

BlueStore – small write
• RocksDB acts as journal
(deferred IO).
• Customer data is written
to RocksDB, and return to
OSD.
• Later customer data is
written into block device.
• Deferred IO entry is
deleted from RocksDB.
5
STATE_PREPARE
_txc_add_transaction/
add deferred io to kv writebach
STATE_AIO_WAITNo aios
STATE_IO_DONE
_txc_finish_io
STATE_KV_QUEUED
Put kv_queue
Kv_cond.notify
STATE_KV_SUBMITTED
_kv_sync_thread/
_kv_finalize_thread
STATE_KV_DONE
_txc_committed_kv
STATE_DEFERRED_QUEUED
_deferred_queue/
_kv_finalize_thread
STATE_DEFERRED_CLEANUP
_txc_finish
STATE_DONE

BlueStore – big write
• No journal is needed.
• Customer data is written
into a new space.
• Return to OSD when
metadata is written into
RocksDB.
• The old space is released.
6
STATE_PREPARE_txc_add_transaction STATE_AIO_WAIT
_txc_aio_submit/
txc_aio_finish
STATE_IO_DONE
_txc_finish_io
STATE_KV_QUEUED
Put kv_queue
Kv_cond.notify
STATE_KV_SUBMITTED
_kv_sync_thread/
_kv_finalize_thread
STATE_KV_DONE
_txc_committed_kv
STATE_FINISHING _txc_finish
STATE_DONE

OSD tuning 1
OSD
bluestore_lat others
• 4k random writes.
• With default config (Perf dump
data): top chart.
• BlueStore finisher is single-
thread by default.
• After setting
bluestore_shard_finishers:
bottom charts.
40
45
50
IOPS(k)
4k_randw 4k_finsiher_false
7

OSD
BlueStore
others
OSD
Write IO time span
• Get time span from perf dump.
• OSD total latency: from OSD
handles a IO in Messengers to
commit the IO.
• BlueStore latency: from BlueStore
gets a IO to commit te IO to OSD.
• Note: Left 4k random writes, right 16k
random writes
BlueStore
others
8

0
20
40
60
256 128 64
IOPS(k)
256 512 768
• Keep total memory usage
consistent.
• RocksDB options:
Ø min_write_buffer_num
ber_to_merge (default
1)
Ø write_buffer_size
(default 256MB),
changed to 128, 64.
Random write tuning
9

• Increase total memory
usage consistent.
• RocksDB options:
Ø max_write_buffer_num
ber (default 8)
• The improvement is little.
Random write tuning – cont.
40
42
44
46
48
50
IOPS(k)
10

0
100
200
300
1
39
77
115
153
191
229
267
305
343
381
419
457
495
533
571
609
647
685
Data written into db
plogs ologs dlogs mlogs others
0
50
100
1
47
93
139
185
231
277
323
369
415
461
507
553
599
645
691
737
L0 SST
plogs ologs dlogs others
Random write – 4k
• When data is written into
RocksDB, it is added into
memTables.
• Once flush condition is
triggered, data in memTables
are flushed into L0 SST files.
• Deferred logs are main data in
every memTable while object
data are main data in L0 files.
11

Random write – 16k
0
100
200
300
1
13
25
37
49
61
73
85
97
109
121
133
145
157
169
Data written into db
plogs ologs dlogs mlogs others
0
100
200
1
14
27
40
53
66
79
92
105
118
131
144
157
170
183
L0 SST
plogs ologs dlogs others
• Similar data as 4k random
writes.
• Object data are main data
both in every memTable
and every L0 SST file.
12

SST2
• Random writes 4k/16k.
• Add a flush style: to delete
duplicated entries recursively.
• Performance is similar as
merge num = 2, but data
written into L0 is decreased.
(5G per 10mins)
RocksDB - Flush data recursively.
0
20
40
60
4k 16k
IOPS(k)
Default dedup_2
Merge Merge
memtable1 memtable2 memtable3 memtable4
memtable1 memtable2 memtable3 memtable4
Dedup
Old:
New:
SST1
SST1 SST2 SST3 SST4

14
Future work
• RocksDB is still heavy for pg logs, object data.
Ø For pg logs data, they are written once, and read when
the OSD node gets recovery.
Ø For object data, one-time journal may be enough.

16
Backup – Test config
• Hardware
• Memory: 128G
• CPU: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
• Disk: Intel P3700 400G
• Software
• Ceph master branch
@f584df78c294b11baa7527d8eab0874ae6a2b809
• Config
• A OSD, a monitor, and a manager.

Performance tuning in BlueStore & RocksDB - Li Xiaoyan

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Performance tuning in BlueStore & RocksDB - Li Xiaoyan

Similar a Performance tuning in BlueStore & RocksDB - Li Xiaoyan (20)

Último

Último (20)

Performance tuning in BlueStore & RocksDB - Li Xiaoyan