2. Overview
2
DISK DISK DISK DISK DISK
OSD OSD OSD
MMM
M
Storage Node
Monitor Node
OSD OSD
ObjectStoreObjectStore ObjectStore ObjectStore ObjectStore
• ObjectStore is to store data in
local nodes.
• BlueStore is default object
store.
3. BlueStore
• BlueStore = Block +
NewStore
• Data write directly to raw
block device.
• Metadata write to
Key/value database
(Rocksdb).
Ø Object data
Ø Omap
Ø Deferred logs
Ø others
• RocksDB is above light
weight file system BlueFS.
3
5. BlueStore – small write
• RocksDB acts as journal
(deferred IO).
• Customer data is written
to RocksDB, and return to
OSD.
• Later customer data is
written into block device.
• Deferred IO entry is
deleted from RocksDB.
5
STATE_PREPARE
_txc_add_transaction/
add deferred io to kv writebach
STATE_AIO_WAITNo aios
STATE_IO_DONE
_txc_finish_io
STATE_KV_QUEUED
Put kv_queue
Kv_cond.notify
STATE_KV_SUBMITTED
_kv_sync_thread/
_kv_finalize_thread
STATE_KV_DONE
_txc_committed_kv
STATE_DEFERRED_QUEUED
_deferred_queue/
_kv_finalize_thread
STATE_DEFERRED_CLEANUP
_txc_finish
STATE_DONE
6. BlueStore – big write
• No journal is needed.
• Customer data is written
into a new space.
• Return to OSD when
metadata is written into
RocksDB.
• The old space is released.
6
STATE_PREPARE_txc_add_transaction STATE_AIO_WAIT
_txc_aio_submit/
txc_aio_finish
STATE_IO_DONE
_txc_finish_io
STATE_KV_QUEUED
Put kv_queue
Kv_cond.notify
STATE_KV_SUBMITTED
_kv_sync_thread/
_kv_finalize_thread
STATE_KV_DONE
_txc_committed_kv
STATE_FINISHING _txc_finish
STATE_DONE
7. OSD tuning 1
OSD
bluestore_lat others
• 4k random writes.
• With default config (Perf dump
data): top chart.
• BlueStore finisher is single-
thread by default.
• After setting
bluestore_shard_finishers:
bottom charts.
40
45
50
IOPS(k)
4k_randw 4k_finsiher_false
7
10. • Increase total memory
usage consistent.
• RocksDB options:
Ø max_write_buffer_num
ber (default 8)
• The improvement is little.
• 4k random writes.
Random write tuning – cont.
40
42
44
46
48
50
IOPS(k)
10
11. 0
100
200
300
1
39
77
115
153
191
229
267
305
343
381
419
457
495
533
571
609
647
685
Data written into db
plogs ologs dlogs mlogs others
0
50
100
1
47
93
139
185
231
277
323
369
415
461
507
553
599
645
691
737
L0 SST
plogs ologs dlogs others
Random write – 4k
• When data is written into
RocksDB, it is added into
memTables.
• Once flush condition is
triggered, data in memTables
are flushed into L0 SST files.
• Deferred logs are main data in
every memTable while object
data are main data in L0 files.
11
13. SST2
• Random writes 4k/16k.
• Add a flush style: to delete
duplicated entries recursively.
• Performance is similar as
merge num = 2, but data
written into L0 is decreased.
(5G per 10mins)
RocksDB - Flush data recursively.
0
20
40
60
4k 16k
IOPS(k)
Default dedup_2
Merge Merge
memtable1 memtable2 memtable3 memtable4
memtable1 memtable2 memtable3 memtable4
Dedup
Old:
New:
SST1
SST1 SST2 SST3 SST4
14. 14
Future work
• RocksDB is still heavy for pg logs, object data.
Ø For pg logs data, they are written once, and read when
the OSD node gets recovery.
Ø For object data, one-time journal may be enough.