2. Hardware and config
Hardware:
Intel(R) Xeon(R) CPU E5-2699 v4 @
2.20GHz
RAM128GB
Intel S5230 SSD 800G
Intel P2700 SSD 400G
IO workload
random write 4k or 16k
Fio+librbd
10 jobs on 10 2GB rbd images
2 pools
iodepth=64
Ceph cluster with one OSD
rbd10_sata_dev_nvme_db_*
bluestore_block_path = /dev/sdb1
bluestore_block_db_path = /dev/nvme0n1p2
bluestore_block_wal_path = /dev/nvme0n1p3
rbd10_nvme_all_*
bluestore_block_path = /dev/nvme0n1p1
bluestore_block_db_path = /dev/nvme0n1p2
bluestore_block_wal_path = /dev/nvme0n1p3
rbd10_sata_all_*
bluestore_block_path = /dev/sdb1
bluestore_block_db_path = /dev/sdb2
bluestore_block_wal_path = /dev/sdb3
3. How BlueStore effects IO latency – 4k RW
0%
50%
100%
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
rbd10_sata_dev_nvme_db_4k
bluestore_percentage other_percentage
0%
50%
100%
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
rbd10_sata_all_4k
bluestore_percentage other_percentage
0%
50%
100%
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
rbd10_nvme_all_4k
bluestore_percentage other_percentage
• Tests
• BlueStore with dev and db on HDD or SSD.
• Do 4k random writes for one hour.
• The data is from perfcounter, dumped
and reset every 3 minutes.
• The charts show in different tests how
much percentage latency of BlueStore
is in total OSD IO latency.
Vertical axis: (bluestore
commit_lat)/op_latency
Horizontal axis: Time in 3 mins’ interval
• Conclusion: With NVMe SSD as
dev&db device, OSD common part is
the bottleneck. While with SATA SSD as
dev&db device, BlueStore is
bottleneck.
4. How BlueStore effects IO latency – 16k RW
0%
50%
100%
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
rbd10_sata_dev_nvme_db_16k
bluestore_percentage other_percentage
0%
50%
100%
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
rbd10_nvme_all_16k
bluestore_percentage other_percentage
0%
50%
100%
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
rbd10_sata_all_16k
bluestore_percentage other_percentage
• Tests
• BlueStore with dev and db on HDD or SSD.
• Do 16k random writes for one hour.
• The data is from perfcounter, dumped
and reset every 3 minutes.
• The charts show in different tests how
much percentage latency of BlueStore
is in total OSD IO latency.
Vertical axis: (bluestore
commit_lat)/op_latency
Horizontal axis: Time in 3 mins’ interval
• Conclusion: Same as 4k RW. Only
difference in the middle scenario
which uses SATA SSD as dev device and
NVMe SSD as db.
5. RocksDB
• Test steps:
Create Ceph cluster with one OSD.
Run fio to do sequential 4k or 16k write to fill the whole rbd images.
Do random write for 30 mins. (4k or 16k)
Get the KV operation sequences. (Update Ceph codes to print out KV sequences)
Create a new RocksDB in ext4 with same above options, and inject the KV operation
sequences into the DB.
• Rocksdb options:
bluestore_rocksdb_options=compression=kNoCompression,max_write_buffer_num
ber=4,min_write_buffer_number_to_merge=2,recycle_log_file_num=4,write_buffer
_size=268435456,writable_file_max_buffer_size=0,compaction_readahead_size=209
7152
6. Data written into per memTable – 4k RW
• It shows what kinds of data is
written into per memTable.
• In every 256MB memTable
(after stable), omap 21M,
onodes 47M, deferred 186M,
others 0.
• Total 1760 256MB memtables
are generated.
0
50
100
150
200
1
49
97
145
193
241
289
337
385
433
481
529
577
625
673
721
769
817
865
913
961
1009
1057
1105
1153
1201
1249
1297
1345
1393
1441
1489
1537
1585
1633
1681
1729
Size(MB)
Every memtable
Data written into memtables
omap onodes deferred other
7. Data written into L0 SST – 4k RW
0
5
10
15
20
1
56
111
166
221
276
331
386
441
496
551
606
661
716
771
826
881
936
991
1046
1101
1156
1211
1266
1321
1376
1431
1486
1541
1596
1651
1706
1761
1816
Size(MB)
Every L0 SST file
Data written into L0 sst
omap onodes deferred others
0
10
20
30
40
1
27
53
79
105
131
157
183
209
235
261
287
313
339
365
391
417
443
469
495
521
547
573
599
625
651
677
703
729
755
781
807
833
859
885
911
937
Size(MB)
Every L0 sst file
Data written into L0 sst
omap onodes deferred others
• They show what kinds of data is written
into every L0 SST file.
• Top chart:
Data written into every L0 SST files: omap
24MB, onodes 15 ~ 27MB, deferred 3MB,
others 0.
Total 923 L0 SST files are created, total size
46653MB.
• Bottom chart:
Added option:
flush_style=kFlushStyleDedup.
omap 12MB, onodes 4MB ~ 8 MB, deferred
0, others 0.
Total 36313MB written into 1840 L0 SST.
• Conclusion: When using the rocksdb
dedup package, about 10GB data less
written to L0 SST files in 30 mins.
8. Data written into per memTable – 16k RW
• It shows what kinds of data is
written into per memTable.
• In every 256MB memTable
(after stable), onodes 169M,
omap 78M, deferred 0, others
9MB.
• Total 434 256MB memTable
are generated.
0
50
100
150
200
1
13
25
37
49
61
73
85
97
109
121
133
145
157
169
181
193
205
217
229
241
253
265
277
289
301
313
325
337
349
361
373
385
397
409
421
433
Size(MB)
Every memtable
Data written into memtables
omap onodes deferred other
9. Data written into L0 SST – 16k RW
0
20
40
60
80
100
1
8
15
22
29
36
43
50
57
64
71
78
85
92
99
106
113
120
127
134
141
148
155
162
169
176
183
190
197
204
211
218
225
232
239
Size(MB)
Every L0 sst file
Data written into L0 sst
omap onodes deferred others
0
10
20
30
40
50
1
15
29
43
57
71
85
99
113
127
141
155
169
183
197
211
225
239
253
267
281
295
309
323
337
351
365
379
393
407
421
435
449
463
477
Size(MB)
Every L0 sst file
Data written into L0 sst
omap onodes deferred others
• They show what kinds of data is
written into every L0 SST file.
• Top chart:
Data written into every L0 SST files:
omap 80MB, onodes 35MB, deferred 0,
others 9MB.
Total 243 L0 SST files are generated,
total size is 29404MB.
• Bottom chart:
With flush_style=kFlushStyleDedup.
omap 40MB, onodes 4MB ~ 8 MB,
deferred 0, others 5MB.
Total 24275MB written into 485 L0 SST.
• Conclusion: When using the rocksdb
dedup package, about 5GB data less
written to L0 SST files in 30 mins.
10. Other conclusions:
• Omap data is main data written into L0 SST, and the rocksdb dedup
PR benefits little for omap.