SlideShare una empresa de Scribd logo
1 de 10
Descargar para leer sin conexión
Some analysis of
BlueStore&RocksDB
Li Xiaoyan (Intel)
Hardware and config
Hardware:
Intel(R) Xeon(R) CPU E5-2699 v4 @
2.20GHz
RAM128GB
Intel S5230 SSD 800G
Intel P2700 SSD 400G
IO workload
random write 4k or 16k
Fio+librbd
10 jobs on 10 2GB rbd images
2 pools
iodepth=64
Ceph cluster with one OSD
rbd10_sata_dev_nvme_db_*
bluestore_block_path = /dev/sdb1
bluestore_block_db_path = /dev/nvme0n1p2
bluestore_block_wal_path = /dev/nvme0n1p3
rbd10_nvme_all_*
bluestore_block_path = /dev/nvme0n1p1
bluestore_block_db_path = /dev/nvme0n1p2
bluestore_block_wal_path = /dev/nvme0n1p3
rbd10_sata_all_*
bluestore_block_path = /dev/sdb1
bluestore_block_db_path = /dev/sdb2
bluestore_block_wal_path = /dev/sdb3
How BlueStore effects IO latency – 4k RW
0%
50%
100%
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
rbd10_sata_dev_nvme_db_4k
bluestore_percentage other_percentage
0%
50%
100%
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
rbd10_sata_all_4k
bluestore_percentage other_percentage
0%
50%
100%
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
rbd10_nvme_all_4k
bluestore_percentage other_percentage
• Tests
• BlueStore with dev and db on HDD or SSD.
• Do 4k random writes for one hour.
• The data is from perfcounter, dumped
and reset every 3 minutes.
• The charts show in different tests how
much percentage latency of BlueStore
is in total OSD IO latency.
Vertical axis: (bluestore
commit_lat)/op_latency
Horizontal axis: Time in 3 mins’ interval
• Conclusion: With NVMe SSD as
dev&db device, OSD common part is
the bottleneck. While with SATA SSD as
dev&db device, BlueStore is
bottleneck.
How BlueStore effects IO latency – 16k RW
0%
50%
100%
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
rbd10_sata_dev_nvme_db_16k
bluestore_percentage other_percentage
0%
50%
100%
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
rbd10_nvme_all_16k
bluestore_percentage other_percentage
0%
50%
100%
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
rbd10_sata_all_16k
bluestore_percentage other_percentage
• Tests
• BlueStore with dev and db on HDD or SSD.
• Do 16k random writes for one hour.
• The data is from perfcounter, dumped
and reset every 3 minutes.
• The charts show in different tests how
much percentage latency of BlueStore
is in total OSD IO latency.
Vertical axis: (bluestore
commit_lat)/op_latency
Horizontal axis: Time in 3 mins’ interval
• Conclusion: Same as 4k RW. Only
difference in the middle scenario
which uses SATA SSD as dev device and
NVMe SSD as db.
RocksDB
• Test steps:
Create Ceph cluster with one OSD.
Run fio to do sequential 4k or 16k write to fill the whole rbd images.
Do random write for 30 mins. (4k or 16k)
Get the KV operation sequences. (Update Ceph codes to print out KV sequences)
Create a new RocksDB in ext4 with same above options, and inject the KV operation
sequences into the DB.
• Rocksdb options:
bluestore_rocksdb_options=compression=kNoCompression,max_write_buffer_num
ber=4,min_write_buffer_number_to_merge=2,recycle_log_file_num=4,write_buffer
_size=268435456,writable_file_max_buffer_size=0,compaction_readahead_size=209
7152
Data written into per memTable – 4k RW
• It shows what kinds of data is
written into per memTable.
• In every 256MB memTable
(after stable), omap 21M,
onodes 47M, deferred 186M,
others 0.
• Total 1760 256MB memtables
are generated.
0
50
100
150
200
1
49
97
145
193
241
289
337
385
433
481
529
577
625
673
721
769
817
865
913
961
1009
1057
1105
1153
1201
1249
1297
1345
1393
1441
1489
1537
1585
1633
1681
1729
Size(MB)
Every memtable
Data written into memtables
omap onodes deferred other
Data written into L0 SST – 4k RW
0
5
10
15
20
1
56
111
166
221
276
331
386
441
496
551
606
661
716
771
826
881
936
991
1046
1101
1156
1211
1266
1321
1376
1431
1486
1541
1596
1651
1706
1761
1816
Size(MB)
Every L0 SST file
Data written into L0 sst
omap onodes deferred others
0
10
20
30
40
1
27
53
79
105
131
157
183
209
235
261
287
313
339
365
391
417
443
469
495
521
547
573
599
625
651
677
703
729
755
781
807
833
859
885
911
937
Size(MB)
Every L0 sst file
Data written into L0 sst
omap onodes deferred others
• They show what kinds of data is written
into every L0 SST file.
• Top chart:
 Data written into every L0 SST files: omap
24MB, onodes 15 ~ 27MB, deferred 3MB,
others 0.
 Total 923 L0 SST files are created, total size
46653MB.
• Bottom chart:
 Added option:
flush_style=kFlushStyleDedup.
 omap 12MB, onodes 4MB ~ 8 MB, deferred
0, others 0.
 Total 36313MB written into 1840 L0 SST.
• Conclusion: When using the rocksdb
dedup package, about 10GB data less
written to L0 SST files in 30 mins.
Data written into per memTable – 16k RW
• It shows what kinds of data is
written into per memTable.
• In every 256MB memTable
(after stable), onodes 169M,
omap 78M, deferred 0, others
9MB.
• Total 434 256MB memTable
are generated.
0
50
100
150
200
1
13
25
37
49
61
73
85
97
109
121
133
145
157
169
181
193
205
217
229
241
253
265
277
289
301
313
325
337
349
361
373
385
397
409
421
433
Size(MB)
Every memtable
Data written into memtables
omap onodes deferred other
Data written into L0 SST – 16k RW
0
20
40
60
80
100
1
8
15
22
29
36
43
50
57
64
71
78
85
92
99
106
113
120
127
134
141
148
155
162
169
176
183
190
197
204
211
218
225
232
239
Size(MB)
Every L0 sst file
Data written into L0 sst
omap onodes deferred others
0
10
20
30
40
50
1
15
29
43
57
71
85
99
113
127
141
155
169
183
197
211
225
239
253
267
281
295
309
323
337
351
365
379
393
407
421
435
449
463
477
Size(MB)
Every L0 sst file
Data written into L0 sst
omap onodes deferred others
• They show what kinds of data is
written into every L0 SST file.
• Top chart:
 Data written into every L0 SST files:
omap 80MB, onodes 35MB, deferred 0,
others 9MB.
 Total 243 L0 SST files are generated,
total size is 29404MB.
• Bottom chart:
 With flush_style=kFlushStyleDedup.
 omap 40MB, onodes 4MB ~ 8 MB,
deferred 0, others 5MB.
 Total 24275MB written into 485 L0 SST.
• Conclusion: When using the rocksdb
dedup package, about 5GB data less
written to L0 SST files in 30 mins.
Other conclusions:
• Omap data is main data written into L0 SST, and the rocksdb dedup
PR benefits little for omap.

Más contenido relacionado

La actualidad más candente

Picobgp - A simple deamon for routing advertising
Picobgp - A simple deamon for routing advertisingPicobgp - A simple deamon for routing advertising
Picobgp - A simple deamon for routing advertising
Claudio Mignanti
 
Velocity 2017 Performance analysis superpowers with Linux eBPF
Velocity 2017 Performance analysis superpowers with Linux eBPFVelocity 2017 Performance analysis superpowers with Linux eBPF
Velocity 2017 Performance analysis superpowers with Linux eBPF
Brendan Gregg
 

La actualidad más candente (19)

Comparison of foss distributed storage
Comparison of foss distributed storageComparison of foss distributed storage
Comparison of foss distributed storage
 
Evaluation of RBD replication options @CERN
Evaluation of RBD replication options @CERNEvaluation of RBD replication options @CERN
Evaluation of RBD replication options @CERN
 
Tacc Infinite Memory Engine
Tacc Infinite Memory EngineTacc Infinite Memory Engine
Tacc Infinite Memory Engine
 
Ceph RBD Update - June 2021
Ceph RBD Update - June 2021Ceph RBD Update - June 2021
Ceph RBD Update - June 2021
 
Memory, Big Data, NoSQL and Virtualization
Memory, Big Data, NoSQL and VirtualizationMemory, Big Data, NoSQL and Virtualization
Memory, Big Data, NoSQL and Virtualization
 
Cassandra Performance Benchmark
Cassandra Performance BenchmarkCassandra Performance Benchmark
Cassandra Performance Benchmark
 
PostgreSQL performance archaeology
PostgreSQL performance archaeologyPostgreSQL performance archaeology
PostgreSQL performance archaeology
 
Comparison of-foss-distributed-storage
Comparison of-foss-distributed-storageComparison of-foss-distributed-storage
Comparison of-foss-distributed-storage
 
PostgreSQL na EXT4, XFS, BTRFS a ZFS / FOSDEM PgDay 2016
PostgreSQL na EXT4, XFS, BTRFS a ZFS / FOSDEM PgDay 2016PostgreSQL na EXT4, XFS, BTRFS a ZFS / FOSDEM PgDay 2016
PostgreSQL na EXT4, XFS, BTRFS a ZFS / FOSDEM PgDay 2016
 
Picobgp - A simple deamon for routing advertising
Picobgp - A simple deamon for routing advertisingPicobgp - A simple deamon for routing advertising
Picobgp - A simple deamon for routing advertising
 
了解网络
了解网络了解网络
了解网络
 
Control your service resources with systemd
 Control your service resources with systemd  Control your service resources with systemd
Control your service resources with systemd
 
AppOS: PostgreSQL Extension for Scalable File I/O @ PGConf.Asia 2019
AppOS: PostgreSQL Extension for Scalable File I/O @ PGConf.Asia 2019AppOS: PostgreSQL Extension for Scalable File I/O @ PGConf.Asia 2019
AppOS: PostgreSQL Extension for Scalable File I/O @ PGConf.Asia 2019
 
Velocity 2017 Performance analysis superpowers with Linux eBPF
Velocity 2017 Performance analysis superpowers with Linux eBPFVelocity 2017 Performance analysis superpowers with Linux eBPF
Velocity 2017 Performance analysis superpowers with Linux eBPF
 
Performance comparison of Distributed File Systems on 1Gbit networks
Performance comparison of Distributed File Systems on 1Gbit networksPerformance comparison of Distributed File Systems on 1Gbit networks
Performance comparison of Distributed File Systems on 1Gbit networks
 
PostgreSQL on EXT4, XFS, BTRFS and ZFS
PostgreSQL on EXT4, XFS, BTRFS and ZFSPostgreSQL on EXT4, XFS, BTRFS and ZFS
PostgreSQL on EXT4, XFS, BTRFS and ZFS
 
1 m+ qps on mysql galera cluster
1 m+ qps on mysql galera cluster1 m+ qps on mysql galera cluster
1 m+ qps on mysql galera cluster
 
はじめてのGlusterFS
はじめてのGlusterFSはじめてのGlusterFS
はじめてのGlusterFS
 
Installation of application server 10g in red hat 4
Installation of application server 10g in red hat 4Installation of application server 10g in red hat 4
Installation of application server 10g in red hat 4
 

Similar a Some analysis of BlueStore and RocksDB

Exadata and OLTP
Exadata and OLTPExadata and OLTP
Exadata and OLTP
Enkitec
 

Similar a Some analysis of BlueStore and RocksDB (20)

Speedrunning the Open Street Map osm2pgsql Loader
Speedrunning the Open Street Map osm2pgsql LoaderSpeedrunning the Open Street Map osm2pgsql Loader
Speedrunning the Open Street Map osm2pgsql Loader
 
Stabilizing Ceph
Stabilizing CephStabilizing Ceph
Stabilizing Ceph
 
LUG 2014
LUG 2014LUG 2014
LUG 2014
 
Accelerating hbase with nvme and bucket cache
Accelerating hbase with nvme and bucket cacheAccelerating hbase with nvme and bucket cache
Accelerating hbase with nvme and bucket cache
 
Accelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket CacheAccelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket Cache
 
Open Source Data Deduplication
Open Source Data DeduplicationOpen Source Data Deduplication
Open Source Data Deduplication
 
ceph-barcelona-v-1.2
ceph-barcelona-v-1.2ceph-barcelona-v-1.2
ceph-barcelona-v-1.2
 
Ceph barcelona-v-1.2
Ceph barcelona-v-1.2Ceph barcelona-v-1.2
Ceph barcelona-v-1.2
 
Exadata and OLTP
Exadata and OLTPExadata and OLTP
Exadata and OLTP
 
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...
 
Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...
 
3.INTEL.Optane_on_ceph_v2.pdf
3.INTEL.Optane_on_ceph_v2.pdf3.INTEL.Optane_on_ceph_v2.pdf
3.INTEL.Optane_on_ceph_v2.pdf
 
TiDB vs Aurora.pdf
TiDB vs Aurora.pdfTiDB vs Aurora.pdf
TiDB vs Aurora.pdf
 
Ceph Performance: Projects Leading up to Jewel
Ceph Performance: Projects Leading up to JewelCeph Performance: Projects Leading up to Jewel
Ceph Performance: Projects Leading up to Jewel
 
Ceph Performance: Projects Leading Up to Jewel
Ceph Performance: Projects Leading Up to JewelCeph Performance: Projects Leading Up to Jewel
Ceph Performance: Projects Leading Up to Jewel
 
Database performance tuning for SSD based storage
Database  performance tuning for SSD based storageDatabase  performance tuning for SSD based storage
Database performance tuning for SSD based storage
 
Journey to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack CloudJourney to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack Cloud
 
Journey to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack CloudJourney to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack Cloud
 
Cassandra and Solid State Drives
Cassandra and Solid State DrivesCassandra and Solid State Drives
Cassandra and Solid State Drives
 
SSD based storage tuning for databases
SSD based storage tuning for databasesSSD based storage tuning for databases
SSD based storage tuning for databases
 

Último

The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
shinachiaurasa2
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
masabamasaba
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
masabamasaba
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
masabamasaba
 

Último (20)

tonesoftg
tonesoftgtonesoftg
tonesoftg
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
 
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 

Some analysis of BlueStore and RocksDB

  • 2. Hardware and config Hardware: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz RAM128GB Intel S5230 SSD 800G Intel P2700 SSD 400G IO workload random write 4k or 16k Fio+librbd 10 jobs on 10 2GB rbd images 2 pools iodepth=64 Ceph cluster with one OSD rbd10_sata_dev_nvme_db_* bluestore_block_path = /dev/sdb1 bluestore_block_db_path = /dev/nvme0n1p2 bluestore_block_wal_path = /dev/nvme0n1p3 rbd10_nvme_all_* bluestore_block_path = /dev/nvme0n1p1 bluestore_block_db_path = /dev/nvme0n1p2 bluestore_block_wal_path = /dev/nvme0n1p3 rbd10_sata_all_* bluestore_block_path = /dev/sdb1 bluestore_block_db_path = /dev/sdb2 bluestore_block_wal_path = /dev/sdb3
  • 3. How BlueStore effects IO latency – 4k RW 0% 50% 100% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 rbd10_sata_dev_nvme_db_4k bluestore_percentage other_percentage 0% 50% 100% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 rbd10_sata_all_4k bluestore_percentage other_percentage 0% 50% 100% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 rbd10_nvme_all_4k bluestore_percentage other_percentage • Tests • BlueStore with dev and db on HDD or SSD. • Do 4k random writes for one hour. • The data is from perfcounter, dumped and reset every 3 minutes. • The charts show in different tests how much percentage latency of BlueStore is in total OSD IO latency. Vertical axis: (bluestore commit_lat)/op_latency Horizontal axis: Time in 3 mins’ interval • Conclusion: With NVMe SSD as dev&db device, OSD common part is the bottleneck. While with SATA SSD as dev&db device, BlueStore is bottleneck.
  • 4. How BlueStore effects IO latency – 16k RW 0% 50% 100% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 rbd10_sata_dev_nvme_db_16k bluestore_percentage other_percentage 0% 50% 100% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 rbd10_nvme_all_16k bluestore_percentage other_percentage 0% 50% 100% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 rbd10_sata_all_16k bluestore_percentage other_percentage • Tests • BlueStore with dev and db on HDD or SSD. • Do 16k random writes for one hour. • The data is from perfcounter, dumped and reset every 3 minutes. • The charts show in different tests how much percentage latency of BlueStore is in total OSD IO latency. Vertical axis: (bluestore commit_lat)/op_latency Horizontal axis: Time in 3 mins’ interval • Conclusion: Same as 4k RW. Only difference in the middle scenario which uses SATA SSD as dev device and NVMe SSD as db.
  • 5. RocksDB • Test steps: Create Ceph cluster with one OSD. Run fio to do sequential 4k or 16k write to fill the whole rbd images. Do random write for 30 mins. (4k or 16k) Get the KV operation sequences. (Update Ceph codes to print out KV sequences) Create a new RocksDB in ext4 with same above options, and inject the KV operation sequences into the DB. • Rocksdb options: bluestore_rocksdb_options=compression=kNoCompression,max_write_buffer_num ber=4,min_write_buffer_number_to_merge=2,recycle_log_file_num=4,write_buffer _size=268435456,writable_file_max_buffer_size=0,compaction_readahead_size=209 7152
  • 6. Data written into per memTable – 4k RW • It shows what kinds of data is written into per memTable. • In every 256MB memTable (after stable), omap 21M, onodes 47M, deferred 186M, others 0. • Total 1760 256MB memtables are generated. 0 50 100 150 200 1 49 97 145 193 241 289 337 385 433 481 529 577 625 673 721 769 817 865 913 961 1009 1057 1105 1153 1201 1249 1297 1345 1393 1441 1489 1537 1585 1633 1681 1729 Size(MB) Every memtable Data written into memtables omap onodes deferred other
  • 7. Data written into L0 SST – 4k RW 0 5 10 15 20 1 56 111 166 221 276 331 386 441 496 551 606 661 716 771 826 881 936 991 1046 1101 1156 1211 1266 1321 1376 1431 1486 1541 1596 1651 1706 1761 1816 Size(MB) Every L0 SST file Data written into L0 sst omap onodes deferred others 0 10 20 30 40 1 27 53 79 105 131 157 183 209 235 261 287 313 339 365 391 417 443 469 495 521 547 573 599 625 651 677 703 729 755 781 807 833 859 885 911 937 Size(MB) Every L0 sst file Data written into L0 sst omap onodes deferred others • They show what kinds of data is written into every L0 SST file. • Top chart:  Data written into every L0 SST files: omap 24MB, onodes 15 ~ 27MB, deferred 3MB, others 0.  Total 923 L0 SST files are created, total size 46653MB. • Bottom chart:  Added option: flush_style=kFlushStyleDedup.  omap 12MB, onodes 4MB ~ 8 MB, deferred 0, others 0.  Total 36313MB written into 1840 L0 SST. • Conclusion: When using the rocksdb dedup package, about 10GB data less written to L0 SST files in 30 mins.
  • 8. Data written into per memTable – 16k RW • It shows what kinds of data is written into per memTable. • In every 256MB memTable (after stable), onodes 169M, omap 78M, deferred 0, others 9MB. • Total 434 256MB memTable are generated. 0 50 100 150 200 1 13 25 37 49 61 73 85 97 109 121 133 145 157 169 181 193 205 217 229 241 253 265 277 289 301 313 325 337 349 361 373 385 397 409 421 433 Size(MB) Every memtable Data written into memtables omap onodes deferred other
  • 9. Data written into L0 SST – 16k RW 0 20 40 60 80 100 1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127 134 141 148 155 162 169 176 183 190 197 204 211 218 225 232 239 Size(MB) Every L0 sst file Data written into L0 sst omap onodes deferred others 0 10 20 30 40 50 1 15 29 43 57 71 85 99 113 127 141 155 169 183 197 211 225 239 253 267 281 295 309 323 337 351 365 379 393 407 421 435 449 463 477 Size(MB) Every L0 sst file Data written into L0 sst omap onodes deferred others • They show what kinds of data is written into every L0 SST file. • Top chart:  Data written into every L0 SST files: omap 80MB, onodes 35MB, deferred 0, others 9MB.  Total 243 L0 SST files are generated, total size is 29404MB. • Bottom chart:  With flush_style=kFlushStyleDedup.  omap 40MB, onodes 4MB ~ 8 MB, deferred 0, others 5MB.  Total 24275MB written into 485 L0 SST. • Conclusion: When using the rocksdb dedup package, about 5GB data less written to L0 SST files in 30 mins.
  • 10. Other conclusions: • Omap data is main data written into L0 SST, and the rocksdb dedup PR benefits little for omap.