SlideShare una empresa de Scribd logo
1 de 26
Brought to you by
Seastore: Next Generation
Backing Store for Ceph
Samuel Just
Senior Principal Software Engineer at
Sam Just
Senior Principal Software Engineer
■ Currently Senior Principal Software Engineer at Red Hat
■ Crimson Lead
■ Veteran Ceph Contributor
Crimson!
What
■ Rewrite IO path in Seastar
● Preallocate cores
● One thread per core
● Explicitly shard all data structures
and work over cores
● No locks and no blocking
● Message passing between cores
● Polling for all IO
■ DPDK, SPDK
● Kernel bypass for network and
storage IO
■ Multi-year effort
Why
■ Not just about how many IOPS we do…
■ More about IOPS per CPU core
■ Current Ceph is based on traditional
multi-threaded programming model
■ Context switching is too expensive
when storage is almost as fast as
memory
Classic-OSD Threading Architecture
Messenger
Worker Thread
Worker Thread
Worker Thread
PG
Worker Thread
Worker Thread
Worker Thread
ObjectStore
Worker Thread
Worker Thread
Worker Thread
osd op queue kv queue
Messenger
Worker Thread
Worker Thread
Worker Thread
out_queue
Crimson-OSD Threading Architecture
Seastar Reactor Thread
Messenger PG ObjectStore
async wait async wait
Messenger
async wait
Seastar Reactor Thread
Messenger PG ObjectStore
async wait async wait
Messenger
async wait
Seastar Reactor Thread
Messenger PG ObjectStore
async wait async wait
Messenger
async wait
Seastar Reactor Thread
Messenger PG ObjectStore
async wait async wait
Messenger
async wait
Seastar Reactor Thread
Messenger PG ObjectStore
async wait async wait
Messenger
async wait
Classic vs Crimson Programming Model
for (;;) {
auto req = connection.read_request();
auto result = handle_request(req);
connection.send_response(result);
}
repeat([conn, this] {
return conn.read_request().then([this](auto req) {
return handle_request(req);
}).then([conn] (auto result) {
return conn.send_response(result);
});
});
Seastore
ObjectStore
■ Transactional
■ Composed of flat object namespace
■ Object names may be large (>1k)
■ Each object contains a key->value mapping (string->bytes) as well as a data payload.
■ Supports COW object clones
■ Supports ordered listing of both the omap and the object namespace
SeaStore
■ New ObjectStore implementation designed natively for crimson’s threading/callback model
■ Avoids cpu-heavy metadata designs like RocksDB
■ Intended to exploit emerging technologies like ZNS, fast nvme, and persistent memory.
Seastore - ZNS
■ New NVMe Specification
■ Intended to address challenges with conventional FTL designs
● High writes amplification, bad for qlc flash
● Background garbage collection tends to impact tail latencies
■ Different interface
● Drive divided into zones
● Zones can only be opened, written sequentially, closed, and released.
■ As it happens, this kind of write pattern tends to be good for conventional ssds as well.
Seastore - Persistent Memory
■ Characteristics:
● Almost DRAM like read latencies
● Write latency drastically lower than flash
● Very high write endurance
● Seems like a good fit for persistently caching data and metadata!
● Reads from persistent memory can simply return a ready future without waiting at all.
■ SeaStore approach currently being discussed:
● Keep caching layer in persistent memory.
● Update by copy-on-write with a paddr->extent mapping maintained via a write-ahead journal.
Design
Seastore - Logical Structure
Root
onode_by_hobject lba_tree
onode
block_info
omap_tree extents (contiguous
block of LBAs)
lba_t -> block_info_t
hobject_t -> onode_t
Seastore - Why use an LBA indirection?
A
B
C C’
When GCing node C, we need to do 2 things:
1. Find all incoming references (B in this case)
2. Write out a transaction updating (and dirtying) those
references as well as writing out the new block C’
Using direct references means we still need to maintain some
means of finding the parent references, and we pay the cost of
updating the relatively low fanout onode and omap trees.
By contrast, using an LBA indirection requires extra reads in the
lookup path, but potentially decreases read and write
amplification during gc.
It also makes refcounting and subtree sharing for COW clones
easier, and we can use it to do sparse mapping for object data
extents.
Seastore - Layout
Journal Segments
header delta D’ ...
delta E’ (logical) block B ... ...
record
(physical) block A
E
D
A
E’
D’
B
LBA Tree LBA Tree
Seastore - ZNS
Journal Segments
header delta D’ ...
delta E’ (logical) block B ... ...
record
(physical) block A
E
D
A
E’
D’
B
LBA Tree LBA Tree
Architecture
Cache Journal LBAManager SegmentCleaner
TransactionManager
OnodeManager OmapManager ObjectDataHandler ...
SeaStore
RootBlock
■ Broadly, SeaStore is divided into components above and below TransactionManager:
● TransactionManager supplies a transactional interface in terms of logically addressed blocks used by
data extents and metadata structures like the ghobject->onode index and omap trees.
● Components under TransactionManager (mainly the logical address tree) deal in terms of physically
addressed extents.
OnodeManager -- FLTree
■ crimson/os/seastore/onode_manager/staged-fltree/
■ Btree based structure for mapping ghobject_t -> onode
■ Splits ghobject_t keys into a fixed prefix (shard, pool, key), variable middle (object name and
namespace), and fixed suffix (snapshot, generation)
■ Internal nodes drop components that are uniform over the whole node to reduce space overhead.
■ Internal nodes can also drop components that differ between adjacent keys.
● Allows nodes close to root to avoid recording name and namespace improving density.
■ Contributed mainly by Yingxin Cheng <yingxin.cheng@intel.com>
OmapManager -- BtreeOmapManager
■ crimson/os/seastore/omap_manager/btree/
■ Implements omap storage for each object.
■ Fairly straightforward btree mapping string keys to string values.
■ Contributed mainly by chunmei-liu <chunmei.liu@intel.com>
ObjectDataHandler
■ crimson/os/seastore/object_data_handler.h/cc
■ Each onode maps a contiguous, sparse range of logical addresses to that object’s offsets
■ Leverages the LBAManager to avoid a secondary extent map.
■ Clone support requires work to enable relocating logical extent addresses (TODO)
TransactionManager
■ crimson/os/seastore/TransactionManager.h/cc
■ Implements a uniform transactional interface for allocating, reading, and mutating logically
addressed extents.
■ Mutations to extents can be expressed as a compact type dependent “delta” which will be included
transparently in the commit journal record.
● For example, BtreeOmapManager represents the insertion of a key into a block by encoding
the key/value pair rather than needing to rewrite the full block.
■ Components using TransactionManager can ignore extent relocation entirely.
LBAManager -- BtreeLBAManager
■ crimson/os/seastore/lba_manager/btree/
■ Btree mapping logical addresses to physical offsets
■ Includes reference counts to be used with clone.
■ Will include extent checksums (TODO)
Cache
■ crimson/os/seastore/lba_manager/cache.h/cc
■ Manages in-memory extents, keeps dirty extents pinned in memory
■ Detects conflicts in transactions to be committed
SegmentCleaner
■ crimson/os/seastore/lba_manager/segment_cleaner
■ Tracks usage status of segments.
■ Runs a background process (within the same reactor) to choose a segment, relocate any live
extents, and release it.
● Logical extents are simply remapped within the LBAManager
● Physical extents (mainly BtreeLBAManager extents) need any references updated as well
(mainly btree parents)
■ Also responsible for throttling foreground work based on pending gc work -- goal is to avoid abrupt
gc pauses and smooth out latency.
Current Status and Future Work
■ Can boot an osd via vstart without snapshots and complete upwards of several minutes of IO before
crashing!
■ Performance/stability: stabilize, add to ceph upstream testing infrastructure
■ Add ability to remap location of logical extents
■ In place mutation support for fast nvme devices
■ Persistent memory support in Cache
■ Tiering
Brought to you by
Samuel Just
Senior Principal Software Engineer at Red Hat
Email: sjust@redhat.com

Más contenido relacionado

La actualidad más candente

Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Odinot Stanislas
 
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephSage Weil
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introductioncolorant
 
Performance tuning in BlueStore & RocksDB - Li Xiaoyan
Performance tuning in BlueStore & RocksDB - Li XiaoyanPerformance tuning in BlueStore & RocksDB - Li Xiaoyan
Performance tuning in BlueStore & RocksDB - Li XiaoyanCeph Community
 
Revisiting CephFS MDS and mClock QoS Scheduler
Revisiting CephFS MDS and mClock QoS SchedulerRevisiting CephFS MDS and mClock QoS Scheduler
Revisiting CephFS MDS and mClock QoS SchedulerYongseok Oh
 
Ceph RBD Update - June 2021
Ceph RBD Update - June 2021Ceph RBD Update - June 2021
Ceph RBD Update - June 2021Ceph Community
 
Analyze Virtual Machine Overhead Compared to Bare Metal with Tracing
Analyze Virtual Machine Overhead Compared to Bare Metal with TracingAnalyze Virtual Machine Overhead Compared to Bare Metal with Tracing
Analyze Virtual Machine Overhead Compared to Bare Metal with TracingScyllaDB
 
Ceph Performance and Sizing Guide
Ceph Performance and Sizing GuideCeph Performance and Sizing Guide
Ceph Performance and Sizing GuideJose De La Rosa
 
Ceph scale testing with 10 Billion Objects
Ceph scale testing with 10 Billion ObjectsCeph scale testing with 10 Billion Objects
Ceph scale testing with 10 Billion ObjectsKaran Singh
 
Ceph Object Storage Reference Architecture Performance and Sizing Guide
Ceph Object Storage Reference Architecture Performance and Sizing GuideCeph Object Storage Reference Architecture Performance and Sizing Guide
Ceph Object Storage Reference Architecture Performance and Sizing GuideKaran Singh
 
Ceph - A distributed storage system
Ceph - A distributed storage systemCeph - A distributed storage system
Ceph - A distributed storage systemItalo Santos
 
A crash course in CRUSH
A crash course in CRUSHA crash course in CRUSH
A crash course in CRUSHSage Weil
 
QEMU Disk IO Which performs Better: Native or threads?
QEMU Disk IO Which performs Better: Native or threads?QEMU Disk IO Which performs Better: Native or threads?
QEMU Disk IO Which performs Better: Native or threads?Pradeep Kumar
 
Ceph Month 2021: RADOS Update
Ceph Month 2021: RADOS UpdateCeph Month 2021: RADOS Update
Ceph Month 2021: RADOS UpdateCeph Community
 
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceZeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceDatabricks
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Julien Le Dem
 
JavaOne 2013: Memory Efficient Java
JavaOne 2013: Memory Efficient JavaJavaOne 2013: Memory Efficient Java
JavaOne 2013: Memory Efficient JavaChris Bailey
 
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephSage Weil
 

La actualidad más candente (20)

Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
 
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for Ceph
 
MyRocks Deep Dive
MyRocks Deep DiveMyRocks Deep Dive
MyRocks Deep Dive
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
Performance tuning in BlueStore & RocksDB - Li Xiaoyan
Performance tuning in BlueStore & RocksDB - Li XiaoyanPerformance tuning in BlueStore & RocksDB - Li Xiaoyan
Performance tuning in BlueStore & RocksDB - Li Xiaoyan
 
Bluestore
BluestoreBluestore
Bluestore
 
Revisiting CephFS MDS and mClock QoS Scheduler
Revisiting CephFS MDS and mClock QoS SchedulerRevisiting CephFS MDS and mClock QoS Scheduler
Revisiting CephFS MDS and mClock QoS Scheduler
 
Ceph RBD Update - June 2021
Ceph RBD Update - June 2021Ceph RBD Update - June 2021
Ceph RBD Update - June 2021
 
Analyze Virtual Machine Overhead Compared to Bare Metal with Tracing
Analyze Virtual Machine Overhead Compared to Bare Metal with TracingAnalyze Virtual Machine Overhead Compared to Bare Metal with Tracing
Analyze Virtual Machine Overhead Compared to Bare Metal with Tracing
 
Ceph Performance and Sizing Guide
Ceph Performance and Sizing GuideCeph Performance and Sizing Guide
Ceph Performance and Sizing Guide
 
Ceph scale testing with 10 Billion Objects
Ceph scale testing with 10 Billion ObjectsCeph scale testing with 10 Billion Objects
Ceph scale testing with 10 Billion Objects
 
Ceph Object Storage Reference Architecture Performance and Sizing Guide
Ceph Object Storage Reference Architecture Performance and Sizing GuideCeph Object Storage Reference Architecture Performance and Sizing Guide
Ceph Object Storage Reference Architecture Performance and Sizing Guide
 
Ceph - A distributed storage system
Ceph - A distributed storage systemCeph - A distributed storage system
Ceph - A distributed storage system
 
A crash course in CRUSH
A crash course in CRUSHA crash course in CRUSH
A crash course in CRUSH
 
QEMU Disk IO Which performs Better: Native or threads?
QEMU Disk IO Which performs Better: Native or threads?QEMU Disk IO Which performs Better: Native or threads?
QEMU Disk IO Which performs Better: Native or threads?
 
Ceph Month 2021: RADOS Update
Ceph Month 2021: RADOS UpdateCeph Month 2021: RADOS Update
Ceph Month 2021: RADOS Update
 
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceZeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013
 
JavaOne 2013: Memory Efficient Java
JavaOne 2013: Memory Efficient JavaJavaOne 2013: Memory Efficient Java
JavaOne 2013: Memory Efficient Java
 
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for Ceph
 

Similar a Next Generation Backing Store for Ceph - Seastore

Everything We Learned About In-Memory Data Layout While Building VoltDB
Everything We Learned About In-Memory Data Layout While Building VoltDBEverything We Learned About In-Memory Data Layout While Building VoltDB
Everything We Learned About In-Memory Data Layout While Building VoltDBjhugg
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2aspyker
 
Application Caching: The Hidden Microservice
Application Caching: The Hidden MicroserviceApplication Caching: The Hidden Microservice
Application Caching: The Hidden MicroserviceScott Mansfield
 
To Serverless and Beyond
To Serverless and BeyondTo Serverless and Beyond
To Serverless and BeyondScyllaDB
 
Ceph Performance: Projects Leading up to Jewel
Ceph Performance: Projects Leading up to JewelCeph Performance: Projects Leading up to Jewel
Ceph Performance: Projects Leading up to JewelColleen Corrice
 
Ceph Performance: Projects Leading Up to Jewel
Ceph Performance: Projects Leading Up to JewelCeph Performance: Projects Leading Up to Jewel
Ceph Performance: Projects Leading Up to JewelRed_Hat_Storage
 
Migrating from InnoDB and HBase to MyRocks at Facebook
Migrating from InnoDB and HBase to MyRocks at FacebookMigrating from InnoDB and HBase to MyRocks at Facebook
Migrating from InnoDB and HBase to MyRocks at FacebookMariaDB plc
 
Introduction to NoSql
Introduction to NoSqlIntroduction to NoSql
Introduction to NoSqlOmid Vahdaty
 
Application Caching: The Hidden Microservice (SAConf)
Application Caching: The Hidden Microservice (SAConf)Application Caching: The Hidden Microservice (SAConf)
Application Caching: The Hidden Microservice (SAConf)Scott Mansfield
 
A Journey into Hexagon: Dissecting Qualcomm Basebands
A Journey into Hexagon: Dissecting Qualcomm BasebandsA Journey into Hexagon: Dissecting Qualcomm Basebands
A Journey into Hexagon: Dissecting Qualcomm BasebandsPriyanka Aash
 
How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...
How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...
How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...JAXLondon2014
 
SSDs, IMDGs and All the Rest - Jax London
SSDs, IMDGs and All the Rest - Jax LondonSSDs, IMDGs and All the Rest - Jax London
SSDs, IMDGs and All the Rest - Jax LondonUri Cohen
 
EVCache: Lowering Costs for a Low Latency Cache with RocksDB
EVCache: Lowering Costs for a Low Latency Cache with RocksDBEVCache: Lowering Costs for a Low Latency Cache with RocksDB
EVCache: Lowering Costs for a Low Latency Cache with RocksDBScott Mansfield
 
Inside CynosDB: MariaDB optimized for the cloud at Tencent
Inside CynosDB: MariaDB optimized for the cloud at TencentInside CynosDB: MariaDB optimized for the cloud at Tencent
Inside CynosDB: MariaDB optimized for the cloud at TencentMariaDB plc
 
TriHUG 3/14: HBase in Production
TriHUG 3/14: HBase in ProductionTriHUG 3/14: HBase in Production
TriHUG 3/14: HBase in Productiontrihug
 
Ceph Day Amsterdam 2015: Measuring and predicting performance of Ceph clusters
Ceph Day Amsterdam 2015: Measuring and predicting performance of Ceph clusters Ceph Day Amsterdam 2015: Measuring and predicting performance of Ceph clusters
Ceph Day Amsterdam 2015: Measuring and predicting performance of Ceph clusters Ceph Community
 
Community Update at OpenStack Summit Boston
Community Update at OpenStack Summit BostonCommunity Update at OpenStack Summit Boston
Community Update at OpenStack Summit BostonSage Weil
 
Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)
Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)
Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)Lars Marowsky-Brée
 
SUSE Storage: Sizing and Performance (Ceph)
SUSE Storage: Sizing and Performance (Ceph)SUSE Storage: Sizing and Performance (Ceph)
SUSE Storage: Sizing and Performance (Ceph)Lars Marowsky-Brée
 

Similar a Next Generation Backing Store for Ceph - Seastore (20)

Everything We Learned About In-Memory Data Layout While Building VoltDB
Everything We Learned About In-Memory Data Layout While Building VoltDBEverything We Learned About In-Memory Data Layout While Building VoltDB
Everything We Learned About In-Memory Data Layout While Building VoltDB
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
 
Application Caching: The Hidden Microservice
Application Caching: The Hidden MicroserviceApplication Caching: The Hidden Microservice
Application Caching: The Hidden Microservice
 
To Serverless and Beyond
To Serverless and BeyondTo Serverless and Beyond
To Serverless and Beyond
 
Ceph Performance: Projects Leading up to Jewel
Ceph Performance: Projects Leading up to JewelCeph Performance: Projects Leading up to Jewel
Ceph Performance: Projects Leading up to Jewel
 
Ceph Performance: Projects Leading Up to Jewel
Ceph Performance: Projects Leading Up to JewelCeph Performance: Projects Leading Up to Jewel
Ceph Performance: Projects Leading Up to Jewel
 
Migrating from InnoDB and HBase to MyRocks at Facebook
Migrating from InnoDB and HBase to MyRocks at FacebookMigrating from InnoDB and HBase to MyRocks at Facebook
Migrating from InnoDB and HBase to MyRocks at Facebook
 
Linux Huge Pages
Linux Huge PagesLinux Huge Pages
Linux Huge Pages
 
Introduction to NoSql
Introduction to NoSqlIntroduction to NoSql
Introduction to NoSql
 
Application Caching: The Hidden Microservice (SAConf)
Application Caching: The Hidden Microservice (SAConf)Application Caching: The Hidden Microservice (SAConf)
Application Caching: The Hidden Microservice (SAConf)
 
A Journey into Hexagon: Dissecting Qualcomm Basebands
A Journey into Hexagon: Dissecting Qualcomm BasebandsA Journey into Hexagon: Dissecting Qualcomm Basebands
A Journey into Hexagon: Dissecting Qualcomm Basebands
 
How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...
How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...
How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...
 
SSDs, IMDGs and All the Rest - Jax London
SSDs, IMDGs and All the Rest - Jax LondonSSDs, IMDGs and All the Rest - Jax London
SSDs, IMDGs and All the Rest - Jax London
 
EVCache: Lowering Costs for a Low Latency Cache with RocksDB
EVCache: Lowering Costs for a Low Latency Cache with RocksDBEVCache: Lowering Costs for a Low Latency Cache with RocksDB
EVCache: Lowering Costs for a Low Latency Cache with RocksDB
 
Inside CynosDB: MariaDB optimized for the cloud at Tencent
Inside CynosDB: MariaDB optimized for the cloud at TencentInside CynosDB: MariaDB optimized for the cloud at Tencent
Inside CynosDB: MariaDB optimized for the cloud at Tencent
 
TriHUG 3/14: HBase in Production
TriHUG 3/14: HBase in ProductionTriHUG 3/14: HBase in Production
TriHUG 3/14: HBase in Production
 
Ceph Day Amsterdam 2015: Measuring and predicting performance of Ceph clusters
Ceph Day Amsterdam 2015: Measuring and predicting performance of Ceph clusters Ceph Day Amsterdam 2015: Measuring and predicting performance of Ceph clusters
Ceph Day Amsterdam 2015: Measuring and predicting performance of Ceph clusters
 
Community Update at OpenStack Summit Boston
Community Update at OpenStack Summit BostonCommunity Update at OpenStack Summit Boston
Community Update at OpenStack Summit Boston
 
Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)
Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)
Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)
 
SUSE Storage: Sizing and Performance (Ceph)
SUSE Storage: Sizing and Performance (Ceph)SUSE Storage: Sizing and Performance (Ceph)
SUSE Storage: Sizing and Performance (Ceph)
 

Más de ScyllaDB

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
What Developers Need to Unlearn for High Performance NoSQL
What Developers Need to Unlearn for High Performance NoSQLWhat Developers Need to Unlearn for High Performance NoSQL
What Developers Need to Unlearn for High Performance NoSQLScyllaDB
 
Low Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & PitfallsLow Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & PitfallsScyllaDB
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasScyllaDB
 
Beyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDBBeyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDBScyllaDB
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasScyllaDB
 
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...ScyllaDB
 
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...ScyllaDB
 
Database Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr SarnaDatabase Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr SarnaScyllaDB
 
Replacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDBReplacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDBScyllaDB
 
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear ScalabilityPowering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear ScalabilityScyllaDB
 
7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptx7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptxScyllaDB
 
Getting the most out of ScyllaDB
Getting the most out of ScyllaDBGetting the most out of ScyllaDB
Getting the most out of ScyllaDBScyllaDB
 
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a MigrationNoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a MigrationScyllaDB
 
NoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration LogisticsNoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration LogisticsScyllaDB
 
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and ChallengesNoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and ChallengesScyllaDB
 
ScyllaDB Virtual Workshop
ScyllaDB Virtual WorkshopScyllaDB Virtual Workshop
ScyllaDB Virtual WorkshopScyllaDB
 
DBaaS in the Real World: Risks, Rewards & Tradeoffs
DBaaS in the Real World: Risks, Rewards & TradeoffsDBaaS in the Real World: Risks, Rewards & Tradeoffs
DBaaS in the Real World: Risks, Rewards & TradeoffsScyllaDB
 
Build Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDBBuild Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDBScyllaDB
 
NoSQL Data Modeling 101
NoSQL Data Modeling 101NoSQL Data Modeling 101
NoSQL Data Modeling 101ScyllaDB
 

Más de ScyllaDB (20)

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
What Developers Need to Unlearn for High Performance NoSQL
What Developers Need to Unlearn for High Performance NoSQLWhat Developers Need to Unlearn for High Performance NoSQL
What Developers Need to Unlearn for High Performance NoSQL
 
Low Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & PitfallsLow Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & Pitfalls
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance Dilemmas
 
Beyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDBBeyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDB
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance Dilemmas
 
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
 
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
 
Database Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr SarnaDatabase Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
 
Replacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDBReplacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDB
 
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear ScalabilityPowering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
 
7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptx7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptx
 
Getting the most out of ScyllaDB
Getting the most out of ScyllaDBGetting the most out of ScyllaDB
Getting the most out of ScyllaDB
 
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a MigrationNoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
 
NoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration LogisticsNoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration Logistics
 
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and ChallengesNoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
 
ScyllaDB Virtual Workshop
ScyllaDB Virtual WorkshopScyllaDB Virtual Workshop
ScyllaDB Virtual Workshop
 
DBaaS in the Real World: Risks, Rewards & Tradeoffs
DBaaS in the Real World: Risks, Rewards & TradeoffsDBaaS in the Real World: Risks, Rewards & Tradeoffs
DBaaS in the Real World: Risks, Rewards & Tradeoffs
 
Build Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDBBuild Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDB
 
NoSQL Data Modeling 101
NoSQL Data Modeling 101NoSQL Data Modeling 101
NoSQL Data Modeling 101
 

Último

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 

Último (20)

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 

Next Generation Backing Store for Ceph - Seastore

  • 1. Brought to you by Seastore: Next Generation Backing Store for Ceph Samuel Just Senior Principal Software Engineer at
  • 2. Sam Just Senior Principal Software Engineer ■ Currently Senior Principal Software Engineer at Red Hat ■ Crimson Lead ■ Veteran Ceph Contributor
  • 3. Crimson! What ■ Rewrite IO path in Seastar ● Preallocate cores ● One thread per core ● Explicitly shard all data structures and work over cores ● No locks and no blocking ● Message passing between cores ● Polling for all IO ■ DPDK, SPDK ● Kernel bypass for network and storage IO ■ Multi-year effort Why ■ Not just about how many IOPS we do… ■ More about IOPS per CPU core ■ Current Ceph is based on traditional multi-threaded programming model ■ Context switching is too expensive when storage is almost as fast as memory
  • 4. Classic-OSD Threading Architecture Messenger Worker Thread Worker Thread Worker Thread PG Worker Thread Worker Thread Worker Thread ObjectStore Worker Thread Worker Thread Worker Thread osd op queue kv queue Messenger Worker Thread Worker Thread Worker Thread out_queue
  • 5. Crimson-OSD Threading Architecture Seastar Reactor Thread Messenger PG ObjectStore async wait async wait Messenger async wait Seastar Reactor Thread Messenger PG ObjectStore async wait async wait Messenger async wait Seastar Reactor Thread Messenger PG ObjectStore async wait async wait Messenger async wait Seastar Reactor Thread Messenger PG ObjectStore async wait async wait Messenger async wait Seastar Reactor Thread Messenger PG ObjectStore async wait async wait Messenger async wait
  • 6. Classic vs Crimson Programming Model for (;;) { auto req = connection.read_request(); auto result = handle_request(req); connection.send_response(result); } repeat([conn, this] { return conn.read_request().then([this](auto req) { return handle_request(req); }).then([conn] (auto result) { return conn.send_response(result); }); });
  • 8. ObjectStore ■ Transactional ■ Composed of flat object namespace ■ Object names may be large (>1k) ■ Each object contains a key->value mapping (string->bytes) as well as a data payload. ■ Supports COW object clones ■ Supports ordered listing of both the omap and the object namespace
  • 9. SeaStore ■ New ObjectStore implementation designed natively for crimson’s threading/callback model ■ Avoids cpu-heavy metadata designs like RocksDB ■ Intended to exploit emerging technologies like ZNS, fast nvme, and persistent memory.
  • 10. Seastore - ZNS ■ New NVMe Specification ■ Intended to address challenges with conventional FTL designs ● High writes amplification, bad for qlc flash ● Background garbage collection tends to impact tail latencies ■ Different interface ● Drive divided into zones ● Zones can only be opened, written sequentially, closed, and released. ■ As it happens, this kind of write pattern tends to be good for conventional ssds as well.
  • 11. Seastore - Persistent Memory ■ Characteristics: ● Almost DRAM like read latencies ● Write latency drastically lower than flash ● Very high write endurance ● Seems like a good fit for persistently caching data and metadata! ● Reads from persistent memory can simply return a ready future without waiting at all. ■ SeaStore approach currently being discussed: ● Keep caching layer in persistent memory. ● Update by copy-on-write with a paddr->extent mapping maintained via a write-ahead journal.
  • 13. Seastore - Logical Structure Root onode_by_hobject lba_tree onode block_info omap_tree extents (contiguous block of LBAs) lba_t -> block_info_t hobject_t -> onode_t
  • 14. Seastore - Why use an LBA indirection? A B C C’ When GCing node C, we need to do 2 things: 1. Find all incoming references (B in this case) 2. Write out a transaction updating (and dirtying) those references as well as writing out the new block C’ Using direct references means we still need to maintain some means of finding the parent references, and we pay the cost of updating the relatively low fanout onode and omap trees. By contrast, using an LBA indirection requires extra reads in the lookup path, but potentially decreases read and write amplification during gc. It also makes refcounting and subtree sharing for COW clones easier, and we can use it to do sparse mapping for object data extents.
  • 15. Seastore - Layout Journal Segments header delta D’ ... delta E’ (logical) block B ... ... record (physical) block A E D A E’ D’ B LBA Tree LBA Tree
  • 16. Seastore - ZNS Journal Segments header delta D’ ... delta E’ (logical) block B ... ... record (physical) block A E D A E’ D’ B LBA Tree LBA Tree
  • 17. Architecture Cache Journal LBAManager SegmentCleaner TransactionManager OnodeManager OmapManager ObjectDataHandler ... SeaStore RootBlock ■ Broadly, SeaStore is divided into components above and below TransactionManager: ● TransactionManager supplies a transactional interface in terms of logically addressed blocks used by data extents and metadata structures like the ghobject->onode index and omap trees. ● Components under TransactionManager (mainly the logical address tree) deal in terms of physically addressed extents.
  • 18. OnodeManager -- FLTree ■ crimson/os/seastore/onode_manager/staged-fltree/ ■ Btree based structure for mapping ghobject_t -> onode ■ Splits ghobject_t keys into a fixed prefix (shard, pool, key), variable middle (object name and namespace), and fixed suffix (snapshot, generation) ■ Internal nodes drop components that are uniform over the whole node to reduce space overhead. ■ Internal nodes can also drop components that differ between adjacent keys. ● Allows nodes close to root to avoid recording name and namespace improving density. ■ Contributed mainly by Yingxin Cheng <yingxin.cheng@intel.com>
  • 19. OmapManager -- BtreeOmapManager ■ crimson/os/seastore/omap_manager/btree/ ■ Implements omap storage for each object. ■ Fairly straightforward btree mapping string keys to string values. ■ Contributed mainly by chunmei-liu <chunmei.liu@intel.com>
  • 20. ObjectDataHandler ■ crimson/os/seastore/object_data_handler.h/cc ■ Each onode maps a contiguous, sparse range of logical addresses to that object’s offsets ■ Leverages the LBAManager to avoid a secondary extent map. ■ Clone support requires work to enable relocating logical extent addresses (TODO)
  • 21. TransactionManager ■ crimson/os/seastore/TransactionManager.h/cc ■ Implements a uniform transactional interface for allocating, reading, and mutating logically addressed extents. ■ Mutations to extents can be expressed as a compact type dependent “delta” which will be included transparently in the commit journal record. ● For example, BtreeOmapManager represents the insertion of a key into a block by encoding the key/value pair rather than needing to rewrite the full block. ■ Components using TransactionManager can ignore extent relocation entirely.
  • 22. LBAManager -- BtreeLBAManager ■ crimson/os/seastore/lba_manager/btree/ ■ Btree mapping logical addresses to physical offsets ■ Includes reference counts to be used with clone. ■ Will include extent checksums (TODO)
  • 23. Cache ■ crimson/os/seastore/lba_manager/cache.h/cc ■ Manages in-memory extents, keeps dirty extents pinned in memory ■ Detects conflicts in transactions to be committed
  • 24. SegmentCleaner ■ crimson/os/seastore/lba_manager/segment_cleaner ■ Tracks usage status of segments. ■ Runs a background process (within the same reactor) to choose a segment, relocate any live extents, and release it. ● Logical extents are simply remapped within the LBAManager ● Physical extents (mainly BtreeLBAManager extents) need any references updated as well (mainly btree parents) ■ Also responsible for throttling foreground work based on pending gc work -- goal is to avoid abrupt gc pauses and smooth out latency.
  • 25. Current Status and Future Work ■ Can boot an osd via vstart without snapshots and complete upwards of several minutes of IO before crashing! ■ Performance/stability: stabilize, add to ceph upstream testing infrastructure ■ Add ability to remap location of logical extents ■ In place mutation support for fast nvme devices ■ Persistent memory support in Cache ■ Tiering
  • 26. Brought to you by Samuel Just Senior Principal Software Engineer at Red Hat Email: sjust@redhat.com