Disaster Recovery with Ceph Block Storage Multi-Site Mirroring

Disaster Recovery and Ceph Block Storage
Introducing Multi-Site Mirroring
Jason Dillaman
RBD Project Technical Lead
Vault 2017

WHAT IS CEPH ALL ABOUT
2
▪ Software-defined distributed storage
▪ All components scale horizontally
▪ No single point of failure
▪ Self-manage whenever possible
▪ Hardware agnostic, commodity hardware
▪ Object, block, and file in a single cluster
▪ Open source (LGPL)

RGW
A web services gateway for
object storage, compatible with
S3 and Swift
LIBRADOS
A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)
RADOS
A software-based, reliable, autonomous, distributed object store comprised of
self-healing, self-managing, intelligent storage nodes and lightweight monitors
RBD
A reliable, fully-distributed
block device with cloud
platform integration
CEPHFS
A distributed file system with
POSIX semantics and scale-out
metadata management
OBJECT BLOCK FILE
CEPH COMPONENTS
3

▪ Block device abstraction
▪ Striped over fixed-size objects
▪ Highlights
▪ Broad integration
▪ Thinly provisioned
▪ Snapshots
▪ Copy-on-write clones
RADOS BLOCK DEVICE
4

LIBRADOS
LIBRBD
M M
M
RADOS CLUSTER
RADOS BLOCK DEVICE
5
KRBD
User-space Client Linux Host
LIBCEPH
IMAGE UPDATES

MIRRORING MOTIVATION
6
▪ Massively scalable and fault tolerant design
▪ What about data center failures?
▪ Data is the “special sauce”
▪ Failure to plan is planning to fail
▪ Snapshot-based incremental backups
▪ Desire online, continuous backups
▪ a la RBD Mirroring

▪ Replication needs to be asynchronous
▪ IO shouldn’t be blocked due to slow WAN
▪ Support transient connectivity issues
▪ Replication needs to be crash consistent
▪ Respect write barriers
▪ Easy management
▪ Expect failure can happen anytime
MIRRORING DESIGN PRINCIPLES
7

▪ Journal-based approach to log all modifications
▪ Support access by multiple clients
▪ Client-side operation
▪ Event logs appended into journal objects
▪ Delay all image modifications until event safe
▪ Commit journal events once image modification safe
▪ Provides an ordered view of all updates to an image
MIRRORING DESIGN FOUNDATION
8

▪ Typical IO path with journaling:
a. Create an event to describe the update
b. Asynchronously append event to journal object
c. Update in-memory image cache
d. Blocks cache writeback for affected extent
e. Completes IO to client
f. Unblock writeback once event is safe
JOURNAL DESIGN
9

▪ IO path with journaling w/o cache:
a. Create an event to describe the update
b. Asynchronously append event to journal object
c. Asynchronously update image once event is safe
d. Complete IO to client once update is safe
JOURNAL DESIGN
10

JOURNAL DESIGN
11
LIBRADOS
LIBRBD
M M
M
RADOS CLUSTER
User-space Client
WRITES READS
JOURNAL EVENTS
IMAGE UPDATES

▪ Mirroring peers are configured per-pool
▪ Enabling mirroring is a per-image property
▪ Requires that the journal image feature is enabled
▪ Image is primary (R/W) or non-primary (R/O)
▪ Replication is handled by rbd-mirror daemon
MIRRORING OVERVIEW
12

RBD-MIRROR DAEMON
▪ Requires access to remote and local clusters
▪ Responsible for image (re)sync
▪ Replays journal to achieve consistency
▪ Pulls journal event from remote cluster
▪ Applies event to local cluster
▪ Commits journal event
▪ Trim journal
▪ Transparently handles failover/failback
▪ Two-way replication between two sites
▪ One-way replication between N sites
13

RBD-MIRROR DAEMON
14
SITE B
Client
LIBRBD
M
CLUSTER A
MM M
CLUSTER B
MM
SITE A
RBD-MIRROR
LIBRBD
JOURNAL
EVENTS
IMAGE
UPDATES
IMAGE
UPDATES

MIRRORING SETUP
15
▪ Deploy rbd-mirror daemon on each cluster
▪ Jewel/Kraken only support single active daemon per-cluster
▪ Provide uniquely named “ceph.conf” for each remote cluster
▪ Create pools with same name on each cluster
▪ Enable pool mirroring (rbd mirror pool enable)
▪ Specify peer cluster via rbd CLI (rbd mirror pool peer add)
▪ Enable image journaling feature (rbd feature enable journaling)
▪ Enable image mirroring (rbd mirror image enable)

▪ Per-image failover / failback
▪ Coordinated demotion / promotion
▪ (rbd mirror image demote)
▪ (rbd mirror image promote)
▪ Uncoordinated promotion + resync
▪ (rbd mirror image promote --force)
▪ Resync from force-promotion / split-brain
▪ (rbd mirror image resync)
SITE FAILOVER
16

▪ Write IOs have worst-case 2x performance hit
▪ Journal event append
▪ Image object write
▪ In-memory cache can mask hit if working set fits
▪ Only supported by librbd-based clients
CAVEATS
17

▪ Use a small SSD/NVMe-backed pool for journals
▪ ‘rbd journal pool = <fast pool name>’
▪ Batch multiple events into a single journal append
▪ ‘rbd journal object flush age = <seconds>’
▪ Increase journal data width to match queue depth
▪ ‘rbd journal splay width = <number of objects>’
▪ Future work: potentially parallelize journal append + image
write between write barriers
MITIGATION
18

▪ Active/Active rbd-mirror daemons
▪ Deferred replication and deletion
▪ “Deep Scrub” of replicated images
▪ Smarter image resynchronization
▪ Improved health status reporting
▪ Improved pool promotion process
FUTURE FEATURES
19

▪ Design for failure
▪ Ceph now provides additional tools for full-scale disaster
recovery
▪ RBD workloads can seamlessly relocate between geographic
sites
▪ Feedback is welcome
BE PREPARED
20

THANK YOU!
Jason Dillaman
RBD Project Tech Lead
dillaman@redhat.com
22

Disaster Recovery with Ceph Block Storage Multi-Site Mirroring

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (19)

Destacado

Destacado (20)

Similar a Disaster Recovery with Ceph Block Storage Multi-Site Mirroring

Similar a Disaster Recovery with Ceph Block Storage Multi-Site Mirroring (20)

Último

Último (20)

Disaster Recovery with Ceph Block Storage Multi-Site Mirroring