Ceph, being a distributed storage system, is highly reliant on the network for resiliency and performance. In addition, it is crucial that the network topology beneath a Ceph cluster be designed in such a way to facilitate easy scaling without service disruption. After an introduction to Ceph itself this talk will dive into the design of Ceph client and cluster network topologies.
4. THE PROBLEM
Growth of data
Existing systems don’t
scale
IT Storage Budget
Increasing cost and
complexity
2010
4
Thursday, January 16, 14
2020
Need to invest in new
platforms ahead of time
7. INTRO TO CEPH
Distributed storage system
Horizontally scalable
No single point of failure
Self healing and self managing
Runs on commodity hardware
GPLv2 License
7
Thursday, January 16, 14
9. SERVICE COMPONENTS
MONITOR
PAXOS for consensus
Maintain cluster state
Typically 3-5 nodes
NOT in write path
OSD
Object storage interface
Gossips with peers
Data lives here
9
Thursday, January 16, 14
PART 1
10. SERVICE COMPONENTS
RADOS GATEWAY
Provides S3/Swift compatibility
Scale out
METADATA
Object storage interface
Gossips with peers
Dynamic subtree partitioning
10
Thursday, January 16, 14
PART 2
11. CRUSH
Ceph uses CRUSH for data placement
Aware of cluster topography
Statistically even distribution across pool
Supports asymmetric nodes and devices
Hierarchal weighting
11
Thursday, January 16, 14
13. POOLS
Groupings of OSDs
Both physical and logical
Volumes / Images
Hot SSD pool
Cold SATA pool
DMCrypt pool
13
Thursday, January 16, 14
14. REPLICATION
Original data durability mechanism
Ceph creates N replicas of each RADOS object
Uses CRUSH to determine replica placement
Required for mutable objects (RBD, CephFS)
More reasonable for smaller installations
14
Thursday, January 16, 14
15. ERASURE CODING
(8:4) MDS code in example
1.5x overhead
8 units of client data to write
4 parity units generated using FEC
All 12 units placed with CRUSH
8/12 total units to satisfy a read
15
Thursday, January 16, 14
Firefly Release
16. CLIENT COMPONENTS
Native API
Mutable object store
Many language bindings
Object classes
CephFS
Linux Kernel CephFS client since 2.6.34
FUSE client
Hadoop JNI bindings
16
Thursday, January 16, 14
17. CLIENT COMPONENTS
Block Storage
Linux Kernel RBD client since 2.6.37+
KVM/QEMU integration
Xen integration
S3/Swift
S3/SWIFT
OSD
RESTful interfaces (HTTP)
CRUD operations
Usage accounting for billing
17
Thursday, January 16, 14
19. INFINIBAND
Currently only supported via IPoIB
Accelio (libxio) integration in Ceph is in early stages
Accelio supports multiple transports RDMA, TCP and
Shared-Memory
Accelio supports multiple RDMA transports (IB, RoCE,
iWARP)
19
Thursday, January 16, 14
20. ETHERNET
Tried and true
Proven at scale
Economical
Many suitable vendors
20
Thursday, January 16, 14
21. 10GbE or 1GbE
Cost of 10GbE trending downward
White box switches turning up heat on vendors
Twinax relatively inexpensive and low power
SFP+ is versatile wrt distance
Single 10GbE for object
Dual 10GbE for block storage (public/cluster)
Bonding many 1GbE links adds lots of complexity
21
Thursday, January 16, 14
22. IPv4 or IPv6 Native
It’s 2014, is this really a question?
Ceph fully supports both modes of operation
Hierarchal allocation models allows “roll up” of routes
Optimal efficiency in RIB
Some tools believe the earth is flat
22
Thursday, January 16, 14
23. LAYER 2
Spanning tree
Switch table size
Broadcast domains (ARP)
MAC frame checksum
Storage protocols (FCoE, ATAoE)
TRILL, MLAG
Layer 2 DCI is crazy pants
Layer 2 tunneled over internet is super crazy pants
23
Thursday, January 16, 14
24. LAYER 3
Address and subnet planning
Proven scale at big web shops
Error detection only on TCP header
Equal cost multi-path (ECMP)
Reasonable for inter-site connectivity
24
Thursday, January 16, 14
26. CLIENT TOPOLOGIES
Path diversity for resiliency
Minimize network diameter
Consistent hop count to minimize net long tail latency
Ease of scaling
Tolerate adversarial traffic patterns (fan-in/fan-out)
26
Thursday, January 16, 14
27. FOLDED CLOS
Sometimes called Fat Tree or Spine and Leaf
Minimum 4 fixed switches, grows to 10k+ node fabrics
Rack or cluster oversubscription possible
Non-blocking also possible
S
S
S
S
Path diversity
S
....
....
1
27
Thursday, January 16, 14
2
N
1
2
S
....
N
1
2
....
N
1
2
N
29. REPLICA TOPOLOGIES
Replica and erasure fan-out
Recovery and remap impact on cluster bandwidth
OSD peering
Backfill served from primary
Tune backfills to avoid large fan-in
29
Thursday, January 16, 14
30. FOLDED CLOS
Sometimes called Fat Tree or Spine and Leaf
Minimum 4, grows to 10k+ node fabrics
Rack or cluster oversubscription possible
Non-blocking also possible
S
S
S
S
Path diversity
S
....
....
1
30
Thursday, January 16, 14
2
N
1
2
S
....
N
1
2
....
N
1
2
N
34. Features
Buffer sizes
Cut through vs store and forward
Oversubscribed vs non-blocking
Automation and monitoring
34
Thursday, January 16, 14
35. FIXED
Fixed switches can easily build large clusters
Easier to source
Smaller failure domains
Fixed designs have many control planes
Virtual chassis.. L3 split brain hilarity?
35
Thursday, January 16, 14
36. LESS SKU
Utilize as few vendor SKUs as possible
If permitted, use same fixed switch for spine and leaf
More affordable to have spares on site or more spares
Quicker MTTR when gear is ready to go
36
Thursday, January 16, 14