SF Ceph Users Jan. 2014

SF BAY AREA CEPH
USERS GROUP

INAUGURAL MEETUP

Thursday, January 16, 14

AGENDA
Intro to Ceph
Ceph Networking
Public Topologies
Cluster Topologies
Network Hardware

2


THE FORECAST

By 2020
over 39 ZB
of data will
be stored.
1.5 ZB are
stored today.

3

THE PROBLEM

Growth of data

 Existing systems don’t
scale

IT Storage Budget

 Increasing cost and
complexity
2010

4


2020

 Need to invest in new
platforms ahead of time

THE SOLUTION

PAST: SCALE UP

FUTURE: SCALE OUT

5


INTRO TO CEPH
 Distributed storage system
 Horizontally scalable
 No single point of failure
 Self healing and self managing
 Runs on commodity hardware
 GPLv2 License

7


ARCHITECTURE

8


SERVICE COMPONENTS
MONITOR
 PAXOS for consensus
 Maintain cluster state
 Typically 3-5 nodes
 NOT in write path

OSD
 Object storage interface
 Gossips with peers
 Data lives here

9


PART 1

SERVICE COMPONENTS
RADOS GATEWAY
 Provides S3/Swift compatibility
 Scale out

METADATA
 Object storage interface
 Gossips with peers
 Dynamic subtree partitioning

10


PART 2

CRUSH
 Ceph uses CRUSH for data placement
 Aware of cluster topography
 Statistically even distribution across pool
 Supports asymmetric nodes and devices
 Hierarchal weighting

11


DATA PLACEMENT

12


POOLS
 Groupings of OSDs
 Both physical and logical
 Volumes / Images
 Hot SSD pool
 Cold SATA pool
 DMCrypt pool

13


REPLICATION
 Original data durability mechanism
 Ceph creates N replicas of each RADOS object
 Uses CRUSH to determine replica placement
 Required for mutable objects (RBD, CephFS)
 More reasonable for smaller installations

14


ERASURE CODING
 (8:4) MDS code in example
 1.5x overhead
 8 units of client data to write
 4 parity units generated using FEC
 All 12 units placed with CRUSH
 8/12 total units to satisfy a read

15


Fireﬂy Release

CLIENT COMPONENTS
Native API
 Mutable object store
 Many language bindings
 Object classes

CephFS
 Linux Kernel CephFS client since 2.6.34
 FUSE client
 Hadoop JNI bindings

16


CLIENT COMPONENTS
Block Storage
 Linux Kernel RBD client since 2.6.37+
 KVM/QEMU integration
 Xen integration

S3/Swift
S3/SWIFT
OSD
 RESTful interfaces (HTTP)
 CRUD operations
 Usage accounting for billing

17


Ceph Networking

INFINIBAND
 Currently only supported via IPoIB
 Accelio (libxio) integration in Ceph is in early stages
 Accelio supports multiple transports RDMA, TCP and
Shared-Memory
 Accelio supports multiple RDMA transports (IB, RoCE,
iWARP)

19


ETHERNET
 Tried and true
 Proven at scale
 Economical
 Many suitable vendors

20


10GbE or 1GbE
 Cost of 10GbE trending downward
 White box switches turning up heat on vendors
 Twinax relatively inexpensive and low power
 SFP+ is versatile wrt distance
 Single 10GbE for object
 Dual 10GbE for block storage (public/cluster)
 Bonding many 1GbE links adds lots of complexity

21


IPv4 or IPv6 Native
 It’s 2014, is this really a question?
 Ceph fully supports both modes of operation
 Hierarchal allocation models allows “roll up” of routes
 Optimal efficiency in RIB
 Some tools believe the earth is ﬂat

22


LAYER 2
 Spanning tree
 Switch table size
 Broadcast domains (ARP)
 MAC frame checksum
 Storage protocols (FCoE, ATAoE)
 TRILL, MLAG
 Layer 2 DCI is crazy pants
 Layer 2 tunneled over internet is super crazy pants

23


LAYER 3
 Address and subnet planning
 Proven scale at big web shops
 Error detection only on TCP header
 Equal cost multi-path (ECMP)
 Reasonable for inter-site connectivity

24


Public Topologies

CLIENT TOPOLOGIES
 Path diversity for resiliency
 Minimize network diameter
 Consistent hop count to minimize net long tail latency
 Ease of scaling
 Tolerate adversarial traffic patterns (fan-in/fan-out)

26


FOLDED CLOS
 Sometimes called Fat Tree or Spine and Leaf
 Minimum 4 ﬁxed switches, grows to 10k+ node fabrics
 Rack or cluster oversubscription possible
 Non-blocking also possible
S
S

S

S

 Path diversity

S
....

....
1

27


2

N

1

2

S

....
N

1

2

....
N

1

2

N

Cluster Topologies

REPLICA TOPOLOGIES
 Replica and erasure fan-out
 Recovery and remap impact on cluster bandwidth
 OSD peering
 Backﬁll served from primary
 Tune backﬁlls to avoid large fan-in

29


FOLDED CLOS
 Sometimes called Fat Tree or Spine and Leaf
 Minimum 4, grows to 10k+ node fabrics
 Rack or cluster oversubscription possible
 Non-blocking also possible
S
S

S

S

 Path diversity

S
....

....
1

30


2

N

1

2

S

....
N

1

2

....
N

1

2

N

N-WAY PARTIAL MESH

31


EVALUATE
 Replication
 Erasure coding
 Special purpose vs general purpose
 Extra port cost

32


Network Hardware

Features
 Buffer sizes
 Cut through vs store and forward
 Oversubscribed vs non-blocking
 Automation and monitoring

34


FIXED
 Fixed switches can easily build large clusters
 Easier to source
 Smaller failure domains
 Fixed designs have many control planes
 Virtual chassis.. L3 split brain hilarity?

35


LESS SKU
 Utilize as few vendor SKUs as possible
 If permitted, use same ﬁxed switch for spine and leaf
 More affordable to have spares on site or more spares
 Quicker MTTR when gear is ready to go

36


Thanks to our host!

37


Kyle Bader
Sr. Solutions Architect

kyle@inktank.com


SF Ceph Users Jan. 2014

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (20)

Similar to SF Ceph Users Jan. 2014

Similar to SF Ceph Users Jan. 2014 (20)

Recently uploaded

Recently uploaded (20)

SF Ceph Users Jan. 2014