This document discusses the future of CephFS, the distributed file system component of Ceph. It describes plans to improve dynamic subtree partitioning to balance metadata load across servers, enhance failure recovery, and scale the metadata cluster. It also covers improving the client protocol, adding snapshot and recursive accounting capabilities, and supporting multiple client implementations like the Linux kernel client and Ceph fuse. The goal is to test these enhancements and continue expanding CephFS integrations and features.
2. RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
intelligent storage nodes
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
intelligent storage nodes
LIBRADOS
A library allowing
apps to directly
access RADOS,
with support for
C, C++, Java,
Python, Ruby,
and PHP
LIBRADOS
A library allowing
apps to directly
access RADOS,
with support for
C, C++, Java,
Python, Ruby,
and PHP
RBD
A reliable and fully-
distributed block
device, with a Linux
kernel client and a
QEMU/KVM driver
RBD
A reliable and fully-
distributed block
device, with a Linux
kernel client and a
QEMU/KVM driver
CEPH FS
A POSIX-compliant
distributed file system,
with a Linux kernel
client and support for
FUSE
RADOSGW
A bucket-based
REST gateway,
compatible with S3
and Swift
RADOSGW
A bucket-based
REST gateway,
compatible with S3
and Swift
APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT
5. Metadata Server
• Manages metadata for a
POSIX-compliant shared
filesystem
• Directory hierarchy
• File metadata (owner,
timestamps, mode, etc.)
• Stores metadata in RADOS
• Does not serve file data to
clients
• Only required for shared
filesystem
6. legacy metadata storage
●
a scaling disaster
●
name inode block list→ → →
data
●
no inode table locality
●
fragmentation
– inode table
– directory
● many seeks
●
difficult to partition
usr
etc
var
home
vmlinuz
passwd
mtab
hosts
lib
…
…
…
include
bin
7. ceph fs metadata storage
●
block lists unnecessary
● inode table mostly useless
●
APIs are path-based, not
inode-based
●
no random table access,
sloppy caching
● embed inodes inside
directories
●
good locality, prefetching
●
leverage key/value object
102
100
1
usr
etc
var
home
vmlinuz
passwd
mtab
hosts
lib
include
bin
…
…
…
8. controlling metadata io
● view ceph-mds as cache
●
reduce reads
– dir+inode prefetching
●
reduce writes
– consolidate multiple writes
●
large journal or log
●
stripe over objects
●
two tiers
– journal for short term
– per-directory for long term
●
fast failure recovery
journal
directories
10. load distribution
●
coarse (static subtree)
●
preserve locality
●
high management overhead
●
fine (hash)
●
always balanced
●
less vulnerable to hot spots
●
destroy hierarchy, locality
●
can a dynamic approach
capture benefits of both
extremes?
static subtree
hash directories
hash files
good locality
good balance
21. client protocol
●
highly stateful
●
consistent, fine-grained caching
● seamless hand-off between ceph-mds daemons
●
when client traverses hierarchy
●
when metadata is migrated between servers
● direct access to OSDs for file I/O
22. an example
● mount -t ceph 1.2.3.4:/ /mnt
●
3 ceph-mon RT
●
2 ceph-mds RT (1 ceph-mds to -osd RT)
● cd /mnt/foo/bar
●
2 ceph-mds RT (2 ceph-mds to -osd RT)
● ls -al
●
open
●
readdir
– 1 ceph-mds RT (1 ceph-mds to -osd RT)
●
stat each file
●
close
● cp * /tmp
●
N ceph-osd RT
ceph-mon
ceph-mds
ceph-osd
26. RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
intelligent storage nodes
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
intelligent storage nodes
LIBRADOS
A library allowing
apps to directly
access RADOS,
with support for
C, C++, Java,
Python, Ruby,
and PHP
LIBRADOS
A library allowing
apps to directly
access RADOS,
with support for
C, C++, Java,
Python, Ruby,
and PHP
RBD
A reliable and fully-
distributed block
device, with a Linux
kernel client and a
QEMU/KVM driver
RBD
A reliable and fully-
distributed block
device, with a Linux
kernel client and a
QEMU/KVM driver
RADOSGW
A bucket-based
REST gateway,
compatible with S3
and Swift
RADOSGW
A bucket-based
REST gateway,
compatible with S3
and Swift
APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT
CEPH FS
A POSIX-compliant
distributed file system,
with a Linux kernel
client and support for
FUSE
CEPH FS
A POSIX-compliant
distributed file system,
with a Linux kernel
client and support for
FUSE
NEARLY
AWESOME
AWESOMEAWESOME
AWESOME
AWESOME
27. Path forward
●
Testing
●
Various workloads
●
Multiple active MDSs
●
Test automation
●
Simple workload generator scripts
●
Bug reproducers
●
Hacking
●
Bug squashing
●
Long-tail features
●
Integrations
●
Ganesha, Samba, *stacks
30. object model
●
pools
●
1s to 100s
●
independent namespaces or object collections
●
replication level, placement policy
●
objects
●
bazillions
●
blob of data (bytes to gigabytes)
●
attributes (e.g., “version=12”; bytes to kilobytes)
●
key/value bundle (bytes to gigabytes)
31. atomic transactions
●
client operations send to the OSD cluster
●
operate on a single object
●
can contain a sequence of operations, e.g.
– truncate object
– write new object data
– set attribute
●
atomicity
●
all operations commit or do not commit atomically
● conditional
●
'guard' operations can control whether operation is performed
– verify xattr has specific value
– assert object is a specific version
●
allows atomic compare-and-swap etc.
32. key/value storage
●
store key/value pairs in an object
●
independent from object attrs or byte data payload
● based on google's leveldb
●
efficient random and range insert/query/removal
●
based on BigTable SSTable design
● exposed via key/value API
●
insert, update, remove
●
individual keys or ranges of keys
● avoid read/modify/write cycle for updating complex
objects
●
e.g., file system directory objects
33. watch/notify
●
establish stateful 'watch' on an object
●
client interest persistently registered with object
●
client keeps session to OSD open
●
send 'notify' messages to all watchers
●
notify message (and payload) is distributed to all watchers
●
variable timeout
●
notification on completion
– all watchers got and acknowledged the notify
● use any object as a communication/synchronization
channel
●
locking, distributed coordination (ala ZooKeeper), etc.
35. watch/notify example
●
radosgw cache consistency
●
radosgw instances watch a single object (.rgw/notify)
●
locally cache bucket metadata
●
on bucket metadata changes (removal, ACL changes)
– write change to relevant bucket object
– send notify with bucket name to other radosgw instances
●
on receipt of notify
– invalidate relevant portion of cache
36. rados classes
●
dynamically loaded .so
●
/var/lib/rados-classes/*
●
implement new object “methods” using existing methods
●
part of I/O pipeline
●
simple internal API
● reads
●
can call existing native or class methods
●
do whatever processing is appropriate
●
return data
● writes
●
can call existing native or class methods
●
do whatever processing is appropriate
●
generates a resulting transaction to be applied atomically
37. class examples
●
grep
●
read an object, filter out individual records, and return those
● sha1
●
read object, generate fingerprint, return that
● images
●
rotate, resize, crop image stored in object
●
remove red-eye
● crypto
●
encrypt/decrypt object data with provided key
39. ideas
●
lua rados class
●
embed lua interpreter in a rados class
●
ship semi-arbitrary code for operations
●
json class
●
parse, manipulate json structures
40. ideas
●
rados mailbox (RMB?)
●
plug librados backend into dovecot, postfix, etc.
●
key/value object for each mailbox
– key = message id
– value = headers
●
object for each message or attachment
●
watch/notify to delivery notification
41.
42. hard links?
● rare
● useful locality properties
●
intra-directory
●
parallel inter-directory
● on miss, file objects provide per-file
backpointers
● degenerates to log(n) lookups
● optimistic read complexity
Notas del editor
Finally, let’s talk about Ceph FS. Ceph FS is a parallel filesystem that provides a massively scalable, single-hierarchy, shared disk. If you use a shared drive at work, this is the same thing except that the same drive could be shared by everyone you’ve ever met (and everyone they’ve ever met).
Remember all that meta-data we talked about in the beginning? Feels so long ago. It has to be stored somewhere! Something has to keep track of who created files, when they were created, and who has the right to access them. And something has to remember where they live within a tree. Enter MDS, the Ceph Metadata Server. Clients accessing Ceph FS data first make a request to an MDS, which provides what they need to get files from the right OSDs.
There are multiple MDSs!
If you aren’t running Ceph FS, you don’t need to deploy metadata servers.
So how do you have one tree and multiple servers?
If there’s just one MDS (which is a terrible idea), it manages metadata for the entire tree.
When the second one comes along, it will intelligently partition the work by taking a subtree.
When the third MDS arrives, it will attempt to split the tree again.
Same with the fourth.
A MDS can actually even just take a single directory or file, if the load is high enough. This all happens dynamically based on load and the structure of the data, and it’s called “dynamic subtree partitioning”.
Ceph FS is feature-complete but still lacks the testing, quality assurance, and benchmarking work we feel it needs to recommend it for production use.