2. APP
APP
LIBRADOS
LIBRADOS
APP
APP
RADOSGW
RADOSGW
AA bucket-based
bucket-based
AA library allowing REST gateway,
library allowing
REST gateway,
apps to directly
compatible with S3
apps to directly
compatible with S3
access RADOS,
access RADOS, and Swift
and Swift
with support for
with support for
C, C++, Java,
C, C++, Java,
Python, Ruby,
Python, Ruby,
and PHP
and PHP
HOST/VM
HOST/VM
CLIENT
CLIENT
RBD
RBD
CEPH FS
AA reliable and fullyreliable and fullydistributed block
distributed block
device, with a a Linux
device, with Linux
kernel client and a a
kernel client and
QEMU/KVM driver
QEMU/KVM driver
A POSIX-compliant
distributed file system,
with a Linux kernel
client and support for
FUSE
RADOS
RADOS
AA reliable, autonomous, distributed object store comprised of self-healing, self-managing,
reliable, autonomous, distributed object store comprised of self-healing, self-managing,
intelligent storage nodes
intelligent storage nodes
5. Metadata Server
• Manages metadata for a
POSIX-compliant shared
filesystem
• Directory hierarchy
• File metadata (owner,
timestamps, mode, etc.)
• Stores metadata in RADOS
• Does not serve file data to
clients
• Only required for shared
filesystem
6. legacy metadata storage
●
a scaling disaster
●
name → inode → block list →
data
●
no inode table locality
●
fragmentation
–
inode table
–
directory
●
many seeks
●
difficult to partition
etc
home
usr
var
vmlinuz
…
hosts
mtab
passwd
…
bin
include
lib
…
7. ceph fs metadata storage
●
●
100
1
etc
home
usr
var
vmlinuz
…
hosts
mtab
passwd
…
block lists unnecessary
inode table mostly useless
●
102
bin
include
lib
…
●
●
APIs are path-based, not
inode-based
no random table access,
sloppy caching
embed inodes inside
directories
●
good locality, prefetching
●
leverage key/value object
8. controlling metadata io
●
view ceph-mds as cache
●
reduce reads
–
●
reduce writes
–
●
dir+inode prefetching
journal
consolidate multiple writes
large journal or log
●
stripe over objects
●
two tiers
–
–
●
journal for short term
per-directory for long term
fast failure recovery
directories
10. load distribution
●
coarse (static subtree)
●
●
high management overhead
fine (hash)
●
always balanced
●
less vulnerable to hot spots
●
●
static subtree
preserve locality
●
good locality
destroy hierarchy, locality
can a dynamic approach
capture benefits of both
extremes?
hash directories
good balance
hash files
21. client protocol
●
highly stateful
●
●
consistent, fine-grained caching
seamless hand-off between ceph-mds daemons
●
●
●
when client traverses hierarchy
when metadata is migrated between servers
direct access to OSDs for file I/O
22. an example
●
mount -t ceph 1.2.3.4:/ /mnt
●
●
●
3 ceph-mon RT
2 ceph-mds RT (1 ceph-mds to -osd RT)
2 ceph-mds RT (2 ceph-mds to -osd RT)
ls -al
●
open
●
readdir
–
1 ceph-mds RT (1 ceph-mds to -osd RT)
●
stat each file
●
●
ceph-osd
cd /mnt/foo/bar
●
●
ceph-mon
close
cp * /tmp
●
N ceph-osd RT
ceph-mds
26. APP
APP
LIBRADOS
LIBRADOS
APP
APP
HOST/VM
HOST/VM
RADOSGW
RADOSGW
RBD
RBD
AA bucket-based
bucket-based
AA library allowing REST gateway,
library allowing
REST gateway,
apps to directly
compatible with S3
apps to directly
compatible with S3
access RADOS,
access RADOS, and Swift
and Swift
with support for
with support for
C, C++, Java,
C, C++, Java,
Python, Ruby,
Python, Ruby,
AWESOME
and PHP
and PHP
CEPH FS
CEPH FS
AA reliable and fullyreliable and fullydistributed block
distributed block
device, with a a Linux
device, with Linux
kernel client and a a
kernel client and
QEMU/KVM driver
QEMU/KVM driver
AA POSIX-compliant
POSIX-compliant
distributed file system,
distributed file system,
with a a Linux kernel
with Linux kernel
client and support for
client and support for
FUSE
FUSE
AWESOME
AWESOME
RADOS
RADOS
CLIENT
CLIENT
NEARLY
AWESOME
AWESOME
AA reliable, autonomous, distributed object store comprised of self-healing, self-managing,
reliable, autonomous, distributed object store comprised of self-healing, self-managing,
intelligent storage nodes
intelligent storage nodes
27. Path forward
●
Testing
●
●
●
Various workloads
Multiple active MDSs
Test automation
●
●
●
Simple workload generator scripts
Bug reproducers
Hacking
●
●
●
Bug squashing
Long-tail features
Integrations
●
Ganesha, Samba, *stacks
28.
29. hard links?
●
rare
●
useful locality properties
●
●
●
intra-directory
parallel inter-directory
on miss, file objects provide per-file
backpointers
●
degenerates to log(n) lookups
●
optimistic read complexity
30. what is journaled
●
lots of state
●
●
●
journaling is expensive up-front, cheap to recover
non-journaled state is cheap, but complex (and somewhat
expensive) to recover
yes
●
●
●
client sessions
actual fs metadata modifications
no
●
●
●
cache provenance
open files
lazy flush
●
client modifications may not be durable until fsync() or visible by
another client
Notas del editor
{"5":"If you aren’t running Ceph FS, you don’t need to deploy metadata servers.\n","11":"If there’s just one MDS (which is a terrible idea), it manages metadata for the entire tree.\n","12":"When the second one comes along, it will intelligently partition the work by taking a subtree.\n","1":"<number>\n","13":"When the third MDS arrives, it will attempt to split the tree again.\n","2":"Finally, let’s talk about Ceph FS. Ceph FS is a parallel filesystem that provides a massively scalable, single-hierarchy, shared disk. If you use a shared drive at work, this is the same thing except that the same drive could be shared by everyone you’ve ever met (and everyone they’ve ever met).\n","14":"Same with the fourth.\n","3":"Remember all that meta-data we talked about in the beginning? Feels so long ago. It has to be stored somewhere! Something has to keep track of who created files, when they were created, and who has the right to access them. And something has to remember where they live within a tree. Enter MDS, the Ceph Metadata Server. Clients accessing Ceph FS data first make a request to an MDS, which provides what they need to get files from the right OSDs.\n","9":"So how do you have one tree and multiple servers?\n","26":"Ceph FS is feature-complete but still lacks the testing, quality assurance, and benchmarking work we feel it needs to recommend it for production use.\n","15":"A MDS can actually even just take a single directory or file, if the load is high enough. This all happens dynamically based on load and the structure of the data, and it’s called “dynamic subtree partitioning”.\n","4":"There are multiple MDSs!\n"}