Ceph Day Santa Clara: The Future of CephFS + Developing with Librados

RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
intelligent storage nodes
RADOS
LIBRADOS
A library allowing
apps to directly
access RADOS,
with support for
C, C++, Java,
Python, Ruby,
and PHP
LIBRADOS
A library allowing
apps to directly
access RADOS,
with support for
C, C++, Java,
Python, Ruby,
and PHP
RBD
A reliable and fully-
distributed block
device, with a Linux
kernel client and a
QEMU/KVM driver
RBD
distributed block
kernel client and a
QEMU/KVM driver
CEPH FS
A POSIX-compliant
distributed file system,
with a Linux kernel
client and support for
FUSE
RADOSGW
A bucket-based
REST gateway,
compatible with S3
and Swift
RADOSGW
A bucket-based
REST gateway,
compatible with S3
and Swift
APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT

MM
MM
MM
CLIENTCLIENT
01
10
01
10
data
metadata

Metadata Server
• Manages metadata for a
POSIX-compliant shared
filesystem
• Directory hierarchy
• File metadata (owner,
timestamps, mode, etc.)
• Stores metadata in RADOS
• Does not serve file data to
clients
• Only required for shared
filesystem

legacy metadata storage
●
a scaling disaster
●
name inode block list→ → →
data
●
no inode table locality
●
fragmentation
– inode table
– directory
● many seeks
●
difficult to partition
usr
etc
var
home
vmlinuz
passwd
mtab
hosts
lib
…
…
…
include
bin

ceph fs metadata storage
●
block lists unnecessary
● inode table mostly useless
●
APIs are path-based, not
inode-based
●
no random table access,
sloppy caching
● embed inodes inside
directories
●
good locality, prefetching
●
leverage key/value object
102
100
1
usr
etc
var
home
vmlinuz
passwd
mtab
hosts
lib
include
bin
…
…
…

controlling metadata io
● view ceph-mds as cache
●
reduce reads
– dir+inode prefetching
●
reduce writes
– consolidate multiple writes
●
large journal or log
●
stripe over objects
●
two tiers
– journal for short term
– per-directory for long term
●
fast failure recovery
journal
directories

one tree
three metadata servers
??

load distribution
●
coarse (static subtree)
●
preserve locality
●
high management overhead
●
fine (hash)
●
always balanced
●
less vulnerable to hot spots
●
destroy hierarchy, locality
●
can a dynamic approach
capture benefits of both
extremes?
static subtree
hash directories
hash files
good locality
good balance

●
scalable
●
arbitrarily partition
metadata
● adaptive
●
move work from busy to
idle servers
●
replicate hot metadata
●
efficient
●
hierarchical partition
preserve locality
● dynamic
●
daemons can join/leave
●
take over for failed nodes
dynamic subtree partitioning

Dynamic partitioning
many directories same directory

Metadata replication and availability

client protocol
●
highly stateful
●
consistent, fine-grained caching
● seamless hand-off between ceph-mds daemons
●
when client traverses hierarchy
●
when metadata is migrated between servers
● direct access to OSDs for file I/O

an example
● mount -t ceph 1.2.3.4:/ /mnt
●
3 ceph-mon RT
●
2 ceph-mds RT (1 ceph-mds to -osd RT)
● cd /mnt/foo/bar
●
2 ceph-mds RT (2 ceph-mds to -osd RT)
● ls -al
●
open
●
readdir
– 1 ceph-mds RT (1 ceph-mds to -osd RT)
●
stat each file
●
close
● cp * /tmp
●
N ceph-osd RT
ceph-mon
ceph-mds
ceph-osd

recursive accounting
●
ceph-mds tracks recursive directory stats
●
file sizes
●
file and directory counts
●
modification time
●
virtual xattrs present full stats
● efficient
$ ls alSh | head
total 0
drwxrxrx 1 root            root      9.7T 20110204 15:51 .
drwxrxrx 1 root            root      9.7T 20101216 15:06 ..
drwxrxrx 1 pomceph         pg4194980 9.6T 20110224 08:25 pomceph
drwxrxrx 1 mcg_test1       pg2419992  23G 20110202 08:57 mcg_test1
drwxx 1 luko            adm        19G 20110121 12:17 luko
drwxx 1 eest            adm        14G 20110204 16:29 eest
drwxrxrx 1 mcg_test2       pg2419992 3.0G 20110202 09:34 mcg_test2
drwxx 1 fuzyceph        adm       1.5G 20110118 10:46 fuzyceph
drwxrxrx 1 dallasceph      pg275     596M 20110114 10:06 dallasceph

snapshots
●
volume or subvolume snapshots unusable at petabyte
scale
●
snapshot arbitrary subdirectories
●
simple interface
●
hidden '.snap' directory
●
no special tools
$ mkdir foo/.snap/one # create snapshot
$ ls foo/.snap
one
$ ls foo/bar/.snap
_one_1099511627776 # parent's snap name is mangled
$ rm foo/myfile
$ ls -F foo
bar/
$ ls -F foo/.snap/one
myfile bar/
$ rmdir foo/.snap/one # remove snapshot

multiple client implementations
●
Linux kernel client
●
mount -t ceph 1.2.3.4:/
/mnt
●
export (NFS), Samba (CIFS)
● ceph-fuse
●
libcephfs.so
●
your app
●
Samba (CIFS)
●
Ganesha (NFS)
●
Hadoop (map/reduce)
kernel
libcephfs
ceph fuse
ceph-fuse
your app
libcephfs
Samba
libcephfs
Ganesha
NFS SMB/CIFS
libcephfs
Hadoop

RADOS
RADOS
LIBRADOS
A library allowing
apps to directly
access RADOS,
with support for
C, C++, Java,
Python, Ruby,
and PHP
LIBRADOS
A library allowing
apps to directly
access RADOS,
with support for
C, C++, Java,
Python, Ruby,
and PHP
RBD
distributed block
kernel client and a
QEMU/KVM driver
RBD
distributed block
kernel client and a
QEMU/KVM driver
RADOSGW
A bucket-based
REST gateway,
compatible with S3
and Swift
RADOSGW
A bucket-based
REST gateway,
compatible with S3
and Swift
APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT
CEPH FS
A POSIX-compliant
with a Linux kernel
FUSE
CEPH FS
A POSIX-compliant
with a Linux kernel
FUSE
NEARLY
AWESOME
AWESOMEAWESOME
AWESOME
AWESOME

Path forward
●
Testing
●
Various workloads
●
Multiple active MDSs
●
Test automation
●
Simple workload generator scripts
●
Bug reproducers
●
Hacking
●
Bug squashing
●
Long-tail features
●
Integrations
●
Ganesha, Samba, *stacks

object model
●
pools
●
1s to 100s
●
independent namespaces or object collections
●
replication level, placement policy
●
objects
●
bazillions
●
blob of data (bytes to gigabytes)
●
attributes (e.g., “version=12”; bytes to kilobytes)
●
key/value bundle (bytes to gigabytes)

atomic transactions
●
client operations send to the OSD cluster
●
operate on a single object
●
can contain a sequence of operations, e.g.
– truncate object
– write new object data
– set attribute
●
atomicity
●
all operations commit or do not commit atomically
● conditional
●
'guard' operations can control whether operation is performed
– verify xattr has specific value
– assert object is a specific version
●
allows atomic compare-and-swap etc.

key/value storage
●
store key/value pairs in an object
●
independent from object attrs or byte data payload
● based on google's leveldb
●
efficient random and range insert/query/removal
●
based on BigTable SSTable design
● exposed via key/value API
●
insert, update, remove
●
individual keys or ranges of keys
● avoid read/modify/write cycle for updating complex
objects
●
e.g., file system directory objects

watch/notify
●
establish stateful 'watch' on an object
●
client interest persistently registered with object
●
client keeps session to OSD open
●
send 'notify' messages to all watchers
●
notify message (and payload) is distributed to all watchers
●
variable timeout
●
notification on completion
– all watchers got and acknowledged the notify
● use any object as a communication/synchronization
channel
●
locking, distributed coordination (ala ZooKeeper), etc.

CLIENT
#1
CLIENT
#2
CLIENT
#3
OSD
watch
ack/commit
ack/commit
watch
ack/commit
watch
notify
notify
notify
notify
ack
ack
ack
complete

watch/notify example
●
radosgw cache consistency
●
radosgw instances watch a single object (.rgw/notify)
●
locally cache bucket metadata
●
on bucket metadata changes (removal, ACL changes)
– write change to relevant bucket object
– send notify with bucket name to other radosgw instances
●
on receipt of notify
– invalidate relevant portion of cache

rados classes
●
dynamically loaded .so
●
/var/lib/rados-classes/*
●
implement new object “methods” using existing methods
●
part of I/O pipeline
●
simple internal API
● reads
●
can call existing native or class methods
●
do whatever processing is appropriate
●
return data
● writes
●
can call existing native or class methods
●
do whatever processing is appropriate
●
generates a resulting transaction to be applied atomically

class examples
●
grep
●
read an object, filter out individual records, and return those
● sha1
●
read object, generate fingerprint, return that
● images
●
rotate, resize, crop image stored in object
●
remove red-eye
● crypto
●
encrypt/decrypt object data with provided key

ideas
●
distributed key/value table
●
aggregate many k/v objects into one big 'table'
●
working prototype exists (thanks, Eleanor!)

ideas
●
lua rados class
●
embed lua interpreter in a rados class
●
ship semi-arbitrary code for operations
●
json class
●
parse, manipulate json structures

ideas
●
rados mailbox (RMB?)
●
plug librados backend into dovecot, postfix, etc.
●
key/value object for each mailbox
– key = message id
– value = headers
●
object for each message or attachment
●
watch/notify to delivery notification

hard links?
● rare
● useful locality properties
●
intra-directory
●
parallel inter-directory
● on miss, file objects provide per-file
backpointers
● degenerates to log(n) lookups
● optimistic read complexity

Ceph Day Santa Clara: The Future of CephFS + Developing with Librados

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Ceph Day Santa Clara: The Future of CephFS + Developing with Librados

Similar a Ceph Day Santa Clara: The Future of CephFS + Developing with Librados (20)

Último

Último (20)

Ceph Day Santa Clara: The Future of CephFS + Developing with Librados

Notas del editor