London Ceph Day: The Future of CephFS

APP
APP

LIBRADOS
LIBRADOS

APP
APP

RADOSGW
RADOSGW

AA bucket-based
bucket-based
AA library allowing REST gateway,
library allowing
REST gateway,
apps to directly
compatible with S3
apps to directly
compatible with S3
access RADOS,
access RADOS, and Swift
and Swift
with support for
with support for
C, C++, Java,
C, C++, Java,
Python, Ruby,
Python, Ruby,
and PHP
and PHP

HOST/VM
HOST/VM

CLIENT
CLIENT

RBD
RBD

CEPH FS

AA reliable and fullyreliable and fullydistributed block
distributed block
device, with a a Linux
device, with Linux
kernel client and a a
kernel client and
QEMU/KVM driver
QEMU/KVM driver

A POSIX-compliant
distributed file system,
with a Linux kernel
client and support for
FUSE

RADOS
RADOS
AA reliable, autonomous, distributed object store comprised of self-healing, self-managing,
reliable, autonomous, distributed object store comprised of self-healing, self-managing,
intelligent storage nodes

CLIENT
CLIENT

metadata

01
01
10
10

data

M
M
M
M

M
M

Metadata Server
• Manages metadata for a
POSIX-compliant shared
filesystem
• Directory hierarchy
• File metadata (owner,
timestamps, mode, etc.)

• Stores metadata in RADOS
• Does not serve file data to
clients
• Only required for shared
filesystem

legacy metadata storage
●

a scaling disaster
●

name → inode → block list →
data

●

no inode table locality

●

fragmentation
–

inode table

–

directory

●

many seeks

●

difficult to partition

etc
home
usr
var
vmlinuz
…

hosts
mtab
passwd
…

bin
include
lib
…

ceph fs metadata storage
●

●

100
1
etc
home
usr
var
vmlinuz
…

hosts
mtab
passwd
…

block lists unnecessary
inode table mostly useless
●

102
bin
include
lib
…

●

●

APIs are path-based, not
inode-based
no random table access,
sloppy caching

embed inodes inside
directories
●

good locality, prefetching

●

leverage key/value object

controlling metadata io
●

view ceph-mds as cache
●

reduce reads
–

●

reduce writes
–

●

dir+inode prefetching
journal

consolidate multiple writes

large journal or log
●

stripe over objects

●

two tiers
–
–

●

journal for short term
per-directory for long term

fast failure recovery

directories

one tree

three metadata servers

??

load distribution
●

coarse (static subtree)
●

●

high management overhead

fine (hash)
●

always balanced

●

less vulnerable to hot spots

●

●

static subtree

preserve locality

●

good locality

destroy hierarchy, locality

can a dynamic approach
capture benefits of both
extremes?

hash directories

good balance

hash files

dynamic subtree partitioning
●

scalable
●

●

arbitrarily partition
metadata

adaptive
●

●

●

move work from busy to
idle servers
replicate hot metadata

efficient
●

●

hierarchical partition
preserve locality

dynamic
●

daemons can join/leave

●

take over for failed nodes

Dynamic partitioning
many directories

same directory

Metadata replication and availability

client protocol
●

highly stateful
●

●

consistent, fine-grained caching

seamless hand-off between ceph-mds daemons
●

●

●

when client traverses hierarchy
when metadata is migrated between servers

direct access to OSDs for file I/O

an example
●

mount -t ceph 1.2.3.4:/ /mnt
●

●

●

3 ceph-mon RT
2 ceph-mds RT (1 ceph-mds to -osd RT)


ls -al
●

open

●

readdir
–


●

stat each file

●

●

ceph-osd

cd /mnt/foo/bar
●

●

ceph-mon

close

cp * /tmp
●

N ceph-osd RT

ceph-mds

recursive accounting
●

ceph-mds tracks recursive directory stats
●

file sizes

●

file and directory counts

●

modification time

●

virtual xattrs present full stats

●

efficient
$ ls alSh | head
total 0
drwxrxrx 1 root            root      9.7T 20110204 15:51 .
drwxrxrx 1 root            root      9.7T 20101216 15:06 ..
drwxrxrx 1 pomceph         pg4194980 9.6T 20110224 08:25 pomceph
drwxrxrx 1 mcg_test1       pg2419992  23G 20110202 08:57 mcg_test1
drwxx 1 luko            adm        19G 20110121 12:17 luko
drwxx 1 eest            adm        14G 20110204 16:29 eest
drwxrxrx 1 mcg_test2       pg2419992 3.0G 20110202 09:34 mcg_test2
drwxx 1 fuzyceph        adm       1.5G 20110118 10:46 fuzyceph
drwxrxrx 1 dallasceph      pg275     596M 20110114 10:06 dallasceph

snapshots
●

volume or subvolume snapshots unusable at petabyte
scale
●

●

snapshot arbitrary subdirectories

simple interface
●

hidden '.snap' directory

●

no special tools

$ mkdir foo/.snap/one
$ ls foo/.snap
one
$ ls foo/bar/.snap
_one_1099511627776
$ rm foo/myfile
$ ls -F foo
bar/
$ ls -F foo/.snap/one
myfile bar/
$ rmdir foo/.snap/one

# create snapshot

# parent's snap name is mangled

# remove snapshot

multiple client implementations
●

Linux kernel client
●

●

mount -t ceph 1.2.3.4:/
/mnt
export (NFS), Samba (CIFS)

●

ceph-fuse

●

libcephfs.so
●

your app

●

Ganesha (NFS)

●

Hadoop (map/reduce)

Ganesha
libcephfs

Samba
libcephfs

Hadoop
libcephfs

your app
libcephfs

Samba (CIFS)

●

SMB/CIFS

NFS

ceph

ceph-fuse
fuse
kernel

APP
APP

LIBRADOS
LIBRADOS

APP
APP

HOST/VM
HOST/VM

RADOSGW
RADOSGW

RBD
RBD

AA bucket-based
bucket-based
AA library allowing REST gateway,
library allowing
REST gateway,
apps to directly
compatible with S3
apps to directly
compatible with S3
access RADOS,
access RADOS, and Swift
and Swift
with support for
with support for
C, C++, Java,
C, C++, Java,
Python, Ruby,
Python, Ruby,
AWESOME
and PHP
and PHP

CEPH FS
CEPH FS

AA reliable and fullyreliable and fullydistributed block
distributed block
device, with a a Linux
device, with Linux
kernel client and a a
kernel client and
QEMU/KVM driver
QEMU/KVM driver

AA POSIX-compliant
POSIX-compliant
with a a Linux kernel
with Linux kernel
FUSE
FUSE

AWESOME

AWESOME
RADOS
RADOS

CLIENT
CLIENT

NEARLY
AWESOME

AWESOME

AA reliable, autonomous, distributed object store comprised of self-healing, self-managing,
reliable, autonomous, distributed object store comprised of self-healing, self-managing,

Path forward
●

Testing
●

●

●

Various workloads
Multiple active MDSs

Test automation
●

●

●

Simple workload generator scripts
Bug reproducers

Hacking
●

●

●

Bug squashing
Long-tail features

Integrations
●

Ganesha, Samba, *stacks

hard links?
●

rare

●

useful locality properties
●

●

●

intra-directory
parallel inter-directory

on miss, file objects provide per-file
backpointers
●

degenerates to log(n) lookups

●

optimistic read complexity

what is journaled
●

lots of state
●

●

●

journaling is expensive up-front, cheap to recover
non-journaled state is cheap, but complex (and somewhat
expensive) to recover

yes
●

●

●

client sessions
actual fs metadata modifications

no
●

●

●

cache provenance
open files

lazy flush
●

client modifications may not be durable until fsync() or visible by
another client

London Ceph Day: The Future of CephFS

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a London Ceph Day: The Future of CephFS

Similar a London Ceph Day: The Future of CephFS (20)

Último

Último (20)

London Ceph Day: The Future of CephFS

Notas del editor