SlideShare una empresa de Scribd logo
1 de 42
Future of CephFS
Sage Weil
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
intelligent storage nodes
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
intelligent storage nodes
LIBRADOS
A library allowing
apps to directly
access RADOS,
with support for
C, C++, Java,
Python, Ruby,
and PHP
LIBRADOS
A library allowing
apps to directly
access RADOS,
with support for
C, C++, Java,
Python, Ruby,
and PHP
RBD
A reliable and fully-
distributed block
device, with a Linux
kernel client and a
QEMU/KVM driver
RBD
A reliable and fully-
distributed block
device, with a Linux
kernel client and a
QEMU/KVM driver
CEPH FS
A POSIX-compliant
distributed file system,
with a Linux kernel
client and support for
FUSE
RADOSGW
A bucket-based
REST gateway,
compatible with S3
and Swift
RADOSGW
A bucket-based
REST gateway,
compatible with S3
and Swift
APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT
MM
MM
MM
CLIENTCLIENT
01
10
01
10
data
metadata
MM
MM
MM
Metadata Server
• Manages metadata for a
POSIX-compliant shared
filesystem
• Directory hierarchy
• File metadata (owner,
timestamps, mode, etc.)
• Stores metadata in RADOS
• Does not serve file data to
clients
• Only required for shared
filesystem
legacy metadata storage
●
a scaling disaster
●
name inode block list→ → →
data
●
no inode table locality
●
fragmentation
– inode table
– directory
● many seeks
●
difficult to partition
usr
etc
var
home
vmlinuz
passwd
mtab
hosts
lib
…
…
…
include
bin
ceph fs metadata storage
●
block lists unnecessary
● inode table mostly useless
●
APIs are path-based, not
inode-based
●
no random table access,
sloppy caching
● embed inodes inside
directories
●
good locality, prefetching
●
leverage key/value object
102
100
1
usr
etc
var
home
vmlinuz
passwd
mtab
hosts
lib
include
bin
…
…
…
controlling metadata io
● view ceph-mds as cache
●
reduce reads
– dir+inode prefetching
●
reduce writes
– consolidate multiple writes
●
large journal or log
●
stripe over objects
●
two tiers
– journal for short term
– per-directory for long term
●
fast failure recovery
journal
directories
one tree
three metadata servers
??
load distribution
●
coarse (static subtree)
●
preserve locality
●
high management overhead
●
fine (hash)
●
always balanced
●
less vulnerable to hot spots
●
destroy hierarchy, locality
●
can a dynamic approach
capture benefits of both
extremes?
static subtree
hash directories
hash files
good locality
good balance
DYNAMIC SUBTREE PARTITIONING
●
scalable
●
arbitrarily partition
metadata
● adaptive
●
move work from busy to
idle servers
●
replicate hot metadata
●
efficient
●
hierarchical partition
preserve locality
● dynamic
●
daemons can join/leave
●
take over for failed nodes
dynamic subtree partitioning
Dynamic partitioning
many directories same directory
Failure recovery
Metadata replication and availability
Metadata cluster scaling
client protocol
●
highly stateful
●
consistent, fine-grained caching
● seamless hand-off between ceph-mds daemons
●
when client traverses hierarchy
●
when metadata is migrated between servers
● direct access to OSDs for file I/O
an example
● mount -t ceph 1.2.3.4:/ /mnt
●
3 ceph-mon RT
●
2 ceph-mds RT (1 ceph-mds to -osd RT)
● cd /mnt/foo/bar
●
2 ceph-mds RT (2 ceph-mds to -osd RT)
● ls -al
●
open
●
readdir
– 1 ceph-mds RT (1 ceph-mds to -osd RT)
●
stat each file
●
close
● cp * /tmp
●
N ceph-osd RT
ceph-mon
ceph-mds
ceph-osd
recursive accounting
●
ceph-mds tracks recursive directory stats
●
file sizes
●
file and directory counts
●
modification time
●
virtual xattrs present full stats
● efficient
$ ls ­alSh | head
total 0
drwxr­xr­x 1 root            root      9.7T 2011­02­04 15:51 .
drwxr­xr­x 1 root            root      9.7T 2010­12­16 15:06 ..
drwxr­xr­x 1 pomceph         pg4194980 9.6T 2011­02­24 08:25 pomceph
drwxr­xr­x 1 mcg_test1       pg2419992  23G 2011­02­02 08:57 mcg_test1
drwx­­x­­­ 1 luko            adm        19G 2011­01­21 12:17 luko
drwx­­x­­­ 1 eest            adm        14G 2011­02­04 16:29 eest
drwxr­xr­x 1 mcg_test2       pg2419992 3.0G 2011­02­02 09:34 mcg_test2
drwx­­x­­­ 1 fuzyceph        adm       1.5G 2011­01­18 10:46 fuzyceph
drwxr­xr­x 1 dallasceph      pg275     596M 2011­01­14 10:06 dallasceph
snapshots
●
volume or subvolume snapshots unusable at petabyte
scale
●
snapshot arbitrary subdirectories
●
simple interface
●
hidden '.snap' directory
●
no special tools
$ mkdir foo/.snap/one # create snapshot
$ ls foo/.snap
one
$ ls foo/bar/.snap
_one_1099511627776 # parent's snap name is mangled
$ rm foo/myfile
$ ls -F foo
bar/
$ ls -F foo/.snap/one
myfile bar/
$ rmdir foo/.snap/one # remove snapshot
multiple client implementations
●
Linux kernel client
●
mount -t ceph 1.2.3.4:/
/mnt
●
export (NFS), Samba (CIFS)
● ceph-fuse
●
libcephfs.so
●
your app
●
Samba (CIFS)
●
Ganesha (NFS)
●
Hadoop (map/reduce)
kernel
libcephfs
ceph fuse
ceph-fuse
your app
libcephfs
Samba
libcephfs
Ganesha
NFS SMB/CIFS
libcephfs
Hadoop
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
intelligent storage nodes
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
intelligent storage nodes
LIBRADOS
A library allowing
apps to directly
access RADOS,
with support for
C, C++, Java,
Python, Ruby,
and PHP
LIBRADOS
A library allowing
apps to directly
access RADOS,
with support for
C, C++, Java,
Python, Ruby,
and PHP
RBD
A reliable and fully-
distributed block
device, with a Linux
kernel client and a
QEMU/KVM driver
RBD
A reliable and fully-
distributed block
device, with a Linux
kernel client and a
QEMU/KVM driver
RADOSGW
A bucket-based
REST gateway,
compatible with S3
and Swift
RADOSGW
A bucket-based
REST gateway,
compatible with S3
and Swift
APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT
CEPH FS
A POSIX-compliant
distributed file system,
with a Linux kernel
client and support for
FUSE
CEPH FS
A POSIX-compliant
distributed file system,
with a Linux kernel
client and support for
FUSE
NEARLY
AWESOME
AWESOMEAWESOME
AWESOME
AWESOME
Path forward
●
Testing
●
Various workloads
●
Multiple active MDSs
●
Test automation
●
Simple workload generator scripts
●
Bug reproducers
●
Hacking
●
Bug squashing
●
Long-tail features
●
Integrations
●
Ganesha, Samba, *stacks
librados
object model
●
pools
●
1s to 100s
●
independent namespaces or object collections
●
replication level, placement policy
●
objects
●
bazillions
●
blob of data (bytes to gigabytes)
●
attributes (e.g., “version=12”; bytes to kilobytes)
●
key/value bundle (bytes to gigabytes)
atomic transactions
●
client operations send to the OSD cluster
●
operate on a single object
●
can contain a sequence of operations, e.g.
– truncate object
– write new object data
– set attribute
●
atomicity
●
all operations commit or do not commit atomically
● conditional
●
'guard' operations can control whether operation is performed
– verify xattr has specific value
– assert object is a specific version
●
allows atomic compare-and-swap etc.
key/value storage
●
store key/value pairs in an object
●
independent from object attrs or byte data payload
● based on google's leveldb
●
efficient random and range insert/query/removal
●
based on BigTable SSTable design
● exposed via key/value API
●
insert, update, remove
●
individual keys or ranges of keys
● avoid read/modify/write cycle for updating complex
objects
●
e.g., file system directory objects
watch/notify
●
establish stateful 'watch' on an object
●
client interest persistently registered with object
●
client keeps session to OSD open
●
send 'notify' messages to all watchers
●
notify message (and payload) is distributed to all watchers
●
variable timeout
●
notification on completion
– all watchers got and acknowledged the notify
● use any object as a communication/synchronization
channel
●
locking, distributed coordination (ala ZooKeeper), etc.
CLIENT
#1
CLIENT
#2
CLIENT
#3
OSD
watch
ack/commit
ack/commit
watch
ack/commit
watch
notify
notify
notify
notify
ack
ack
ack
complete
watch/notify example
●
radosgw cache consistency
●
radosgw instances watch a single object (.rgw/notify)
●
locally cache bucket metadata
●
on bucket metadata changes (removal, ACL changes)
– write change to relevant bucket object
– send notify with bucket name to other radosgw instances
●
on receipt of notify
– invalidate relevant portion of cache
rados classes
●
dynamically loaded .so
●
/var/lib/rados-classes/*
●
implement new object “methods” using existing methods
●
part of I/O pipeline
●
simple internal API
● reads
●
can call existing native or class methods
●
do whatever processing is appropriate
●
return data
● writes
●
can call existing native or class methods
●
do whatever processing is appropriate
●
generates a resulting transaction to be applied atomically
class examples
●
grep
●
read an object, filter out individual records, and return those
● sha1
●
read object, generate fingerprint, return that
● images
●
rotate, resize, crop image stored in object
●
remove red-eye
● crypto
●
encrypt/decrypt object data with provided key
ideas
●
distributed key/value table
●
aggregate many k/v objects into one big 'table'
●
working prototype exists (thanks, Eleanor!)
ideas
●
lua rados class
●
embed lua interpreter in a rados class
●
ship semi-arbitrary code for operations
●
json class
●
parse, manipulate json structures
ideas
●
rados mailbox (RMB?)
●
plug librados backend into dovecot, postfix, etc.
●
key/value object for each mailbox
– key = message id
– value = headers
●
object for each message or attachment
●
watch/notify to delivery notification
hard links?
● rare
● useful locality properties
●
intra-directory
●
parallel inter-directory
● on miss, file objects provide per-file
backpointers
● degenerates to log(n) lookups
● optimistic read complexity

Más contenido relacionado

La actualidad más candente

What's new in Luminous and Beyond
What's new in Luminous and BeyondWhat's new in Luminous and Beyond
What's new in Luminous and BeyondSage Weil
 
Ceph, Now and Later: Our Plan for Open Unified Cloud Storage
Ceph, Now and Later: Our Plan for Open Unified Cloud StorageCeph, Now and Later: Our Plan for Open Unified Cloud Storage
Ceph, Now and Later: Our Plan for Open Unified Cloud StorageSage Weil
 
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephSage Weil
 
GlusterFS Native driver for Openstack Manila at GlusterNight Paris @ Openstac...
GlusterFS Native driver for Openstack Manila at GlusterNight Paris @ Openstac...GlusterFS Native driver for Openstack Manila at GlusterNight Paris @ Openstac...
GlusterFS Native driver for Openstack Manila at GlusterNight Paris @ Openstac...Deepak Shetty
 
Dude where's my volume, open stack summit vancouver 2015
Dude where's my volume, open stack summit vancouver 2015Dude where's my volume, open stack summit vancouver 2015
Dude where's my volume, open stack summit vancouver 2015Sean Cohen
 
A crash course in CRUSH
A crash course in CRUSHA crash course in CRUSH
A crash course in CRUSHSage Weil
 
BlueStore, A New Storage Backend for Ceph, One Year In
BlueStore, A New Storage Backend for Ceph, One Year InBlueStore, A New Storage Backend for Ceph, One Year In
BlueStore, A New Storage Backend for Ceph, One Year InSage Weil
 
Debugging with-wireshark-niels-de-vos
Debugging with-wireshark-niels-de-vosDebugging with-wireshark-niels-de-vos
Debugging with-wireshark-niels-de-vosGluster.org
 
XSKY - ceph luminous update
XSKY - ceph luminous updateXSKY - ceph luminous update
XSKY - ceph luminous updateinwin stack
 
Sdc 2012-challenges
Sdc 2012-challengesSdc 2012-challenges
Sdc 2012-challengesGluster.org
 
2019.06.27 Intro to Ceph
2019.06.27 Intro to Ceph2019.06.27 Intro to Ceph
2019.06.27 Intro to CephCeph Community
 
Gluster technical overview
Gluster technical overviewGluster technical overview
Gluster technical overviewGluster.org
 
20160401 guster-roadmap
20160401 guster-roadmap20160401 guster-roadmap
20160401 guster-roadmapGluster.org
 
20160401 Gluster-roadmap
20160401 Gluster-roadmap20160401 Gluster-roadmap
20160401 Gluster-roadmapGluster.org
 
OpenVZ, Virtuozzo and Docker
OpenVZ, Virtuozzo and DockerOpenVZ, Virtuozzo and Docker
OpenVZ, Virtuozzo and DockerKirill Kolyshkin
 

La actualidad más candente (20)

What's new in Luminous and Beyond
What's new in Luminous and BeyondWhat's new in Luminous and Beyond
What's new in Luminous and Beyond
 
Ceph, Now and Later: Our Plan for Open Unified Cloud Storage
Ceph, Now and Later: Our Plan for Open Unified Cloud StorageCeph, Now and Later: Our Plan for Open Unified Cloud Storage
Ceph, Now and Later: Our Plan for Open Unified Cloud Storage
 
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for Ceph
 
GlusterFS Native driver for Openstack Manila at GlusterNight Paris @ Openstac...
GlusterFS Native driver for Openstack Manila at GlusterNight Paris @ Openstac...GlusterFS Native driver for Openstack Manila at GlusterNight Paris @ Openstac...
GlusterFS Native driver for Openstack Manila at GlusterNight Paris @ Openstac...
 
Dude where's my volume, open stack summit vancouver 2015
Dude where's my volume, open stack summit vancouver 2015Dude where's my volume, open stack summit vancouver 2015
Dude where's my volume, open stack summit vancouver 2015
 
Bluestore
BluestoreBluestore
Bluestore
 
A crash course in CRUSH
A crash course in CRUSHA crash course in CRUSH
A crash course in CRUSH
 
BlueStore, A New Storage Backend for Ceph, One Year In
BlueStore, A New Storage Backend for Ceph, One Year InBlueStore, A New Storage Backend for Ceph, One Year In
BlueStore, A New Storage Backend for Ceph, One Year In
 
Redis at LINE
Redis at LINERedis at LINE
Redis at LINE
 
Debugging with-wireshark-niels-de-vos
Debugging with-wireshark-niels-de-vosDebugging with-wireshark-niels-de-vos
Debugging with-wireshark-niels-de-vos
 
Strata - 03/31/2012
Strata - 03/31/2012Strata - 03/31/2012
Strata - 03/31/2012
 
XSKY - ceph luminous update
XSKY - ceph luminous updateXSKY - ceph luminous update
XSKY - ceph luminous update
 
Sdc 2012-challenges
Sdc 2012-challengesSdc 2012-challenges
Sdc 2012-challenges
 
Scale 10x 01:22:12
Scale 10x 01:22:12Scale 10x 01:22:12
Scale 10x 01:22:12
 
2019.06.27 Intro to Ceph
2019.06.27 Intro to Ceph2019.06.27 Intro to Ceph
2019.06.27 Intro to Ceph
 
librados
libradoslibrados
librados
 
Gluster technical overview
Gluster technical overviewGluster technical overview
Gluster technical overview
 
20160401 guster-roadmap
20160401 guster-roadmap20160401 guster-roadmap
20160401 guster-roadmap
 
20160401 Gluster-roadmap
20160401 Gluster-roadmap20160401 Gluster-roadmap
20160401 Gluster-roadmap
 
OpenVZ, Virtuozzo and Docker
OpenVZ, Virtuozzo and DockerOpenVZ, Virtuozzo and Docker
OpenVZ, Virtuozzo and Docker
 

Similar a Ceph Day Santa Clara: The Future of CephFS + Developing with Librados

London Ceph Day: The Future of CephFS
London Ceph Day: The Future of CephFSLondon Ceph Day: The Future of CephFS
London Ceph Day: The Future of CephFSCeph Community
 
Ceph Block Devices: A Deep Dive
Ceph Block Devices: A Deep DiveCeph Block Devices: A Deep Dive
Ceph Block Devices: A Deep Divejoshdurgin
 
Ceph Block Devices: A Deep Dive
Ceph Block Devices:  A Deep DiveCeph Block Devices:  A Deep Dive
Ceph Block Devices: A Deep DiveRed_Hat_Storage
 
OSDC 2015: John Spray | The Ceph Storage System
OSDC 2015: John Spray | The Ceph Storage SystemOSDC 2015: John Spray | The Ceph Storage System
OSDC 2015: John Spray | The Ceph Storage SystemNETWAYS
 
Open stack HA - Theory to Reality
Open stack HA -  Theory to RealityOpen stack HA -  Theory to Reality
Open stack HA - Theory to RealitySriram Subramanian
 
Ceph Day London 2014 - The current state of CephFS development
Ceph Day London 2014 - The current state of CephFS development Ceph Day London 2014 - The current state of CephFS development
Ceph Day London 2014 - The current state of CephFS development Ceph Community
 
Storage tiering and erasure coding in Ceph (SCaLE13x)
Storage tiering and erasure coding in Ceph (SCaLE13x)Storage tiering and erasure coding in Ceph (SCaLE13x)
Storage tiering and erasure coding in Ceph (SCaLE13x)Sage Weil
 
CephFS in Jewel: Stable at Last
CephFS in Jewel: Stable at LastCephFS in Jewel: Stable at Last
CephFS in Jewel: Stable at LastCeph Community
 
New features for Ceph with Cinder and Beyond
New features for Ceph with Cinder and BeyondNew features for Ceph with Cinder and Beyond
New features for Ceph with Cinder and BeyondCeph Community
 
New Features for Ceph with Cinder and Beyond
New Features for Ceph with Cinder and BeyondNew Features for Ceph with Cinder and Beyond
New Features for Ceph with Cinder and BeyondOpenStack Foundation
 
Open Source Storage at Scale: Ceph @ GRNET
Open Source Storage at Scale: Ceph @ GRNETOpen Source Storage at Scale: Ceph @ GRNET
Open Source Storage at Scale: Ceph @ GRNETNikos Kormpakis
 
Scenic City Summit (2021): Real-Time Streaming in any and all clouds, hybrid...
Scenic City Summit (2021):  Real-Time Streaming in any and all clouds, hybrid...Scenic City Summit (2021):  Real-Time Streaming in any and all clouds, hybrid...
Scenic City Summit (2021): Real-Time Streaming in any and all clouds, hybrid...Timothy Spann
 
Ceph - A distributed storage system
Ceph - A distributed storage systemCeph - A distributed storage system
Ceph - A distributed storage systemItalo Santos
 
Ceph Day Beijing: Big Data Analytics on Ceph Object Store
Ceph Day Beijing: Big Data Analytics on Ceph Object Store Ceph Day Beijing: Big Data Analytics on Ceph Object Store
Ceph Day Beijing: Big Data Analytics on Ceph Object Store Ceph Community
 
Openstack with ceph
Openstack with cephOpenstack with ceph
Openstack with cephIan Colle
 
Ippevent : openshift Introduction
Ippevent : openshift IntroductionIppevent : openshift Introduction
Ippevent : openshift Introductionkanedafromparis
 
Ceph Overview for Distributed Computing Denver Meetup
Ceph Overview for Distributed Computing Denver MeetupCeph Overview for Distributed Computing Denver Meetup
Ceph Overview for Distributed Computing Denver Meetupktdreyer
 
What you need to know about ceph
What you need to know about cephWhat you need to know about ceph
What you need to know about cephEmma Haruka Iwao
 

Similar a Ceph Day Santa Clara: The Future of CephFS + Developing with Librados (20)

London Ceph Day: The Future of CephFS
London Ceph Day: The Future of CephFSLondon Ceph Day: The Future of CephFS
London Ceph Day: The Future of CephFS
 
Block Storage For VMs With Ceph
Block Storage For VMs With CephBlock Storage For VMs With Ceph
Block Storage For VMs With Ceph
 
XenSummit - 08/28/2012
XenSummit - 08/28/2012XenSummit - 08/28/2012
XenSummit - 08/28/2012
 
Ceph Block Devices: A Deep Dive
Ceph Block Devices: A Deep DiveCeph Block Devices: A Deep Dive
Ceph Block Devices: A Deep Dive
 
Ceph Block Devices: A Deep Dive
Ceph Block Devices:  A Deep DiveCeph Block Devices:  A Deep Dive
Ceph Block Devices: A Deep Dive
 
OSDC 2015: John Spray | The Ceph Storage System
OSDC 2015: John Spray | The Ceph Storage SystemOSDC 2015: John Spray | The Ceph Storage System
OSDC 2015: John Spray | The Ceph Storage System
 
Open stack HA - Theory to Reality
Open stack HA -  Theory to RealityOpen stack HA -  Theory to Reality
Open stack HA - Theory to Reality
 
Ceph Day London 2014 - The current state of CephFS development
Ceph Day London 2014 - The current state of CephFS development Ceph Day London 2014 - The current state of CephFS development
Ceph Day London 2014 - The current state of CephFS development
 
Storage tiering and erasure coding in Ceph (SCaLE13x)
Storage tiering and erasure coding in Ceph (SCaLE13x)Storage tiering and erasure coding in Ceph (SCaLE13x)
Storage tiering and erasure coding in Ceph (SCaLE13x)
 
CephFS in Jewel: Stable at Last
CephFS in Jewel: Stable at LastCephFS in Jewel: Stable at Last
CephFS in Jewel: Stable at Last
 
New features for Ceph with Cinder and Beyond
New features for Ceph with Cinder and BeyondNew features for Ceph with Cinder and Beyond
New features for Ceph with Cinder and Beyond
 
New Features for Ceph with Cinder and Beyond
New Features for Ceph with Cinder and BeyondNew Features for Ceph with Cinder and Beyond
New Features for Ceph with Cinder and Beyond
 
Open Source Storage at Scale: Ceph @ GRNET
Open Source Storage at Scale: Ceph @ GRNETOpen Source Storage at Scale: Ceph @ GRNET
Open Source Storage at Scale: Ceph @ GRNET
 
Scenic City Summit (2021): Real-Time Streaming in any and all clouds, hybrid...
Scenic City Summit (2021):  Real-Time Streaming in any and all clouds, hybrid...Scenic City Summit (2021):  Real-Time Streaming in any and all clouds, hybrid...
Scenic City Summit (2021): Real-Time Streaming in any and all clouds, hybrid...
 
Ceph - A distributed storage system
Ceph - A distributed storage systemCeph - A distributed storage system
Ceph - A distributed storage system
 
Ceph Day Beijing: Big Data Analytics on Ceph Object Store
Ceph Day Beijing: Big Data Analytics on Ceph Object Store Ceph Day Beijing: Big Data Analytics on Ceph Object Store
Ceph Day Beijing: Big Data Analytics on Ceph Object Store
 
Openstack with ceph
Openstack with cephOpenstack with ceph
Openstack with ceph
 
Ippevent : openshift Introduction
Ippevent : openshift IntroductionIppevent : openshift Introduction
Ippevent : openshift Introduction
 
Ceph Overview for Distributed Computing Denver Meetup
Ceph Overview for Distributed Computing Denver MeetupCeph Overview for Distributed Computing Denver Meetup
Ceph Overview for Distributed Computing Denver Meetup
 
What you need to know about ceph
What you need to know about cephWhat you need to know about ceph
What you need to know about ceph
 

Último

WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 

Último (20)

WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 

Ceph Day Santa Clara: The Future of CephFS + Developing with Librados

  • 2. RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes LIBRADOS A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Ruby, and PHP LIBRADOS A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Ruby, and PHP RBD A reliable and fully- distributed block device, with a Linux kernel client and a QEMU/KVM driver RBD A reliable and fully- distributed block device, with a Linux kernel client and a QEMU/KVM driver CEPH FS A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE RADOSGW A bucket-based REST gateway, compatible with S3 and Swift RADOSGW A bucket-based REST gateway, compatible with S3 and Swift APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT
  • 5. Metadata Server • Manages metadata for a POSIX-compliant shared filesystem • Directory hierarchy • File metadata (owner, timestamps, mode, etc.) • Stores metadata in RADOS • Does not serve file data to clients • Only required for shared filesystem
  • 6. legacy metadata storage ● a scaling disaster ● name inode block list→ → → data ● no inode table locality ● fragmentation – inode table – directory ● many seeks ● difficult to partition usr etc var home vmlinuz passwd mtab hosts lib … … … include bin
  • 7. ceph fs metadata storage ● block lists unnecessary ● inode table mostly useless ● APIs are path-based, not inode-based ● no random table access, sloppy caching ● embed inodes inside directories ● good locality, prefetching ● leverage key/value object 102 100 1 usr etc var home vmlinuz passwd mtab hosts lib include bin … … …
  • 8. controlling metadata io ● view ceph-mds as cache ● reduce reads – dir+inode prefetching ● reduce writes – consolidate multiple writes ● large journal or log ● stripe over objects ● two tiers – journal for short term – per-directory for long term ● fast failure recovery journal directories
  • 10. load distribution ● coarse (static subtree) ● preserve locality ● high management overhead ● fine (hash) ● always balanced ● less vulnerable to hot spots ● destroy hierarchy, locality ● can a dynamic approach capture benefits of both extremes? static subtree hash directories hash files good locality good balance
  • 11.
  • 12.
  • 13.
  • 14.
  • 16. ● scalable ● arbitrarily partition metadata ● adaptive ● move work from busy to idle servers ● replicate hot metadata ● efficient ● hierarchical partition preserve locality ● dynamic ● daemons can join/leave ● take over for failed nodes dynamic subtree partitioning
  • 19. Metadata replication and availability
  • 21. client protocol ● highly stateful ● consistent, fine-grained caching ● seamless hand-off between ceph-mds daemons ● when client traverses hierarchy ● when metadata is migrated between servers ● direct access to OSDs for file I/O
  • 22. an example ● mount -t ceph 1.2.3.4:/ /mnt ● 3 ceph-mon RT ● 2 ceph-mds RT (1 ceph-mds to -osd RT) ● cd /mnt/foo/bar ● 2 ceph-mds RT (2 ceph-mds to -osd RT) ● ls -al ● open ● readdir – 1 ceph-mds RT (1 ceph-mds to -osd RT) ● stat each file ● close ● cp * /tmp ● N ceph-osd RT ceph-mon ceph-mds ceph-osd
  • 23. recursive accounting ● ceph-mds tracks recursive directory stats ● file sizes ● file and directory counts ● modification time ● virtual xattrs present full stats ● efficient $ ls ­alSh | head total 0 drwxr­xr­x 1 root            root      9.7T 2011­02­04 15:51 . drwxr­xr­x 1 root            root      9.7T 2010­12­16 15:06 .. drwxr­xr­x 1 pomceph         pg4194980 9.6T 2011­02­24 08:25 pomceph drwxr­xr­x 1 mcg_test1       pg2419992  23G 2011­02­02 08:57 mcg_test1 drwx­­x­­­ 1 luko            adm        19G 2011­01­21 12:17 luko drwx­­x­­­ 1 eest            adm        14G 2011­02­04 16:29 eest drwxr­xr­x 1 mcg_test2       pg2419992 3.0G 2011­02­02 09:34 mcg_test2 drwx­­x­­­ 1 fuzyceph        adm       1.5G 2011­01­18 10:46 fuzyceph drwxr­xr­x 1 dallasceph      pg275     596M 2011­01­14 10:06 dallasceph
  • 24. snapshots ● volume or subvolume snapshots unusable at petabyte scale ● snapshot arbitrary subdirectories ● simple interface ● hidden '.snap' directory ● no special tools $ mkdir foo/.snap/one # create snapshot $ ls foo/.snap one $ ls foo/bar/.snap _one_1099511627776 # parent's snap name is mangled $ rm foo/myfile $ ls -F foo bar/ $ ls -F foo/.snap/one myfile bar/ $ rmdir foo/.snap/one # remove snapshot
  • 25. multiple client implementations ● Linux kernel client ● mount -t ceph 1.2.3.4:/ /mnt ● export (NFS), Samba (CIFS) ● ceph-fuse ● libcephfs.so ● your app ● Samba (CIFS) ● Ganesha (NFS) ● Hadoop (map/reduce) kernel libcephfs ceph fuse ceph-fuse your app libcephfs Samba libcephfs Ganesha NFS SMB/CIFS libcephfs Hadoop
  • 26. RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes LIBRADOS A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Ruby, and PHP LIBRADOS A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Ruby, and PHP RBD A reliable and fully- distributed block device, with a Linux kernel client and a QEMU/KVM driver RBD A reliable and fully- distributed block device, with a Linux kernel client and a QEMU/KVM driver RADOSGW A bucket-based REST gateway, compatible with S3 and Swift RADOSGW A bucket-based REST gateway, compatible with S3 and Swift APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT CEPH FS A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE CEPH FS A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE NEARLY AWESOME AWESOMEAWESOME AWESOME AWESOME
  • 27. Path forward ● Testing ● Various workloads ● Multiple active MDSs ● Test automation ● Simple workload generator scripts ● Bug reproducers ● Hacking ● Bug squashing ● Long-tail features ● Integrations ● Ganesha, Samba, *stacks
  • 28.
  • 30. object model ● pools ● 1s to 100s ● independent namespaces or object collections ● replication level, placement policy ● objects ● bazillions ● blob of data (bytes to gigabytes) ● attributes (e.g., “version=12”; bytes to kilobytes) ● key/value bundle (bytes to gigabytes)
  • 31. atomic transactions ● client operations send to the OSD cluster ● operate on a single object ● can contain a sequence of operations, e.g. – truncate object – write new object data – set attribute ● atomicity ● all operations commit or do not commit atomically ● conditional ● 'guard' operations can control whether operation is performed – verify xattr has specific value – assert object is a specific version ● allows atomic compare-and-swap etc.
  • 32. key/value storage ● store key/value pairs in an object ● independent from object attrs or byte data payload ● based on google's leveldb ● efficient random and range insert/query/removal ● based on BigTable SSTable design ● exposed via key/value API ● insert, update, remove ● individual keys or ranges of keys ● avoid read/modify/write cycle for updating complex objects ● e.g., file system directory objects
  • 33. watch/notify ● establish stateful 'watch' on an object ● client interest persistently registered with object ● client keeps session to OSD open ● send 'notify' messages to all watchers ● notify message (and payload) is distributed to all watchers ● variable timeout ● notification on completion – all watchers got and acknowledged the notify ● use any object as a communication/synchronization channel ● locking, distributed coordination (ala ZooKeeper), etc.
  • 35. watch/notify example ● radosgw cache consistency ● radosgw instances watch a single object (.rgw/notify) ● locally cache bucket metadata ● on bucket metadata changes (removal, ACL changes) – write change to relevant bucket object – send notify with bucket name to other radosgw instances ● on receipt of notify – invalidate relevant portion of cache
  • 36. rados classes ● dynamically loaded .so ● /var/lib/rados-classes/* ● implement new object “methods” using existing methods ● part of I/O pipeline ● simple internal API ● reads ● can call existing native or class methods ● do whatever processing is appropriate ● return data ● writes ● can call existing native or class methods ● do whatever processing is appropriate ● generates a resulting transaction to be applied atomically
  • 37. class examples ● grep ● read an object, filter out individual records, and return those ● sha1 ● read object, generate fingerprint, return that ● images ● rotate, resize, crop image stored in object ● remove red-eye ● crypto ● encrypt/decrypt object data with provided key
  • 38. ideas ● distributed key/value table ● aggregate many k/v objects into one big 'table' ● working prototype exists (thanks, Eleanor!)
  • 39. ideas ● lua rados class ● embed lua interpreter in a rados class ● ship semi-arbitrary code for operations ● json class ● parse, manipulate json structures
  • 40. ideas ● rados mailbox (RMB?) ● plug librados backend into dovecot, postfix, etc. ● key/value object for each mailbox – key = message id – value = headers ● object for each message or attachment ● watch/notify to delivery notification
  • 41.
  • 42. hard links? ● rare ● useful locality properties ● intra-directory ● parallel inter-directory ● on miss, file objects provide per-file backpointers ● degenerates to log(n) lookups ● optimistic read complexity

Notas del editor

  1. Finally, let’s talk about Ceph FS. Ceph FS is a parallel filesystem that provides a massively scalable, single-hierarchy, shared disk. If you use a shared drive at work, this is the same thing except that the same drive could be shared by everyone you’ve ever met (and everyone they’ve ever met).
  2. Remember all that meta-data we talked about in the beginning? Feels so long ago. It has to be stored somewhere! Something has to keep track of who created files, when they were created, and who has the right to access them. And something has to remember where they live within a tree. Enter MDS, the Ceph Metadata Server. Clients accessing Ceph FS data first make a request to an MDS, which provides what they need to get files from the right OSDs.
  3. There are multiple MDSs!
  4. If you aren’t running Ceph FS, you don’t need to deploy metadata servers.
  5. So how do you have one tree and multiple servers?
  6. If there’s just one MDS (which is a terrible idea), it manages metadata for the entire tree.
  7. When the second one comes along, it will intelligently partition the work by taking a subtree.
  8. When the third MDS arrives, it will attempt to split the tree again.
  9. Same with the fourth.
  10. A MDS can actually even just take a single directory or file, if the load is high enough. This all happens dynamically based on load and the structure of the data, and it’s called “dynamic subtree partitioning”.
  11. Ceph FS is feature-complete but still lacks the testing, quality assurance, and benchmarking work we feel it needs to recommend it for production use.