The conventional approach to deploying applications on OpenStack uses virtual machines (usually KVM) backed by block devices (usually Ceph RBD). As interest increases in container-based application deployment models like Docker, it is worth looking at what alternatives exist for combining compute and storage (both shared and non-shared). Mapping RBD block devices directly to host kernels trades isolation for performance and may be appropriate for many private clouds without significant changes to the infrastructure. More importantly, moving away from a virtualization allows for non-block interfaces and a range of alternative models based on file or object.
Attendees will leave this talk with a basic understanding of the storage components and services available to both virtual machines and Linux containers, a view of a several ways they can be combined and the performance, reliability, and security trade-offs associated with those possibilities, and several proposals for how the relevant OpenStack projects (Nova, Cinder, Manila) can work together to make it easy.
6. 6
WHY NOT CONTAINERS?
Technology
● Security
– Shared kernel
– Limited isolation
● OS flexibility
– Shared kernel limits OS choices
● Inertia
Ecosystem
● New models don't capture many
legacy services
7. 7
WHY CEPH?
● All components scale horizontally
● No single point of failure
● Hardware agnostic, commodity hardware
● Self-manage whenever possible
● Open source (LGPL)
● Move beyond legacy approaches
– client/cluster instead of client/server
– avoid ad hoc HA
8. 8
CEPH COMPONENTS
RGW
A web services gateway
for object storage,
compatible with S3 and
Swift
LIBRADOS
A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)
RADOS
A software-based, reliable, autonomous, distributed object store comprised of
self-healing, self-managing, intelligent storage nodes and lightweight monitors
RBD
A reliable, fully-distributed
block device with cloud
platform integration
CEPHFS
A distributed file system
with POSIX semantics and
scale-out metadata
management
APP HOST/VM CLIENT
10. 10
EXISTING BLOCK STORAGE MODEL
VM
●
VMs are the unit of cloud compute
●
Block devices are the unit of VM storage
– ephemeral: not redundant, discarded when VM dies
– persistent volumes: durable, (re)attached to any VM
●
Block devices are single-user
●
For shared storage,
– use objects (e.g., Swift or S3)
– use a database (e.g., Trove)
– ...
11. 11
KVM + LIBRBD.SO
● Model
– Nova → libvirt → KVM → librbd.so
– Cinder → rbd.py → librbd.so
– Glance → rbd.py → librbd.so
● Pros
– proven
– decent performance
– good security
● Cons
– performance could be better
● Status
– most common deployment model
today (~44% in latest survey)
M M
RADOS CLUSTER
QEMU / KVM
LIBRBD
VM NOVA
CINDER
12. 12
MULTIPLE CEPH DRIVERS
● librbd.so
– qemu-kvm
– rbd-fuse (experimental)
● rbd.ko (Linux kernel)
– /dev/rbd*
– stable and well-supported on modern kernels and distros
– some feature gap
● no client-side caching
● no “fancy striping”
– performance delta
● more efficient → more IOPS
● no client-side cache → higher latency for some workloads
13. 13
LXC + CEPH.KO
● The model
– libvirt-based lxc containers
– map kernel RBD on host
– pass host device to libvirt, container
● Pros
– fast and efficient
– implement existing Nova API
● Cons
– weaker security than VM
● Status
– lxc is maintained
– lxc is less widely used
– no prototype
M M
RADOS CLUSTER
LINUX HOST
RBD.KO
CONTAINER
NOVA
14. 14
NOVA-DOCKER + CEPH.KO
● The model
– docker container as mini-host
– map kernel RBD on host
– pass RBD device to container, or
– mount RBD, bind dir to container
● Pros
– buzzword-compliant
– fast and efficient
● Cons
– different image format
– different app model
– only a subset of docker feature set
● Status
– no prototype
– nova-docker is out of tree
https://wiki.openstack.org/wiki/Docker
15. 15
IRONIC + CEPH.KO
● The model
– bare metal provisioning
– map kernel RBD directly from guest image
● Pros
– fast and efficient
– traditional app deployment model
● Cons
– guest OS must support rbd.ko
– requires agent
– boot-from-volume tricky
● Status
– Cinder and Ironic integration is a hot topic at
summit
● 5:20p Wednesday (cinder)
– no prototype
● References
– https://wiki.openstack.org/wiki/Ironic/blueprints/
cinder-integration
M M
RADOS CLUSTER
LINUX HOST
RBD.KO
16. 16
BLOCK - SUMMARY
● But
– block storage is same old boring
– volumes are only semi-elastic (grow, not shrink; tedious to resize)
– storage is not shared between guests
performance efficiency VM
client
cache
striping
same
images?
exists
kvm + librbd.so best good X X X yes X
lxc + rbd.ko good best close
nova-docker + rbd.ko good best no
ironic + rbd.ko good best close? planned!
18. 18
MANILA FILE STORAGE
● Manila manages file volumes
– create/delete, share/unshare
– tenant network connectivity
– snapshot management
● Why file storage?
– familiar POSIX semantics
– fully shared volume – many clients can mount and share data
– elastic storage – amount of data can grow/shrink without explicit
provisioning
MANILA
19. 19
MANILA CAVEATS
● Last mile problem
– must connect storage to guest network
– somewhat limited options (focus on Neutron)
● Mount problem
– Manila makes it possible for guest to mount
– guest is responsible for actual mount
– ongoing discussion around a guest agent …
● Current baked-in assumptions about both of these
MANILA
20. 20
?
APPLIANCE DRIVERS
● Appliance drivers
– tell an appliance to export NFS to guests
– map appliance IP into tenant network
(Neutron)
– boring (closed, proprietary, expensive, etc.)
● Status
– several drivers from usual suspects
– security punted to vendor
NFS
MANILA
21. 21
GANESHA DRIVER
● Model
– service VM running nfs-ganesha server
– mount file system on storage network
– export NFS to tenant network
– map IP into tenant network
● Status
– in-tree, well-supported
KVM
GANESHA
???
NFS
MANILA
???
22. 22
KVM
GANESHA
KVM + GANESHA + LIBCEPHFS
● Model
– existing Ganesha driver, backed by
Ganesha's libcephfs FSAL
● Pros
– simple, existing model
– security
● Cons
– extra hop → higher latency
– service VM is SpoF
– service VM consumes resources
● Status
– Manila Ganesha driver exists
– untested with CephFS
M M
RADOS CLUSTER
LIBCEPHFS
KVM
NFS
NFS.KO
MANILA
NATIVE CEPH
23. 23
KVM + CEPH.KO (CEPH-NATIVE)
● Model
– allow tenant access to storage network
– mount CephFS directly from tenant VM
● Pros
– best performance
– access to full CephFS feature set
– simple
● Cons
– guest must have modern distro/kernel
– exposes tenant to Ceph cluster
– must deliver mount secret to client
● Status
– no prototype
– CephFS isolation/security is work-in-progress
KVM
M M
RADOS CLUSTER
CEPH.KO
MANILA
NATIVE CEPH
24. 24
NETWORK-ONLY MODEL IS LIMITING
● Current assumption of NFS or
CIFS sucks
● Always relying on guest mount
support sucks
– mount -t ceph -o what?
● Even assuming storage
connectivity is via the network
sucks
● There are other options!
– KVM virtfs/9p
● fs pass-through to host
● 9p protocol
● virtio for fast data transfer
● upstream; not widely used
– NFS re-export from host
● mount and export fs on host
● private host/guest net
● avoid network hop from NFS
service VM
– containers and 'mount --bind'
25. 25
NOVA “ATTACH FS” API
● Mount problem is ongoing discussion by Manila team
– discussed this morning
– simple prototype using cloud-init
– Manila agent? leverage Zaqar tenant messaging service?
● A different proposal
– expand Nova to include “attach/detach file system” API
– analogous to current attach/detach volume for block
– each Nova driver may implement function differently
– “plumb” storage to tenant VM or container
● Open question
– Would API do the final “mount” step as well? (I say yes!)
26. 26
KVM + VIRTFS/9P + CEPHFS.KO
● Model
– mount kernel CephFS on host
– pass-through to guest via virtfs/9p
● Pros
– security: tenant remains isolated from
storage net + locked inside a directory
● Cons
– require modern Linux guests
– 9p not supported on some distros
– “virtfs is ~50% slower than a native
mount?”
● Status
– Prototype from Haomai Wang
HOST
M M
RADOS CLUSTER
KVM VIRTFS
MANILA
NATIVE CEPH
CEPH.KO
VM
9P
NOVA
27. 27
KVM + NFS + CEPHFS.KO
● Model
– mount kernel CephFS on host
– pass-through to guest via NFS
● Pros
– security: tenant remains isolated
from storage net + locked inside a
directory
– NFS is more standard
● Cons
– NFS has weak caching consistency
– NFS is slower
● Status
– no prototype
HOST
M M
RADOS CLUSTER
KVM
MANILA
NATIVE CEPH
CEPH.KO
VM
NFS
NOVA
28. 28
(LXC, NOVA-DOCKER) + CEPHFS.KO
● Model
– host mounts CephFS directly
– mount --bind share into
container namespace
● Pros
– best performance
– full CephFS semantics
● Cons
– rely on container for security
● Status
– no prototype
HOST
M M
RADOS CLUSTER
CONTAINER
MANILA
NATIVE CEPH
CEPH.KO
NOVA
29. 29
IRONIC + CEPHFS.KO
● Model
– mount CephFS directly from bare
metal “guest”
● Pros
– best performance
– full feature set
● Cons
– rely on CephFS security
– networking?
– agent to do the mount?
● Status
– no prototype
– no suitable (ironic) agent (yet)
HOST
M M
RADOS CLUSTER
MANILA
NATIVE CEPH
CEPH.KO
NOVA
30. 30
THE MOUNT PROBLEM
● Containers may break the current 'network fs' assumption
– mounting becomes driver-dependent; harder for tenant to do the right thing
● Nova “attach fs” API could provide the needed entry point
– KVM: qemu-guest-agent
– Ironic: no guest agent yet...
– containers (lxc, nova-docker): use mount --bind from host
● Or, make tenant do the final mount?
– Manila API to provide command (template) to perform the mount
● e.g., “mount -t ceph $cephmonip:/manila/$uuid $PATH -o ...”
– Nova lxc and docker
● bind share to a “dummy” device /dev/manila/$uuid
● API mount command is 'mount --bind /dev/manila/$uuid $PATH'
31. 31
SECURITY: NO FREE LUNCH
● (KVM, Ironic) + ceph.ko
– access to storage network relies on Ceph security
● KVM + (virtfs/9p, NFS) + ceph.ko
– better security, but
– pass-through/proxy limits performance
● (by how much?)
● Containers
– security (vs a VM) is weak at baseline, but
– host performs the mount; tenant locked into their share directory
32. 32
PERFORMANCE
● 2 nodes
– Intel E5-2660
– 96GB RAM
– 10gb NIC
● Server
– 3 OSD (Intel S3500)
– 1 MON
– 1 MDS
● Client VMs
– 4 cores
– 2GB RAM
● iozone, 2x available RAM
● CephFS native
– VM ceph.ko → server
● CephFS 9p/virtfs
– VM 9p → host ceph.ko → server
● CephFS NFS
– VM NFS → server ceph.ko →
server
35. 35
SUMMARY MATRIX
performance consistency VM gateway net hops security agent
mount
agent
prototype
kvm + ganesha +
libcephfs
slower (?) weak (nfs) X X 2 host X X
kvm + virtfs + ceph.ko good good X X 1 host X X
kvm + nfs + ceph.ko good weak (nfs) X X 1 host X
kvm + ceph.ko better best X 1 ceph X
lxc + ceph.ko best best 1 ceph
nova-docker + ceph.ko best best 1 ceph
IBM talk -
Thurs 9am
ironic + ceph.ko best best 1 ceph X X
37. 37
CONTAINERS ARE DIFFERENT
● nova-docker implements a Nova view of a (Docker) container
– treats container like a standalone system
– does not leverage most of what Docker has to offer
– Nova == IaaS abstraction
● Kubernetes is the new hotness
– higher-level orchestration for containers
– draws on years of Google experience running containers at scale
– vibrant open source community
38. 38
KUBERNETES SHARED STORAGE
● Pure Kubernetes – no OpenStack
● Volume drivers
– Local
● hostPath, emptyDir
– Unshared
● iSCSI, GCEPersistentDisk, Amazon EBS, Ceph RBD – local fs on top of existing device
– Shared
● NFS, GlusterFS, Amazon EFS, CephFS
● Status
– Ceph drivers under review
● Finalizing model for secret storage, cluster parameters (e.g., mon IPs)
– Drivers expect pre-existing volumes
● recycled; missing REST API to create/destroy volumes
39. 39
KUBERNETES ON OPENSTACK
● Provision Nova VMs
– KVM or ironic
– Atomic or CoreOS
● Kubernetes per tenant
● Provision storage devices
– Cinder for volumes
– Manila for shares
● Kubernetes binds into pod/container
● Status
– Prototype Cinder plugin for Kubernetes
https://github.com/spothanis/kubernetes/tree/cinder-vol-plugin
KVM
Kube node
nginx pod
mysql pod
KVM
Kube node
nginx pod
mysql pod
KVM
Kube master
Volume
controller
...
CINDER MANILA
NOVA
40. 40
WHAT NEXT?
● Ironic agent
– enable Cinder (and Manila?) on bare metal
– Cinder + Ironic
● 5:20p Wednesday (Cinder)
● Expand breadth of Manila drivers
– virtfs/9p, ceph-native, NFS proxy via host, etc.
– the last mile is not always the tenant network!
● Nova “attach fs” API (or equivalent)
– simplify tenant experience
– paper over VM vs container vs bare metal differences
41. THANK YOU!
Sage Weil
CEPH PRINCIPAL ARCHITECT
Haomai Wang
FREE AGENT
sage@redhat.com
haomaiwang@gmail.com
@liewegas
42. 42
FOR MORE INFORMATION
● http://ceph.com
● http://github.com/ceph
● http://tracker.ceph.com
● Mailing lists
– ceph-users@ceph.com
– ceph-devel@vger.kernel.org
● irc.oftc.net
– #ceph
– #ceph-devel
● Twitter
– @ceph