Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
An intro to Ceph and big data - CERN Big Data Workshop
1. an intro to ceph and big data
patrick mcgarry – inktank
Big Data Workshop – 27 JUN 2013
2. what is ceph?
distributed storage system
− reliable system built with unreliable components
− fault tolerant, no SPoF
commodity hardware
− expensive arrays, controllers, specialized
networks not required
large scale (10s to 10,000s of nodes)
− heterogenous hardware (no fork-lift upgrades)
− incremental expansion (or contraction)
dynamic cluster
3. what is ceph?
unified storage platform
− scalable object + compute storage platform
− RESTful object storage (e.g., S3, Swift)
− block storage
− distributed file system
open source
− LGPL server-side
− client support in mainline Linux kernel
4. RADOS – the Ceph object store
A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
intelligent storage nodes
RADOS – the Ceph object store
A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
intelligent storage nodes
LIBRADOS
A library allowing
apps to directly
access RADOS,
with support for
C, C++, Java,
Python, Ruby,
and PHP
LIBRADOS
A library allowing
apps to directly
access RADOS,
with support for
C, C++, Java,
Python, Ruby,
and PHP
RBD
A reliable and fully-
distributed block
device, with a Linux
kernel client and a
QEMU/KVM driver
RBD
A reliable and fully-
distributed block
device, with a Linux
kernel client and a
QEMU/KVM driver
CEPH FS
A POSIX-compliant
distributed file system,
with a Linux kernel
client and support for
FUSE
CEPH FS
A POSIX-compliant
distributed file system,
with a Linux kernel
client and support for
FUSE
RADOSGW
A bucket-based REST
gateway, compatible
with S3 and Swift
RADOSGW
A bucket-based REST
gateway, compatible
with S3 and Swift
APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT
14.
efficient key/value storage inside an object
atomic single-object transactions
− update data, attr, keys together
− atomic compare-and-swap
object-granularity snapshot infrastructure
inter-client communication via object
embed code in ceph-osd daemon via plugin
API
− arbitrary atomic object mutations, processing
rich librados API
15. Data and compute
RADOS Embedded Object Classes
Moves compute directly adjacent to data
C++ by default
Lua bindings available
16. die, POSIX, die
successful exascale architectures will replace
or transcend POSIX
− hierarchical model does not distribute
line between compute and storage will blur
− some processes is data-local, some is not
fault tolerance will be first-class property of
architecture
− for both computation and storage
17. POSIX – I'm not dead yet!
CephFS builds POSIX namespace on top of
RADOS
− metadata managed by ceph-mds daemons
− stored in objects
strong consistency, stateful client protocol
− heavy prefetching, embedded inodes
architected for HPC workloads
− distribute namespace across cluster of MDSs
− mitigate bursty workloads
− adapt distribution as workloads shift over time
27. snapshots
snapshot arbitrary subdirectories
simple interface
− hidden '.snap' directory
− no special tools
$ mkdir foo/.snap/one # create snapshot
$ ls foo/.snap
one
$ ls foo/bar/.snap
_one_1099511627776 # parent's snap name is mangled
$ rm foo/myfile
$ ls -F foo
bar/
$ ls -F foo/.snap/one
myfile bar/
$ rmdir foo/.snap/one # remove snapshot
28. how can you help?
try ceph and tell us what you think
− http://ceph.com/resources/downloads
http://ceph.com/resources/mailing-list-irc/
− ask if you need help
ask your organization to start dedicating
resources to the project http://github.com/ceph
find a bug (http://tracker.ceph.com) and fix it
participate in our ceph developer summit
− http://ceph.com/events/ceph-developer-summit
RADOS is a distributed object store, and it’s the foundation for Ceph. On top of RADOS, the Ceph team has built three applications that allow you to store data and do fantastic things. But before we get into all of that, let’s start at the beginning of the story.
Let’s start with RADOS, Reliable Autonomic Distributed Object Storage. In this example, you’ve got five disks in a computer. You have initialized each disk with a filesystem (btrfs is the right filesystem to use someday, but until it’s stable we recommend XFS). On each filesystem, you deploy a Ceph OSD (Object Storage Daemon). That computer, with its five disks and five object storage daemons, becomes a single node in a RADOS cluster. Alongside these nodes are monitor nodes, which keep track of the current state of the cluster and provide users with an entry point into the cluster (although they do not serve any data themselves).
With CRUSH, the data is first split into a certain number of sections. These are called “placement groups”. The number of placement groups is configurable. Then, the CRUSH algorithm runs, having received the latest cluster map and a set of placement rules, and it determines where the placement group belongs in the cluster. This is a pseudo-random calculation, but it’s also repeatable; given the same cluster state and rule set, it will always return the same results.
Each placement group is run through CRUSH and stored in the cluster. Notice how no node has received more than one copy of a placement group, and no two nodes contain the same information? That’s important.
When it comes time to store an object in the cluster (or retrieve one), the client calculates where it belongs.
What happens, though, when a node goes down? The OSDs are always talking to each other (and the monitors), and they know when something is amiss. The third and fifth node on the top row have noticed that the second node on the bottom row is gone, and they are also aware that they have replicas of the missing data.
The OSDs collectively use the CRUSH algorithm to determine how the cluster should look based on its new state, and move the data to where clients running CRUSH expect it to be.
Because of the way placement is calculated instead of centrally controlled, node failures are transparent to clients.
Most people will default to discussions about CephFS when confronted with either Big Data or HPC applications. This can mean using CephFS by itself, or perhaps as a drop-in replacement for HDFS. [NOT READY ARGUMENT] There are a couple of other options, however. You can use librados to talk directly to the object store. One user I know actually plugged Hadoop in at this level, instead of using CephFS. Ceph also has a pretty decent key-value store proof-of-concept done by an intern last year. It's based on a b-tree structure but uses a fixed height of two levels instead of a true tree structure. This draws from both a normal B-Tree and Google BigTable. Would love to see someone do more with it.
I mentioned librados, this is the low-level library that allows you to directly access a RADOS cluster from your application. This has native language bindings for C, C++, Python, etc. This is obviously the fastest way to get at your data and comes with no inherent overheard or translation layer.
For most object systems an object is just a bunch of bytes, maybe some extended attributes. Ceph you can store a lot more than that. You can store key/value pairs inside an object, think berkelyDB or sql where each object is a logical container. It supports atomic transaction so you can do things like atomic compare-and-swap. Update the bytes and the keys/values in an atomic fashion and it will be consistently distributed and replicated across a cluster in a safe way. There is snapshotting that will give you per-directory snapshots, and inter-client communication for locking and whatnot. The really exciting part about this is the ability to implement your own functionality on the OSD.....
These embedded object classes allow you to send an object method call to the cluster and it will actually perform that computation without having to pull the data over the network. The downside to using these object classes is the injection of new functionality into the system. A compiled C++ plugin has to be delivered and dynamically loaded into each OSD process. This becomes more complicated if a cluster is composed of multiple architecture targets, and makes it difficult to update functionality on the fly. One approach to addressing these problems is to embed a language run-time within the OSD. Noah Watkins, one of our engineers tackled this with some Lua bindings which are available.
One of the more contentious assertions that Sage likes to make is that as we move towards exascale computing and beyond we'll need to transcend or replace POSIX. The heirarchical model just doesn't scale well beyond a certain level. Future models are going to have to start blurring the line between compute and storage and recognizing when data is local to perform operations vs when you need to gather data from multiple sources and gather data for an operation. And finally fault tolerance needs to become a first-class property of these architectures. As we push the scale of our existing architectures, building things like burst buffers to deal with huge checkpoints across millions of cores it just doesn't make a whole lot of sense.
Having said all that, there are too many things (both people and code) built using POSIX mentality to ditch it any time soon. CephFS is designed to provide that POSIX layer on top of RADOS. [read slide] Now, as we've said there is certainly some work to be done on CephFS, but I want to share a bit about how it works since it (and similar thinking) will play a big part of Ceph's HPC and Big Data applications going forward.
CephFS adds a metadata server (or MDS) to the list of node types in your Ceph cluster. Something has to keep track of who created files, when they were created, and who has the right to access them. And something has to remember where they live within a tree. Clients accessing Ceph FS data first make a request to an MDS, which provides what they need to get files from the right OSDs.
There are multiple MDSs!
So how do you have one tree and multiple servers?
If there’s just one MDS (which is a terrible idea), it manages metadata for the entire tree.
When the second one comes along, it will intelligently partition the work by taking a subtree.
When the third MDS arrives, it will attempt to split the tree again.
Same with the fourth.
A MDS can actually even just take a single directory or file, if the load is high enough. This all happens dynamically based on load and the structure of the data, and it’s called “dynamic subtree partitioning”. This is done as a periodic load balance exchange. The transfer just ships the cache contents between MDS and lets the clients continue transparently.
CephFS has some neat features that you don't find in most file systems. Because they built the filesystem namespace from the ground up they were able to build these features into the infrastructure. One of these features is recursive accounting. The MDSs keep track of directory stats for every dir in the file system. For instance, when you do an 'ls -al' the file size is actually the total number of bytes stored in that directory recursively in the system. The same thing you can get from a 'du' but in realtime.
[Provides snapshots] The motivation here is once you start talking about petabytes and exabytes it doesn't make much sense to try to snapshot the entire tree. You need to be able to snapshot different directories and different data sets. You can add and remove snapshots for any directory with standard bash-type commands.
Also, next Ceph developer summit coming soon to plan for the Emperor release. Would love to see some blueprints submitted for CephFS work.