Ceph Storage and Penguin Computing on Demand

Ceph and Penguin Computing On Demand
Travis Rhoden

Copyright © 2013 Penguin Computing, Inc. All rights reserved
Who is Penguin Computing?
●
Founded in 1997, with a focus on custom Linux systems
●
Core markets: HPC, enterprise/data-center, HPC cloud services
– We see about a 50/50 mix between HPC and enterprise orders
– Offer turn-key clusters, full-range of Linux servers
●
Now the largest private System Integrator in North America
●
Stable, profitable, growing...

What is Penguin Computing On Demand (POD)?
●
POD launched in 2009 as an HPC-as-a-Service offering
●
Purpose-built HPC cluster for on-demand customers
– Offers low-latency interconnects, high core counts, plentiful RAM for
processing
– Non-virtualized compute resources, focused on absolute compute
performance
– Tuned MPI/cluster stack available “out of the box”
●
“Pay as you go” – only for what you use, charge per core-hour
●
Customizable, persistent user environment
●
Over 50 million commercial jobs run

Original POD designs
●
Original clusters used
standalone DAS NFS
servers
●
Login nodes ran on
VMWare, then KVM,
stored locally on host

Original POD limitations
●
Disparate NFS servers led to non-global namespace
– Users unable to take advantage of all installed storage
– Not all disks able to contribute to performance (no scale-out effect)
– A full NFS server affected co-resident users
– NFS server RAID card a SPoF
●
Never lost data, but did have times where data was inaccessible
●
VM login nodes were handled by a standalone set of hardware
– Storage servers not leveraged for hosting VM disks

POD New Architecture
●
Time for something different
– More expandable
– More fault tolerant
– More flexible
●
OpenStack & Ceph

POD Ceph Usage – Open Stack
●
Ceph OpenStack integration is a big plus
– Store Disk images in Ceph (Glance)
– Store Volumes in Ceph (Cinder)
– Boot VMs straight from Ceph (boot from volume)
– Leverage COW semantics for boot volume creation
– Live migration
●
No immediate need for RADOSGW
– Nice to know it's there if we need it

POD Ceph Usage - RBD
●
The same storage system hosts RBDs for us
●
Each POD user has their $HOME in an RBD
– To make visible to all compute nodes and customer-accessible login
nodes, we mount the RBD on one of several NFS servers and export
from there
– Aren't quite ready to throw full weight into CephFS, but early testing
has started
– We know this creates a performance bottleneck, but the pros outweigh
the cons

POD Ceph Usage – RBD Pros and Cons
●
Pros
– Thin provisioning
– User specific backups or snapshots
– Nice block device to export 1:1 mapping
●
Cons
– NFS server SPoF and bottleneck
– Loss of parallel access to OSDs
– Slow-ish resize

POD Storage Hardware
●
Started with 5x Penguin Computing IB2712 chassis
– Dual Xeon 5600-series
– 48GB RAM
– Dual 10GbE
– 12x hot-swap 3.5” SATA drives
– 2x internal SSDs for OS and OSD journals
●
6 journals on each SSD
●
60x 2TB → 120TB raw storage
– 109TB available in Ceph
●
XFS on OSDs

POD Ceph Storage Config
●
Running 3 monitors
– On same chassis as OSDs (not recommended by Inktank)
●
Running 2 MDS processes
– On same chassis as OSDs
– 1 active, 1 backup
●
Each chassis has a 2-port 10GbE LAG to ToR switch
●
2 replicas
●
Separate pools for Glance, Cinder, user $HOMEs

CephFS on POD
●
Primary use case for storage on POD is users reading and writing data to
their $HOME directory
●
On our HPC clusters, primarily tends to be sequential writes, but we see
sequential reads and some bits of random I/O
●
Running VMs also produce random I/O
●
Since users can run jobs comprising dozens of compute nodes, potentially
all hitting the same folder(s), would be nice to use CephFS rather than
NFS
●
Testing a scratch space a good way to start
●
Using ceph-fuse, as Cluster is CentOS 6.3

CephFS initial benchmarks
●
Simple dd, 1GB file, 4MB blocks
– (dd if=/dev/zero of=[dir] bs=4M count=256 conv=fdatasync)

Ceph Lessons Learned
●
Our 3rd production Ceph cluster
– 1st has been decommissioned, ran Argonaut and Bobtail, used IPoIB
– 2nd being decommissioned, still running Bobtail
– 3rd is the primary workhorse for a production POD cluster, launched on Bobtail,
now running the latest Cuttlefish
●
For RBD, very recent Linux kernel a must if using kclient
– Pre 3.10 had kpanic issues when using cephx
●
SSDs nice, but may not be best bang for buck
– 3-4 OSD journals per SSD is ideal, but does add significant cost
– We've seen promising results using higher end RAID controllers in lieu of
SSDs, due to write-back cache, at an overall lower cost
– We still need to test more to determine how this behavior caries over
sequential vs random, and small vs large I/O.
●
Need to work hard to balance density versus manageable failure domains
– Density very popular, but leads to a lot of recovery traffic if server fails

Thanks!
@off_rhoden
trhoden@penguincomputing.com
@PenguinHPC

Ceph Storage and Penguin Computing on Demand

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (13)

Recently uploaded

Recently uploaded (20)

Ceph Storage and Penguin Computing on Demand