Ceph Storage and Penguin Computing on Demand
- 2. Copyright © 2013 Penguin Computing, Inc. All rights reserved
Who is Penguin Computing?
●
Founded in 1997, with a focus on custom Linux systems
●
Core markets: HPC, enterprise/data-center, HPC cloud services
– We see about a 50/50 mix between HPC and enterprise orders
– Offer turn-key clusters, full-range of Linux servers
●
Now the largest private System Integrator in North America
●
Stable, profitable, growing...
- 3. Copyright © 2013 Penguin Computing, Inc. All rights reserved
What is Penguin Computing On Demand (POD)?
●
POD launched in 2009 as an HPC-as-a-Service offering
●
Purpose-built HPC cluster for on-demand customers
– Offers low-latency interconnects, high core counts, plentiful RAM for
processing
– Non-virtualized compute resources, focused on absolute compute
performance
– Tuned MPI/cluster stack available “out of the box”
●
“Pay as you go” – only for what you use, charge per core-hour
●
Customizable, persistent user environment
●
Over 50 million commercial jobs run
- 4. Copyright © 2013 Penguin Computing, Inc. All rights reserved
Original POD designs
●
Original clusters used
standalone DAS NFS
servers
●
Login nodes ran on
VMWare, then KVM,
stored locally on host
- 5. Copyright © 2013 Penguin Computing, Inc. All rights reserved
Original POD limitations
●
Disparate NFS servers led to non-global namespace
– Users unable to take advantage of all installed storage
– Not all disks able to contribute to performance (no scale-out effect)
– A full NFS server affected co-resident users
– NFS server RAID card a SPoF
●
Never lost data, but did have times where data was inaccessible
●
VM login nodes were handled by a standalone set of hardware
– Storage servers not leveraged for hosting VM disks
- 6. Copyright © 2013 Penguin Computing, Inc. All rights reserved
POD New Architecture
●
Time for something different
– More expandable
– More fault tolerant
– More flexible
●
OpenStack & Ceph
- 7. Copyright © 2013 Penguin Computing, Inc. All rights reserved
POD Ceph Usage – Open Stack
●
Ceph OpenStack integration is a big plus
– Store Disk images in Ceph (Glance)
– Store Volumes in Ceph (Cinder)
– Boot VMs straight from Ceph (boot from volume)
– Leverage COW semantics for boot volume creation
– Live migration
●
No immediate need for RADOSGW
– Nice to know it's there if we need it
- 8. Copyright © 2013 Penguin Computing, Inc. All rights reserved
POD Ceph Usage - RBD
●
The same storage system hosts RBDs for us
●
Each POD user has their $HOME in an RBD
– To make visible to all compute nodes and customer-accessible login
nodes, we mount the RBD on one of several NFS servers and export
from there
– Aren't quite ready to throw full weight into CephFS, but early testing
has started
– We know this creates a performance bottleneck, but the pros outweigh
the cons
- 9. Copyright © 2013 Penguin Computing, Inc. All rights reserved
POD Ceph Usage – RBD Pros and Cons
●
Pros
– Thin provisioning
– User specific backups or snapshots
– Nice block device to export 1:1 mapping
●
Cons
– NFS server SPoF and bottleneck
– Loss of parallel access to OSDs
– Slow-ish resize
- 10. Copyright © 2013 Penguin Computing, Inc. All rights reserved
POD Storage Hardware
●
Started with 5x Penguin Computing IB2712 chassis
– Dual Xeon 5600-series
– 48GB RAM
– Dual 10GbE
– 12x hot-swap 3.5” SATA drives
– 2x internal SSDs for OS and OSD journals
●
6 journals on each SSD
●
60x 2TB → 120TB raw storage
– 109TB available in Ceph
●
XFS on OSDs
- 11. Copyright © 2013 Penguin Computing, Inc. All rights reserved
POD Ceph Storage Config
●
Running 3 monitors
– On same chassis as OSDs (not recommended by Inktank)
●
Running 2 MDS processes
– On same chassis as OSDs
– 1 active, 1 backup
●
Each chassis has a 2-port 10GbE LAG to ToR switch
●
2 replicas
●
Separate pools for Glance, Cinder, user $HOMEs
- 12. Copyright © 2013 Penguin Computing, Inc. All rights reserved
CephFS on POD
●
Primary use case for storage on POD is users reading and writing data to
their $HOME directory
●
On our HPC clusters, primarily tends to be sequential writes, but we see
sequential reads and some bits of random I/O
●
Running VMs also produce random I/O
●
Since users can run jobs comprising dozens of compute nodes, potentially
all hitting the same folder(s), would be nice to use CephFS rather than
NFS
●
Testing a scratch space a good way to start
●
Using ceph-fuse, as Cluster is CentOS 6.3
- 13. Copyright © 2013 Penguin Computing, Inc. All rights reserved
CephFS initial benchmarks
●
Simple dd, 1GB file, 4MB blocks
– (dd if=/dev/zero of=[dir] bs=4M count=256 conv=fdatasync)
- 14. Copyright © 2013 Penguin Computing, Inc. All rights reserved
Ceph Lessons Learned
●
Our 3rd production Ceph cluster
– 1st has been decommissioned, ran Argonaut and Bobtail, used IPoIB
– 2nd being decommissioned, still running Bobtail
– 3rd is the primary workhorse for a production POD cluster, launched on Bobtail,
now running the latest Cuttlefish
●
For RBD, very recent Linux kernel a must if using kclient
– Pre 3.10 had kpanic issues when using cephx
●
SSDs nice, but may not be best bang for buck
– 3-4 OSD journals per SSD is ideal, but does add significant cost
– We've seen promising results using higher end RAID controllers in lieu of
SSDs, due to write-back cache, at an overall lower cost
– We still need to test more to determine how this behavior caries over
sequential vs random, and small vs large I/O.
●
Need to work hard to balance density versus manageable failure domains
– Density very popular, but leads to a lot of recovery traffic if server fails
- 15. Copyright © 2013 Penguin Computing, Inc. All rights reserved
Thanks!
@off_rhoden
trhoden@penguincomputing.com
@PenguinHPC