This document discusses a project between UC San Francisco and VMware to test running high-performance computing (HPC) workloads in a private cloud environment. The project aimed to prove that certain life sciences workloads could run virtually without significant performance degradation compared to dedicated hardware. An initial private cloud was set up using Dell servers and storage from EMC, DDN, and Mellanox switches. Benchmarking of applications like BLAST, Bowtie, and R was planned to compare performance between bare-metal and virtualized environments. The results would assess whether the private cloud could provide benefits like self-service provisioning, multi-tenancy, and isolation of workloads.
ICT role in 21st century education and its challenges
VMworld 2013: How UC San Francisco Delivered ‘Science as a Service’ with Private Cloud for HPC
1. How UC San Francisco Delivered ‘Science as a
Service’ with Private Cloud for HPC
Brad Dispensa, University of California
Andrew Nelson, VMware
VSVC5272
#VSVC5272
2. 2
Agenda
Who we are
Motivation
Project
Architecture
Next steps
3. 3
Who We Are
Andrew Nelson
Staff Systems Engineer
• VMware
• VCDX#33
4. 4
Who We Are
Brad Dispensa
Director IT / IS UCSF
• Department of Anesthesia
• Institute for Human Genetics
8. 8
What This Is Not…
We are not launching a new product
This is about a collaboration to determine the use case and
limitations of running workloads that historically have been run in
HPC clusters that could be run virtually
What we find we will share so you can make your own choices
9. 9
Motivation
A need to deploy HPC as service
• *Where the use case makes sense
Where could it make sense?
• Jobs that are not dependent on saturating all I/O
• Jobs that don’t require all available resources
• Jobs that require bleeding edge packages
• Users want to run as root (Really?!)
• User wants to run an unsupported OS
• Development / QA
• Job integrity more important than run time
• Funding issue (Grant based)
11. 11
Bias?
Why VM people think they wouldn’t do this
• “You will saturate my servers and cause slowdowns in production systems”
• “I don’t have HPC Fabric”
• “Vm sprawl would take over my datacenter”
• “How would I begin to scope for a use case that does not fit the usual 20%
utilization model?”
Why HPC people think they wouldn’t do this
• “Its not high performance”
• “It will be slow and unwieldy”
• “My app has to be run on dedicated hardware”
• “Latency introduced by the hypervisor”
• “That wont work for my weird use case”
12. 12
Motivation
Here’s the thing….
• Most life science jobs are single threaded
• Most “programmers” are grad-students
• HPC in Life Sciences is not the same as HPC for
oil and gas or other engineering users.
• We are not “critical” it’s research, 5 9’s is not our deal
• When do most long runs start, Friday. Nice to use
that hardware that was just going to idle all weekend.
• How is this any different than any discussion
in HPC?
• We often debate which file system is better, which
chipset, controller.
• It’s never be one size fits all.
• Spending more time sizing rather than just running it.
• The hardware should really be agnostic
• Should we buy …. Or ….
http://frabz.com/meme-generator/caption/10406-Willy-Wonka/
13. 13
Run Any Software Stacks
App A
OS A
App B
OS B
virtualization layer
hardware
virtualization layer
hardware
virtualization layer
hardware
Support groups with disparate software requirements
Including root access
15. 15
Use Resources More Efficiently
App A
OS A
App B
OS B
virtualization layer
hardware
virtualization layer
hardware
virtualization layer
hardware
App A
OS A
App C
OS B
App C
OS A
Avoid killing or pausing jobs
Increase overall throughput
17. 17
Multi-tenancy with Resource Guarantees
Define policies to manage resource sharing
between groups
App A
OS A
App B
OS B
virtualization layer
hardware
virtualization layer
hardware
virtualization layer
hardware
App A
OS A
App C
OS B
App C
OS A
App A
OS A
App B
OS B
18. 18
Protect Applications from Hardware Failures
virtualization layer
hardware
virtualization layer
hardware
virtualization layer
hardware
Reactive Fault Tolerance: “Fail and Recover”
App A
OS
App A
OS
19. 19
Protect Applications from Hardware Failures
virtualization layer
hardware
virtualization layer
hardware
virtualization layer
hardware
MPI-0
OS
MPI-1
OS
MPI-2
OS
Proactive Fault Tolerance: “Move and Continue”
20. 20
Elastic Application Layers
App A
OS A
App B
OS B
virtualization layer
hardware
virtualization layer
hardware
virtualization layer
hardware
App A
OS A
App C
OS B
App C
OS A
Ability to decouple compute and data and size
each appropriately
Multi-threading vs multi-VMs
App A
OS A
App B
OS B
Compute
OS A
Data
OS B
24. 24
Project Overview
Collaborative research effort between UCSF and VMware Field and
CTO Office
• Additional participation by nVidia, EMC/Isilon and DDN.
Prove out the value of a private cloud solution for HPC Life
Sciences workload
Stand up a small private cloud on customer-supplied hardware
• Dell M1000E Blade Chassis
• Dell M610 Blades
• FDR-IB
• Equalogic VMDK storage
• DDN GPFS store
• EMC/Isilon store (NFS)
Testing to include an array of Life Sciences applications important
to UCSF, including some testing of the use of VMware VDI to move
science desktop workloads into the private cloud
25. 25
Project Overview
Desktop visualization
• Could we also replace expensive desktops with thin-client like devices
for users that need to visualize complex imaging datasets or 3D
instrument datasets?
27. 27
Project Overview – Success Factors
Didn’t have to be as fast as metal but can’t be significantly slower
The end product must allow a user to self provision a host from a
vetted list of options
• “I want 10 Ubuntu machines that I can run as root with X packages installed”
Environment must be agile allowing for different workloads to
cohabitate a single hardware environment
• I.E. You can run a “R” workload on the same blade that is running a desktop
visualization job
What ever you could do on metal, you have to be able to do
in virtualization (*)
Users must be fully sandboxed to prevent “bad stuff” from
leaving their workloads
28. 28
Agenda
Who we are
Motivation
Project
Architecture
Next steps
29. 29
VMware vCAC
Users IT
Research Group 1 Research Group m
Public Clouds
Programmatic
Control and
Integrations
User Portals Security
VMware
vCNS
Research Cluster 1 Research Cluster n
VMware vCloud Automation Center
VMware
vCenter Server
VMware vSphere VMware vSphere VMware vSphere
Catalogs
VMware
vCenter Server
VMware
vCenter Server
Secure Private Cloud for HPC
30. 30
Architecture
The components are “off the shelf”
• Standard Dell servers
• Mellanox FDR switches
• Isilon and DDN are tuned as normal
No custom workflows
• We tune the nodes the same way you would normally in your virtual and HPC
environments.
There is no “next-gen” black box appliance used, what we have
you can have.
31. 31
Architecture
Why Blades?
• It’s what we have...
• The chassis allows us to isolate more easily for initial testing but they are also
commonly deployed in dense virtualization environments as well in HPC.
32. 32
Agenda
Who we are
Motivation
Project
Architecture
Next steps
33. 33
Next Steps
The Results will report performance comparisons between bare-
metal and virtualized for a set of Life Sciences applications
important to UCSF and life sciences:
• BLAST – running a synthetic data set
• Bowtie
• Affymetrix and Illumina genomics pipelines (both with vendor-supplied
test datasets)
• R – with a stem-cell dataset (likely) or a hypertension dataset (possibly)
• Desktop virtualization
The Results will also report on use of VDI to move current
workstation science applications onto the Proof of Concept
server cluster
An important part of this will be an assessment of the hypothesized
value props: self provisioning, multi-tenancy, etc.
34. 34
Next Steps
Complete initial benchmarking
• Capture core metrics on the physical hardware and then capture the same
data as a virtualized host.
Does it work?
• What happens when we start to scale it upwards, does performance
stay linear?
39. How UC San Francisco Delivered ‘Science as a
Service’ with Private Cloud for HPC
Brad Dispensa, University of California
Andrew Nelson, VMware
VSVC5272
#VSVC5272